Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and system for feature word extraction from document set based on location information

A technology of location information and extraction method, applied in the field of feature word extraction of document sets, can solve the problems of low feature word extraction accuracy and manual correction, and achieve the effect of reducing labor correction cost, providing accuracy, and improving accuracy.

Active Publication Date: 2021-02-19
TCL CORPORATION
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The technical problem to be solved by the present invention is to provide a method and system for extracting feature words from document sets based on location information, which solves the problem of feature words existing in document sets in existing TF-IDF feature word extraction methods. The extraction accuracy is not high and needs manual correction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for feature word extraction from document set based on location information
  • A method and system for feature word extraction from document set based on location information
  • A method and system for feature word extraction from document set based on location information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The present invention provides a method and system for extracting feature words from a document set based on location information. In order to make the purpose, technical solution and effect of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0043] Term frequency (term frequency, TF) refers to the frequency that a given word appears in the file.

[0044]

[0045] Inverse document frequency (IDF) is a measure of the universal importance of words. The IDF of a specific word can be divided by the total number of documents |D| by the number of documents containing the word|{j:t i ∈d j}|, and take the logarithm of the obtained quotient to get:

[0046]

[0047] The TF-IDF weight is:

[0048] tf·idf...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a document collection feature word extracting method and system based on position information. The method comprises the steps that space vector model processing is carried out on a document collection; document position information of each feature word in each document in the document collection is obtained, and document position information weights are calculated according to the document position information; TF-IDF weights of the feature words in the document collection are calculated with weighting word frequency according to the document position information weights, and the TF-IDF weights are ranked to obtain document collection feature words. According to the document collection feature word extracting method and system based on the position information, document collection position information weights are added into the TF-IDF weights, the precision of extracting the document collection feature words is improved, the accuracy of automatic classification of the document collection is improved, and manual correcting cost is reduced.

Description

technical field [0001] The present invention relates to the technical field of extracting feature words from document sets, in particular to a method and system for extracting feature words from document sets based on location information. Background technique [0002] In the information age, information continues to grow every day. The feature word selection of the document set is to select some representative feature words from the original high-dimensional feature words, and then use the selected feature words for subsequent document set processing to improve classification efficiency and Solve the shortcomings of classifiers that are unstable in high-dimensional situations. [0003] Commonly used feature word selection methods mainly include TF-IDF, information gain, chi-square test, mutual information, etc. Among them, the IF-IDF method is simple in form, simple in structure, and has a high accuracy rate. However, the traditional TF-IDF method mainly has the following ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
CPCG06F16/35
Inventor 吴成龙王巍
Owner TCL CORPORATION