Topic feature text keyword extraction method

An extraction method and topic feature technology, applied in the field of natural language processing, can solve the problems that cannot effectively reflect the importance of words and the distribution of feature words, cannot complete weight adjustment well, and low-frequency words cannot be extracted, etc. , to achieve the effect of reducing the influence of human subjective factors, avoiding adverse effects, and good scalability

Inactive Publication Date: 2018-11-06
10TH RES INST OF CETC
View PDF3 Cites 47 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The disadvantage of the statistical method is that the amount of calculation is large; the extraction results will be meaningful and incomplete strings, resulting in low accuracy; low-frequency words cannot be extracted; a large amount of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Topic feature text keyword extraction method
  • Topic feature text keyword extraction method
  • Topic feature text keyword extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] refer to figure 1 . According to the present invention, a method for extracting subject feature text keywords is characterized in that it includes the following steps: using text as the carrier of information, extracting text keywords according to the subject distribution characteristics is divided into a training stage and a test stage, and the training stage of the training stage The text preprocessing module, the inverse document frequency calculation module, the topic model learning module, the global weight calculation module and the test text preprocessing module, the local weight calculation module, the comprehensive score calculation and the sorting module in the test stage constitute the text keyword extraction algorithm model ; Wherein, the training text preprocessing module sequentially performs Chinese word segmentation, stop word removal and part-of-speech filtering on the input training text data, and then inputs the preprocessed training text data to the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a topic feature text keyword extraction method. Through the method, text keyword extraction results better than those of a traditional TF-IDF method can be obtained. Accordingto the technical scheme, at a training stage, word segmentation, stop word removal, part-of-speech filtering and other preprocessing are performed on a training text, statistical analysis is performedon inverse document frequency of words, meanwhile a topic model method is utilized to learn and obtain a topic probability matrix of the words, normalization processing is performed, topic distribution entropy of the words is calculated according to the topic probability matrix of the words, global weights of the words are calculated in combination with the inverse document frequency and the topic distribution entropy, and global weight calculation results are output to a test stage; and after a test text is preprocessed, statistical analysis is performed on normalized term frequency of wordsin the test text, the normalized term frequency is combined with the global weight calculation results obtained at the training stage, comprehensive scores of the words are calculated are ordered, and a plurality of words with the highest scores in the score order are used as automatic keyword extraction results of the current test text.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and in particular relates to a method for extracting text keywords based on topic distribution characteristics of words. Background technique [0002] Keyword extraction is the key to technologies such as information retrieval, text classification and clustering, and automatic abstract generation, and is an important means to quickly obtain document topics. Keywords are traditionally defined as a set of words or phrases that summarize the subject matter of a document. Keywords characterize the topic and key content of a document, and are the smallest unit to express the core content of a text. Keywords have very important applications in many fields, such as automatic summarization of documents, web page information extraction, classification and clustering of documents, search engines, etc. However, in most cases, the text does not directly provide keywords, so it is necess...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
CPCG06F40/279
Inventor 彭易锦代翔黄细凤王侃杨拓
Owner 10TH RES INST OF CETC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products