Unsupervised keyword extraction method

An extraction method and keyword technology, applied in the field of text processing algorithms, can solve the problems of not directly considering the relevance of phrases, and it is difficult to further improve the extraction effect, so as to achieve the effect of rich semantic information and improved effect.

Pending Publication Date: 2019-11-19
SUN YAT SEN UNIV +1
View PDF3 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the current mainstream graph-based unsupervised keyword extraction method uses the idea of ​​PageRank. The core of the idea is that important words will increase the importance of words with a higher degree of correlation with them. However, these methods generally build word graphs. Only indirectly considers the correlation between words (the co-occurrence relationship of words, or semantic similarity), and does not directly consider the correlation between the phrases as a whole, which will make it difficult to further improve the extraction effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unsupervised keyword extraction method
  • Unsupervised keyword extraction method
  • Unsupervised keyword extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0042] Such as figure 1 As shown, a faster and more efficient unsupervised keyword extraction method based on word-phrase graph and LDA topic model, the specific process is:

[0043] S1: Preprocessing the document data, including removing stop words, part-of-speech tagging, removing punctuation marks and illegal symbols, etc., to obtain a word set W.

[0044] S2: Use pattern matching combined with syntactic rules to carry out noun phrase chunking (NP-chunking), and specifically use part-of-speech tagging and "adjective + noun" mode to obtain a series of candidate key phrases.

[0045] S3: Use the LDA topic model to calculate the word salience score of each word in the word set obtained in S1, sort in descending order according to the score, and take the top k as the topic word set of this document.

[0046] S4: Use the candidate key phrases obtained in S2 and the topic words obtained in S3 to construct a phrase-word graph.

[0047] S5: According to the graph structure constr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an unsupervised keyword extraction method, and the method comprises the steps: obtaining a part of topic words with higher topic correlation degree through an LDA topic model, and building a phrase-word graph through employing topic words and candidate phrases as nodes; screening and promoting candidate phrases with higher topic association degree by utilizing the topic words, so that the influence of noise candidate phrases on a result is indirectly inhibited. On the other hand, aiming at the condition that short text candidate phrases are insufficient, the theme wordsare used as supplements of semantic information, so that the semantic information of the algorithm graph structure is richer. For the situation that candidate phrases of a long text are too large andare mingled with too much noise, the topic words play a certain role in screening. Keyword extraction is not sensitive to the length of an article any more, and the effect is further improved.

Description

technical field [0001] The invention relates to the field of text processing algorithms, and more specifically, relates to an unsupervised keyword extraction method. Background technique [0002] With the rapid growth of text data (such as scientific literature, web pages, and social tweets), the analysis and mining of text data has become an important research field that has attracted much attention. Among them, how to extract keywords (keyphrases, including words and phrases) reflecting the theme of the document from the text document has always been a key basic problem and research hotspot in the field of natural language processing, and its research results can be widely used in document retrieval, Specific application areas such as document summarization, text classification, topic detection, question answering system, etc. [0003] Among the automatic keyword extraction methods, the graph-based keyword extraction method is currently the most effective and widely studi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/33G06F16/34G06F17/27
CPCG06F16/3344G06F16/3346G06F16/345
Inventor 张兴宇潘炎印鉴潘文杰
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products