Text keyword extraction method based on combination of Word2Vec and word co-occurrence

A keyword and combination technology, applied in the field of text keyword extraction based on the combination of Word2Vec and word co-occurrence, can solve the problem of low keyword extraction efficiency, poor keyword extraction effect, and poor keyword extraction algorithm accuracy. Ideal and other issues, to achieve the effect of good effect and high keyword extraction efficiency

Active Publication Date: 2018-01-09
NANJING UNIV OF POSTS & TELECOMM
View PDF5 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, whether in the process of clustering or classification, the characteristics of the vocabulary cannot give sufficient information about the semantics of the vocabulary, so the accuracy of these keyword extraction algorithms is not ideal
[0006] In summary, the traditional keyword extraction method has the problems of poor keyword extraction effect and low keyword extraction efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text keyword extraction method based on combination of Word2Vec and word co-occurrence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail:

[0031] figure 1 For the overall flow chart of the inventive method, refer to figure 1 , the text keyword extraction method based on the combination of Word2Vec and word co-occurrence described in the present embodiment, the method comprises the following steps: the detailed process is as follows:

[0032] Step A): Divide the text into several clauses, perform word segmentation on the clauses, and perform part-of-speech tagging at the same time to obtain a vocabulary set;

[0033] Step B): Then preprocess the vocabulary set, scan the connected vocabulary to obtain the vocabulary combination, filter the modal particles, auxiliary words and unreasonable vocabulary and vocabulary combinations starting with these words according to the stop vocabulary list, Get the preliminary candidate set D 1 ;

[0034] Step C): the preliminary candidate set word ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text keyword extraction method based on combination of Word2Vec and word co-occurrence. According to the method, an ICTCLAS word segmentation system is adopted to perform word segmentation and part-of-speech tagging on a text to obtain a vocabulary set; then, the vocabulary set is preprocessed, unreasonable vocabulary combinations are filtered out, and a preliminary candidate set is obtained; the preliminary candidate set is placed in a trained Word2Vec model to obtain a word vector table, the distance between word vectors in the word vector table is calculated, kmeans clustering is performed on the preliminary candidate set to obtain a secondary candidate set of keywords, and a word co-occurrence rate of the secondary candidate set in the preliminary candidate set is obtained according to the word vector distance; and different weight values are given to different vocabulary lengths, corresponding weights are obtained according to the word co-occurrence rateand the vocabulary lengths, ordering is performed according to the weights, and the first m keywords are final keywords. Through the method, the word vectors generated through Word2Vec are adopted toperform clustering, then the text keywords are extracted in combination with word co-occurrence and other basic characteristics, therefore, the extracted keywords are more accurate, and the method canadapt to extraction of keywords of different texts.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a text keyword extraction method based on the combination of Word2Vec and word co-occurrence. Background technique [0002] Keyword extraction is to extract some words most relevant to the meaning of this article from the text. These vocabulary can also greatly summarize the main content and central idea of ​​the article. When writing a thesis, the author generally requires to provide several keywords, which can greatly facilitate the readers to determine whether the thesis is the required one, and achieve the preview effect. [0003] Traditional keyword tagging is mainly done manually. Generally, domain experts are invited to read some specific documents, and then select some words as keywords according to the content of the text. The advantage of this is that the accuracy of keywords is relatively high, and generally fits well with the content of the arti...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
Inventor 李晓飞刘佳雯韩光
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products