Keyword automatic extraction method based on distributed expression word vector calculation

An automatic keyword extraction technology, applied in computing, computer components, character and pattern recognition, etc., can solve the imbalance of keyword information labeling, poor keyword group extraction effect, and no good solution for keyword extraction, etc. problem, to achieve the effect of excellent extraction accuracy, improved extraction performance, and solved extraction difficulties

Inactive Publication Date: 2016-10-12
SHANGHAI UNIV
View PDF5 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] From the analysis of the research status at home and abroad, the current automatic keyword extraction research technology still has limitations:
[0005] (1) The existing automatic keyword extraction algorithms face many problems such as polysemous words, redundant expressions of synonyms, dynamics of thesaurus updates, and cross-domain content complexity.
[0006] (2) Most automatic keyword extraction algorithms are based on small-scale experimental sample...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Keyword automatic extraction method based on distributed expression word vector calculation
  • Keyword automatic extraction method based on distributed expression word vector calculation
  • Keyword automatic extraction method based on distributed expression word vector calculation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] Preferred embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0032] The data sets in this embodiment are four English papers in different fields of computer science obtained from the IEEE digital library. The following table lists the number of papers, the number of keywords, and the number of words in the word vector word list after training for each data set. In each data set, 50 data are extracted as the test sample set, and the rest are the initial training set, as shown in Table 1.

[0033] Table 1

[0034]

[0035] Among them, Data Mining, Information Extraction, and Recommendation datasets are concentrated in fields, and the corpus is relatively pure.

[0036] In this embodiment, the experiment of the automatic keyword extraction method uses the Word2vec tool of Google to carry out the experiment, uses the C language to implement the program, and runs under the Ubuntu environment. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a keyword automatic extraction method based on distributed expression word vector calculation. The method automatically generates characteristics, and preferably solves keyword automatic extraction. The steps of the method are as followings: step 1, obtaining a training original dataset; step 2, performing preprocessing on a training set and a test text, including removing punctuation, digits, stop words, and filtering word characteristics; step 3, after the training set is obtained, through training of a linguistic model, converting the training set into a word vector table; step 4, through a distance calculation method, calculating the distance from a keyword word vector to a to-be-tested text; step 5, by different distance calculation methods, respectively obtaining arithmetic average semantic distances between distributed expression word vectors of all keywords of a field keyword set and distributed expression word vectors of all words of the test text, so as to select and sort. The method provides a new thought for extraction of keywords, semantic information of a dataset is fully used, and accuracy of automatic extraction is substantially improved.

Description

technical field [0001] The invention relates to a method for automatically extracting keywords based on distributed expression word vector calculation, belonging to the field of text mining (Text Mining). Background technique [0002] The continuous development of information technology has led to explosive growth of information in many fields, and a large amount of text information has been digitized. Electronic information resources such as digital libraries, electronic thesis databases, E-books, etc. have brought great convenience to people in collecting, storing and using information, and have become an indispensable part of modern life. With the continuous increase of electronic information, how to quickly and accurately obtain the required information from large-scale text information has become a huge challenge. Keyword extraction is an effective means to solve the above problems. It is one of the core technologies in the field of text mining and plays a very importa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/35G06F18/24
Inventor 朱文浩刘懿霆陈洁郭心怡丁庆功缪慧
Owner SHANGHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products