Theme word vector and network structure-based theme keyword extraction method

A network structure and extraction method technology, applied in the field of keyword extraction, can solve the problems of keyword extraction, topic identification and clustering, etc.

Active Publication Date: 2018-05-18
SHANDONG UNIV OF SCI & TECH
View PDF3 Cites 60 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

On the basis of considering the word frequency and word co-occurrence relationship, the algorithm can extract the keywords of a single document concisely and effectively, but it cannot identify and cluster the topics of multiple documents, so it cannot identify the keywords of documents under a specific topic. word extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Theme word vector and network structure-based theme keyword extraction method
  • Theme word vector and network structure-based theme keyword extraction method
  • Theme word vector and network structure-based theme keyword extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The specific implementation of the present invention will be further described below in conjunction with the drawings and specific embodiments:

[0042] Such as figure 1 As shown, a topic keyword extraction method based on topic word vectors and network structure specifically includes:

[0043] Segment the original text corpus;

[0044] Perform topic clustering on the text corpus based on the LDA topic model, and obtain the Top100 keyword set KeywordsSet in each topic with the topic correlation 1 ={k 1 ,..., k 100 };

[0045] Use word2vec to represent each word in the text corpus as a word vector, and obtain the semantic similarity between every two words by calculating the cosine value between the word vectors;

[0046] Calculate the keywords set and KeywordsSet 1 The semantic similarity of each keyword in the top5 words, the keyword set KeywordsSet 1 The words in and their semantic similarity top5 together form a new keyword set KeywordsSet 2 ;

[0047] KeywordsSet 2 Each keywor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a theme word vector and network structure-based theme keyword extraction method, and particularly relates to the technical field of extracting keywords from texts. The theme word vector and network structure-based theme keyword extraction method comprises the following steps of: carrying out theme clustering on a text corpus on the basis of an LDA theme model, and obtaining100 keywords, relevancies of which with each theme are top 100 in the theme; expressing each word in the text corpus as a word vector by utilizing word2vec, obtaining a semantic similarity between every two words through calculation, and respectively calculating 5 words, semantic similarities of which with each keyword in the keywords are top 5, wherein the keywords and the words, the semantic similarities of which with each keyword are top 5 form a new keyword set; and constructing a keyword network and obtaining the top 20 words in each set to serve as keywords of the theme. According to the method, keywords which have relatively high word frequencies in documents can be extracted, and keywords which have relatively word frequencies and are strongly associated with themes can be effectively discovered.

Description

Technical field [0001] The invention relates to the technical field of extracting keywords from texts, in particular to a method for extracting theme keywords based on a theme word vector and a network structure. Background technique [0002] With the widespread application of representation learning technology in the field of natural language processing, word2vec is used for vector representation of words, which can describe and obtain the semantic and grammatical rules of words well. At the same time, topic models can well explain the document-level topic aggregation. . Therefore, the current research on fusion of topic models and word vector representations of topic keywords is becoming more and more extensive. [0003] LDA topic model: Among the various topic models proposed, LDA is a generative model that can summarize topic distribution. LDA is a three-level hierarchical Bayesian model, in which each item in the collection is modeled as a finite mixture of potential topics....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/3334G06F16/35G06F18/22G06F18/2411
Inventor 胡晓慧李超曾庆田戴明弟赵中英
Owner SHANDONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products