A document keyword extraction method and device based on LDA and word vectors

A technology of keywords and word vectors, which is applied in the field of document keyword extraction based on LDA and word vectors, can solve the problems of not being able to highlight the core content of the document, not being able to realize the comprehensiveness of keywords, and redundant keywords, so as to avoid noisy data interference, high precision, and the effect of improving accuracy
CN109766544AActive Publication Date: 2019-05-17HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
Publication Date
2019-05-17

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention relates to the technical field of natural language processing and deep learning, in particular to a document keyword extraction method based on LDA and word vectors, which comprises thefollowing steps of (A) judging whether a document title and content are consistent or not by using a title discriminator, and executing the next step if the document title and the content are consistent; (B) calculating the weight of the theme in the document and the weight of the vocabularies in the document on the theme; (C) calculating the weight of the vocabularies in the document, and sortingaccording to the weight values to generate a candidate keyword set of the document; (E) mapping the vocabularies into a word vector space; (F) calculating the distance between the word vectors in theword vector space, sorting the word vectors according to the distance, and selecting the first M sorted vocabularies as keywords of the document. The invention further discloses an extraction device.Compared with a traditional method, the extracted document keywords are high in precision and high in reliability, the titles and the Chinese characters are filtered out, the interference of noise data is avoided, and accuracy is further improved.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The present invention relates to the technical fields of natural language processing and deep learning, in particular to a document keyword extraction method and device based on LDA and word vectors. Background technique

[0002] Keywords can concisely and accurately describe the content of the text, and generally consist of several words and phrases. Keyword extraction, also known as keyword tagging, refers to extracting a number of representative words or phrases from text or text collections to reflect the main semantic information of the text. An important channel for information of interest. The advent of the Internet era puts forward new requirements for keyword extraction. The extracted keywords should have the following three characteristics: significance, readability and comprehensiveness. Significance means that the extracted keywords should reflect the core content of the document. For example, "machine translation" is extracted from the d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More