Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Document keyword extraction method and device based on BERT model

A keyword and document technology, applied in the field of information processing, can solve problems such as inability to solve synonyms and different words and inaccurate extraction results, achieve the effect of reducing human workload, improving accuracy and recall rate, and solving the problem of ignoring semantics

Active Publication Date: 2021-06-01
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011]Aiming at the deficiencies of the prior art, the purpose of the present invention is to provide a document keyword extraction method and device based on the BERT model, through semantic information and document collection information extraction The keywords that can accurately and comprehensively express the content of the document center have good transferability between different document collections, without the need for manual word segmentation and data labeling, reducing human workload, and solving the problem that existing keyword extraction methods are easily affected by ambiguous words 、 Unable to solve the shortcomings of synonyms and different words and the extraction results are not accurate enough

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document keyword extraction method and device based on BERT model
  • Document keyword extraction method and device based on BERT model
  • Document keyword extraction method and device based on BERT model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] In order to make the above-mentioned features and effects of the present invention more clear and understandable, the following specific examples are given together with the accompanying drawings for detailed description as follows.

[0027] When conducting research on automatic keyword extraction methods, the inventors found that the existing unsupervised document keyword extraction technology does not utilize the semantic information relationship between documents and words, which will lead to inaccurate final extraction results. Consider a situation where keywords that can reflect the central content of a document do not appear many times in the document. In this case, the term frequency of the document keyword will decrease, and the number of words co-occurring with the document keyword will also decrease, resulting in a lower score for the document keyword, and eventually the term cannot be correctly selected as the document keyword. For example, in an article disc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A document keyword extraction method based on a BERT model comprises the following steps that each document in a document set is coded through the BERT model, and the attention weight of document semantics generated by the BERT model to each sub-word is extracted; restoring the sub-words into words, and aggregating the attention weights of the sub-words into the attention weight of the words; the attention weights of the same word at different positions in the document are aggregated into the attention weight, irrelevant to the position, of the word, and the attention weight is recorded as p (wordweight '2jeemaa2' doc); calculating the attention weight of each word on the document set, and recording the attention weight of each word on the document set as p (wordweight '2jeemaa2' corpus); and combining the p (wordweight '2jeemaa2' doc) and the p (wordweight '2jeemaa2' corpus), and selecting N words with the highest final attention weight as the keyword of the document. According to the method, the BERT model is used for extracting the document semantic representation to calculate the word attention weight distribution, the keyword extraction is finally realized, the word frequency information is considered, the problem that the semantics is ignored by the traditional unsupervised algorithm is effectively solved, and the keyword extraction accuracy and recall rate are improved.

Description

technical field [0001] The present invention relates to the technical field of information processing, in particular to a document keyword extraction method and device based on a BERT model. Background technique [0002] With the vigorous development of Internet technology and the rapid growth of network information, document keyword extraction technology can be used to index document content characteristics, construct information retrieval, quickly extract document center content, improve reader review efficiency, and deal with information overload. [0003] Keywords are words that can express the core content of a document, including words, terms, and phrases, which contain a certain amount of information and can promote the understanding of text content. From a technical point of view, document keyword extraction is the basic work of text mining research such as text retrieval, document comparison, abstract generation, document classification and clustering; Whether the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/332G06F16/33G06F16/953G06F40/30G06N3/04G06N3/08
CPCG06F16/3329G06F16/3346G06F16/3344G06F40/30G06F16/953G06N3/04G06N3/08
Inventor 程学旗郭嘉丰范意兴张儒清赵恒马新宇
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products