Method and system for automatically extracting keywords from multiple documents and computer program

A technology for automatic extraction and keyword extraction, which is applied in computing, natural language data processing, special data processing applications, etc. It can solve the problems of keyword generalization, poor recognition of combination word boundaries, and inability to express clear semantics, and achieve readability. Enhancement, practicability guarantee, labor cost saving effect

Inactive Publication Date: 2018-06-29
GLOBAL TONE COMM TECH
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] To sum up, the problems existing in the existing technology are: the current Tf-Idf algorithm is poor in identifying the boundaries of compound words, and compound words are often not phrases with complete semantics; TextRank can only extract single words, but cannot extract combinations Words, leading to the keywords extracted by the TextRank algorithm are too generalized to express clear semantics

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for automatically extracting keywords from multiple documents and computer program
  • Method and system for automatically extracting keywords from multiple documents and computer program
  • Method and system for automatically extracting keywords from multiple documents and computer program

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0046] Compared with single-document keyword automatic extraction, multi-document can provide more effective statistical support. Information is extremely important to the extraction of compound words. The extraction effect of compound words depends not only on the quality of the unary words, but also on the semantic integrity of the compound words and the suitability of collocations within the compound words. Information needs to be obtained from a large number of The way of obtaining information also determines the effect of compound word extraction. The biggest challenge faced by the multi-document keyword automatic ext...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of computer software, and discloses a method and system for automatically extracting keywords from multiple documents and a computer program. The method comprises the steps of extracting keyword seeds; utilizing statistical information of the mutual positions of words in text to measure whether or not a combined word is semantically integrated; if yes,determining the combined word to be a key combined word; if not, conducting left and right expansion. By means of the extraction method of discovering semantic integrity and cutting off reasonable keyphrases, the readability of a result obtained by automatically extracting the keywords from multiple documents is greatly improved; the extracted keywords have a larger average length, and the themesof multiple documents can be better described through more abundant and integrated semantics; the keywords extracted by a Tf-Idf algorithm are more detailed, the semantics is broader and not specific, and the themes of the documents cannot be represented; the practicality is ensured, labor cost for annotation is saved, and the automatic keyword extraction can be conducted without any annotated corpora.

Description

technical field [0001] The invention belongs to the technical field of computer software, and in particular relates to a multi-document keyword automatic extraction method and system, and a computer program. Background technique [0002] A word is the smallest unit for expressing semantics, and a keyword is a collection of words or phrases that best represent the theme of a certain piece of text. Keyword automatic extraction technology is an automatic technology for identifying meaningful and representative segments or words. Automatic keyword extraction helps readers quickly and accurately grasp the topic of the article; and this technology has good applications in various scenarios such as automatic summarization, information retrieval, and information extraction. At present, keyword extraction methods are divided into two categories according to whether they are supervised or not: 1) Unsupervised algorithm, which does not require heavy labeling work, and can automaticall...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
CPCG06F40/284G06F40/289
Inventor 巢文涵姜鑫宋俊平程国艮
Owner GLOBAL TONE COMM TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products