Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode

A sorting method and keyword technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as separation and inability to have different importance

Active Publication Date: 2013-10-02
BEIJING UNIV OF TECH
View PDF4 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the same distribution-based method cannot separate words with similar spatial distribution but significantly different importance.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode
  • Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode
  • Method for sequencing keywords based on entropy difference between word-spacing-appearing internal mode and external mode

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] Step (1) get the text

[0037] Get the text, which consists of a certain number of sentences.

[0038] The test corpus is "The Origin of Species" by Charles Darwin, and the keyword appendix provided by W.S. Dallas is used as the evaluation basis.

[0039] Step (2) Text Preprocessing

[0040] Step (2.1) removes all punctuation and converts all letters to lowercase; the table of contents, glossary, and index are removed from the text.

[0041] Step (2.2) For English text, word segmentation is based on simple spaces. First remove the stop words, different word forms in English are regarded as different words, for example, "organ" and "organs" are regarded as two different words. Count the word frequency m of each word and the total number of words N in the full text. Calculate the probability p=m / N of each word.

[0042] Step (2.3) For Chinese text, use commonly used word segmentation software for word segmentation. Segment Chinese text using a general word segmentat...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for sequencing keywords based on the entropy difference between a word-spacing-appearing internal mode and an external mode and belongs to the field of character information processing. According to the method, the condition that the appearance of the keywords is affected by two modes, namely an internal mode (1) and an external mode (2), is believed, the internal mode (1) is used for describing the statistical properties of positions of the keywords in a topic, and the external mode (2) is used for describing the statistical attributes for the appearance of topic clusters in a text. Found through results of experiments on real texts, the greater the information entropy difference between the word-spacing-appearing internal mode and the external mode, the greater the probability that the word is a keyword.

Description

technical field [0001] The invention relates to a novel text keyword extraction and sorting method, which belongs to the field of text information processing. Background technique [0002] With the in-depth development of the Internet, the amount of information on the network is increasing, and the means of obtaining information are becoming more and more convenient. But at the same time Internet users encounter the problem of information explosion. In order to solve such a problem, we need to be able to quickly find the part of interest from the massive amount of information. This requires us to be able to extract keywords from text information. [0003] Traditional methods believe that if a word is identified as a keyword, then it must have significant statistical characteristics. H. P. Luhn proposed the original keyword extraction method. In Luhn's method, keywords are sorted by word frequency after removing common and rare words. Since then, word frequency based met...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 杨震司书勇雷建军范科峰赖英旭
Owner BEIJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products