Method and device for extracting hot word phrases from document set

A phrase and document technology, applied in the field of information processing, can solve the problems of dependency, poor word segmentation effect, poor effect of hot word phrase extraction, etc., and achieve the effect of improving robustness

Active Publication Date: 2014-10-01
TSINGHUA UNIV
View PDF3 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

As far as the existing hot word phrase extraction technology is concerned, it usually relies on the word segmentation system. If the word segmentation effect of the word segmentation system is not good, it will directly lead to the poor effect of hot word phrase extraction. Segmentation of sentences in regular documents such as newspapers has a better effect of word segmentation, but the effect of word segmentation for irregular words on the Internet is poor, which further leads to the failure of most Internet hot word phrases to be extracted well; and , the existing hot word phrase extraction technology can usually only extract phrases containing fewer words, such as two words or three words, and the extraction of phrases is more dependent on linguistic rules (such as grammar and syntax rules), flexible In addition, the existing hot word phrase extraction technology usually adopts a strategy of expanding shorter words into longer words, resulting in the inability to extract most of the longer and noise-containing words (such as "的", " "Let" and "To") were successfully extracted, which means that the existing hot word phrase extraction technology is less robust

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting hot word phrases from document set
  • Method and device for extracting hot word phrases from document set
  • Method and device for extracting hot word phrases from document set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0037] figure 1 A flow chart of method 1 for extracting hot word phrases from a document set according to an embodiment of the present invention is shown. According to an embodiment of the present invention, method 1 includes:

[0038] Step s101, performing word segmentation for each clause in the document set;

[0039] Step s102, for all phrases consisting of K consecutive words in each clause, judge the degree of clarity of phrase boundaries and / or the closeness of the relationship between words in the phrase, where K is a positive integer, which can be set in advance by the user according to needs. Boundary salience indicates the degree of freedom of collocation of the phrase and the words on the left and right of the phrase;

[0040] Step s103, based on the judgment result of the obviousness of the phrase boundary and / or the closeness of the relationsh...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for extracting hot word phrases from a document set. The method comprises performing word segmentation on every clause in the document set through a word segmentation unit; judging the phrase boundary distinctness degree and or the closeness degree of the relation between words in every phrase which is formed by less than K continuous words in every clause through a judgment unit, wherein K is a positive integer and the boundary distinctness degree is used for indicating the matching freedom degree of phrases and words located on the left sides and the right sides of the phrases; at least extracting a part of phrases from the phrases which are formed by the less than K continuous words based on a judgment result of the phrase boundary distinctness degree and or the closeness degree of the relation between the words in every phrase through a hot word phrase extraction unit to serve as the hot word phrases to be output. Compared with the prior art, the hot word phrases can be accurately extracted from various corpuses.

Description

technical field [0001] The invention relates to information processing technology, in particular to a method and device for extracting hot word phrases from document collections. Background technique [0002] With the explosive growth of Internet information, people have a higher and higher demand for hot topics to obtain hot information, such as the topic "Development and Reform Commission", "Security Regulatory Commission", "Ye Bao", etc. to obtain related objects and events Therefore, how to better extract hot word phrases from various corpus resources has become an important topic in the field of natural language processing. As far as the existing hot word phrase extraction technology is concerned, it usually relies on the word segmentation system. If the word segmentation effect of the word segmentation system is not good, it will directly lead to the poor effect of hot word phrase extraction. Segmentation of sentences in regular documents such as newspapers has a bett...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 黄民烈朱小燕
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products