Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

News keyword abstraction method based on word frequency and multi-component grammar

A keyword and news technology, applied in the field of news text mining, can solve problems such as the impact of word segmentation system quality, inconsistent keyword definition standards, and difficulties in subsequent text processing

Inactive Publication Date: 2008-06-11
TSINGHUA UNIV
View PDF0 Cites 88 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, most of the research work is focused on improving the accuracy of keyword extraction, without carefully studying what words are keywords, resulting in inconsistent standards for keyword definitions, and it is difficult to compare various methods
In addition, the results of keyword extraction are greatly affected by the quality of the word segmentation system. In keyword extraction, most keyword extraction methods use word segmentation as the first step in processing, and the missed detection and errors in the word segmentation process directly cause text Difficulties in follow-up processing, and for this reason, the keyword extraction method also needs to solve the problem of extracting unregistered words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • News keyword abstraction method based on word frequency and multi-component grammar
  • News keyword abstraction method based on word frequency and multi-component grammar
  • News keyword abstraction method based on word frequency and multi-component grammar

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0076] The method comprises the steps of:

[0077] (1) Analyze the linguistic and semantic features of the news, and give the definition of news keywords

[0078] (1.1) Study the characteristic parts of speech of keywords

[0079] The content of this part is to manually analyze the linguistic and semantic features of a certain news text collection, refer to the current texts and keywords that are common on the Internet, and combine the six elements of news to summarize several types of news keywords.

[0080] News texts usually include news events, and news events generally include six elements of 5W1H, which are "When, What, Who, Where, Why and How". And these six elements are exactly what people care about. News keywords should be related to the six elements as much as possible. It can be said that the six elements are the target of keyword extraction. By analyzing the news text, we summarize the potential parts of speech of the six elements of news, that is, the possible ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method to extract new keywords based on word frequency and multiple grammars is provided, which belongs to the technology field of a natural language processing, and is characterized by extracting the potential models of part of speech of the multiple grammars of the keywords by researching characteristic part of speech of the keywords and adopting computer to assist excavation and taking the models as the basis of the keywords to extract arithmetic. When extracting the new keywords, firstly excavating the multiple phrases in text in accordance with the potential models of part of speech and extract candidate word set of the keywords, and then excavating potential keywords not loading from titles and add the potential keywords to the candidate keyword set. The application brings forward an improved single text word frequency / inverse text frequency value (tf / idf) format, introduces target-oriented characteristics, grades the candidate keywords, obtains the order of the candidate keywords and gives the keywords of news document after optimizing the results. Compared with the traditional keyword extraction method based on single text word frequency / inverse text frequency value (tf / idf), the method has higher recall rate under the condition of the same precision.

Description

technical field [0001] The invention belongs to the field of text mining, in particular to news text mining. Background technique [0002] Keyword extraction is an important research topic in text information retrieval. The keyword extraction of Chinese news plays an extremely important role in understanding the important content of news and realizing the precise retrieval of related news events. Text keywords refer to several words or phrases that can summarize the text and are related to the semantic content of the text. Through keywords, people can quickly find the information they need. Furthermore, keywords can also provide rich semantic information for deeper text mining applications, such as text classification, text clustering, text retrieval, and topic mining. [0003] At present, there are many keyword extraction methods at home and abroad, and they have been widely used. However, most of the research work is focused on improving the accuracy of keyword extract...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
Inventor 李涓子樊绮娜李军唐杰张鹏许斌
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products