Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for automatically discovering new words from document set

An automatic discovery and document technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as difficulty in new word recognition, new word segmentation, and impact on new word recognition.

Active Publication Date: 2014-07-30
TSINGHUA UNIV
View PDF6 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, new words have important applications in automatic summarization, text clustering / classification, information retrieval, etc. According to statistics, more than 1,000 new Chinese words appear on the Internet every year, and most of these new words are time-sensitive in various fields. Professional terms, since most of these new words do not exist in the dictionary, it is difficult for the existing word segmentation algorithm to identify these new words from the document set
Taking the emotional new word "Give force (adjective)" and the document "The performance is very good" as an example, the existing word segmentation algorithm usually performs the following word segmentation: performance / noun very / adverb give / verb force / noun, so that the new The word "Gili" cannot be segmented as a complete word to affect the recognition of new words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for automatically discovering new words from document set
  • Method and device for automatically discovering new words from document set
  • Method and device for automatically discovering new words from document set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0088] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0089] figure 1 A flow chart of method 1 for automatically discovering new words from a document set according to an embodiment of the present invention is shown. According to an embodiment of the present invention, method 1 includes:

[0090] Step S101, acquiring one or more templates;

[0091] Step S102, extracting words matching each template in the one or more templates from the document set;

[0092] Step S103, selecting at least a part of the templates from the one or more templates and adding them to the set of candidate templates;

[0093] Step S104, selecting at least a part of words from the extracted words that match each template in the one or more templates and adding them to the candidate word set;

[0094] Step S105, sort the candidate words in the candidate word set based on the templates in the candidate template set, and add a certain nu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a device for automatically discovering new words from a document set. A template acquisition unit acquires one or more templates; a word extraction unit extracts words matched with all templates of the one or more templates from the document set; a candidate template set adding unit selects at least a part of templates from the one or more templates and adds the part of templates to a candidate template set; a candidate word set adding unit selects at least a part of words from the words matched with all the templates of the one or more templates and adds the part of words to a candidate word set; a new word set adding unit sequences candidate words in the candidate word set according to the templates in the candidate template set, and adds a certain number of candidate words to a new word set according to the sequence. Compared with the prior art, by adopting the method and the device, new words can be effectively discovered.

Description

technical field [0001] The invention relates to natural language processing technology, in particular to a method and device for automatically discovering new words from a document collection. Background technique [0002] In social networks, netizens like to use their own personalized language to express their views on politics, society, culture, etc. Generally, the more people disseminate personalized language, the easier it is to become a new hot word on the Internet (referred to as "new word"). At present, new words have important applications in automatic summarization, text clustering / classification, information retrieval, etc. According to statistics, more than 1,000 new Chinese words appear on the Internet every year, and most of these new words are time-sensitive in various fields. Professional terms, since most of these new words do not exist in the dictionary, it is difficult for existing word segmentation algorithms to identify these new words from the document ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06F17/30
Inventor 黄民烈朱小燕
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products