New word discovering method and system thereof

A new word discovery, discovery method technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem that the frequency of use is not very high, it is difficult to set thresholds, meaningful strings or new words cannot be output, etc. problems, to achieve the effect of reducing workload and reducing the time for manual collection and sorting of new words

Active Publication Date: 2008-02-27
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] However, it is difficult to set an appropriate threshold for judging whether a string is a meaningful string or a new word based on the stability, independence, and integrity of the string.
If the threshold is too small, the accuracy of new word discovery is very low, and many meaningless garbage strings may be output; if the threshold is too large, some meaningful strings or new words in the corpus will not be output
The method based on the stability, independence and integrity of strings can only identify the part of the new words that appear frequently in the large-scale corpus. For some new words that have clear semantics and can be used independently, it is likely to be due to the presence of new words in the corpus. The frequency of use in is not very high and cannot be output

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • New word discovering method and system thereof
  • New word discovering method and system thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to make the purpose, technical solution and advantages of the present invention clearer, a new word discovery method and system of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0050] The core idea of ​​the present invention is that when a character string appears frequently in the corpus, existing methods cannot effectively judge whether the character string is a new word; or when the corpus size is not large enough, there may be many New words cannot be effectively recognized because the frequency is not very high, and the present invention uses the search engine to accurately search and / or fuzzy search low-frequency character strings respectively, which is equivalent to using the huge database indexed by the search engine as corp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method to find out new words, and it including the following step: finding the character string which appears more than certain threshold in the sound materials; counting respectively the number of the different characters and words which appear on the right and left of the position that every character string appears in the sound materials,; if the number of the different words or characters on the right or the left of the character string is bigger than certain threshold value set in advance, print the character string as new word; otherwise, searching the character string precisely and ambiguously on the search web station, if the number of the results the precise searching returns is bigger than certain threshold value, or / and the ratio of the number of the results the precise searching returns and the number of the results the ambiguous searching returns is bigger than certain threshold value, and / or the number of the kinds of the characters and words on the left and right the character string in the web page that the precise searching returns, print the character string as new word. The invention can not only find out the new words appearing in the sound materials frequently, but also find out the new words appearing in the sound materials infrequently, and the accuracy to find out new words is high.

Description

technical field [0001] The invention relates to the field of text information processing, in particular to a method and system for using a search engine to assist in finding new words in corpus. Background technique [0002] In natural language processing or computational linguistics, neologisms are words that have never appeared before, or new uses of words that have appeared before. New words are generally not included in dictionaries, so many people equate new words with unregistered words. [0003] With the progress of the times and the development of the economy, a large number of new words are constantly emerging in all aspects of people's daily life, especially with the increasing popularity of the Internet in China, a large number of new words on the Internet emerge in endlessly, and the words in daily life New words are also spread more quickly. According to reports, there are at least 1,000 new Chinese words or new usages in my country every year. The speed at w...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 龚才春黄玉兰
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products