Method and system for finding out new words

A new word discovery and new word technology, applied in the field of new word discovery methods and systems, can solve problems such as failure to achieve good results, difficulty in determining word boundaries, and complicated processes

Inactive Publication Date: 2011-11-02
SHENGLE INFORMATION TECH SHANGHAI
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, since the composition of new words varies widely, in many cases there is no general rule, such as the translation of names in novels, magic names, racial names, and dictionaries and rules often fail to achieve good results.
[0004] 2. Word boundaries are difficult to deter

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for finding out new words
  • Method and system for finding out new words
  • Method and system for finding out new words

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The method and system for discovering new words proposed by the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0046] Such as figure 1 As shown, the present invention proposes a new word discovery method, comprising:

[0047] S1, extracting the bigram elements of the known background corpus according to the bigram language model, and counting the word frequency sum and the number of types of all the bigram elements in the known background corpus.

[0048] The known background corpus refers to a large-scale general corpus that contains enough grammatical and morphological phenomena, and can truly reflect the overall picture of modern Chinese in terms of characters, vocabulary, grammar, and semantics.

[0049] In this embodiment, the known background corpus is the corpus in the modern Chinese corpus of the National Language Commission, based on the bigram language model, the list of bi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and system for finding out new words. The method comprises the following steps of: based on a bigram language model, respectively extracting bigram elements of a foreground corpus; respectively obtaining statistical information of the foreground corpus; filtering the bigram elements according to the statistical information and a first pre-set rule; expanding the remained bigram elements in the foreground corpus by using an n-gram language model and a second pre-set rule, wherein re-counting a background corpus is unnecessary during the updating of n-gram elements; preventing from re-finding out existing new words in the background corpus; judging boundaries of the new words according to the second pre-set rule; and removing garbage bigram elements and n-gram elements. The method is used simply and easily. The manual correction burden is reduced.

Description

technical field [0001] The invention relates to the field of text information processing, in particular to a new word discovery method and system. Background technique [0002] Chinese (and Asian languages ​​such as Japanese) do not use spaces to denote word boundaries like Western languages, so word segmentation is the primary task of Chinese language processing. However, with the rapid development of Internet content services (such as Weibo, novels), new words on the Internet continue to emerge, and the word segmentation models used in systems such as Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and search engines need to be constantly updated. Neologisms are never outdated, so neologism discovery has recently become a research hotspot. At present, there are roughly three problems faced by new word discovery: [0003] 1. Lack of effective basis. There is no clear definition of new words at present. In the prior art, dictionaries (as background corpus) are g...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 吴悦
Owner SHENGLE INFORMATION TECH SHANGHAI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products