Unlock instant, AI-driven research and patent intelligence for your innovation.

New word discovery method and system based on word vector representation in massive texts

A new word discovery and word vector technology, applied in special data processing applications, instruments, biological neural network models, etc., can solve problems such as high cost, poor portability, and complex calculation of statistical indicators, and achieve simple, efficient implementation, and high accuracy rate effect

Inactive Publication Date: 2017-09-15
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF2 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The supervised method is mainly based on statistical learning. This method requires a large amount of labeled data and a large number of feature selection work, and the cost of obtaining a large amount of labeled data is often high, and feature selection requires rich experience; the unsupervised method is mainly based on Rules or calculation of some statistical indicators to discover new words, rule-based methods need to formulate a large number of language rules, poor portability, and a simple statistical indicator is often ineffective, and some statistical indicators are complex to calculate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • New word discovery method and system based on word vector representation in massive texts
  • New word discovery method and system based on word vector representation in massive texts
  • New word discovery method and system based on word vector representation in massive texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] In order to make the purpose, technical solution and advantages of the present invention clearer, the specific implementation manners of the present invention will be clearly and completely described below.

[0033] figure 1 A flowchart showing a method for discovering new words based on word vector representation in massive texts according to an embodiment of the present invention.

[0034] The method comprises: step S1, preprocessing the corpus of the new word discovery task; step S2, performing n-gram word string mining on the preprocessed corpus to obtain n-gram candidate word strings in the corpus; step S3 , set the word vector, and perform pruning according to the similarity of the corresponding word vector between the word in the n-gram candidate word string and the word, to obtain a new word.

[0035] First, in step S1, the corpus of the new word discovery task is preprocessed. The purpose of this embodiment is to find new words in the corpus of the new word d...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the field of Chinese natural language processing and provides a new word discovery method and system based on word vector representation in massive texts. The method comprises following steps: preprocessing corpus of a new word discovery task; performing n-gram word string digging on the preprocessed corpus; setting word vector and pruning according to the similarities of word vectors between words in the n-gram candidate word string to obtain new words. Compared with the prior art, the technical scheme of the invention has higher accuracy; in addition, since the present invention does not require a lot of manual annotation data, it can be more simply and efficiently implemented.

Description

technical field [0001] The invention belongs to the field of Chinese natural language processing, and in particular relates to a new word discovery method and system based on word vector representation in massive texts. Background technique [0002] New word discovery is a very important research content in the field of Chinese natural language processing. Since Chinese is not like many Western languages ​​such as English, there are fixed separators between words, so word segmentation is usually a necessary step at the beginning of Chinese information processing tasks, and new word discovery is closely related to word segmentation. Sproat and Emerson pointed out that the appearance of new words greatly affects the word segmentation accuracy of word segmentation tools, and 60% of word segmentation errors are caused by new words. In the neologism task, there is no well-defined concept of "new words". In the field of Chinese word segmentation, there are two concepts of new wo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06N3/02
CPCG06F40/284G06N3/02
Inventor 袁华钱宇
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA