Method for extracting novel field words

A field and new word technology, applied in the field of new word discovery in natural language processing, can solve the problems of new word coverage in the field, word segmentation algorithm can not do it, etc., to achieve the effect of ensuring the rate of word formation

Inactive Publication Date: 2016-11-09
EAST CHINA NORMAL UNIV
View PDF4 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Chinese word segmentation is the basis of most information retrieval systems. However, the existing word segmentation algorithms cannot cover all new words in the field. At this time, it is necessary to use the user dictionary to improve the accuracy of word segmentation, thereby improving the quality of the retrieval system. It can be seen that timely discovery and Updating new words in the field is of great significance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting novel field words
  • Method for extracting novel field words
  • Method for extracting novel field words

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The present invention will be further described in detail in conjunction with the following specific embodiments and accompanying drawings. The process, conditions, experimental methods, etc. for implementing the present invention, except for the content specifically mentioned below, are common knowledge and common knowledge in this field, and the present invention has no special limitation content.

[0026] The present invention utilizes word2vec and the field new word extraction method that Bootstrapping iteration combines to comprise the following steps:

[0027] Step 1: Obtain corpus in several fields, remove control characters in the corpus, and obtain neatly formatted field texts;

[0028] Step 2: segment the domain text into sentences according to the punctuation marks, and obtain the domain single sentence set S;

[0029] Step 3: Initialize and set the n-gram model, and segment the strings for the field single sentence set S to obtain the string set W 0 ;

[...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for extracting novel field words by the aid of combination of word2vec and Bootstrapping iteration. The method includes preprocessing field word corpus; segmenting preprocessed field texts by the aid of n-gram; counting word frequencies, left and right adjacent word numbers, left and right word entropy and mutual information six-dimensional statistics of segmented character strings; setting a group of parameters by the aid of kmeans; carrying out preliminary evaluation; carrying out filtering to obtain first round of results; respectively computing sums of cosine similarity of each candidate word and seed sets by the aid of word vector spaces and a group of field seed data; setting sum threshold values and carrying out secondary evaluation so as to extract the novel words of fields. The word vector spaces are obtained by means of word2vec training. The method has the advantages that the method is applicable to extracting the novel words from the large-scale field corpus and excellent in portability; the problem of difficulty in filtering non-field words with verb-object constructions, reduplication and the like can be fundamentally solved by the aid of the method.

Description

technical field [0001] The invention relates to the field of new word discovery in natural language processing, in particular to a method for extracting new words in the field. Background technique [0002] With the rapid development of Internet technology, a large number of new words continue to emerge. Including proper nouns, derived words, dialect words, industry words, transliterated words, words with foreign letters, words used in Hong Kong and Taiwan, and related terms in other fields, these words are called neologisms. [0003] Chinese word segmentation is the basis of most information retrieval systems. However, the existing word segmentation algorithms cannot cover all new words in the field. At this time, it is necessary to use the user dictionary to improve the accuracy of word segmentation, thereby improving the quality of the retrieval system. It can be seen that timely discovery and It is of great significance to update new words in the field. [0004] Tradit...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22
CPCG06F40/131
Inventor 杨燕马敬超贺樑
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products