Corpus expansion method and apparatus

An extension method and extension device technology, which are applied in the field of corpus extension methods and devices, can solve the problems of missing paths, low probability of forming sentences, affecting the use effect, etc., and achieve the effect of improving the actual application effect, perfecting the path of forming sentences, and improving the probability of forming sentences.

Active Publication Date: 2018-05-11
BEIJING SINOVOICE TECH CO LTD
View PDF5 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In view of this, the present invention aims to propose a corpus expansion method and device to solve the lack of paths in the prior art due to sparse corpus in practical applications, because many actually needed words or combinations may not appear, so that The probability of forming a sentence is significantly reduced, which affects the use effect
Solved the problem that in the practical application of sparse corpus, due to the lack of path, the probability of forming a sentence is small, which affects the use effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus expansion method and apparatus
  • Corpus expansion method and apparatus
  • Corpus expansion method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] refer to figure 1 , which is a flow chart of a corpus expansion method described in the embodiment of the present invention, may specifically include the following steps:

[0042] Step 101, using first corpus data to train and obtain an n-gram language model and a neural network language model; the first corpus data is sparse corpus data.

[0043] In the embodiment of the present invention, after obtaining a corpus data, after preprocessing the data in the corpus such as cleaning, word segmentation, etc., the corpus data in units of phrases is obtained, and the preprocessed sparse corpus is trained using the n-gram language model Tools and neural network language model training tools for training n-gram language models and neural network language models.

[0044] Specifically, taking the n-gram language model as an example, the appearance of the nth word is related to the first n-1 words, but not to any other words (this is also the assumption in Hidden Markov.) The pr...

Embodiment 2

[0067] refer to figure 2 , which is a flow chart of a corpus expansion method described in the embodiment of the present invention, may specifically include the following steps:

[0068] Step 201, using the first corpus data to train and obtain an n-gram language model and a neural network language model; the first corpus data is sparse corpus data.

[0069] This step is the same as step 101 and will not be described in detail here.

[0070] Step 202, sort the predicted word data according to the occurrence probability of each word in the predicted word data.

[0071] In practical applications, for a trained neural network language model, for any input word or sentence composed of multiple words, the language model can calculate the probability distribution of words that will appear after the word or phrase. For example, if you input a phrase as the starting word such as "I want today", the language model will give a higher probability to the words that may appear, and give...

Embodiment 3

[0109] refer to image 3 , is a structural block diagram of a corpus expansion device according to an embodiment of the present invention.

[0110] The language model training module 301 is used to obtain an n-gram language model and a neural network language model by using the first corpus data training; the first corpus data is sparse corpus data;

[0111] The second corpus data generation module 302 is used to use the neural network language model to predict the word data after the word or word in the first corpus data, and generate the second corpus data;

[0112]The third corpus data generation module 303 is used to input the second corpus data into the n-gram language model, and generate the third corpus data after filtering;

[0113] The first corpus data updating module 304 is configured to add the third corpus to the first corpus data to generate updated first corpus data.

[0114] refer to Figure 4 , is a schematic diagram of the relationship between modules in t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a corpus expansion method and apparatus. The method comprises the steps of performing training by utilizing first corpus data to obtain an n-gram language model and a neural network language model, wherein the first corpus data is sparse corpus data; by utilizing the neural network language model, predicting word or phrase data after words or phrases in the first corpus data, and generating second corpus data; inputting the second corpus data to the n-gram language model, and performing filtering to generate third corpus data; and adding the third corpus data to the first corpus data, thereby generating updated first corpus data. The problems of small sentence accomplishment probability and influence on usage effect, due to necessary word deficiency, for sparse corpora in actual application are solved.

Description

technical field [0001] The invention relates to the technical field of language processing, including a corpus expansion method and device. Background technique [0002] The N-gram language model is the most commonly used language model in speech recognition at this stage, and can be obtained by performing statistical calculations on the word-segmented text. It is widely used in natural language processing, and its main purpose is to calculate the sentence probability of a certain sentence. [0003] However, the N-gram language model is very dependent on the amount of training data. The current optimization method for the N-gram language model is mainly to add more adaptive corpus. [0004] The sparse corpus is in the training corpus. Due to the limited corpus, many words or combinations that are actually needed may not appear. Usually, the lack of paths will cause the probability of forming a sentence to be significantly reduced, which will significantly affect the subseq...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/36G06F40/216G06F40/284
Inventor 殷子墨李健
Owner BEIJING SINOVOICE TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products