Method and device used for training language model according to corpus sequence

A language model and model technology, applied in natural language data processing, electrical digital data processing, special data processing applications, etc., can solve problems such as inability to achieve results, and achieve the effect of improving accuracy and good modeling effects

Active Publication Date: 2014-01-15
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF3 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in language model training, the characteristics of different order language models are not the same, and using the same smoothing algorithm for different order grammars cannot achieve the best results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device used for training language model according to corpus sequence
  • Method and device used for training language model according to corpus sequence
  • Method and device used for training language model according to corpus sequence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0026] figure 1 A schematic diagram of a device for training a language model according to a corpus sequence according to one aspect of the present invention is shown. The model training device 1 includes sequence acquisition means 101 , iterative execution means 102 , algorithm determination means 103 , model training means 104 and order update means 105 .

[0027] Wherein, the sequence obtaining means 101 obtains a corpus sequence to be used for training a target language model. Specifically, the sequence obtaining means 101 obtains the corpus sequence intended to be used for training the target language model from the corpus, for example, by calling an API provided by the corpus; or, the sequence obtaining means 101, for example, by calling the corpus The application program interface (API) and other methods to obtain the corpus information to be used fo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention aims to provide a method and device used for training a language model according to a corpus sequence. The corpus sequence used for training the target language model is acquired, initial order information of the target language model is set as the current training order, and the following operations are carried out through iteration in combination with the highest order information of the target language model till the current training order exceeds the highest order information, wherein the operations include that according to the current training order, a smoothing algorithm corresponding to the target language model is determined; according to the corpus sequence, the target language model is trained through the smoothing algorithm to acquire an updated target language model; the current training order is updated. In comparison with the prior art, the method and device have the advantages that different smoothing algorithms are adopted for language models with different orders according to the characteristics of the language models with different orders, the advantages of different smoothing algorithms are played, and thus better model establishment effects can be achieved. Furthermore, the method and device can be combined with voice identification, and thus the accuracy of the voice identification can be improved.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a technology for training a language model according to a corpus sequence. Background technique [0002] Language model modeling mainly obtains a statistical model for language by counting the grammatical distribution in the text corpus, which is used to describe the probability that a text string becomes a natural language. In language model training, in order to describe the zero-order grammar with a certain probability, a smoothing algorithm is usually used to smooth part of the probability of high-frequency grammar to low-frequency grammar by using the idea of ​​"robbing the rich and helping the poor". [0003] At present, there are many smoothing algorithms for language models, such as katz smoothing algorithm, KN smoothing (Kneser-Ney smoothing) algorithm, plus one smoothing algorithm, wb smoothing (Witten-Bell smoothing) algorithm, etc. The more commonl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/90332G06F40/216
Inventor 万广鲁
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products