Language model training method, query method and corresponding device

A training method and language model technology, applied in speech analysis, speech recognition, instruments, etc., can solve the problems of large training corpus, rapid update of language models affecting the speech search system, slow training speed, etc.

Active Publication Date: 2014-06-18
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF6 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method of training the language model can only train the language model based on the training corpus in a serial manner, which will cause a large amount of training corpus or the language model is too large, and the training speed will be slow, which will affect the language model of the voice search system. quick update

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Language model training method, query method and corresponding device
  • Language model training method, query method and corresponding device
  • Language model training method, query method and corresponding device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0072] figure 1 The flow chart of the language model training method provided by Embodiment 1 of the present invention, such as figure 1 As shown, the method includes the following steps:

[0073] Step 101: Divide the training corpus into blocks to obtain N sets of training corpus, where N is a positive integer greater than 1.

[0074] In order to improve the update speed of the language model, in the embodiment of the present invention, the original serial processing of the training corpus is changed to parallel processing, so firstly the training corpus is divided into blocks to obtain multiple sets of training corpus, so that the multiple sets of training The corpus is processed in parallel.

[0075] Here, the division of the training corpus can be performed according to any strategy, as long as the training corpus can be divided into N groups. In addition, the training corpus used in this step can be the user input information of all time periods in the search text duri...

Embodiment 2

[0119] image 3 The flow chart of the language model query method provided by Embodiment 2 of the present invention, as shown in image 3 As shown, the query method specifically includes the following steps:

[0120] Step 301: Obtain the word sequence to be queried, and execute step 302 with the word sequence to be queried as the currently input word sequence.

[0121] Step 302: Adjust the currently input word sequence to a preset word order structure, and the adjusted word sequence is in the following order: the penultimate word, the last word, and other words in the currently input word sequence are arranged in reverse order.

[0122] The word order structure adjustment of the input word sequence in this step matches the word order structure of the Trie tree storing the probability information.

[0123] Step 303: query the adjusted word sequence on the Trie tree storing the forward probability information obtained in the first embodiment.

[0124] Step 304: Determine whet...

Embodiment 3

[0137] Figure 4 The structural diagram of the training device for the language model provided by Embodiment 3 of the present invention, as shown in Figure 4 As shown, the training device includes: a block processing unit 400, N recursive processing units 410, N word order tree building units 420, and a merge processing unit 430, where N is a positive integer greater than 1.

[0138] The block processing unit 400 blocks the training corpus to obtain N sets of training corpus, and provides the N sets of training corpus to each recursive processing unit 410 respectively.

[0139] In the embodiment of the present invention, the original serial processing of the training corpus is changed into parallel processing, so the block processing unit 400 first divides the training corpus to obtain multiple sets of training corpus, so that the subsequent training corpus can be performed on the multiple sets of training corpus Parallel processing. The training corpus used by the block pr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a language model training method, a query method and a corresponding device; the training method comprises the following steps: partitioning training corpus to obtain N groups of training corpus, wherein the N is a positive integer bigger than 1; carrying out parallel execution to the N groups of training corpus obtained by partition; ordering recursion suffix trees so as to respectively obtain ordering results reflecting inverted order position conditions of each word in each sentence; based on the ordering result, respectively setting up an n-ary word order tree according to a preset first word order structure under a condition that a second last word of each sentence is regarded as a root node, and the n refers to the preset one or more positive integers bigger than 1; combining the word order trees of the same root node and converting the word order so as to obtain a Trie tree storing forward probability information. A word order sequence of the Trie tree from root to leaf is as the following order: the second last word in the sentence, a last work, and other words arranged in an inverted order. By employing the method and device, the language model can be fast updated.

Description

【Technical field】 [0001] The invention relates to the technical field of speech recognition in computer applications, in particular to a language model training method, query method and corresponding device. 【Background technique】 [0002] Speech recognition refers to enabling machines to accurately recognize the content of speech in different situations, so as to execute various human intentions based on the recognized information, such as performing voice searches. At present, with the continuous development of speech recognition technology, statistical language models have been widely used in various fields, such as speech recognition, information retrieval, and spoken language understanding. For large-vocabulary continuous speech recognition, the language model is a very critical link in the entire recognition system, which directly affects the performance and recognition effect of the entire recognition system. [0003] In technical applications such as voice search, l...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L15/06
Inventor 贾磊万广鲁
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products