Unlock instant, AI-driven research and patent intelligence for your innovation.

A language model training method, training device and testing method

A language model and training method technology, applied in the computer field, can solve problems such as increasing computing time and cost

Active Publication Date: 2018-12-18
龙马智芯(珠海横琴)科技有限公司
View PDF6 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the old corpus word list and vocabulary that exist in the prior art and fail to contain all the words or words in the new corpus, it is necessary to use all the old corpus plus new corpus to regenerate the word list and vocabulary to retrain the language model, thereby greatly increasing The problem of calculation time and cost, the present invention provides a kind of language model training method, training device and inspection method, greatly increase the probability that old corpus vocabulary and word list all contain the word or words in newly-added corpus, in the amount of information Save training time in case of growing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A language model training method, training device and testing method
  • A language model training method, training device and testing method
  • A language model training method, training device and testing method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] Such as figure 1 As shown, it is a schematic flowchart of a language model training method provided by Embodiment 1 of the present invention, including steps S11 to S12, specifically as follows:

[0042] S11: Initialize the word list and vocabulary with specific words and phrases.

[0043] S12: Train the language model using the initialized word list and vocabulary and the original corpus to generate a trained language recognition model.

[0044] In this embodiment, the specific words and phrases may be some of the most frequently used words and phrases automatically obtained according to the frequency of use of the user, or they may be the words and phrases in the internal storage library, which is not limited in this embodiment . The words and phrases in the internal repository contain commonly used words given by the country's official. After initializing the vocabulary and vocabulary with the specific characters and words in this way, the probability that the old...

Embodiment 2

[0047] figure 2 It is a schematic flowchart of a language model training method provided in Embodiment 2 of the present invention. This embodiment is optimized on the basis of Embodiment 1. In this embodiment, incremental training will be performed on the trained language recognition model, specifically: when new corpus is received, count the new corpus and statistically analyze its Word Error Rate and Word Error Rate.

[0048] Further, when the amount of new corpus is not less than the set threshold, or when the word error rate or word error rate of the new corpus is not less than the set threshold, the language recognition model is incrementally trained.

[0049] Further, a part of the existing corpus is randomly selected or all of the existing corpus is used for incremental training of the language recognition model.

[0050] Further, calculate the total number of new corpora as m, randomly select α*m old corpora, mix m new corpora with α*m old corpora to generate a mixed ...

Embodiment 3

[0061] image 3 It is a schematic flowchart of a language model training method provided in Embodiment 3 of the present invention. This embodiment is optimized on the basis of Embodiment 1. In this embodiment, incremental training will be carried out for the trained language recognition model, specifically: when a new corpus is received, first analyze and judge whether the source of the new corpus is the same .

[0062] If the source of the new corpus is the same, enter the following process:

[0063] Count the new corpus and statistically analyze its word error rate and word error rate.

[0064] Further, when the amount of new corpus is not less than the set threshold, or when the word error rate or word error rate of the new corpus is not less than the set threshold, the language recognition model is incrementally trained.

[0065] Further, a part of the existing corpus is randomly selected or all of the existing corpus is used for incremental training of the language rec...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a language model training method, a training device and a testing method. The training method includes initializing a word list and / or a thesaurus with specific words and / or words, training a language model by using the word list and / or thesaurus and original stored corpus, and generating a trained language recognition model. The invention can solve the problem that when the old corpus word list and the word list do not contain all the words or words in the new corpus in the prior art, all the old corpus and the new corpus are used to regenerate the word list and the word list to retrain the language model so as to greatly increase the calculation time and cost. It greatly increases the probability that the old corpus word list and thesaurus contain all the words orwords in the new corpus, thus reducing the training time.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a language model training method, training device and testing method. Background technique [0002] The establishment of existing language models is based on a large number of training sentences or phrases, and the generation of vocabulary and vocabulary is also based on the use of words and words that appear in the corpus. When a new corpus is added (new corpus refers to words or words that have not appeared in the old corpus), it is necessary to use all the old corpus plus all the new corpus to regenerate the word list and vocabulary, and then use The entire corpus retrains the language model. This situation will add a lot of computing time and cost. [0003] For example, under normal circumstances, the capacity of the word table contained in the corpus of 300 hours to 1200 hours is about 3000-5000 words, but the commonly used Chinese characters are about 8000 words. When ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/263G06F40/279
Inventor 郑权张峰聂颖
Owner 龙马智芯(珠海横琴)科技有限公司