Automatic corpus selecting algorithm for statistical language model
A statistical language model and language model technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of long time-consuming training corpus selection, suboptimal results, and shortening the processing time of corpus selection, so as to reduce the The effect of corpus selection time, quality improvement and accuracy improvement
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0009] exist figure 1 Among them, the input data of the algorithm startup is the original corpus list file, calculate the size of the original corpus according to this file, and evenly distribute the original corpus to a limited number of corpus subsets, then train the language model of each corpus subset, and calculate their With the cross-entropy of the benchmark reference language model, sort the calculation results, select several subsets with the smallest cross-entropy, merge them into the result set, and then train the language model of the result set to calculate its accuracy. If the accuracy rate meets the requirements, the corpus selection algorithm ends, otherwise the algorithm process is iterated.
[0010] exist figure 2 Among them, the difference between the automatic corpus selection algorithm and the manual comparison of the final language model accuracy is identified. Here, the test is divided into different aspects, such as phrases, long sentences, regular vo...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 