Unlock instant, AI-driven research and patent intelligence for your innovation.

Automatic corpus selecting algorithm for statistical language model

A statistical language model and language model technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of long time-consuming training corpus selection, suboptimal results, and shortening the processing time of corpus selection, so as to reduce the The effect of corpus selection time, quality improvement and accuracy improvement

Inactive Publication Date: 2011-09-21
方圆 +1
View PDF0 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In order to overcome the shortcomings of the existing statistical language model that the training corpus selection takes a long time and the result is not optimal, the present invention provides a new algorithm, which can not only reduce the processing time of corpus selection by hundreds of times, but also can Better improve the balance of the selected corpus and improve the accuracy of the resulting language model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic corpus selecting algorithm for statistical language model
  • Automatic corpus selecting algorithm for statistical language model
  • Automatic corpus selecting algorithm for statistical language model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0009] exist figure 1 Among them, the input data of the algorithm startup is the original corpus list file, calculate the size of the original corpus according to this file, and evenly distribute the original corpus to a limited number of corpus subsets, then train the language model of each corpus subset, and calculate their With the cross-entropy of the benchmark reference language model, sort the calculation results, select several subsets with the smallest cross-entropy, merge them into the result set, and then train the language model of the result set to calculate its accuracy. If the accuracy rate meets the requirements, the corpus selection algorithm ends, otherwise the algorithm process is iterated.

[0010] exist figure 2 Among them, the difference between the automatic corpus selection algorithm and the manual comparison of the final language model accuracy is identified. Here, the test is divided into different aspects, such as phrases, long sentences, regular vo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an automatic corpus selecting algorithm for a statistical language model, which can be used for increasing the corpus selecting speed and the selecting quality. The automatic corpus selecting algorithm comprises the steps of: randomly dividing the original corpora into a plurality of subsets of the same size; respectively training the language model of each subset; calculating the crossing entropy of each subset by utilizing the language model with high accuracy as a reference model; selecting and merging a plurality of sets with the minimum entropy values into a final corpus result set; and repeating the process till the language model of the corpus result set has a standard accuracy.

Description

technical field [0001] The invention relates to the improvement of a method for automatically extracting corpus by a statistical language model in the field of natural language processing, and in particular can improve the extraction speed of massive original corpus. Background technique [0002] At present, the known method of selecting the training corpus of the statistical language model is manually selected, and professionals in natural language processing read a large amount of text corpus, and select the corpus set that they think is the best balance, as much as possible. Filter out the noise. However, it takes too long to manually filter text files exceeding 100 megabytes, and the accuracy is not optimal. It is impossible to update the language model in time and quickly reflect hot words. Contents of the invention [0003] In order to overcome the shortcomings of the existing statistical language model that the training corpus selection takes a long time and the re...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
Inventor 方圆秦晓康
Owner 方圆