Automatic corpus selecting algorithm for statistical language model

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A statistical language model and language model technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of long time-consuming training corpus selection, suboptimal results, and shortening the processing time of corpus selection, so as to reduce the The effect of corpus selection time, quality improvement and accuracy improvement

Inactive Publication Date: 2011-09-21

方圆 +1

View PDF0 Cites 3 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] In order to overcome the shortcomings of the existing statistical language model that the training corpus selection takes a long time and the result is not optimal, the present invention provides a new algorithm, which can not only reduce the processing time of corpus selection by hundreds of times, but also can Better improve the balance of the selected corpus and improve the accuracy of the resulting language model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0009] exist figure 1 Among them, the input data of the algorithm startup is the original corpus list file, calculate the size of the original corpus according to this file, and evenly distribute the original corpus to a limited number of corpus subsets, then train the language model of each corpus subset, and calculate their With the cross-entropy of the benchmark reference language model, sort the calculation results, select several subsets with the smallest cross-entropy, merge them into the result set, and then train the language model of the result set to calculate its accuracy. If the accuracy rate meets the requirements, the corpus selection algorithm ends, otherwise the algorithm process is iterated.

[0010] exist figure 2 Among them, the difference between the automatic corpus selection algorithm and the manual comparison of the final language model accuracy is identified. Here, the test is divided into different aspects, such as phrases, long sentences, regular vo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to an automatic corpus selecting algorithm for a statistical language model, which can be used for increasing the corpus selecting speed and the selecting quality. The automatic corpus selecting algorithm comprises the steps of: randomly dividing the original corpora into a plurality of subsets of the same size; respectively training the language model of each subset; calculating the crossing entropy of each subset by utilizing the language model with high accuracy as a reference model; selecting and merging a plurality of sets with the minimum entropy values into a final corpus result set; and repeating the process till the language model of the corpus result set has a standard accuracy.

Description

technical field [0001] The invention relates to the improvement of a method for automatically extracting corpus by a statistical language model in the field of natural language processing, and in particular can improve the extraction speed of massive original corpus. Background technique [0002] At present, the known method of selecting the training corpus of the statistical language model is manually selected, and professionals in natural language processing read a large amount of text corpus, and select the corpus set that they think is the best balance, as much as possible. Filter out the noise. However, it takes too long to manually filter text files exceeding 100 megabytes, and the accuracy is not optimal. It is impossible to update the language model in time and quickly reflect hot words. Contents of the invention [0003] In order to overcome the shortcomings of the existing statistical language model that the training corpus selection takes a long time and the re...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27

Inventor 方圆秦晓康

Owner 方圆

Automatic corpus selecting algorithm for statistical language model

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology