Method and system for filtering bilingualism corpora

A technology of bilingual corpus and filtering method, which is applied in the field of bilingual corpus filtering method and system, can solve the problems of decreased recall rate, failure to cover the distribution of corpus, decreased accuracy rate, etc., and achieve the effect of improving accuracy rate and recall rate

Inactive Publication Date: 2008-06-18
BEIJING KINGSOFT SOFTWARE +2
View PDF0 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The feature threshold is set empirically, and the feature threshold may often be determined by the setter based on only a few corpus resources, which cannot cover the dis

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for filtering bilingualism corpora
  • Method and system for filtering bilingualism corpora
  • Method and system for filtering bilingualism corpora

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The invention provides a filtering method of a bilingual corpus, which is used to improve the universality, accuracy rate and recall rate of the corpus.

[0043] Referring to FIG. 1 and FIG. 2 , FIG. 1 is a flow chart of the first embodiment of the bilingual corpus filtering method of the present invention, and FIG. 2 is a flow chart of establishing a classification model in FIG. 1 .

[0044] The bilingual corpus filtering method described in the first embodiment of the present invention comprises the following steps:

[0045] S100. Determine the sentence length ratio characteristic value of the bilingual sentence pair.

[0046] Determine the number of words or characters used in a bilingual sentence pair. The number of words or characters in one of the two sentences is compared with the number of words or characters in the other of the above two sentences, and the obtained value is the characteristic value of the sentence length ratio.

[0047] When the bilingual sen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a filtering method of a bilingual corpus and the method comprises the following steps: A. ratio flag value of sentence length of English-Chinese bilingual sentence pair is confirmed; B. the number of different parts of speech in the English-Chinese bilingual sentence pair is respectively counted, the matching number of the corresponding words in a bilingual intertranslating dictionary and words of the part of speech are calculated and the interpretation eigenvalue is confirmed according to the number of different parts of speech and the matching number; C. the filtration and classification are carried out by the ratio eigenvalue of the sentence length and the interpretation eigenvalue according to a classification model established by using a training set in advance. The invention discloses a bilingual corpus system; the invention also provides a filtering method of the bilingual corpus and a system thereof, which are used for improving universality, accuracy and recalling rate of the corpus.

Description

technical field [0001] The invention relates to a corpus filtering method, in particular to a bilingual corpus filtering method and system. Background technique [0002] The great value of corpus resources for natural language processing research has been increasingly recognized. Especially the parallel bilingual corpus, which is a special corpus that contains information about mutual translation between two languages. Parallel bilingual corpora can provide rich matching information between two languages, and have important application value in the acquisition of translation knowledge, the establishment of bilingual dictionaries, machine translation based on statistics or examples, word sense disambiguation, etc., especially high-quality The role of corpus is more prominent. [0003] There are two main ways to build a corpus, one is the traditional method of manual collection; the other is to obtain the text-level aligned corpus through computer automatic sentence alignmen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
Inventor 王刚高立琦刘挺王海洲
Owner BEIJING KINGSOFT SOFTWARE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products