Method and device for mining comparable network language materials

A corpus and network technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as the limitation of comparable corpora, ambiguity in vocabulary translation, insufficient bilingual knowledge coverage, etc., to enhance analysis, improve accuracy, The effect of improving utilization

Active Publication Date: 2013-12-25
HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
View PDF3 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the problems faced by the above three methods are: ambiguity in vocabulary translation, i

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for mining comparable network language materials
  • Method and device for mining comparable network language materials
  • Method and device for mining comparable network language materials

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0045] A method for mining comparable corpus in the network, the method includes the following steps: (1) using a web crawler to obtain source language web pages, and preprocessing to form source language documents; (2) constructing cross-language topics according to an existing bilingual corpus The model analyzes the probability of cross-language topics of the source language documents, and generates the corresponding target language query words by using the source language document topic information; (3) Submit the target language query words to the search engine, obtain the target language documents in the network, and select the top N (4) Analyze the cross-language topic probability distribution of the target language candidate similar documents, and calculate the similarity between the source language document and the target language candidate similar documents according to the KL divergence of the topic probability distribution. and filter out the source language document...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for mining comparable network language materials. The method includes acquiring source language web pages by the aid of network crawlers and preprocessing the source language web pages to obtain source language documents; analyzing probabilities of cross-language topics of the source language documents and generating corresponding target language query phrases; submitting the target language query phrases to search engines and selecting front N documents to form a target language candidate similar document set; computing similarity degrees of the source language documents and target language candidate similar documents, sieving documents with high similarity degrees and constructing a comparable language material bank. The invention further discloses a device for implementing the method for mining the comparable network language materials. The method and the device have the advantages that the problem of ambiguity or long time consumption due to vocabulary translation can be solved; the source language documents come from specific website contents acquired by the network crawlers, the target language documents come from the integral internet, and accordingly the source language document utilization rate can be effectively increased; the source language documents are matched with the target language similar documents by the aid of topic distribution similarity, and accordingly the language material bank construction accuracy can be improved.

Description

technical field [0001] The invention relates to the technical field of statistical machine translation and cross-language information retrieval, in particular to a method and device for mining comparable corpus on a network. Background technique [0002] A comparable corpus is a collection of documents in different languages ​​and similar in content but not inter-translated, from which to mine fine-grained translation equivalence of bilingual terms, named entities, parallel sentence pairs, etc. facilitates lexicography, cross-language information retrieval, and statistical machine translation development in other fields. Compared with the parallel corpus, the similarity of the content of the comparable corpus reduces the requirement of mutual translation between the source language and the target language document in the parallel corpus, so that the comparable corpus has the advantages of authentic language, broad source, comprehensive field, novel content and easy access. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 李淼朱泽德张健曾新华陈雷曾伟辉郑守国高会议胡泽林杨振新陈晟李华龙董瀚琳吴娜卞程飞翁士状
Owner HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products