A Method of Mining Comparable Corpus from the Internet

A corpus and network technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as insufficient coverage of bilingual knowledge, ambiguity in vocabulary translation, limitations of comparable corpus, etc., to enhance analysis, improve accuracy, The effect of improving the utilization rate

Active Publication Date: 2017-02-08
HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
View PDF3 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the problems faced by the above three methods are: ambiguity in vocabulary translation, insufficient coverage of bilingual knowledge, or limited comparable corpus to specific data sources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Mining Comparable Corpus from the Internet
  • A Method of Mining Comparable Corpus from the Internet
  • A Method of Mining Comparable Corpus from the Internet

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] A method for mining comparable network corpora, the method comprising the steps in the following sequence: (1) using a web crawler to obtain a source language web page, and forming a source language document after preprocessing; (2) constructing a cross-language theme based on an existing bilingual corpus model, which analyzes the probability of cross-language topics in source language documents, and uses the topic information of source language documents to generate corresponding target language query words; (3) submits target language query words to search engines to obtain target language documents in the network, and selects the top N The documents constitute the target language candidate similar document set, and N is 10; (4) Analyze the cross-lingual topic probability distribution of the target language candidate similar documents, and calculate the similarity between the source language document and the target language candidate similar document according to the KL...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for mining comparable network language materials. The method includes acquiring source language web pages by the aid of network crawlers and preprocessing the source language web pages to obtain source language documents; analyzing probabilities of cross-language topics of the source language documents and generating corresponding target language query phrases; submitting the target language query phrases to search engines and selecting front N documents to form a target language candidate similar document set; computing similarity degrees of the source language documents and target language candidate similar documents, sieving documents with high similarity degrees and constructing a comparable language material bank. The invention further discloses a device for implementing the method for mining the comparable network language materials. The method and the device have the advantages that the problem of ambiguity or long time consumption due to vocabulary translation can be solved; the source language documents come from specific website contents acquired by the network crawlers, the target language documents come from the integral internet, and accordingly the source language document utilization rate can be effectively increased; the source language documents are matched with the target language similar documents by the aid of topic distribution similarity, and accordingly the language material bank construction accuracy can be improved.

Description

technical field [0001] The invention relates to the technical field of statistical machine translation and cross-language information retrieval, in particular to a mining method for network comparable corpus. Background technique [0002] A comparable corpus is a collection of documents in different languages ​​and similar in content but not inter-translatable, from which mining bilingual terms, named entities, parallel sentence equivalence, and other fine-grained translation equivalences facilitates lexicography, cross-language information retrieval, and statistical machine translation development in other fields. Compared with the parallel corpus, the similarity of the content of the comparable corpus reduces the requirements for translation between the source language and the target language documents in the parallel corpus, so that the comparable corpus has the advantages of authentic language, wide source, comprehensive field, novel content and easy access. [0003] Th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 李淼朱泽德张健曾新华陈雷曾伟辉郑守国高会议胡泽林杨振新陈晟李华龙董瀚琳吴娜卞程飞翁士状
Owner HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products