A method and device for interactively extracting comparable corpus and bilingual dictionaries

A bilingual dictionary, interactive technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as difficulty in identifying comparable corpus and comparability, and difficulty in extracting and translating vocabulary.

Active Publication Date: 2017-08-11
HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to solve the defect that it is difficult to identify comparable corpus under the condition of insufficient scale of bilingual dictionaries in the field seed and it is difficult to extract inter-translation vocabulary under the condition of different degrees of comparability, and to provide a method and device for interactively extracting comparable corpus and bilingual dictionaries solve the above problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for interactively extracting comparable corpus and bilingual dictionaries
  • A method and device for interactively extracting comparable corpus and bilingual dictionaries
  • A method and device for interactively extracting comparable corpus and bilingual dictionaries

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0057] In order to have a further understanding and understanding of the structural features of the present invention and the achieved effects, the preferred embodiments and accompanying drawings are used for a detailed description, as follows:

[0058] Such as figure 1 As shown, a method for interactively extracting comparable corpus and bilingual dictionaries according to the present invention comprises the following steps:

[0059] The first step is the preprocessing process 101 . Perform part-of-speech restoration, word segmentation, and stop word removal on documents to obtain preprocessed document collections and vocabulary collections.

[0060] For M source language documents and N target language documents, perform preprocessing such as part-of-speech restoration, word segmentation, and stop word removal according to the methods of the prior art, and obtain the source language document set D S ={d m |1≤m≤M}, target language document set D T ={d n |1≤n≤N}, source l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for interactively extracting a comparable corpus and a bilingual dictionary and a device thereof, and aims to overcome the defects of difficulty in identifying the comparable corpus under the condition of insufficient domain seed bilingual dictionary scale and difficulty in extracting inter-translation vocabulary under the condition of different comparable degrees. The method comprises the following steps: performing word characteristic reduction, word segmentation and stop word removing on a document to obtain a preprocessed document set and a vocabulary set; constructing relations between a source language document and a target language document, between source language vocabulary and target language vocabulary and between a bilingual vocabulary pair and a bilingual document pair respectively; iterating, enhancing and calculating the weights of the bilingual document pair and the bilingual vocabulary pair; selecting a bilingual document pair of which the weight is the largest for constructing the comparable corpus, and selecting a bilingual vocabulary pair of which the weight is the largest for constructing the bilingual dictionary. The judgment that similarity among different language vocabularies is facilitated through the similarity among different language documents is performed, the similarity among different language documents is increased through the similarity among different language vocabularies, and synchronous extraction of the comparable corpus and the bilingual dictionary is realized through interactive iteration and enhancement.

Description

technical field [0001] The present invention relates to the technical field of cross-language information processing, in particular to a method and device for interactively extracting comparable corpus and bilingual dictionaries. Background technique [0002] Bilingual comparable corpus and bilingual dictionaries are two kinds of cross-lingual resources with different granularities, which are of great value to cross-lingual information processing such as statistical machine translation and cross-lingual information retrieval. The comparable corpus is composed of document pairs with different languages ​​and similar content but not inter-translatable, from which excavating translation equivalence pairs with different granularities such as bilingual inter-translation vocabulary, bilingual named entities, and parallel sentence pairs can effectively solve the new problems faced by bilingual dictionary compilation. More fine-grained bilingual knowledge such as bilingual dictionar...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/28G06F17/30
Inventor 朱泽德王绍祺李淼张健陈雷杨振新卫林钰曾新华郑守国李华龙翁士状盛文溢高会议陈晟
Owner HEFEI INSTITUTES OF PHYSICAL SCIENCE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products