Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

A Domain Adaptive Sentence Alignment System Based on Self-Bootstrapping

A sentence pair, self-guided technology, applied in the field of text processing of natural language processing, can solve the problems of low quality of sentence alignment, no domain specificity, time and energy consumption, saving resources, convenient and concise operation, improving The effect of alignment quality

Inactive Publication Date: 2017-02-15
NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT +1
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] On the one hand, the machine translation system has an urgent need for the processed parallel corpus and aligned sentence pairs; on the other hand, the operations required in the preprocessing are relatively cumbersome, and these tasks are too time-consuming and energy-consuming to do manually; and The current sentence alignment has problems of low quality and domain specificity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Domain Adaptive Sentence Alignment System Based on Self-Bootstrapping
  • A Domain Adaptive Sentence Alignment System Based on Self-Bootstrapping
  • A Domain Adaptive Sentence Alignment System Based on Self-Bootstrapping

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] like figure 1 As shown, the architecture of this system includes four parts, and the related implementation of each part is as follows:

[0030] 1. Web page processing module

[0031] This part takes webpage corpus as the main processing object. Webpage corpus refers to the parallel or comparable HTML files that are directly crawled from the web. Through the analysis of the format and related features of specific web pages, regular expressions are used to extract the corresponding text, including Chinese text and English text.

[0032] 2. English processing module

[0033] Combining the features of English punctuation marks, it handles sentence operations, tokenization and rooting processes, etc.

[0034] Lemmatization is the process of separating English words from the punctuation that follows them. Usually, these punctuation marks following words will affect the recognition of English words. Since English texts often have special punctuation marks (such as he’s s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Provided is a domain self-adaption sentence alignment system based on a self-guidance mode. The domain self-adaption sentence alignment system comprises a webpage processing module, a Chinese text processing module, an English text processing module and a double language text processing module. Firstly, materials of different web pages are extracted and correspondingly pre-processed; sentence-level alignment is carried out on Chinese and English sentences through a sentence alignment algorithm which is based on the self-guidance mode and integrates a plurality of characteristics. Meanwhile, intertranslation words capable of reflecting related domain information and subject information are extracted. Sentence alignment quality is improved, and the domain self-adaption sentence alignment system has the advantage of being strong in domain adaptability.

Description

technical field [0001] The invention relates to a field adaptive sentence alignment system based on a bootstrap method, which belongs to the field of text processing of natural language processing. The self-bootstrap method refers to using algorithm results to feed back algorithm conditions and achieving optimization through multiple iterations. Background technique [0002] In the field of natural language processing, the acquisition of high-quality parallel corpus is a very important issue, which is of great significance for applications such as machine translation and cross-language retrieval. The Internet is a good resource bank and a good source of corpus. However, due to the particularity of the Internet for information storage and organization, if you want to make better use of text information, you need to extract and preprocess web page information. Whether a large-scale well-preprocessed sentence pair with high alignment quality can be obtained is a key factor aff...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F17/28
CPCG06F16/3335G06F16/374
Inventor 程工刘春阳庞琳张旭巢文涵黄智李舟军
Owner NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products