Distributed acquisition system facing web bilingual parallel corpora resources

A technology of collecting system and parallel corpus, applied in transmission systems, special data processing applications, instruments, etc., can solve the problems of few channels for obtaining corpus, low crawling efficiency, and small crawling scale, so as to improve crawling efficiency and save money. The effect of computing resources and resolving possession conflicts

Inactive Publication Date: 2013-04-03
HARBIN INST OF TECH
View PDF4 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The present invention provides a distributed collection system for web bilingual parallel corpus resources, which solves the problems of small crawling scale, less channels for obtaining corpus, and low crawling efficiency in the existing system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed acquisition system facing web bilingual parallel corpora resources
  • Distributed acquisition system facing web bilingual parallel corpora resources

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0020] Specific implementation mode one: the distributed collection system facing web bilingual parallel corpus resources described in this implementation mode:

[0021] A link repository module for storing the hyperlinks contained in the crawling task;

[0022] Screen filter module 1, input the link flow from the link repository module, and judge whether the link satisfies the crawling condition; if the crawling condition is met, then judge whether it includes non-bilingual sites, and judge whether to crawl according to the rules;

[0023] The webpage crawler module 2 obtains the download list from the screening filter module 1, and then downloads the webpage corresponding to the url link in the download list from the Internet;

[0024] The original webpage library module, the webpage downloaded by the webpage crawler module 2 is saved in the original webpage library module, for storing the original webpage that the webpage crawler module 2 grabs;

[0025] The bilingual dete...

specific Embodiment approach 2

[0040] Embodiment 2: This embodiment is a further description of the link repository module described in Embodiment 1: it is used to store and maintain a large-scale crawled link library, which includes the URL address of the web page, the crawl status and the crawl status. Take time.

[0041] This embodiment stores these meta-information in the captured task list to decide whether to perform a crawl or an incremental update on a link.

specific Embodiment approach 3

[0042] Embodiment 3: This embodiment is a further description of the screening filter module 1 described in Embodiment 1: the screening filter module 1 sequentially reads link items from the link repository module and screens a link to be grabbed list; the filtering strategy is composed of custom filtering rules and blacklist rules; the filtering rules include general regular expressions, and non-bilingual sites provided by the blacklist; after reading a record from the link repository module , make rules to judge whether to add it to the crawling list as the input of the web crawler module 2; another function is to update the link repository module regularly, and eliminate redundant and worthless links according to the filtering rules. Improve link repository quality.

[0043] In this embodiment, the non-bilingual websites that have been discriminated are dynamically added to the blacklist during the translation corpus collection process, and are directly ignored in the next ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A distributed acquisition system facing web bilingual parallel corpora resources relates to the technical field of corpora acquisition, and solves the problems that the conventional system is low in crawling scale, less in corpora acquiring ways, and lower in crawling efficiency. The system comprises an interlinking memory pool module, a screening filter module, a webpage crawl device module, an original webpage library module, a bilingual detection module, a blacklist module, a bilingual webpage library module and an interlinking withdrawal device module. The invention overcomes the technical defects in the conventional technical field, adopts the Internet as a corpora acquisition target, can effectively solve the resource occupation conflicting problem of a distributed system, can provide a universal design framework for a bilingual parallel corpora acquisition system, can dynamically add non-bilingual sites into a blacklist unceasingly, can effectively grab parallel corpora in the Internet, and can greatly improve the bilingual corpora grabbing efficiency.

Description

technical field [0001] The invention relates to the technical field of corpus acquisition, in particular to a distributed acquisition system for bilingual parallel corpus. Background technique [0002] Statistical machine translation is one of the methods of machine translation. The basic idea is to build a statistical translation model through statistical analysis of a large number of parallel corpora, and then use this model for translation. [0003] Parallel corpora play a vital role in statistical machine translation technology. Parallel corpus with sufficient quantity and good quality is a necessary condition for building a high-performance statistical machine translation system. [0004] There are great difficulties in the construction and acquisition of bilingual parallel corpora, and all countries have invested a lot of manpower, material resources and financial resources. However, the sources of bilingual parallel corpora are mainly concentrated in specific fields...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/28G06F17/30H04L29/08
Inventor 徐志明张志超韩啸天
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products