Method which is used for classifying translation manuscript in automatic fragmentation mode and based on large-scale term corpus

A corpus and large-scale technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as unfavorable translation fragmentation classification methods, improve classification efficiency, shorten classification time, and reduce query time Effect

Inactive Publication Date: 2013-05-15
IOL WUHAN INFORMATION TECH CO LTD
View PDF2 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The present invention aims to provide a method for automatic fragmentation classification of translated manuscripts based on a large-scale terminology corpus to solve the above-mentioned problems that are not conducive to the fragmentation classification method of translated manuscripts

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method which is used for classifying translation manuscript in automatic fragmentation mode and based on large-scale term corpus
  • Method which is used for classifying translation manuscript in automatic fragmentation mode and based on large-scale term corpus
  • Method which is used for classifying translation manuscript in automatic fragmentation mode and based on large-scale term corpus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be described in detail below with reference to the accompanying drawings and in combination with embodiments. see figure 1 , the process of the embodiment includes:

[0028] S11: Extract each keyword of each paragraph of the translated manuscript, and establish a corresponding relationship between each paragraph and each keyword contained therein;

[0029] S12: Match each keyword of the translated manuscript in the term corpus one by one, and use the industry category attribute of the term matched by each keyword as the industry category attribute to which the keyword belongs in each segment corresponding to it;

[0030] S13: According to the corresponding relationship, determine that each segment contains the same maximum industry category attributes;

[0031] S14: classify the segment with the most industry category attributes.

[0032] Since the number of words in the document to be translated is much smaller than the number of words in ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method which is used for classifying a translation manuscript in an automatic fragmentation mode and based on a large-scale term corpus. The method which is used for classifying the translation manuscript in an automatic fragmentation mode and based on the large-scale term corpus comprises that the translation manuscript is processed in a word classification mode, stop words are eliminated, a key word set is acquired, each key word of each paragraph of the translation manuscript is picked up, and corresponding relations of each paragraph and each key word included by the each paragraph are built; key words of the translation manuscript are one by one matched in the term corpus, and industry categorical attributes of terms matched by the key word are used as attributive industry categorical attributes of each paragraph corresponding to the key word; according to the corresponding relations, identical and maximum categorical attributes included by each paragraph are confirmed; and the paragraph is classified by the maximum categorical attributes. Because the number of words of the translation manuscript is far less than the number of words of the term corpus, the term corpus has the function of being looked up according to alphabet sequences and a pattern matching algorithm needs not adopting when key word matching is conducted in the term corpus, and therefore lookup time is greatly reduced, fragmentation time of the translation manuscript is shortened and fragmentation efficiency is improved.

Description

technical field [0001] The invention relates to the field of document division, in particular to a method for automatic fragmentation and classification of translated manuscripts based on a large-scale terminology corpus. Background technique [0002] At present, the production of corpus in the prior art generally includes the following processes: [0003] Collection of corpus: corpus can come from national standards, industry standards and other standard documents, and can also come from officially published dictionaries, encyclopedias, periodicals, teaching materials, newspapers and other reference books and related documents published on authoritative websites; Other terminology corpus network, exchange corpus data and record carrier, etc. to obtain. [0004] Standardization processing: According to the established standard format or rules, the corpus obtained from various sources is initially processed. For example, the duplicate checking of corpus, the unified convers...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 江潮
Owner IOL WUHAN INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products