Check patentability & draft patents in minutes with Patsnap Eureka AI!

Machine-translation-oriented multi-strategy segmentation method and device of English long sentence

A machine translation, multi-strategy technology, applied in the field of natural language processing machine translation, can solve problems such as small coverage of language phenomena, and achieve the effect of improving quality

Active Publication Date: 2015-11-18
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF3 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The purpose of the present invention is to solve the problem that the existing rule-based sentence segmentation method has too little coverage of language phenomena, and the existing machine learning-based method can only use commas in the sentence to segment, and proposes a A Novel Multi-Strategy English Long Sentence Segmentation Method for Machine Translation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Machine-translation-oriented multi-strategy segmentation method and device of English long sentence
  • Machine-translation-oriented multi-strategy segmentation method and device of English long sentence
  • Machine-translation-oriented multi-strategy segmentation method and device of English long sentence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The present invention will be further described below in conjunction with embodiment.

[0034] Such as figure 1 As shown, a kind of machine translation-oriented multi-strategy English long sentence segmentation method of the present invention includes training steps and actual segmentation steps, which are described in detail below respectively:

[0035] The first is the training step, which proceeds as follows:

[0036] Step 1, prepare the training corpus and preprocess the corpus. Since the CRF needs to be used to mine the information of the comma position in the corpus, it is necessary to prepare English sentences with a large number of commas as the training corpus. In the experiment, we selected about 450,000 English sentences containing at least two commas as the training corpus.

[0037] At the same time, necessary preprocessing of the corpus is required, such as removing garbled characters and special symbols, English tokenization, etc.

[0038] For the defi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a machine-translation-oriented multi-strategy segmentation method and device of an English long sentence, and belongs to the technical field of natural language processing machine translation. The method comprises two steps of training and practical use, wherein the training step comprises the following specific steps: firstly, preparing and preprocessing English training corpuses; then, carrying out characteristic extraction on the corpuses, wherein the characteristic extraction comprises the extraction of dependency syntax characteristics, the extraction of part-of-speech tagging characteristics, the extraction of comma position characteristics and the like; and finally, creating a characteristic template training CRF (Conditional Random Field) model, and simultaneously designing a plurality of rules which can relatively accurately process simple phenomena. The practical use step specifically comprises the following steps: firstly, carrying out the characteristic extraction on the English long sentence to be processed, wherein the extracted characteristics are the same with the extracted characteristics in the training step; then, independently using a rule algorithm and the CRF model to label a comma position; and finally, adding a comma on the labeled position to finish segmentation. Compared with the prior art, the method and the device can effectively and accurately segment the English long sentence to improve the machine translation quality through a strategy that the rules and statistics are combined.

Description

technical field [0001] The invention relates to a pre-translation preprocessing method for machine translation, in particular to a machine translation-oriented multi-strategy English long sentence segmentation method, which belongs to the technical field of natural language processing machine translation. Background technique [0002] Today, the Internet is very developed and spread all over the world. With the help of the Internet, people from different nationalities and speaking different languages ​​can share information anytime and anywhere, and people are increasingly eager to obtain useful information on the Internet quickly and smoothly. However, in the face of the vast amount of information on the Internet today, traditional human translation seems powerless. Therefore, under such a background, machine translation technology has a huge market, and scholars from various countries have also done a lot of research work in this field. [0003] In recent years, machine ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/28
Inventor 冯冲杨书立黄河燕
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More