Supercharge Your Innovation With Domain-Expert AI Agents!

A multi-strategy English long sentence segmentation method for machine translation

A machine translation, multi-strategy technology, applied in the field of natural language processing machine translation, can solve problems such as small coverage of language phenomena, and achieve the effect of improving quality

Active Publication Date: 2017-12-19
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The purpose of the present invention is to solve the problem that the existing rule-based sentence segmentation method has too little coverage of language phenomena, and the existing machine learning-based method can only use commas in the sentence to segment, and proposes a A Novel Multi-Strategy English Long Sentence Segmentation Method for Machine Translation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A multi-strategy English long sentence segmentation method for machine translation
  • A multi-strategy English long sentence segmentation method for machine translation
  • A multi-strategy English long sentence segmentation method for machine translation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The present invention will be further described below in conjunction with embodiment.

[0034] like figure 1 As shown, a kind of machine translation-oriented multi-strategy English long sentence segmentation method of the present invention includes training steps and actual segmentation steps, which are described in detail below respectively:

[0035] The first is the training step, which proceeds as follows:

[0036] Step 1, prepare the training corpus and preprocess the corpus. Since the CRF needs to be used to mine the information of the comma position in the corpus, it is necessary to prepare English sentences with a large number of commas as the training corpus. In the experiment, we selected about 450,000 English sentences containing at least two commas as the training corpus.

[0037] At the same time, necessary preprocessing of the corpus is required, such as removing garbled characters and special symbols, English tokenization, etc.

[0038] For the definit...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a machine translation-oriented multi-strategy English long sentence segmentation method and device, belonging to the technical field of natural language processing machine translation. The method includes two steps of training and actual use; for the training step: first prepare the English training corpus and preprocess it; then perform feature extraction on the corpus, including extracting dependent syntactic features, part-of-speech tagging features, and comma position features; Finally, create a feature template to train the CRF model; at the same time, design a number of rules that can handle simple phenomena more accurately; for the actual use steps, first perform feature extraction on long English sentences to be processed, and the extracted features are the same as the training steps; then use rule algorithms respectively And the CRF model to mark the position of the comma; then add a comma to the marked position to complete the segmentation. Compared with the existing technology, by adopting the strategy of combining rules and statistics, it can effectively and accurately segment English long sentences and improve the quality of machine translation.

Description

technical field [0001] The invention relates to a pre-translation preprocessing method for machine translation, in particular to a machine translation-oriented multi-strategy English long sentence segmentation method, which belongs to the technical field of natural language processing machine translation. Background technique [0002] Today, the Internet is very developed and spread all over the world. With the help of the Internet, people from different nationalities and speaking different languages ​​can share information anytime and anywhere, and people are increasingly eager to obtain useful information on the Internet quickly and smoothly. However, in the face of the vast amount of information on the Internet today, traditional human translation seems powerless. Therefore, under such a background, machine translation technology has a huge market, and scholars from various countries have also done a lot of research work in this field. [0003] In recent years, machine ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27G06F17/28
Inventor 冯冲杨书立黄河燕
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More