Unlock instant, AI-driven research and patent intelligence for your innovation.

Corpus word segmentation preprocessing method for machine translation

A machine translation and preprocessing technology, which is applied in natural language data processing, neural learning methods, natural language translation, etc., can solve the problem of word waste, placeholder segmentation granularity, etc., and achieve the effect of improving the accuracy of word segmentation and solving the waste of occupation

Pending Publication Date: 2022-08-09
四川语言桥信息技术有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to solve the problems of word waste and large granularity of word segmentation in the existing word segmentation preprocessing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus word segmentation preprocessing method for machine translation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] like figure 1 Show, this embodiment provides a pre -processing method of a machine translation of a machine translation, including the following steps:

[0036] Step S1: Data cleaning the original corpus according to the language rules;

[0037] As a preferred plan for this embodiment, the data cleaning described in the steps S1 includes:

[0038] Remove the empty line; remove sentences that are not aligned at the end of the sentence in the statement; remove the HTML markup language; remove the rigid character; remove the sentence containing a third -party language; remove the garbled code; the algorithm processing of the sentences to remove the alignment algorithm to remove the poor alignment effect is poor Sentences; take the original text and translation as the key; convert Chinese traditional Chinese to simplified.

[0039] Step S2: Standardized symbolization of the Classes after cleaning;

[0040] In this embodiment, the symbolic standardization process described in th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a corpus word segmentation preprocessing method for machine translation, relates to the technical field of machine translation preprocessing, solves the problems of word waste and occupation and large word segmentation granularity in existing word segmentation preprocessing, and comprises the following steps: carrying out data cleaning on an original corpus according to a language rule; performing symbol standardization processing on the cleaned corpus; performing word segmentation processing on the corpus after symbol standardization; performing capital and small case conversion on letters of the corpus to solve the problem of vocabulary occupation caused by capital and small cases of the letters, and obtaining a training corpus; performing word segmentation on the training corpus based on a BPE algorithm to obtain an optimal word list; the method has the advantages that no repeated vocabulary occupation exists in the corpus package, and the word segmentation granularity is small.

Description

Technical field [0001] The invention involves the field of machine translation pre -processing technology, and it is more specifically that the corpus pre -processing method of corpus translations involved in machine translation. Background technique [0002] In the era of deep learning, the choice of words is basically the first step in all natural language processing tasks. The choice of different vocabulary also affects the effect of the final model, because NMT, that is, the neuroplasma translation system In order to control the complexity of the calculation, has a fixed -sized vocabulary, which usually limits the vocabulary to between 30K and 80K Essence How to build the best word for machine translation models is a question that deep learners have been exploring. [0003] Because the generation of neural machine translation models and vocabulary training all depend on the quality and standardized format of the original training data. The words generated by training data sho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/211G06F40/216G06F40/221G06F40/289G06F40/58G06N3/08
CPCG06F40/289G06F40/211G06F40/216G06F40/221G06F40/58G06N3/08
Inventor 朱宪超陈秋霖霍展羽
Owner 四川语言桥信息技术有限公司