Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment

A sentence pair and bilingual technology, applied in the computer field, can solve the problem of high maintenance cost of bilingual sentences

Active Publication Date: 2020-06-09
TENCENT TECH (SHENZHEN) CO LTD
View PDF7 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Based on this, it is necessary to provide a bilingual sentence alignment method, device, computer-readable storage medium and computer equipment for the high maintenance cost of bilingual sentence alignment

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
  • Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
  • Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0065] In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

[0066] figure 1 It is an application environment diagram of the bilingual corpus sentence alignment method in one embodiment. The application environment involves the terminal 110 , or involves the terminal 110 and the server 120 . Terminal 110 and server 120 are connected via a network. When the terminal 110 is involved, the terminal 110 obtains the parallel text to be aligned and the language type of the original text and the language type of the translated text in the parallel text to be aligned; performs preprocessing on the parallel text to be aligned, calls the mon...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a bilingual corpus sentence alignment method and device, a computer readable storage medium and a computer device. The method comprises the steps of obtaining language types of a to-be-aligned parallel text and an original text and a translation text; preprocessing the to-be-aligned parallel text to obtain a to-be-aligned parallel sentence pair; calling a monolingual wordsegmentation model corresponding to the language types of the original text and the translation text from a monolingual word segmentation model group trained through a SentencePiece algorithm, and performing word segmentation processing to obtain a sentence segment group of the to-be-aligned parallel text and the sentence segment group of the translation text to be aligned; and performing format processing on the sentence fragment groups of the to-be-aligned original text and the to-be-aligned translated text according to a preset format processing mode to obtain bilingual sentence pair groups, calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair groups according to the bilingual dictionary to obtain sentence alignment parallel corpora. The monolingual word segmentation models of all languages are trained through the SentencePiece algorithm, so that the coupling degree and maintenance difficulty of codes are reduced, and the maintenance cost is reduced.

Description

technical field [0001] The present application relates to the field of computer technology, in particular to a bilingual sentence alignment method, device, computer-readable storage medium and computer equipment. Background technique [0002] When performing sentence-level alignment in bilingual parallel corpora with chapter-level alignment, a feasible approach is to use sentence length information and lexical information to judge the similarity of sentences in the two language parallel corpora. [0003] For example, if the lengths of two sentences differ greatly, the similarity between the two sentences is low, and the possibility of being a parallel sentence pair is also small. Or, if the two sentences contain the same number or the same letter string, the similarity between the two sentences is higher, and the possibility that the two sentences are parallel sentence pairs is higher. And, when two sentences contain words of the same concept in two languages, the similarit...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/211G06F40/242G06F40/279G06F40/58G06F40/103
CPCY02D10/00
Inventor 鲁思祈
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products