Text data processing method and device, equipment and medium

A processing method and text data technology, applied in the computer field, which can solve the problems of limited corpus and limited machine translation quality.

Pending Publication Date: 2022-04-15
TENCENT TECH (SHENZHEN) CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method requires strict alignment of the English side, resulting in the size of the extracted corpus being much smaller than that of English-C

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text data processing method and device, equipment and medium
  • Text data processing method and device, equipment and medium
  • Text data processing method and device, equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0075] The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

[0076] It should be understood that artificial intelligence (AI) is the theory, method, technology and method that use digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to unde...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a text data processing method and device, equipment and a medium, the method relates to the field of artificial intelligence, and the method comprises the following steps: obtaining a first text pair and a second text pair, obtaining a first sub-text from the first text pair, and obtaining a second sub-text from the second text pair; determining an editing distance between the first sub-text and the second sub-text, if the editing distance meets a similarity condition, generating a first target sub-text which is associated with the semantic information of the first sub-text and belongs to a third language type, and generating a second target sub-text which is associated with the semantic information of the second sub-text and belongs to a second language type; and generating a text sample pair according to the first text pair, the second text pair, the first target sub-text and the second target sub-text. By adopting the method and the device, the text sample pair formed by different language types can be generated, so that the corpus quantity of the corpus can be increased while the quality of the corpus is ensured.

Description

technical field [0001] The present application relates to the field of computer technology, and in particular to a text data processing method, device, equipment and medium. Background technique [0002] Existing corpora usually contain text pairs (ie sentence pairs) of English-Centric (that is, the source language end or the target language end is English). Centric sentence pairs generate non-English (non-English direction, that is, both the source language and the target language are non-English languages) sentence pairs. [0003] Currently, an extraction-based method is used to generate corpus. When the extraction-based method extracts parallel corpus from an existing English-Centric corpus, it usually aligns identical English ends to create a multi-channel parallel corpus. For example, when the English ends of the "French-English" sentence pair and the "English-German" sentence pair are exactly the same, the three-way aligned sentence pair of "French-English-German" can...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/58G06F40/56G06F40/49G06F40/45G06K9/62
Inventor 杨振许钰林孟凡东
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products