Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text data processing method and device, equipment and medium

A processing method and text data technology, applied in the computer field, which can solve the problems of limited corpus and limited machine translation quality.

Pending Publication Date: 2022-04-15
TENCENT TECH (SHENZHEN) CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method requires strict alignment of the English side, resulting in the size of the extracted corpus being much smaller than that of English-Centric, which means that the existing text pair generation method has a limited number of generated corpus, which limits the quality of machine translation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text data processing method and device, equipment and medium
  • Text data processing method and device, equipment and medium
  • Text data processing method and device, equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0075] The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of this application.

[0076] It should be understood that artificial intelligence (AI for short) is a theory, method, technology and technology that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. operating system. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the nature of intelligenc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the invention provides a text data processing method and device, equipment and a medium, the method relates to the field of artificial intelligence, and the method comprises the following steps: obtaining a first text pair and a second text pair, obtaining a first sub-text from the first text pair, and obtaining a second sub-text from the second text pair; determining an editing distance between the first sub-text and the second sub-text, if the editing distance meets a similarity condition, generating a first target sub-text which is associated with the semantic information of the first sub-text and belongs to a third language type, and generating a second target sub-text which is associated with the semantic information of the second sub-text and belongs to a second language type; and generating a text sample pair according to the first text pair, the second text pair, the first target sub-text and the second target sub-text. By adopting the method and the device, the text sample pair formed by different language types can be generated, so that the corpus quantity of the corpus can be increased while the quality of the corpus is ensured.

Description

technical field [0001] The present application relates to the field of computer technology, and in particular to a text data processing method, device, equipment and medium. Background technique [0002] Existing corpora usually contain text pairs (ie sentence pairs) of English-Centric (that is, the source language end or the target language end is English). Centric sentence pairs generate non-English (non-English direction, that is, both the source language and the target language are non-English languages) sentence pairs. [0003] Currently, an extraction-based method is used to generate corpus. When the extraction-based method extracts parallel corpus from an existing English-Centric corpus, it usually aligns identical English ends to create a multi-channel parallel corpus. For example, when the English ends of the "French-English" sentence pair and the "English-German" sentence pair are exactly the same, the three-way aligned sentence pair of "French-English-German" can...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/58G06F40/56G06F40/49G06F40/45G06K9/62
Inventor 杨振许钰林孟凡东
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products