BPE-Learn acceleration method for sub-word segmentation

A sub-word and vocabulary technology, applied in the field of BPE-Learn acceleration for sub-word segmentation, can solve the problems of GPU resource waste, consume a lot of time, and do not allow data segmentation statistics, so as to improve GPU usage and shorten statistics time. Effect

Active Publication Date: 2020-05-19
沈阳雅译网络技术有限公司
View PDF6 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Under such a massive amount of data, the BPE-Learn process will consume a lot of time, as long as several hours or even more than ten hours, so that the data preprocessing process before training takes up a lot of time, and at the same time, it also consumes precious GPU resources. caused a certain waste
[0007] BPE-Learn conducts statistics on the full amount of corpus, and does not allow data to be divided into statistics. The ordinary multi-process synchronous acceleration method can only obtain local byte pair frequencies, and cannot represent the highest global frequency.
[0008] Therefore, the traditional BPE-Learn algorithm cannot complete byte pair statistics in a short period of time under massive data, nor can it achieve multi-process acceleration of byte pair statistics through data segmentation and other methods

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • BPE-Learn acceleration method for sub-word segmentation
  • BPE-Learn acceleration method for sub-word segmentation
  • BPE-Learn acceleration method for sub-word segmentation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further elaborated below in conjunction with the accompanying drawings of the description.

[0037] The present invention proposes a BPE-Learn acceleration method oriented to subword segmentation, uses multi-process statistics and an interactive mode to accelerate the BPE-Learn algorithm, and solves the serious time-consuming problem of subword segmentation in neural machine translation training.

[0038] like figure 2 Shown, BPE-Learn algorithm acceleration of the present invention comprises the following steps:

[0039] 1) Read in the training data, segment the data according to spaces, count the number of times each word appears in the corpus, and record it as a vocabulary;

[0040] 2) Divide the vocabulary into N sub-tables, create an independent sub-process for each sub-table for byte pair statistics, and assign a communication queue to each sub-process for interaction with the main process;

[0041] 3) In the sub-process, the chara...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a BPE-Learn acceleration method for sub-word segmentation, and the method comprises the steps: reading in training data, calculating the frequency of occurrence of each word ina corpus, and recording the frequency as a vocabulary; dividing the vocabulary into N sub-tables; using characters as basic units for byte pair statistics in the sub-process; starting to count byte pairs in respective sub-tables by the plurality of sub-processes at the same time, and informing the host process that the counting is finished through a communication queue; enabling the main processto read a temporary file, summarize the statistical result of each sub-process, select a byte pair with the highest frequency, store the byte pair into the file, and meanwhile, perform pruning; enabling the sub-process to wait for receiving a main process signal, conducting zero setting on the byte pair frequency which is not segmented, calculating and updating the byte pair frequency of the source word, and returning the byte pair frequency to the main process; and ending the statistics when byte pairs in the file stored in the host process meet the quantity requirement. According to the method, the byte pair statistical time in sub-word segmentation before training is shortened, and the GPU utilization rate is increased in neural machine translation model training.

Description

technical field [0001] The invention relates to the field of machine translation, in particular to a BPE-Learn acceleration method for subword segmentation. Background technique [0002] Neural Machine Translation (NMT) is a machine translation technology that uses words as the smallest unit. Because the NMT system involves a huge amount of computational complexity, the system limits the NMT vocabulary to a fixed-size vocabulary in order to keep resources and time consumption within an available range. Vocabularies that have not appeared in the training corpus are called unregistered word (OOV). For unregistered words, the NMT system uses the unified tag UNK to replace them. This method leads to inaccurate translations of unregistered translations, and even destroys the translation structure of the entire sentence. In order to avoid the occurrence of unregistered words, the researchers proposed the subword segmentation method BPE. After subword segmentation, words are div...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/58G06F40/44G06F40/289
CPCY02D10/00
Inventor 杜权刘兴宇朱靖波肖桐张春良
Owner 沈阳雅译网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products