A method and apparatus for generating sample data
By using the first processing unit to generate encoded sequences during language model training and leveraging the parallel processing of the second processing unit, the problem of low efficiency in text data processing in existing technologies is solved, the efficiency of sample data generation is improved, and the training time of the language model is reduced.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING SANKUAI ONLINE TECH CO LTD
- Filing Date
- 2022-08-02
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, the method of using a central processing unit (CPU) to perform loop operations to process text data and generate sample data is inefficient, resulting in increased language model training time.
The first processing unit generates the encoded sequence of the text data to be processed, and the second processing unit replaces part of the encoding in parallel to generate sample data. The graphics processing unit (GPU) or field-programmable gate array (FPGA) and other sub-processing units are used for parallel processing.
It improves the efficiency of text data processing, reduces the training time of language models, avoids resource waste, and achieves full-word masking and dynamic masking.
Smart Images

Figure CN115392479B_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of text processing, and in particular to a method and apparatus for generating sample data. Background Technology
[0002] In the training process of a language model, the generation of sample data and the training of the model are separate. That is, the text data used to train the language model needs to be processed in advance to obtain sample data, and then the sample data is input into the language model to train the language model.
[0003] However, the current common method for processing text data is to use a central processing unit (CPU) to perform loop operations to process each piece of text data one by one, thereby obtaining each sample data. This method of processing text data based on loop operations is inefficient and increases the time required to train a language model.
[0004] Therefore, how to improve the efficiency of processing text data used to train language models in order to reduce the training time of language models is an urgent problem to be solved. Summary of the Invention
[0005] This specification provides a method and apparatus for generating sample data to partially solve the aforementioned problems existing in the prior art.
[0006] The following technical solution is adopted in this specification:
[0007] This specification provides a method for generating sample data, including:
[0008] Obtain the text data to be processed;
[0009] For each piece of text data to be processed, the first processing unit generates the corresponding encoding sequence for that piece of text data.
[0010] The encoding sequence corresponding to each text data to be processed is imported into the second processing unit, so that the encoding of at least some characters in the encoding sequence corresponding to each text data to be processed is replaced with the specified encoding in parallel by the second processing unit to generate each sample data.
[0011] Optionally, the second processing unit includes sub-processing units;
[0012] In parallel, the encoding of at least a portion of the characters in the encoding sequence corresponding to each piece of text data to be processed is replaced with a specified encoding, specifically including:
[0013] Each sub-processing unit processes each piece of text data in parallel to obtain sample data corresponding to each piece of text data; where:
[0014] For each sub-processing unit, the position of each word in the text data to be processed is determined within that sub-processing unit.
[0015] Based on the position of each word in the text data to be processed, construct the feature matrix corresponding to the text data to be processed.
[0016] Based on the feature matrix corresponding to the text data to be processed, at least some of the characters in the encoding sequence corresponding to the text data to be processed are replaced with the specified encoding to obtain the sample data corresponding to the text data to be processed.
[0017] Optionally, based on the position of each word in the text data to be processed, a feature matrix corresponding to the text data to be processed is constructed, specifically including:
[0018] For each word in the text data to be processed, determine the row vector corresponding to the word based on the position of each character in the text data to be processed;
[0019] Based on the position of each word in the text data to be processed, the row vectors corresponding to each word in the text data to be processed are sorted to construct the feature matrix corresponding to the text data to be processed.
[0020] Optionally, before the second processing unit replaces the encoding of at least a portion of the characters in the encoding sequence corresponding to each piece of text data to be processed with the specified encoding in parallel, the method further includes:
[0021] For each sub-processing unit, the position of each word in the text data to be processed is randomly adjusted to obtain the rearrangement information corresponding to the text data to be processed.
[0022] The second processing unit replaces the encoding of at least a portion of the characters in the encoding sequence corresponding to each piece of text data to be processed with a specified encoding in parallel, specifically including:
[0023] Based on the rearrangement information corresponding to the text data to be processed and the feature matrix corresponding to the text data to be processed, at least some of the characters in the encoding sequence corresponding to the text data to be processed are replaced with the specified encoding.
[0024] Optionally, based on the rearrangement information corresponding to the text data to be processed and the feature matrix corresponding to the text data to be processed, at least some of the characters in the encoded sequence corresponding to the text data to be processed are replaced with a specified encoding, specifically including:
[0025] Based on the rearrangement information corresponding to the text data to be processed, the position order between each row vector in the feature matrix corresponding to the text data to be processed is adjusted to obtain the rearrangement matrix corresponding to the text data to be processed.
[0026] Based on the rearrangement matrix corresponding to the text data to be processed, at least some of the characters in the encoding sequence corresponding to the text data to be processed are replaced with the specified encoding.
[0027] Optionally, based on the rearrangement matrix corresponding to the text data to be processed, at least some of the characters in the encoded sequence corresponding to the text data to be processed are replaced with a specified encoding, specifically including:
[0028] The target words are determined based on the rearrangement matrix corresponding to the text data to be processed;
[0029] In the encoding sequence corresponding to the text data to be processed, the encoding of the characters contained in the target word is replaced with the specified encoding.
[0030] Optionally, the target words are determined based on the rearrangement matrix corresponding to the text data to be processed, specifically including:
[0031] For each row vector in the rearrangement matrix corresponding to the text data to be processed, determine the number of characters contained in the word corresponding to that row vector, and the sum of the number of characters contained in the words corresponding to the row vectors preceding that row vector in the rearrangement matrix;
[0032] Determine whether the sum of the number of characters in the word corresponding to the row vector and the number of characters in the words corresponding to the row vectors preceding the row vector in the rearrangement matrix exceeds a preset threshold.
[0033] If not, then the word corresponding to that row vector is determined to be the target word of the text data to be processed.
[0034] This specification provides a sample data generation apparatus, including:
[0035] The acquisition module is used to acquire the text data to be processed.
[0036] The first generation module is used to generate the encoding sequence corresponding to each text data to be processed through the first processing unit.
[0037] The second module is used to import the encoding sequence corresponding to each text data to be processed into the second processing unit, so that the second processing unit can replace the encoding of at least some characters in the encoding sequence corresponding to each text data to be processed with the specified encoding in parallel to generate each sample data.
[0038] This specification provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method for generating sample data.
[0039] This specification provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the above-mentioned methods for generating sample data and training models.
[0040] The above-mentioned technical solutions adopted in this specification can achieve the following beneficial effects:
[0041] In the sample data generation method provided in this specification, each text data to be processed is first obtained. For each text data to be processed, a first processing unit generates an encoding sequence corresponding to the text data to be processed. The encoding sequence corresponding to each text data to be processed is then imported into a second processing unit. The second processing unit then replaces the encoding of at least some characters in the encoding sequence corresponding to each text data to be processed with a specified encoding in parallel, thereby generating each sample data.
[0042] As can be seen from the above method, this scheme can divide the sample data generation process into two parts: preprocessing of the text data to be processed and generation of sample data. Thus, the first processing unit can preprocess the text data to be processed to obtain the encoding sequence of the text data to be processed. Then, the second processing unit generates sample data in parallel according to the encoding sequence of each text data to be processed. This avoids the situation where the second processing unit is idle and wastes resources after all the processing tasks of the text data to be processed are handed over to the first processing unit. Moreover, the second processing unit can process each text data to be processed in parallel, thereby improving the processing efficiency of the text data to be processed and reducing the time required to train the language model. Attached Figure Description
[0043] The accompanying drawings, which are included to provide a further understanding of this specification and form part of this specification, illustrate exemplary embodiments and are used to explain this specification, but do not constitute an undue limitation thereof. In the drawings:
[0044] Figure 1 This is a flowchart illustrating a method for generating sample data provided in this specification.
[0045] Figure 2 This is a schematic diagram of the feature matrix corresponding to the text data to be processed provided in this specification.
[0046] Figure 3 This is a schematic diagram illustrating the method for generating rearranged matrices provided in this specification;
[0047] Figure 4 This is a schematic diagram of the sample data generation process provided in this specification;
[0048] Figure 5 This is a schematic diagram of a sample data generation device provided in this specification.
[0049] Figure 6 The one provided in this specification corresponds to Figure 1 A schematic diagram of an electronic device. Detailed Implementation
[0050] To make the objectives, technical solutions, and advantages of this specification clearer, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of them. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this specification.
[0051] The technical solutions provided in the various embodiments of this specification are described in detail below with reference to the accompanying drawings.
[0052] Figure 1 This is a flowchart illustrating a method for generating sample data provided in this specification, including the following steps:
[0053] S101: Obtain the text data to be processed.
[0054] In the pre-training process of a language model, the processing of the text data to be processed for training the language model is usually separated from the training process of the language model. That is, the sample data required for the pre-training of the language model needs to be processed in advance and saved so that it can be used in the pre-training process. In order to reduce the time required to train the language model, the efficiency of generating sample data is particularly important.
[0055] Based on this, in this specification, the server of the business platform can obtain various text data to be processed for training the language model, and process these text data to obtain sample data, and then train the language model using the sample data. The sample data may include at least one of the following: the encoding sequence after at least some characters of the text data to be processed are masked, the position encoding sequence of each masked character in the text data to be processed, and the supplementary encoding sequence of the position of each masked character in the text data to be processed.
[0056] For example: Assume that there is a text data to be processed as "efficient pre-training", and the word to be masked is "efficient". Then the encoding sequence after masking the characters included in the word "efficient" in the text data to be processed is "[101, 103, 103, 7564, 6378, 5298, 102]". Here, 101 is the encoding used to represent the start position of the text data to be processed, 102 is the encoding used to represent the end position of the text data to be processed, and 103 is the specified encoding used to represent the masked characters in the text data to be processed.
[0057] In the above content, the position encoding sequence of each masked character in the text data to be processed in the text data to be processed is "[1, 2, 0]". Here, 1 and 2 are the subscripts of the characters "高" and "效" included in the masked word "efficient" in the encoding sequence after masking at least some characters in the text data to be processed, and 0 is the padding encoding used to pad to the maximum length. The supplementary encoding sequence of the position of each masked character in the text data to be processed in the text data to be processed is "[1, 1, 0]". Here, 1 represents that in the position encoding sequence of each masked character in the text data to be processed in the text data to be processed, the encoding at this position is not the padding encoding (i.e., "1" and "2" in [1, 2, 0] are the subscripts of the masked characters), and 0 indicates that the encoding at this position is the padding encoding (i.e., "0" in [1, 2, 0] is the padding encoding used to pad to the maximum length).
[0058] It should be noted that in the above content, in the position encoding sequence "[1, 2, 0]" of each masked character in the text data to be processed in the text data to be processed, only "1" and "2" actually represent the positions of the masked characters in the text data to be processed. The reason for using the padding encoding "0" is that there may be many masked characters in other text data. Therefore, in order to make the lengths of the position encoding sequences of each masked character in multiple text data to be processed in the text data to be processed unified for use in the subsequent training process of the language model, developers can set a maximum length according to actual needs. When generating the position encoding sequence of each masked character in the text data to be processed in the text data to be processed, use the padding encoding "0" to pad the position encoding sequence of each masked character in the text data to be processed to the maximum length. In addition, in this specification, there are also other vectors or encoding sequences that need to use the padding encoding 0 to pad them to the maximum length, all for the purpose of unifying the lengths of the vectors or encoding sequences. This specification will not elaborate on them one by one.
[0059] In this specification, the execution entity used to implement the sample data generation method can be a server set up in the business platform or a designated terminal device, such as a laptop computer or desktop computer. For ease of description, this specification will only use the server as the execution entity as an example to illustrate the sample data generation method provided in this specification.
[0060] S102: For each piece of text data to be processed, the first processing unit generates the encoding sequence corresponding to the text data to be processed.
[0061] For each piece of text data to be processed, the server can determine the corresponding encoding sequence through a first processing unit, which may refer to a central processing unit (CPU). The encoding sequence includes: a character encoding sequence and a word encoding sequence. The character encoding sequence is obtained by encoding each character in the text data, and the word encoding sequence is obtained by encoding each word in the text data.
[0062] For example, suppose there is a text data to be processed called "efficient pre-training". The character encoding sequence corresponding to this text data is "[101, 7770, 3126, 7564, 6378, 5298, 102, 0]". Each encoding here corresponds to a character contained in the text data. For example, "101" is a marker indicating the start position of the text data (which can be understood as a marker indicating that the text data starts from this position), "7770" is the encoding for "high", "3126" is the encoding for "efficient", "0" is a padding encoding used to fill in the maximum length, and "102" is a marker indicating the end position of the text data (which can be understood as a marker indicating that the text data ends at this position).
[0063] In addition, to facilitate the differentiation of each word contained in the text data to be processed, the server can also determine the word encoding sequence corresponding to the text data to be processed through the first processing unit. For example, assuming there is a text data to be processed called "efficient pre-training", the word encoding sequence corresponding to the text data to be processed is "[0, 1, 3, 6, 0, 0, 0, 0]", where each encoding represents the index of the first character of the word contained in the text data to be processed in the character encoding sequence. In the text data to be processed "efficient pre-training", there are four words, namely the marker word 101 indicating the beginning position of the text data to be processed (this word has only one character "101", and its corresponding index in the character encoding sequence is 0, that is, it represents the first character in the character encoding sequence). The word encoding sequence of the text data to be processed is padded to the maximum length by adding "0". The words are: "efficient" (the first character of this word is "high", and its index in the character encoding sequence is 1, meaning the second character in the character encoding sequence belongs to the word "efficient"), "pre-training" (the first character of this word is "pre", and its index in the character encoding sequence is 3, meaning the fourth character in the character encoding sequence belongs to the word "pre-training"), and "102" (this word has only one character, i.e., 102, and its corresponding position in the character encoding sequence is 6, meaning the seventh character in the character encoding sequence belongs to the word encoding sequence indicating the end position of the text data to be processed).
[0064] As can be seen from the above, the server can determine the position of the characters in the character encoding sequence corresponding to the text data to be processed based on the word encoding sequence "[0, 1, 3, 6, 0, 0, 0, 0]". The first character belongs to the first word, the second and third characters belong to the second word, the third and fifth characters belong to the third word, and so on. The server can determine the position of the characters in the character encoding sequence of each word in the text data.
[0065] In addition, the server can also determine the actual length of the character encoding sequence of each text data to be processed (the number of codes other than the padding code "0" in the character encoding sequence of the text data to be processed), so that when the second processing unit replaces the encoding of at least a portion of the characters in the character encoding sequence corresponding to each text data to be processed with the specified code in parallel, it can determine which codes in the character encoding sequence corresponding to the text data to be processed are the actual character codes and which codes are padding codes. For example, if the text data to be processed is "efficient pre-training", the character encoding sequence corresponding to the text data to be processed is "[101, 7770, 3126, 7564, 6378, 5298, 102, 0]", and the actual length of the text data to be processed is 7, the server can further determine, based on the actual length 7 of the text data to be processed, that the eighth "0" in the character encoding sequence corresponding to the text data to be processed is the padding code.
[0066] It should be noted that the first processing unit mentioned above can be a central processing unit (CPU). After the server obtains the character encoding sequence and word encoding sequence of each text data to be processed through the first processing unit, it can save them so that when the language model needs to be trained later, the second processing unit can obtain the character encoding sequence and word encoding sequence of each text data to be processed in batches, and generate sample data based on these character encoding sequence and word encoding sequence of the text data to be processed for training the language model. The second processing unit can be such as a graphics processing unit (GPU) or a field-programmable gate array (FPGA).
[0067] S103: Import the encoding sequence corresponding to each text data to be processed into the second processing unit, so that the second processing unit can replace the encoding of at least some characters in the encoding sequence corresponding to each text data to be processed with the specified encoding in parallel to generate each sample data.
[0068] After obtaining the encoding sequence corresponding to each text data to be processed, the server can import the encoding sequence corresponding to each text data to be processed into the second processing unit. Through each sub-processing unit contained in the second processing unit, at least some characters in each text data to be processed need to be masked in parallel. In the character encoding sequence of each text data to be processed, the encoding of the at least some characters to be masked is replaced with the specified encoding to obtain sample data corresponding to each text data to be processed. The sample data is then input into the language model so that the language model can predict the masked characters based on the other unmasked characters in the encoding sequence of the text data to be processed after at least some characters are masked. The language model is then trained based on the prediction results.
[0069] Specifically, the server can, for each sub-processing unit included in the second processing unit, determine the position of each word in the text data to be processed based on the word encoding sequence corresponding to the text data to be processed processed by that sub-processing unit, and construct a feature matrix corresponding to the text data to be processed based on the position of each word in the text data to be processed. Figure 2 As shown.
[0070] Figure 2 This is a schematic diagram of the feature matrix corresponding to the text data to be processed provided in this specification.
[0071] from Figure 2 As can be seen, the feature matrix corresponding to the "efficient pre-training" text data has L rows and L columns, that is, the shape of the feature matrix is [L, L]. Each row vector corresponds to a word contained in the text data. For example, the first row vector [1, 0, 0, 0, 0, 0, 0, 0] corresponds to the marker word at the start position in the text data. Through the word encoding sequence of the text data, we know that the word has only one character, which is 101, and the position of this character in the word encoding sequence of the text data is the first position. Therefore, we can determine that the first position in the first row vector is "1". "1" means that the character corresponding to the first position in the word encoding sequence of the text data belongs to the word corresponding to this row vector. The "0"s in other positions in the first row vector indicate that other characters in the word encoding sequence of the text data do not belong to this placeholder word at the start position.
[0072] For another example: the second row vector [0, 1, 1, 0, 0, 0, 0, 0], corresponding to the word "efficient" contained in the text data to be processed. From the word encoding sequence of the text data to be processed, it can be seen that the word "efficient" has two characters, and the first character "高" is in the second position in the character encoding sequence of the text data to be processed, which is 7770. Therefore, it can be determined that the second position and the third position in the second row vector are "1". The "1" in the second position of the second row vector indicates that the character corresponding to the second position in the character encoding sequence of the text data to be processed belongs to the word "efficient", and the "1" in the third position indicates that the character corresponding to the third position in the character encoding sequence of the text data to be processed also belongs to the word "efficient". The "0" in other positions in the second row vector indicates that the characters corresponding to other positions in the character encoding sequence of the text data to be processed do not belong to the word "efficient".
[0073] In Figure 2 , the 5th to 8th row vectors are all used to pad the feature matrix of the text data to be processed to the maximum length.
[0074] From the above content, it can be seen that the server can, through this sub-processing unit, for each word contained in the text data to be processed, determine the row vector corresponding to the word according to the positions of each character contained in the word in the text data to be processed, and sort the row vectors corresponding to each word contained in the text data to be processed according to the positions of each word contained in the text data to be processed, so as to construct the feature matrix corresponding to the text data to be processed.
[0075] Furthermore, after the server obtains the feature matrix of the text data to be processed through this sub-processing unit, it can determine the target word of the text data to be processed according to the feature matrix of the text data to be processed. Furthermore, in the character encoding sequence of the text data to be processed, it can replace the encoding of the characters contained in the target word with the specified encoding.
[0076] Specifically, before the server processes each text data to be processed in parallel through each sub-processing unit included in the second processing unit, through the second processing unit, for each sub-processing unit, it randomly adjusts the positions of each word contained in the text data to be processed by the sub-processing unit in the text data to be processed, so as to obtain the rearrangement information corresponding to the text data to be processed.
[0077] For example, assuming the text data to be processed is "efficient pre-training", it can be determined that the text data contains four words: "marker of the start position of the text data", "efficient", "pre-training", and "marker of the end position of the text data". These four words can be numbered to obtain the number array corresponding to the words in the text data, which is "[0, 1, 2, 3, 4, 5, 6, 7, 8]". Here, 0 is the number corresponding to "marker of the start position of the text data", 1 is the number corresponding to "efficient", 2 is the number corresponding to "pre-training", 3 is the number corresponding to "marker of the end position of the text data", and 4, 5, 6, 7, 8 are the padding numbers used to fill in the maximum length. Then, the order of the numbers contained in the number array can be adjusted by a preset method to obtain the rearranged number array corresponding to the text data to be processed, "[1, 2, 0, 4, 3, 7, 6, 5, 8]". This number array is the rearranged information mentioned above.
[0078] Furthermore, for each sub-processing unit included in the second processing unit, the server can use that sub-processing unit to adjust the positional order of the row vectors in the feature matrix corresponding to the text data to be processed, based on the rearrangement information corresponding to the text data to be processed, to obtain the rearrangement matrix corresponding to the text data to be processed, such as... Figure 3 As shown.
[0079] Figure 3 This is a schematic diagram of the method for generating rearranged matrices provided in this specification.
[0080] from Figure 3 As can be seen, for the text data to be processed that is "efficiently pre-trained", the server can adjust the position order of each row vector in the feature matrix corresponding to the text data to be processed according to the rearrangement information of the text data to be processed, i.e., "[1, 2, 0, 4, 3, 7, 6, 5, 8]". That is, according to the order of each word in the rearrangement information, the position order of the row vector corresponding to each word in the feature matrix is adjusted, such as... Figure 3 In this process, the row vector corresponding to "efficient" is moved to the first row, the row vector corresponding to "pre-training" is moved to the second row, the row vector corresponding to "marking the end position of the text data to be processed" is moved to the fifth row, and so on, to obtain the rearranged matrix corresponding to the text data to be processed.
[0081] Furthermore, the server can use this sub-processing unit to determine the target word of the text data to be processed based on the rearrangement matrix corresponding to the text data to be processed. Then, it can replace the encoding of the characters contained in the target word with the specified encoding in the character encoding sequence of the text data to be processed.
[0082] There are many ways for the sub-processing unit to determine the target word of the text data to be processed based on the rearrangement matrix corresponding to the text data to be processed. For example, for each row vector in the rearrangement matrix corresponding to the text data to be processed, the number of characters contained in the word corresponding to the row vector is determined, and the sum of the number of characters contained in the words corresponding to the row vectors before the row vector in the rearrangement matrix is determined. It is then determined whether the sum exceeds a preset threshold. If not, the word corresponding to the row vector is determined to be the target word of the text data to be processed.
[0083] For example, assuming a preset threshold of 2 and the text data to be processed is "efficient pre-training", then in the feature matrix of the text data to be processed, we can determine that the word "efficient" corresponding to the first row vector contains 2 characters. The words corresponding to the rows before this row vector in the rearranged matrix contain 0 characters. Therefore, we can determine that the sum of the number of characters in the word corresponding to the first row vector and the number of characters in the words corresponding to the rows before the first row vector in the rearranged matrix does not exceed the preset threshold of 2. Therefore, the word "efficient" corresponding to the first row vector can be identified as the target word. Furthermore, we can determine that the word "pre-training" corresponding to the second row vector contains 3 characters. The words corresponding to the rows before this row vector in the rearranged matrix contain 2 characters. Therefore, we can determine that the sum of the number of characters in the word corresponding to the first row vector and the number of characters in the words corresponding to the rows before the first row vector in the rearranged matrix is 5, which exceeds the preset threshold of 2. Therefore, the word "pre-training" corresponding to the third row vector will not be considered as the target word.
[0084] It should be noted that, as can be seen from the above, the server can use sub-processing units to extract the first few row vectors (not exceeding a preset threshold number of characters) from the rearranged matrix corresponding to the text data to be processed, and determine the words corresponding to these row vectors as the target words that need to be masked. However, since the sub-processing unit rearranges the order of the row vectors in the feature matrix corresponding to the text data to be processed before each extraction, it can ensure that the position order of the row vectors in the rearranged matrix is random each time a target word is selected. Therefore, the words extracted by the server through the sub-processing unit are random each time. For example, assuming the preset threshold is 3, the target word corresponding to the first row vector selected in the rearranged matrix the first time might be "efficient". Then, after regenerating the rearranged matrix, the target word corresponding to the first row vector selected in the second rearranged matrix might be "pre-trained".
[0085] In the above content, the method by which the sub-processing unit determines the target word of the text data to be processed based on the rearrangement matrix corresponding to the text data to be processed can also be as follows: for each row vector in the rearrangement matrix corresponding to the text data to be processed, determine whether the number of rows corresponding to the row vector exceeds a preset threshold; if not, determine the word corresponding to the row vector as the target word of the text data to be processed.
[0086] For each piece of text data to be processed, after the server determines the target word corresponding to the text data to be processed, it can replace the characters contained in the target word in the character encoding sequence of the text data to be processed with a preset specified code used to mark the masking position, so as to realize the masking operation and thus obtain the sample data for training the language model.
[0087] For example, suppose the text data to be processed is "efficient pre-training", and the character encoding sequence of the text data is "[101, 7770, 3126, 7564, 6378, 5298, 102, 0]". The target word of the text data is "efficient". Then, in the character encoding sequence of the text data to be processed, the characters contained in the target word "efficient" can be replaced with a preset specified code to obtain the masked sample data "[101, 103, 103, 7564, 6378, 5298, 102, 0]", where "103" is the preset specified code used to mark the masking position.
[0088] It is worth noting that in practical applications, the server can acquire each piece of text data to be processed. After determining the encoding sequence corresponding to each piece of text data to be processed through the first processing unit, the second processing unit can generate a matrix corresponding to each piece of text data to be processed based on the encoding sequence corresponding to each piece of text data to be processed. The feature matrices corresponding to each piece of text data to be processed together form a tensor of size [bitch_size, L, L]. This tensor contains bitch_size feature matrices of size [L, L], where batch_size is the number of pieces of text data to be processed. Each feature matrix of size [L, L] corresponds to one piece of text data to be processed. Then, the server can process each feature matrix contained in the tensor through each sub-processing unit contained in the second processing unit to obtain the sample data corresponding to each piece of text data to be processed.
[0089] To facilitate understanding, this specification also provides a schematic diagram of the sample data generation process, such as... Figure 4 As shown.
[0090] Figure 4 This is a schematic diagram of the process for generating sample data provided in this specification.
[0091] from Figure 4 As can be seen, after the server obtains each sample data to be processed, it can encode each text data to be processed through the first processing unit to obtain the encoding sequence corresponding to each text data. When it is necessary to train the language model, the encoding sequence corresponding to each text data to be processed can be imported into the second processing unit. The second processing unit generates rearrangement information corresponding to each text data to be processed. Then, the feature matrix corresponding to each text data to be processed can be constructed in parallel through the sub-processing units in the second processing unit. According to the pre-generated rearrangement information, the position order between the row vectors in the feature matrix corresponding to each text data to be processed is adjusted to obtain the rearrangement matrix corresponding to each text data to be processed. Then, sample data for training the language model can be generated according to the rearrangement matrix corresponding to each text data to be processed.
[0092] It should be noted that all actions involving the acquisition of signals, information, or data in this manual are performed in accordance with the relevant data protection laws and regulations of the country where the device is located, and with the authorization granted by the owner of the relevant device.
[0093] As can be seen from the above method, this scheme divides the sample data generation process into two parts, which are executed by the first processing unit and the second processing unit of the server, respectively. The first processing unit preprocesses each text data to be processed to obtain the word encoding sequence and character encoding sequence of each text data to be processed. Then, each text data to be processed is processed in parallel by the sub-processing units of the second processing unit to obtain each sample data. Therefore, the situation where the second processing unit is idle when the first processing unit processes the text data to be processed can be avoided, thereby increasing the utilization rate of the second processing unit. Furthermore, since the second processing unit can determine the target words that need to be masked based on the rearrangement matrix and mask the characters contained in the target words, it can improve the processing efficiency of the text data to be processed and save the time required for training the language model, while realizing full word masking and dynamic masking.
[0094] The above describes a method for determining one or more implementation keywords in this specification. Based on the same approach, this specification also provides a corresponding sample data generation device, such as... Figure 5 As shown.
[0095] Figure 5 A schematic diagram of a sample data generation device provided in this specification includes:
[0096] The acquisition module 501 is used to acquire the text data to be processed.
[0097] The first generation module 502 is used to generate an encoding sequence corresponding to each text data to be processed through the first processing unit.
[0098] The second generation module 503 is used to import the encoding sequence corresponding to each text data to be processed into the second processing unit, so that the second processing unit can replace the encoding of at least some characters in the encoding sequence corresponding to each text data to be processed with the specified encoding in parallel to generate each sample data.
[0099] Optionally, the second processing unit includes sub-processing units;
[0100] The second generation module 503 is specifically used to process each text data to be processed in parallel by each sub-processing unit to obtain sample data corresponding to each text data to be processed; wherein: for each sub-processing unit, the position of each word contained in the text data to be processed is determined by the sub-processing unit; based on the position of each word contained in the text data to be processed, a feature matrix corresponding to the text data to be processed is constructed; based on the feature matrix corresponding to the text data to be processed, at least some of the characters in the encoding sequence corresponding to the text data to be processed are replaced with a specified encoding to obtain sample data corresponding to the text data to be processed.
[0101] Optionally, the second generation module 503 is specifically used to: for each word contained in the text data to be processed, determine the row vector corresponding to the word according to the position of each character contained in the word in the text data to be processed; and sort the row vectors corresponding to each word contained in the text data to be processed according to the position of each word in the text data to be processed, so as to construct the feature matrix corresponding to the text data to be processed.
[0102] Optionally, the second generation module 503 is specifically used to: for each sub-processing unit, randomly adjust the position of each word in the text data to be processed in the text data to be processed, to obtain the rearrangement information corresponding to the text data to be processed; and based on the rearrangement information corresponding to the text data to be processed and the feature matrix corresponding to the text data to be processed, replace the encoding of at least some characters in the encoding sequence corresponding to the text data to be processed with the specified encoding.
[0103] Optionally, the second generation module 503 is specifically used to: adjust the position order between each row vector in the feature matrix corresponding to the text data to be processed according to the rearrangement information corresponding to the text data to be processed, to obtain the rearrangement matrix corresponding to the text data to be processed; and replace the encoding of at least some characters in the encoding sequence corresponding to the text data to be processed with the specified encoding according to the rearrangement matrix corresponding to the text data to be processed.
[0104] Optionally, the second generation module 503 is specifically used to: determine the target word based on the rearrangement matrix corresponding to the text data to be processed; and replace the encoding of the characters contained in the target word with a specified encoding in the encoding sequence corresponding to the text data to be processed.
[0105] Optionally, the second generation module 503 is specifically configured to: for each row vector in the rearrangement matrix corresponding to the text data to be processed, determine the sum of the number of characters contained in the word corresponding to the row vector and the number of characters contained in the words corresponding to the row vector preceding the row vector in the rearrangement matrix; determine whether the sum of the number of characters contained in the word corresponding to the row vector and the number of characters contained in the words corresponding to the row vector preceding the row vector in the rearrangement matrix exceeds a preset threshold; if not, determine that the word corresponding to the row vector is the target word of the text data to be processed.
[0106] This specification also provides a computer-readable storage medium storing a computer program that can be used to execute the above-described... Figure 1 A method for generating sample data is provided.
[0107] This instruction manual also provides Figure 6 The one shown corresponds to Figure 1 A schematic diagram of the structure of an electronic device. (e.g.) Figure 6 At the hardware level, the electronic device includes a processor, internal bus, network interface, memory, and non-volatile memory, and may also include other hardware required for the business operations. The processor reads the corresponding computer program from the non-volatile memory into memory and then runs it to achieve the above-mentioned functions. Figure 1 The method for generating the sample data described above. Of course, in addition to software implementation, this specification does not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.
[0108] In the 1990s, improvements to a technology could be clearly distinguished as either hardware improvements (e.g., improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (improvements to the methodology). However, with technological advancements, many methodological improvements today can be considered direct improvements to the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved methodology into the hardware circuit. Therefore, it cannot be said that a methodological improvement cannot be implemented using hardware physical modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user programming the device. Designers can program and "integrate" a digital system onto a PLD themselves, without needing chip manufacturers to design and manufacture dedicated integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing integrated circuit chips, this programming is mostly implemented using "logic compiler" software. Similar to the software compiler used in program development, the original code before compilation must be written in a specific programming language, called a Hardware Description Language (HDL). There are many HDLs, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, and RHDL (Ruby Hardware Description Language). Currently, the most commonly used are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should understand that by simply performing some logic programming on the method flow using one of these hardware description languages and programming it into an integrated circuit, the hardware circuit implementing the logical method flow can be easily obtained.
[0109] The controller can be implemented in any suitable manner. For example, it can take the form of a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code form, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included therein for implementing various functions can also be considered as structures within the hardware component. Alternatively, the means for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.
[0110] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.
[0111] For ease of description, the above devices are described in terms of function, divided into various units. Of course, in implementing this specification, the functions of each unit can be implemented in one or more software and / or hardware components.
[0112] Those skilled in the art will understand that embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0113] This specification is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this specification. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0114] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0115] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0116] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0117] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0118] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0119] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0120] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0121] This specification can be described in the general context of computer-executable instructions that are executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This specification can also be practiced in distributed computing environments, where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0122] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.
[0123] The above description is merely an embodiment of this specification and is not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this specification.
Claims
1. A method of generating sample data, characterized by, include: Acquire each text data to be processed; for each text data to be processed, generate an encoding sequence corresponding to the text data to be processed through the first processing unit; import the encoding sequence corresponding to each text data to be processed into the second processing unit, so that the second processing unit can replace the encoding of at least some characters in the encoding sequence corresponding to each text data to be processed with the specified encoding in parallel, and generate each sample data; The second processing unit includes sub-processing units; The parallel processing of at least a portion of the character encodings in the encoding sequence corresponding to each text data to be processed is performed by replacing them with a specified encoding. Specifically, this includes: processing each text data to be processed in parallel by each sub-processing unit to obtain sample data corresponding to each text data to be processed; wherein: for each sub-processing unit, the position of each word in the text data to be processed is determined; based on the position of each word in the text data to be processed, a feature matrix corresponding to the text data to be processed is constructed; based on the feature matrix corresponding to the text data to be processed, at least a portion of the character encodings in the encoding sequence corresponding to the text data to be processed are replaced with a specified encoding to obtain sample data corresponding to the text data to be processed.
2. The method of claim 1, wherein, Based on the position of each word in the text data to be processed, a feature matrix corresponding to the text data to be processed is constructed. Specifically, this includes: for each word in the text data to be processed, determining the row vector corresponding to the word based on the position of each character in the text data to be processed; and sorting the row vectors corresponding to each word in the text data to be processed based on the position of each word in the text data to be processed, so as to construct the feature matrix corresponding to the text data to be processed.
3. The method of claim 1, wherein, Before the second processing unit replaces the encoding of at least a portion of the characters in the encoding sequence corresponding to each text data to be processed with the specified encoding in parallel, the method further includes: for each sub-processing unit, randomly adjusting the position of each word in the text data to be processed processed by the sub-processing unit to obtain the rearrangement information corresponding to the text data to be processed; the second processing unit replaces the encoding of at least a portion of the characters in the encoding sequence corresponding to each text data to be processed with the specified encoding in parallel, specifically including: based on the rearrangement information corresponding to the text data to be processed and the feature matrix corresponding to the text data to be processed, replacing the encoding of at least a portion of the characters in the encoding sequence corresponding to the text data to be processed with the specified encoding.
4. The method of claim 3, wherein, Based on the rearrangement information corresponding to the text data to be processed and the feature matrix corresponding to the text data to be processed, the encoding of at least some characters in the encoding sequence corresponding to the text data to be processed is replaced with the specified encoding. Specifically, this includes: adjusting the position order between the row vectors in the feature matrix corresponding to the text data to be processed according to the rearrangement information corresponding to the text data to be processed to obtain the rearrangement matrix corresponding to the text data to be processed; and replacing the encoding of at least some characters in the encoding sequence corresponding to the text data to be processed with the specified encoding according to the rearrangement matrix corresponding to the text data to be processed.
5. The method of claim 4, wherein, Based on the rearrangement matrix corresponding to the text data to be processed, at least some of the characters in the encoding sequence corresponding to the text data to be processed are replaced with the specified encoding. Specifically, this includes: determining the target word based on the rearrangement matrix corresponding to the text data to be processed; and replacing the encoding of the characters contained in the target word in the encoding sequence corresponding to the text data to be processed with the specified encoding.
6. The method of claim 5, wherein, Based on the rearrangement matrix corresponding to the text data to be processed, the target word is determined, specifically including: for each row vector in the rearrangement matrix corresponding to the text data to be processed, determining the number of characters contained in the word corresponding to that row vector, and the sum of the number of characters contained in the words corresponding to the row vectors preceding that row vector in the rearrangement matrix; determining whether the sum of the number of characters contained in the word corresponding to that row vector and the sum of the number of characters contained in the words corresponding to the row vectors preceding that row vector in the rearrangement matrix exceeds a preset threshold; if not, then the word corresponding to that row vector is determined to be the target word of the text data to be processed.
7. A sample data generation apparatus, characterized in that, include: The acquisition module is used to acquire the text data to be processed. The first generation module is used to generate the encoding sequence corresponding to each text data to be processed through the first processing unit. The second module is used to import the encoding sequence corresponding to each text data to be processed into the second processing unit, so that the second processing unit can replace the encoding of at least some characters in the encoding sequence corresponding to each text data to be processed with the specified encoding in parallel to generate each sample data; The second processing unit includes sub-processing units; The parallel processing of at least a portion of the character encodings in the encoding sequence corresponding to each text data to be processed is performed by replacing them with a specified encoding. Specifically, this includes: processing each text data to be processed in parallel by each sub-processing unit to obtain sample data corresponding to each text data to be processed; wherein: for each sub-processing unit, the position of each word in the text data to be processed is determined; based on the position of each word in the text data to be processed, a feature matrix corresponding to the text data to be processed is constructed; based on the feature matrix corresponding to the text data to be processed, at least a portion of the character encodings in the encoding sequence corresponding to the text data to be processed are replaced with a specified encoding to obtain sample data corresponding to the text data to be processed.
8. A computer-readable storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the method described in any one of claims 1 to 6.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method described in any one of claims 1 to 6.