Method and device for training model, equipment, medium and program product

A training model and model technology, applied in the field of deep learning, can solve problems such as complex models, and achieve the effect of improving generalization ability and robustness

Active Publication Date: 2022-02-25
BEIJING BAIDU NETCOM SCI & TECH CO LTD
6 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, as the model becomes more and more complex, the problems of mode...
View more

Method used

Thus, by maximizing the first KL divergence, the output result of the model based on the first perturbation data and the first word feature representation set deviates from the output result of the model based on the first word feature representation to the greatest extent, thereby facilitating Improve the generalization ability and robustness of the model.
Thus, by maximizing the fourth KL divergence, the model is based on the first disturbance data, the first word feature representation set of the source language text, the second disturbance data and the third word feature representation set of the target language text The output results deviate to the greatest extent from the output results of the model based on the first word feature representation and the third word feature representation set, which is conducive to improving the generalization ability and robustness of the model.
Thus, by using complementary mask data in the training process to carry out complementary masking to the disturbance data, and by the disturbance data of the complementary masking, the first word feature representation set of the training text is disturbed, thereby generating the disturbed The second word feature representation set is used for training the model, which can improve the generalization ability and robustness of the model.
Thus, by using complementary mask data in the training process to carry out complementary masking to the disturbance data, generate two disturbance data through complementary masking, and to the first word feature representation set or the 3rd word feature representation set The results of the two local perturbations are weighted, so that the generated second word feature representation set and fourth word feature representation set can reflect more diverse anti-perturbation directions, which can improve the generalization ability and robustness of the model.
Thus, by using complementary mask data in the training process to carry out complementary masking to the perturbation data, and through the perturbat...
View more

Abstract

The invention provides a method and device for training a model, equipment, a medium and a program product, and relates to the field of deep learning. According to the specific implementation scheme, the method comprises the steps of generating first disturbance data, wherein the first disturbance data is used for disturbing a first word feature representation set associated with a training text of a model; generating first mask data and first complementary mask data, wherein the first mask data are used for masking a first part of data in the first disturbance data, and the first complementary mask data are used for masking data except the first part of data in the first disturbance data; generating first masking disturbance data based on the first mask data and the first disturbance data; generating second masking disturbance data based on the first complementary mask data and the first disturbance data; and generating a second word feature representation set for training the model based on the first masking perturbation data, the second masking perturbation data and the first word feature representation set. Therefore, the generalization ability and robustness of the model can be improved.

Application Domain

Natural language translationCharacter and pattern recognition +3

Technology Topic

EngineeringAlgorithm +4

Image

  • Method and device for training model, equipment, medium and program product
  • Method and device for training model, equipment, medium and program product
  • Method and device for training model, equipment, medium and program product

Examples

  • Experimental program(1)

Example Embodiment

[0024] The exemplary embodiments of the present disclosure will be described below, including various details of the embodiments of the present disclosure to facilitate understanding, and they should be considered simply exemplary. Accordingly, it will be appreciated by those skilled in the art that various changes and modifications can be made without departing from the scope and spirit of the disclosure. Also, for the sake of clarity and concise, the following description is omitted in the following description.
[0025] As mentioned above, the problem of model training has been equipped with poor robustness begins. The disturbance-based traditional model training method mainly includes the following: 1) Word replacement, randomly replace a part of the words in parallel corners to any word in the word table; 2) Word discard, randomly use all true word vector rather than The true word vector is carried out; 3) Virtual confrontation training, through regular item, make the model to disturb the input disturbance. However, the traditional model training methods do not have a good solution to the problem that the model is easy to fit.
[0026] In order to solve one or more of the above problems and other potential problems, an exemplary embodiment of the present disclosure proposes a scheme for processing training models for natural language, in which a computing device generates first disturbance Data, the first disturbance data is used to disturb the first word characteristics, and the first word feature represents the set of training texts of the model. The computing device also generates the first mask data and the first complementary mask data, and the first mask data is used to mask the first part of the data in the first disturbance data, and the first complementary mask data is used for the first disturbance data. In addition to data other than the first part of the data is masked. Subsequently, the computing device generates a first masking disturbance data based on the first mask data and the first disturbance data, and generates a second masking disturbance data based on the first complement mask data and the first disturbance data. Next, the computing device generates a second term feature representation based on the first masking distraction data, the second masking distraction data, and the first word feature set, generating a second term feature representation for training model. According to the scheme of the present disclosure, the disturbance data is complementaryly masked by utilizing complementary mask data during the training, and the first word characteristics of the training text are disturbed by the complementary masking disturbance data, thereby generating disturbance The second word feature representation is used for training models, which can improve the generalization ability and robustness of the model.
[0027] Hereinafter, specific embodiments of the present disclosure will be described in more detail in connection with the accompanying drawings.
[0028] figure 1 An example of an information processing environment 100 according to an embodiment of the present disclosure is shown. Such as figure 1 As shown, the information processing environment 100 includes a computing device 110, a model 120 of natural language processing, a training text 130 of model 120, a first word characteristic associated with training text 130, a first disturbance data 150, and A mask data 160, the first complement mask data 170, and the second word feature representation 180.
[0029] Computing device 110 can include a server, a desktop computer, a tablet computer, a personal computer, and the like.
[0030] Models 120 for natural language processing include, for example, a text classification model, a translation model, and the like. Model 120 can be implemented in a neural network.
[0031] Training text 130 can include a statement. Multiple words can have multiple words in the statement, such as I. For each word, a word feature representation, such as word vector. The number of elements of the word vector of each word may be the same, such as D. Thus, for the training text 130, a first term feature representation 140 can be generated, and the first word feature representation set 140 can include a plurality of word feature representations. The first word feature representation, for example, in the form of a matrix, the matrix can be a word vector, the number of columns of the matrix, the number of words of the matrix is ​​the word vector, so that the dimension of the matrix can be DXI . For example, the training text 130 is "Today is sunny", including 5 words, the number of elements of each word, for example 10, then the first word feature represents a matrix of 10x5.
[0032] The dimension of the first disturbance data 150 can be the same as the first word feature representation 140, such as DXI. The first disturbance data 150 may be, for example, randomly generated.
[0033] The dimension of the first mask data 160 and the first complementary mask data 170 may be the same as the dimension of the first word feature representing the set 140, such as DXI. The element in the first mask data 160 and the first complementary mask data 170 can be binary, such as 0 means masking, 1 means not masking. For example, the first mask data 160 can be represented as M x ∈ {0,1} d×I The first complementary mask data 170 can be expressed (1 x -M x ), Where 1 x ∈ {1} d×I.
[0034]The computing device 110 is configured to generate a first disturbance data 150, and the first disturbance data 150 is used to disturb the first word feature representation 140, and the first word feature represents the training text 130 of the model 120; generated first The mask data 160 and the first complementary mask data 170, the first mask data 160 is used to mask the first portion of the first disturbance data 150, and the first complementary mask data 170 is used for the first disturbance data 150. In addition to data other than the first partial data, it is masked; based on the first mask data 160 and the first disturbance data 150, the first masking disturbance data is generated, and the first complementary mask data 170 and the first disturbance data 150 are generated. Two masking data; and based on the first masking distraction data, the second masking distraction data, and the first term feature representation set 140, the second term feature representation set 180 is generated for training model 120.
[0035] Thereby, by complementting the disturbance data by using complementary mask data during the training, the first word characteristics of the training text are disturbed by the complementary masking disturbance data, thereby generating disturbances The feature representation is used for training models, which can enhance the generalization ability and robustness of the model.
[0036] figure 2 A flowchart of method 200 for processing training models for natural language is shown in accordance with an embodiment of the present disclosure. For example, method 200 can be figure 1 The calculation device 110 shown is executed. It should be understood that method 200 may also include additional frames not shown and / or may omit the displayed box, the scope of the present disclosure is not limited thereto.
[0037] At block 202, computing device 110 generates a first disturbance data 150, and the first disturbance data 150 is configured to disturb the first word feature representation set 140, and the first word feature representing the set 140 associated with the training text 130 of the model 120.
[0038] In some embodiments, the first disturbance data 150 can maximize the first KL (Kullback-Leibler "scattering between the following first item and the second term: based on the first word feature representation 140, generated via the model 120 The output result; and based on the first word feature representation set 140 and the first disturbance data 150, the output result generated via the model 120. Make the first KL scatter degree can be expressed by the following formula (1).
[0039]
[0040] Among them, x is a training text 130, δ x Indicates the disturbance data for the training text 130, L KL (x, δ x , Θ) represents the first KL scatter, which is defined by the following formula (2). The first disturbance data 150 can be obtained by solving formula (1).
[0041] KL (f (e (x); θ) || f (e (x) + δ x Θ)) (2)
[0042] Wherein, F represents the current parameters of the model 120, and the current parameter of the model 120, e (x) represents the first term feature representation 140 of the training text 130 (x). The previous representation of KL brackets is based on the first word characteristic representation set 140, and the resulting result of the model 120 is generated by the model 120, and the latter in KL parentheses represents the set 140 and the disturbance data based on the first word feature, and generated via the model 120. Output results.
[0043] Specifically, the computing device 110 can generate a first bias number set of the first KL scattering based on the first word feature representation. For example, the first bias number set of the first KL scatter can be solved by the following formula (3).
[0044]
[0045] Where i represents the ith in the training text 130, where i is greater than or equal to 1 and less than or equal to I, A i Represents the first bias number of the word I, It is indicated that the first KL scatter is guided to the quantity of the word of the first word.
[0046] Subsequently, the computing device 110 can generate the first disturbance data 150 based on the Wromeius norm of the first delay number set and the first deflection number set. For example, the first disturbance data 150 can be generated by the following formula (4).
[0047] δ' xi = ∈A i / || a || f (4)
[0048] Among them, δ ' xi Indicates the disturbance of the first word in the training text 130, || A || F represents the Florbenius Norm of the first bias number (F norm of reflections). ∈ indicates a preset value, which is a scale exceeded parameter for controlling the F norms of the disturbance data.
[0049] Thereby, by maximizing the first KL scattering, the model is made to maximize the output result representing the output of the first word characteristic based on the first disturbance data and the first word feature, thereby facilitating the model of the model. Generalization and robustness.
[0050] Returning figure 2 At block 204, the computing device 110 generates the first mask data 160 and the first complementary mask data 170, and the first mask data 160 is used to mask the first part of the data in the first disturbance data 150, and the first completion The mask data 170 is configured to mask the data other than the first portion of the data in the first disturbance data 150.
[0051] Taking the first word feature is a 3x3 matrix as an example, the three line elements of the first mask data 160 are, for example, (0, 0, 0), (0, 0, 0), then The three row elements of the first complementary mask data 170 are, for example, (1, 1, 1), (1, 1, 1), that is, the first mask data 160 is in the 3x3 matrix. The data other than the data of the second row of the second column is masked, and the first complementary mask data 170 masks the data of the second row in the 3x3 matrix, so the first mask data 160 and the first Complementary mask data 170 is complementary.
[0052] At block 206, the computing device 110 generates the first masking disturbance data based on the first mask data 160 and the first disturbance data 150.
[0053] For example, the first mask data 160 and the first disturbance data 150 can be multiplied by elements, thereby generating the first masking distraction data.
[0054] At block 208, the computing device 110 generates a second masking disturbance data based on the first complementary mask data 170 and the first disturbance data 150.
[0055] For example, the first complementary mask data 170 and the first disturbance data 150 can be multiplied by elements, thereby generating a second masking distraction data.
[0056] At block 210, the computing device 110 generates a second word feature representation set based on the first masking distraction data, the second masking distraction data, and the first term feature representation set 140, generates a second term feature representation set for training model 120.
[0057] For example, the first masking distraction data and the first term feature representation set 140 can be added to the element, generating the first intermediate word feature representation, and the second masking disturbance data and the first word feature representation 140 Add, generate the second intermediate word characteristics set. Subsequently, a second term feature representation 180 is generated based on the first intermediate word feature representation and the second intermediate word feature representation, such as by average or weighting.
[0058] Thereby, by complementting the disturbance data by using complementary mask data during the training, the first word characteristics of the training text are disturbed by the complementary masking disturbance data, thereby generating disturbances The feature representation is used for training models, which can enhance the generalization ability and robustness of the model.
[0059] image 3 A flow diagram of a method 300 for generating a second term feature representation is shown in accordance with an embodiment of the present disclosure. For example, method 300 can be figure 1 The calculation device 110 shown is executed. It should be understood that method 300 may also include additional frames not shown and / or may omit the displayed frame, the scope of the present disclosure is not limited thereto.
[0060] At block 302, the computing device 110 generates the first masking disturbance result based on the first masking distraction data and the first word feature representation 140.
[0061] The first masking disturbance result can be represented by the following formula (5).
[0062]
[0063] in, Indicates the first masking disturbance result, e (x) means a first word characteristic representation 140, A first masked disturbance data generated by the first mask data 160 and the first disturbance data 150 is multiplied by the element.
[0064] At block 304, the computing device 110 generates a second masking disturbance result based on the second masking distraction data and the first term feature representation 140.
[0065] The second masking disturbance result can be represented by the following formula (6).
[0066]
[0067] in, Represents the second masking disturbance result, e (x) means a first word characteristic representation 140, A second masked disturbance data generated by the first complementary mask data 170 and the first disturbance data 150 is multiplied by the element.
[0068] At block 306, the computing device 110 generates a second term feature representation based on the first masking disturbance result and the second masking disturbance result.
[0069] Specifically, the computing device 110 can generate weight values ​​based on a predetermined distribution function. For example, the resulting weight value λ ~ u (0, 1), wherein u means a uniform distribution.
[0070] Subsequently, the computing device 110 can generate the first weighting result based on the weight value and the first masking perturbation result. The computing device 110 can also generate a second weighting result based on the complementary weight value of the weight value and the second masking disturbance result. The complementary weight value is, for example, 1-weight value λ.
[0071] Next, the computing device 110 can generate a second term feature representation based on the first weighting result and the second weighting result. The second term feature representation can be generated by the following formula (7).
[0072]
[0073] Among them, R (x) represents the second term feature representation, λ represents the weight value, Represents the first masking disturbance result, Represents the second masking disturbance result.
[0074] Thereby, by weighting the results of the two local disturbances representing the first word characteristics, the second term feature represents the set of the prefillation of the disturbance direction, thereby improving the generalization ability and robustness of the model.
[0075]In some embodiments, the computing device 110 can also generate a first output result based on the model 120 based on the first word feature representation. The first output result can be expressed as ω = f (e (x), θ). The computing device 110 can also generate a second output result via the model 120 based on the second term feature representation. The second output result can be expressed as F (R (x); θ).
[0076] Next, the computing device 110 can generate a second KL scattering between the first output and the second output result, and the third KL scattering between the second output and the first output result. The second KL scatter can be expressed as KL (ω || f (R (x); θ)). The third KL scatter can be expressed as KL (F (R (x); θ) || ω).
[0077] Subsequently, the computing device 110 can generate a loss associated with the training text 130 based on the second KL scattering and the third KL scattering, and the loss associated with the training text 130 is used to update the parameters of the model 120. For example, the second KL scattering and the third KL scatter can be taken average to generate losses associated with the training text 130. For example, the loss can be expressed as (Kl (ω || f (R (x); θ)) + KL (f (R (x); θ) || ω) / 2.
[0078] In addition, for multiple training texts in batch training, computing device 110 can average multiple losses associated with multiple training text to generate losses associated with batch training to update the model 120 parameters.
[0079] Thereby, the loss is generated by the two KL scatterings of the uninterrupted original training sample and the two KL scatterings that are based on the second output output output from the vast dynamic training sample, so that the loss can be more symmetrical, thereby improving the model Wildness and robustness.
[0080] Figure 4 A schematic block diagram of a model 400 in accordance with an embodiment of the present disclosure is shown. Such as Figure 4 As shown, model 400 includes an encoder 410 and a decoder 420. For example, model 400 can be a translation model. The training text of the model 400 can include source language text, such as a Chinese text, and the training tag associated with training text may include target language text, such as English text.
[0081] The first word feature representation set 430 can be used as the input of the encoder 410, and the third word feature representation set 450 can be used as the input of the decoder 420.
[0082] Figure 5 A flow diagram of a method 500 for a natural language processing training model is shown in accordance with an embodiment of the present disclosure. For example, method 500 can be figure 1 The calculation device 110 shown is executed. It should be understood that method 500 may also include additional frames not shown and / or may omit the displayed frame, the scope of the present disclosure is not limited thereto.
[0083] At block 502, the computing device 110 generates the first disturbance data 150 and the second disturbance data, and the first disturbance data 150 is used to disturb the first word feature representation 430, and the first word feature representation 430 is related to the source language text. The second disturbance data is used to disturb the third word feature set 450 associated with the target language text.
[0084] In some embodiments, the target language text can have multiple words, such as J words. For each word, a word feature representation, such as word vector. The number of elements of the word vector of each word may be the same, such as D. Thus, for the target language, the third word feature representation can be generated, and the third term feature representation can include a plurality of word features. The third term feature representation, for example, in the form of a matrix, each column of the matrix can be a word vector, the number of columns of the matrix, the number of the number of the matrix is ​​the word vector, so that the dimension of the matrix can be DXJ . For example, the target language text is "Today Is Sunny", including 3 words, the number of words of the word vector of each word, for example 10, the first word feature represents a matrix of 10x3.
[0085] The dimension of the second disturbance data can be identical to the third word feature, such as DXJ. The second disturbance data can be generated, for example, randomly generated.
[0086] In some embodiments, the first disturbance data 150 and the second disturbance data make the fourth KL scattering between the first item and the second item to maximize the input 430 as the input of the encoder 410 with the first term feature. And the input result generated by the third word feature representation 450 as the decoder 420, and the output result generated by the model 400; and the input of the first term feature representing set 430 and the first disturbance data 150 as the input of the encoder 410, and in the first The three word features represent the set 450 and the second disturbance data as the input result generated by the model 400 as the input of the decoder 400. Maximize the fourth KL scattering can be represented by the following formula (8).
[0087]
[0088] Among them, x represents the source language text, y indicates the target language text, δ x Indicates the disturbance data for the source language, δ y Indicates the disturbance data for the target language text, L KL (x, y, δ x ,δ y , Θ) represents the fourth KL scatter, which is defined by the following formula (9). The first disturbance data and the second disturbance data can be obtained by solving the formula (9).
[0089] KL (f (e (x), e (y); θ) || f (e (x) + δ x , E (y) + δ y; Θ)) (9)
[0090] Wherein, F represents the current parameters of the model 400, θ represent the model 400, and e (x) represents the first term feature representation set 140, E (Y) represents a third word feature representation of the target language text. The previous representation of KL parentheses represents the input of the first term feature representing the input 430 as the encoder 410, and the input result is generated by the model 400 in the input result of the model 400, and the output result generated by the model 400. The following representation is input to the input of the encoder 410 as the first term feature representing the input 430 and the first disturbance data 150, and the input of the second-word feature representing the set 450 and the second disturbance data as the input of the decoder 420, via the model 400 Output results.
[0091] Specifically, the computing device 110 can represent the first deflection number set of the fourth KL scatter based on the first word feature representation 430. For example, the first derivative set of the fourth KL scatter can be solved by the following formula (10).
[0092]
[0093] Where i represents the first word in the source language text, where i is greater than or equal to 1 and less than or equal to I, A i Represents the first bias number of the word I, It is shown to guide the word fourth KL scatter with respect to the word I word.
[0094] Subsequently, the computing device 110 can generate the first disturbance data 150 based on the Wromeius norm of the first delay number set and the first deflection number set. Specific content can be found in the formula (4), and details will not be described here.
[0095] Specifically, the computing device 110 can represent the set 450 based on the third word feature, generate the second bias number set of the fourth KL scatter. For example, the first deflection number set of the fourth KL scatter can be solved by the following formula (11).
[0096]
[0097] Wherein, j represents the jth of the target language text, where j is greater than or equal to 1 and less than or equal to J, B j Represents the second deviation of the word J word, It shows the passage of the word fourth KL scatter with respect to the word J word.
[0098] Subsequently, the computing device 110 can generate second disturbance data based on the Froenius norm of the second defective number set and the second deflection number set. For example, the first disturbance data 150 can be generated by the following formula (12).
[0099]
[0100] in, Indicates the disturbance of the jth word in the target language, and || B | F represents the Froenius Norm of the Second Demand Number Set (F Niology). ∈ indicates a preset value, which is a scale exceeded parameter for controlling the F norms of the disturbance data.
[0101] Thereby, by maximizing the fourth KL scattering, the model is based on the first disturbance data, the first word of the source language text represents the third word feature of the second disturbance data, and the target language text represents the output result set output. The degree of deviation model is based on the first word feature representation and the third word feature represents the output result of the set, thereby facilitating improving the generalization ability and robustness of the model.
[0102] At block 504, computing device 110 generates a first mask data 160, a first complement mask data 170, a second mask data, and a second complement mask data. The first mask data 160 is configured to mask the first portion of the first disturbance data 150, and the first complement mask data 170 is configured to mask the data other than the first portion of the data in the first disturbance data 150. The second mask data is configured to mask the second partial data in the second disturbance data, and the second complementary mask data is for masking for data other than the second partial data in the second disturbance data.
[0103] The generation of the first mask data 160 and the first complementary mask data 170 can be refer to the above, and details are not described herein again.
[0104] The dimension of the second mask data and the second complementary mask data may be the same as the dimension of the third word feature representation 450, such as DXJ. The element in the second mask data and the second complementary mask data can be binary, such as 0 means masking, 1 means not masking. For example, the second mask data can be represented as m y ∈ {0,1} d×J The second complementary mask data can be expressed (1 y -M y ), Where 1 y ∈ {1} d×J.
[0105] At block 506, the computing device 110 generates the first masking disturbance data based on the first mask data 160 and the first disturbance data 150.
[0106] At block 508, the computing device 110 generates a second masking disturbance data based on the first complementary mask data 170 and the first disturbance data 150.
[0107] At block 510, the computing device 110 generates a second term feature representation 440 based on the first masking distraction data, the second masking distraction data, and the first term feature represent set 430, generates a second term feature representation set 440 for training model 400.
[0108] At block 512, the computing device 110 generates a third masked disturbance data based on the second mask data and the second disturbance data.
[0109] For example, the second mask data and the second disturbance data can be multiplied by elements, thereby generating a third masking distraction data.
[0110] At block 514, the computing device 110 generates a fourth masking disturbance data based on the second complement mask data and the second disturbance data.
[0111] For example, the second complementary mask data and the second disturbance data can be multiplied by elements, thereby generating fourth masking distraction data.
[0112] At block 516, the computing device 110 generates the fourth-word feature representation 460 based on the third masking distraction data, the fourth masking distraction data, and the third word feature representation set 450, and generates the fourth word feature representation set 460 to train the model 400 as the input of the decoder 420.
[0113] It should be understood that Figure 5 The 506-510 steps are executed in parallel with 512 to 516 steps, but this is only an example, and the 506-510 steps and 512 to 516 steps can be performed in order, and the scope of the present disclosure is not limited .
[0114] Thus, by complementing the disturbance data by using complementary mask data during the training, the first term characteristic representation and the third word feature representation are disturbed by complementary masking disturbance data. The second word characteristics represents the set and founcing of the fourth word characteristic representation for training models, which can improve the generalization ability and robustness of the model.
[0115] Image 6 A flow diagram of a method 600 for generating a fourth word feature representation is shown in accordance with an embodiment of the present disclosure. For example, method 600 can be figure 1 The calculation device 110 shown is executed. It should be understood that method 600 may also include additional frames not shown and / or may omit the frame shown, the scope of the present disclosure is not limited thereto.
[0116] At block 602, the computing device 110 generates a third masking perturbation result based on the third masking distraction data and the third word feature representation 140.
[0117] The third masking disturbance result can be represented by the following formula (13).
[0118]
[0119] in, Indicates the third masking disturbance result, e (y) means a third word characteristic representation 450, A third masking distraction data generated by the second mask data and the second disturbance data is multiplied by the element.
[0120] At block 604, computing device 110 generates a fourth masking disturbance result based on the fourth masking distant data and the third word feature representation 450.
[0121] The fourth masking distraction result can be represented by the following formula (14).
[0122]
[0123] in, Indicates the fourth masking results, e (y) means a third word characteristic representation 450, The second complementary mask data and the second disturbance data are multi-multiply the fourth masking distraction data generated by the element.
[0124] At block 606, the computing device 110 generates a fourth word characteristic representation 460 based on the third masking disturbance result and the fourth masking result.
[0125] Specifically, the computing device 110 can generate weight values ​​based on a predetermined distribution function. For example, the resulting weight value λ ~ u (0, 1), wherein u means a uniform distribution.
[0126] Subsequently, the computing device 110 can generate a third weighting result based on the weight value and the third mask disturbance result. The computing device 110 can also generate a fourth weighting result based on the complementary weight value of the weight value and the fourth masking result. The complementary weight value is, for example, 1-weight value λ.
[0127] Next, the computing device 110 can generate the fourth term feature representation 460 based on the third weighting result and the fourth weighting result. The fourth term feature representation 460 can be generated by the following formula (15).
[0128]
[0129] Among them, R (Y) means the fourth word characteristic representation 460, λ indicates the weight value, Represents the third masking perturbation result, Indicates the fourth masking result.
[0130] Thereby, by weighting the results of the two local disturbances representing the third word feature, the fourth term feature represents the set of extension and robustness of the model.
[0131] In some embodiments, the computing device 110 can also generate a first output result based on the model 400 based on the first word feature representation set 450, and generates a first output result, wherein the first word feature represents a set as encoder 410. Input, the third word feature represents the input as the input of the decoder 420. The first output result can be expressed as ω = f (e (x), e (y), θ).
[0132] The computing device 110 can also generate the set 460 and the fourth term feature representation set 460, and generate a second output result, and the second term feature representation is generated, and the fourth term feature representation is generated. The incoming input is set as the decoder 420. The second output result can be expressed as F (R (x), R (Y); θ)).
[0133] Next, the computing device 110 can generate the first output result and the second output result, and the sixth KL scattering between the second output and the first output result. The fifth KL scatter can be expressed as KL (ω || f (R (x), R (Y); θ)). The sixth KL scatter can be expressed as KL (f (R (x), R (Y); θ) || ω).
[0134] Subsequently, the computing device 110 can generate a loss associated with the training text 130 based on the fifth KL scattering and the sixth KL scattering, to update the parameters of the model 400. For example, the second KL scattering and the third KL scatter can be taken average to generate losses associated with the training text 130. For example, the loss can be expressed as (Kl (ω || f (R (Y); θ) + KL (f (R (x), R (Y); θ) || ω)) / 2.
[0135] Furthermore, for multiple training texts in batches, computing device 110 can average multiple losses associated with multiple training text to generate losses associated with batch training to update the model 400 parameters.
[0136] Thereby, the loss is generated by the two KL scatterings of the uninterrupted original training sample and the two KL scatterings that are based on the second output output output from the vast dynamic training sample, so that the loss can be more symmetrical, thereby improving the model Wildness and robustness.
[0137] Figure 7 A process 700 for generating a second term feature representation set or fourth word feature representation is shown in accordance with an embodiment of the present disclosure.
[0138] The first mask data 710 is multiplied by the first disturbance data 730, generating the first masking distraction data 740. The first complementary mask data 720 is multiplied by the first disturbance data 730, generating a second masking distraction data 750. The first masking disturbance data 740 is added to the first word feature representation 760 to generate the first masking disturbance result. The second masking distraction data 750 is added to the first word feature representation 760, and after the element is added, the second masking disturbance result is generated. The first masking disturbance result is multiplied by the result of the weight value λ and the second masking disturbance result is added to the result of the complementary weight value (1-λ), generating the second word feature representation 770.
[0139] Similarly, the second mask data 710 is multiplied by the second disturbance data 730, and generates a third masking distraction data 740. The second complementary mask data 720 is multiplied by the second disturbance data 730, and the fourth masking disturbance data 750 is generated. The third masking disturbance data 740 is added to the third word feature representation 760, generates a third masking disturbance result. The fourth masking distraction data 750 is added to the third word feature representation 760, and the fourth masking disturbance result is generated. The resulting result of the third masking disturbance multiplied by the result of the weight value λ and the fourth masking disturbance result is added to the result of the complementary weight value (1-λ), and generate the fourth term feature representation 770.
[0140] Thereby, by complementing the disturbance data by complementing the disturbance data by using complementary mask data, two disturbance data of complementary masking is generated, and two localizations represent the set or a third word characteristic of the first word. The results of disturbance are weighted, so that the generated second word feature representation and the fourth term feature represent collectors can reflect more of the antibiotic disturbance direction, and can improve the generalization ability and robustness of the model.
[0141] Embodiments of the present disclosure also provide a method for use in natural language processing. The method includes acquiring the text to be processed; and a model generated by the method training according to the method of the above-described embodiment according to the present disclosure is generated, based on the method training according to the above embodiment of the present disclosure. Processing results include, for example, but are not limited to, text classification results, target language text, etc. The text to be processed can include source language text.
[0142] Thereby, natural language processing can be performed using the generalization ability and the robustness of robustness.
[0143] Figure 8 A device 800 for a natural language processing training model is shown in accordance with an embodiment of the present disclosure. Such as Figure 8 As shown, device 800 includes a first disturbance data generation module 810, a first masking data generating module 820, a first masking module 830, a second masking module 840, and a first term feature representation of a module 850.
[0144] Regarding the first disturbance data generating module 810, it is used to generate first disturbance data, and the first disturbance data is used to disturb the first word characteristic representation, and the first word feature represents the training text of the model.
[0145] Regarding the first mask data generation module 820, it is used to generate the first mask data and the first complementary mask data, and the first mask data is used to mask the first part of the first disturbance data, and the first complement The mask data is used to mask the data other than the first partial data in the first disturbance data.
[0146] Regarding the first masking module 830, it is used to generate the first masking disturbance data based on the first mask data and the first disturbance data;
[0147] Regarding the second masking module 840, it is used to generate a second masking disturbance data based on the first complement mask data and the first disturbance data.
[0148] Regarding the first word feature representation generation module 850 for generating a second word feature representation based on the first masking distraction data, the second masking disturbance data, and the first term feature representation, generating a second term feature representation for training model.
[0149] In some embodiments, the first disturbance data maximizes the first KL scattering between the following first item and the second term: based on the first word feature representation, the output result generated by the model; and based on the first word The feature represents the set and the first disturbance data, and the output result generated by the model.
[0150] In some embodiments, the first disturbance data generation module 810 includes a first bias generation sub-module and a first disturbance data generating sub-module. The first biasing number generating a sub-module is used to generate a first bias set of first KL scattering based on the first word feature representation. The first disturbance data generated sub-module is used to generate first disturbance data based on the Froenius norm of the first delay number set and the first bias number set.
[0151] In some embodiments, the first term feature means that the generation module 850 includes a first masking disturbance result to generate a sub-module, a second masking disturbance result generating a sub-module, and a second term feature representation a sub-module. The first masking disturbance result generates a sub-module for generating the first masking disturbance result based on the first masking distraction data and the first word feature representation. The second masking disturbance result generates a sub-module for generating a second masking disturbance result based on the second masking distraction data and the first word feature representation. The second term feature represents a sub-module for generating a second term feature representation based on the first masking disturbance result and the second masking disturbance result.
[0152] In some embodiments, the second term feature represents the generated sub-module is also used to generate weight values ​​based on the predetermined distribution function; based on the weight value and the first masking disturbance result, generate the first weighting result; based on the weight value of the weight value And the second masking disturbance result, generate the second weighting result; and generate a second term feature representation based on the first weighting result and the second weighting result.
[0153] In some embodiments, device 800 further includes a first output generation module, a second output generation module, a KL scatter generation module, and a loss generating module. The first output result generation module is used to generate a first output result based on the first word feature representation. The second output generation generation module is used to generate a second output result based on the second term feature representation. The KL scatter generation module is used to generate the second KL scattering between the first output and the second output result, and the third KL scattering between the second output and the first output result. The loss generation module is used to generate a loss associated with the training text based on the second KL scattering and the third KL scattering, and the loss associated with the training text is used to update the parameters of the model.
[0154] In some embodiments, the model includes an encoder and a decoder.
[0155] In some embodiments, the training text includes a source language text, and the training tag associated with the training text includes the target language text, and the device 800 also includes a second disturbance data generation module, the second mask data generation module, the third masking disturbance Modules, fourth masking distraction modules, and second word feature representation generation modules. The second disturbance data generating module is used to generate a second disturbance data, and the second disturbance data is used to disturb the third word feature associated with the target language text. The second mask data generating module is configured to generate the second mask data and the second complementary mask data, and the second mask data is used to mask, second complement mask data for the second portion of the second disturbance data. Used to mask data other than the second disturbance data except for the second partial data. The third masking disturbance module is used to generate a third masking disturbance data based on the second mask data and the second disturbance data. The fourth masking disturbance module is used to generate fourth masking disturbance data based on the second complement mask data and the second disturbance data. The second term feature representation generating module is used to generate a fourth-word feature representation based on the third masking distraction data, fourth masking distraction data, and the third word feature representation, to generate the fourth word feature representation, to train the model as the input of the decoder.
[0156] In some embodiments, the first disturbance data and the second disturbance data make the fourth KL scattering between the following first items and the second item to maximize the input of the first word characteristics, the set is used as the input of the encoder. And the input as a decoder is used as the input of the third word, and the output result generated by the model; and the input of the first word feature representation and the first disturbance data as the input of the encoder, and the third word characteristics representation and the first Erhanting data is used as the input of the decoder, and the output result generated by the model is generated.
[0157] In some embodiments, the second disturbance data generating module includes a second bias derivative generation sub-module and a second disturbance data generating sub-module. The second bias derivative generation sub-module is used to represent the second deviation of the fourth KL scattering based on the third word feature representation. The second disturbance data generated sub-module is used to generate second disturbance data based on the Froenius norm of the second deflection number and the second deflection number.
[0158] In some embodiments, device 800 further includes a first output generation module, a second output generation module, a KL scatter generation module, and a loss generating module. The first output result generating module is used to generate a set, and generate a first output result, which generates a first output result is generated by a model, and the first word feature represents the input, the third word feature representation. The collected inputs are entered by the decoder. The second output result generation module is configured to generate a set based on the second term feature representation, and generate a second output result via a model, wherein the second term feature represents the input of the encoder, the fourth word feature representation The collected inputs are entered by the decoder. The KL scatter generation module is used to generate a fifth KL scatter degree between the first output and the second output result, and the sixth KL scatter between the second output and the first output result. The loss generation module is used to generate a loss associated with the training text based on the fifth KL scattering and the sixth KL scattering, and the loss associated with the training text is used to update the parameters of the model.
[0159] In some embodiments, the model includes a translation model.
[0160] Figure 9 A schematic block diagram of means 900 for natural language processing is shown in accordance with an embodiment of the present disclosure. Such as Figure 9 As shown, device 900 includes a text acquisition module 910 and a processing result generating module 920. Text acquisition module 910 is used to obtain the text to be processed. The processing result generation module 920 is used to generate a processing result based on the method training according to the method of the present disclosure, based on the method of processing according to the embodiment of the present disclosure.
[0161] In the technical solution of the present disclosure, the acquisition, storage and application of the personal information involved in the user is in line with the provisions of the relevant laws and regulations, and does not violate the publicquence.
[0162] According to an embodiment of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0163] Figure 10 A schematic block diagram showing an example electronic device 1000 that can be used to implement the embodiments of the present disclosure is shown. Electronic equipment is intended to represent various forms of digital computers, such as laptop, desktop computer, workbench, personal digital assistant, server, blade server, large computer, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connection and relationship, and their functions are merely examples, and are not intended to limit the implementation of the present disclosure as described herein.
[0164] Such as Figure 10 As shown, device 1000 includes a calculation unit 1001, which can perform various appropriate appropriately Action and processing. In the RAM 1003, the various programs and data required to operate the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other through bus 1004. The input / output (I / O) interface 1005 is also connected to the bus 1004.
[0165] The plurality of components in the device 1000 are connected to the I / O interface 1005, including: input unit 1006, such as a keyboard, a mouse, or the like; output unit 1007, such as various types of displays, speakers, etc., storage unit 1008, such as a disk, CD, etc. ; And communication unit 1009, such as NIC, modem, wireless communication transceiver, etc. The communication unit 1009 allows the device 1000 to exchange information / data with other devices through a computer network and / or a variety of telecommunications networks, such as the Internet.
[0166] Computing unit 1001 can be a general and / or dedicated processing component having processing and computing power. Some examples of calculating unit 1001 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various operational machine learning model algorithms calculation unit, digital signal processing (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1001 performs the respective methods and processing described above, for example, 200, 300, 500, 600. For example, in some embodiments, the method 200, 300, 500, 600 can be implemented as a computer software program that is tangily included in a machine readable medium, such as a storage unit 1008. In some embodiments, some or all of the computer program may be loaded and / or mounted to device 1000 via the ROM 1002 and / or communication unit 1009. When the computer program is loaded into the RAM 1003 and executed by the calculation unit 1001, one or more steps of the methods 200, 300, 500, 600 described above can be performed. Alternatively, in other embodiments, the calculation unit 1001 can be configured to perform method 200, 300, 500, 600 by any other suitable manner (e.g., by means of firmware).
[0167] Various embodiments of the systems and techniques described above in this article can be in digital electronic circuitry, integrated circuitry, field programmable gate array (FPGA), dedicated integrated circuit (ASIC), special standard product (ASSP), chip system System (SOC), load programmable logic (CPLD), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include: implementing in one or more computer programs, which can be performed and / or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be a dedicated or universal programmable processor, data and instructions can be received from the storage system, at least one input device, and at least one output device, and transmit data and instruction to the storage system, the at least one input device, and at least one An output device.
[0168] The program code for implementing the method of the present disclosure can be written any combination of one or more programming languages. These program code can provide a processor or controller for a general purpose computer, a dedicated computer, or another programmable data processing device such that the program code is performed by the processor or the controller to perform the functions specified in the flowchart and / or block diagram / The operation is implemented. The program code can be performed entirely on the machine, partially executed on the machine, execute on the machine as a stand-alone software package and is performed on the remote machine or on the remote machine or server.
[0169] In the context of the present disclosure, the machine readable medium may be a tangible medium, which may contain or store procedures for instruction execution systems, devices or devices, or combined with instruction execution systems, devices, or devices. The machine readable medium can be a machine readable signal medium or a machine readable storage medium. Machine readable media can include, but are not limited to, electron, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or any suitable combination of the above. More specific examples of machine readable storage media include electrical connection, portable computer disc, hard disk, random access memory (RAM), read-only memory (ROM) based on one or more lines of electrical connection, read-only memory (ROM), erased-programmable read-only memory (EPROM or flash memory), fiber optic, convenient compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the above.
[0170] In order to provide interaction with the user, the systems and techniques described herein can be implemented on a computer, which is: display device for displaying information to the user (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor ); And keyboards and pointing devices (e.g., mouse or trackballs), users can provide input to the computer by this keyboard and the pointing device. Other types of devices can also be used to provide interactions with the user; for example, feedback to the user can be any form of sensing feedback (eg, visual feedback, audible feedback, or haptic feedback); and can be in any form (including Acoustic input, voice input, or haptic input) to receive input from the user.
[0171] The systems and techniques described herein can be implemented in a computing system (e.g., a data server) including the background component, or a computing system (e.g., application server) including an intermediate member component, or a computing system including a front end member (eg, With a user computer with a graphical user interface or a web browser, the user can interact with the system and technique of the system and technology described herein by this graphical user interface or the network browser), or including this background component, an intermediate member, Or in any combination of the front end member. The components of the system can be connected to each other by digital data communication (eg, a communication network) in any form or a medium. Examples of the communication network include: LAN, WAN (WAN), and the Internet.
[0172] Computer systems can include clients and servers. Client and servers are generally away from each other and are usually interacting through a communication network. The relationship between the client and the server is generated by running on the corresponding computer and having a client-server relationship with each other. The server can be a cloud server or a server of a distributed system, or a server that combines block chains.
[0173] It should be understood that the various forms of forms shown above can be used, reordering, increasing, and deleting steps. For example, the steps described in the present disclosure can be performed in parallel, and may be performed sequentially, as long as the technical solutions disclosed in the present disclosure, this paper is not limited thereto.
[0174] DETAILED DESCRIPTION OF THE INVENTION The scope of the present disclosure is not constituted. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and replacements can be made according to design requirements and other factors. Any modification, equivalent replacement and improvement, etc. within the spirit and principles of the present disclosure, should be included within the scope of the present disclosure.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Sliding window error correction-based attack resisting text enhancement method and storage device

PendingCN114201954AImprove robustness and generalization
Owner:FUJIAN YIRONG INFORMATION TECH +2

SSD face detection method based on deep learning

ActiveCN112183362Aincrease differentiationImprove robustness and generalization
Owner:GUANGXI UNIVERSITY OF TECHNOLOGY

Electroencephalogram signal classification method based on collaborative contrast regularization average teacher model

PendingCN114757273AImprove robustness and generalizationImprove discriminative learning ability
Owner:NANJING UNIV OF TECH

Water quality prediction method based on gated circulation unit network integration

PendingCN111062476AImprove robustness and generalizationHigh precision
Owner:CHONGQING UNIV

Thyroid disease prediction modeling method based on association decision tree

PendingCN111489827Aavoid costImprove robustness and generalization
Owner:JILIN UNIV

Classification and recommendation of technical efficacy words

  • Improve robustness and generalization

SSD face detection method based on deep learning

ActiveCN112183362Aincrease differentiationImprove robustness and generalization
Owner:GUANGXI UNIVERSITY OF TECHNOLOGY

Thyroid disease prediction modeling method based on association decision tree

PendingCN111489827Aavoid costImprove robustness and generalization
Owner:JILIN UNIV

Water quality prediction method based on gated circulation unit network integration

PendingCN111062476AImprove robustness and generalizationHigh precision
Owner:CHONGQING UNIV

Electroencephalogram signal classification method based on collaborative contrast regularization average teacher model

PendingCN114757273AImprove robustness and generalizationImprove discriminative learning ability
Owner:NANJING UNIV OF TECH

Sliding window error correction-based attack resisting text enhancement method and storage device

PendingCN114201954AImprove robustness and generalization
Owner:FUJIAN YIRONG INFORMATION TECH +2
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products