[0047] In order to make the objectives, technical solutions and effects of the present invention clearer and clearer, the present invention will be described in further detail below. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
[0048] The embodiments of the present invention are described below with reference to the accompanying drawings.
[0049] In view of the above problems, an embodiment of the present invention provides a method for generating encoded text based on adversarial training, please refer to figure 1 , figure 1 This is a flow chart of a preferred embodiment of a method for generating encoded text based on adversarial training of the present invention. like figure 1 shown, it includes:
[0050] Step S100, constructing an adversarial generation network in advance;
[0051] Step S200, optimizing the confrontation generation network to generate an optimized confrontation generation network, and the discarding mask in each training in the optimized confrontation generation network is not fixed;
[0052] Step S300, performing adversarial training on the pre-training model according to the confrontation generating network to generate a target pre-training model;
[0053] Step S400: Input the bond information to be processed into the target pre-training model to generate encoded text.
[0054] During specific implementation, the embodiments of the present invention are applied to bond information extraction. The invention implements the adversarial training method for optimization, and is applicable to any pre-training network model similar to the BERT model, and has a particularly obvious effect on text data in the field of cash coupon transactions.
[0055] The mathematical formula for adversarial training is as follows:
[0056]
[0057] Among them, E7 (x,y) represents the input vector, minE (x,y):D [] represents the mathematical expectation minimization of the model, r adv Represents the perturbation of the word vector, θ represents the parameters of the model, and y is the real label. In fact, adversarial training is essentially a min-max problem. The formula is mainly divided into two parts, one is the maximization of the internal loss function, and the other is the minimization of the external empirical risk. To describe the idea of adversarial training in one sentence, it is to perform gradient rise on the input (increase the loss) to make the input as different as possible from the original, and perform gradient descent on the parameters (reduce the loss) to make the model as accurate as possible.
[0058] This method no longer fixes the dropout mask, but uses probability distribution metrics or mean square error to constrain the output differences of two different dropout models, so that dropout can play its own role, and it can also make the model output of two different dropouts as far as possible be consistent. The dropoutmask masks the neuron position of dropout, similar to adding a neuron in, so that the model does not know that a neuron is lost here
[0059] In one embodiment, the adversarial generative network is pre-built, including:
[0060] The adversarial generative network based on the FreeLB adversarial training method is pre-built.
[0061] In the specific implementation, FreeLB has been improved on the basis of PGD. Projected Gradient Descent (PGD) uses multiple perturbations, each perturbation only takes a small step. When the range after multiple perturbations exceeds the specified perturbation range, it is mapped back to the circle of the perturbation range. However, PGD only uses the gradient of the last disturbance to update the parameters, while FreeLB takes the average gradient of the next iteration to update the parameters.
[0062] The mathematical formula of PGD is as follows:
[0063]
[0064] The mathematical formula of FreeLB is as follows:
[0065]
[0066] where K represents the number of perturbations.
[0067] In one embodiment, the adversarial generative network is optimized to generate an optimized adversarial generative network, and the discarding mask in each training in the optimized adversarial generative network is not fixed, including:
[0068] Optimize the adversarial generation network based on the FreeLB adversarial training method, and modify the dropout position of the fixed model to be not fixed, so that the dropout mask in each training in the optimized confrontation generation network is not fixed;
[0069] The output of the model is constrained by JS divergence to generate an optimized adversarial generative network.
[0070] In specific implementation, when FreeLB is used in pre-training models such as BERT, the BERT model will use dropout during finetuning to make the model performance more superior, but the use of dropout will make FreeLB update the model every time it searches for the largest disturbance during gradient ascent are inconsistent, resulting in increased input noise. Fine tuning, that is, fine tuning. After the model completes pre-training, the model parameter weights will be adjusted for the corresponding tasks during downstream tasks, so that the model has a higher accuracy rate on the corresponding tasks.
[0071] The most common practice of FreeLB is to fix the position of the model dropout, so that the model is consistent during finetuning. However, the position of the fixed model dropout is to make the dropout lose its own meaning, although the position of the dropout in the fine tuning process may be different from the position of the dropout during the pre-training, so it cannot play the role of the dropout. In order to solve this problem, this solution introduces JS divergence. JS divergence is a measure of the similarity of two probability distributions, which is based on a variant of KL divergence.
[0072]
[0073] where KL is the KL divergence,
[0074]
[0075] So for N samples, the loss is,
[0076]
[0077] Where x represents the input data, X represents the set of data, Y(x), Z(x) represent the data Y respectively , The probability distribution corresponding to Z, Rs(θ) represents the size of the KL divergence of the model; in this way, the maximization of the JS divergence can constrain the network output distribution of FreeLB when the gradient rises at each step, and achieve the effect of distribution convergence. The objective function is as follows.
[0078]
[0079] Equation 7 solves the problem of input noise increase caused by the use of dropout in FreeLB at each step of gradient ascent, and is also applicable to the adversarial training method that performs multiple gradient ascents to find the maximum perturbation.
[0080] Improvements to the FreeLB approach are not only applicable to pre-trained models such as BERT, but also to models with any network structure.
[0081] In one embodiment, adversarial training is performed on a pre-trained model according to an adversarial generation network to generate a target pre-trained model, including:
[0082] The BERT pre-training model is adversarially trained according to the adversarial generative network to generate the target pre-training model.
[0083] During specific implementation, the gradient attack scheme in the embodiment of the present invention is applicable to pre-training models similar to BERT, but is not limited to BERT, but also applicable to Roberta, xlm-roberta, XLNET, and the like.
[0084] In one embodiment, adversarial training is performed on the BERT pre-training model according to the confrontation generation network, and the target pre-training model is generated, including:
[0085] Obtain the bond information sample, input the bond information sample into the input layer of the BERT pre-training model, and generate the input text;
[0086] Feed the input text into the embedding layer of the BERT pretrained model;
[0087] The embedding layer of the BERT pre-training model is adversarially trained according to the adversarial generative network to generate the target pre-training model.
[0088] During specific implementation, noise can be added to the original image in computer vision, but it does not affect the properties of the original image. In the field of NLP (Natural Language Processing), it is not possible to directly add noise to word encoding, because word embedding is essentially one-hot encoding one-hot. If the above noise is added to one-hot, it will The original sentence will be ambiguous. Therefore, a natural idea is to add perturbations to the word embeddings (word vectors). Input the input text into the embedding layer of the BERT pre-training model; conduct adversarial training on the embedding layer of the BERT pre-training model according to the adversarial generation network to generate the target pre-training model.
[0089] In one embodiment, adversarial training is performed on the embedding layer of the BERT pre-training model according to the adversarial generation network to generate a target pre-training model, including:
[0090] According to the adversarial generation network, two gradient attacks are performed on the embedding layer of the BERT pre-training model to generate the perturbed gradient;
[0091] The parameters of the pre-training model are updated according to the perturbed gradient to generate the target pre-training model.
[0092]In specific implementation, adversarial training is to attack the embedding after the addition of the three when performing gradient attack. However, it is unreasonable to conduct gradient attack on the embedding after the addition of the three, because the original intention of gradient attack is to change the semantics of the input word, so that the word is misunderstood by the model, that is, "university" is understood by the model as "except" meaning other than "university". In this case, attacking the embedding of the word vector toke embedding, the position vector position embedding, and the segmentation vector segment embedding seems to make the whole attack more complicated, and the model will also misunderstand the position and segment. Segment position segmentation (or segment): which sentence the current word belongs to. Therefore, this scheme adopts the method of secondary attack to generate the gradient after perturbation; according to the gradient after perturbation, the parameters of the pre-training model are updated to generate the target pre-training model.
[0093] In one embodiment, the embedding of the BERT pre-training model is composed of token embedding, positionembedding, and segment embedding. Then, according to the confrontation generation network, the embedding layer of the BERT pre-training model is subjected to two gradient attacks, and the gradient after the disturbance is generated, including:
[0094] The first gradient attack is performed on the token embedding of the BERT pre-training model according to the confrontation generation network, and the first gradient after the first disturbance is generated;
[0095] The second gradient attack is performed on the position embedding of the BERT pre-training model according to the confrontation generation network, and the second gradient after the second perturbation is generated.
[0096] When implemented, such as figure 2 and image 3 As shown, in pre-training models such as BERT, the embedding input to the model is obtained by adding token embedding, position embedding, and segment embedding. For the first time, the gradient attack on token embedding is performed first, and the perturbed gradient is used. to update parameters. As mentioned above, the purpose is to change the semantics of the input words, so that the words are misunderstood by the model, and further enhance the semantic understanding ability of the model. The second gradient attack is to attack the position embedding. The purpose is to perturb the position of each word in the corpus. For example, "I like China" may become "I like China", which is similar to the model. The data augmentation makes the model resistant to this perturbation, allowing it to understand the correct meaning of the sentence.
[0097] The mathematical formula of the first gradient attack is as follows:
[0098]
[0099] Among them, E token represents the input word vector, r adv represents the perturbation to the word vector, θ represents the parameters of the model, (f(E token +r adv;θ) represents the model output, y is the true label, L(f(E token +r adv; θ), y) represents the loss between the model and the true label, K represents the number of perturbations, Represents the maximization of loss, ε represents the maximum range of word vector disturbance, Rs(θ) represents the KL divergence of the model, minE (x,y):D [] denotes the mathematical expectation minimization of the model.
[0100] The mathematical formula of the second gradient attack is as follows:
[0101]
[0102] Among them, E token represents the input word vector.
[0103] Combining the above methods, our proposed new technology solution model is expressed as follows:
[0104]
[0105] in,
[0106]
[0107]
[0108] The embodiment of the present invention perturbs the toke embedding and the position embedding respectively, instead of perturbing the sum of the toke embedding, the position embedding, and the segment embedding. The perturbation of tokenembedding is to enhance the difficulty of the model's understanding of text semantics, and the effect of perturbation of position embedding is similar to disrupting each word in the text, enhancing the data, and further improving the model's ability to understand the text.
[0109] The embodiment of the present invention provides a method for generating encoded text based on adversarial training, which is optimized for the FreeLB adversarial training method, and is applicable to any network model similar to the BERT model, especially for text data in the field of cash coupon transactions. This method no longer fixes the dropout mask, but uses probability distribution metrics or mean square error to constrain the output differences of two different dropout models, so that dropout can play its own role, and it can also make the model output of two different dropouts as far as possible be consistent. Furthermore, this scheme perturbs toke embedding and positionembedding separately, instead of perturbing the sum of toke embedding, position embedding, and segmentembedding. The perturbation of token embedding is to enhance the difficulty of the model's understanding of text semantics, and the effect of perturbation of position embedding is similar to disrupting each word in the text, enhancing the data, and further improving the model's ability to understand the text. The accuracy of the extraction of transaction elements in the secondary transaction business of financial bonds has been improved by more than 2% to 5%.
[0110] It should be noted that the above steps do not necessarily have a certain sequence. Those of ordinary skill in the art can understand from the description of the embodiments of the present invention that in different embodiments, the above steps may have different execution sequences. That is, it can be executed in parallel, or it can be executed interchangeably, and so on.
[0111] Another embodiment of the present invention provides an apparatus for generating encoded text based on adversarial training, such as Figure 4 As shown, the device 1 includes:
[0112] a network building module 11, for pre-constructing an adversarial generative network;
[0113] The network optimization module 12 is used to optimize the confrontation generation network, and generate an optimized confrontation generation network, and the discard mask in each training in the optimized confrontation generation network is not fixed;
[0114] The confrontation training module 13 is used to perform confrontation training on the pre-training model according to the confrontation generation network, and generate the target pre-training model;
[0115] The encoding module 14 is used for inputting the bond information to be processed into the target pre-training model to generate encoded text.
[0116] For specific implementation manners, refer to the method embodiments, which will not be repeated here.
[0117] Another embodiment of the present invention provides an electronic device, such as Figure 5 As shown, the electronic device 10 includes:
[0118] one or more processors 110 and memory 120, Figure 5 Taking a processor 110 as an example for introduction, the processor 110 and the memory 120 may be connected by a bus or in other ways. Figure 5 Take the connection through the bus as an example.
[0119] The processor 110 is used to complete various control logics of the electronic device 10, which can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), a single-chip microcomputer, an ARM ( Acorn RISCMachine) or other programmable logic devices, discrete gate or transistor logic, discrete hardware controls, or any combination of these components. Also, the processor 110 may be any conventional processor, microprocessor or state machine. The processor 110 may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors combined with a DSP core, or any other such configuration.
[0120] As a non-volatile computer-readable storage medium, the memory 120 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the generation of encoded text based on adversarial training in the embodiment of the present invention The program instruction corresponding to the method. The processor 110 executes various functional applications and data processing of the device 10 by running the non-volatile software programs, instructions and units stored in the memory 120, that is, the generation of encoded text based on adversarial training in the above method embodiments is realized. method.
[0121] The memory 120 may include a storage program area and a storage data area, wherein the storage program area may store an operation device, an application program required for at least one function; the storage data area may store data created according to the use of the device 10 and the like. Additionally, memory 120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 120 may optionally include memory located remotely from processor 110, which may be connected to device 10 via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
[0122] One or more units are stored in the memory 120, and when executed by the one or more processors 110, perform the adversarial training-based coded text generation method in any of the above method embodiments, for example, perform the above-described method. figure 1 The method in step S100 to step S400.
[0123] Embodiments of the present invention provide a non-volatile computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more processors, for example, to execute the above-described figure 1 The method in step S100 to step S400.
[0124] As examples, the nonvolatile storage medium can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) as external cache memory. By way of illustration and not limitation, RAM may be configured in formats such as Synchronous RAM (SRAM), Dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), SynchlinkDRAM (SLDRAM), and Direct Rambus (Lambas) RAM (DRRAM) and many other forms. The disclosed memory controls or memories of the operating environments described herein are intended to include one or more of these and/or any other suitable types of memory.
[0125] Another embodiment of the present invention provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that when executed by a processor , causing the processor to execute the method for generating encoded text based on adversarial training in the above method embodiment. For example, perform the above-described figure 1 The method in step S100 to step S400.
[0126] The above-described embodiments are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, Alternatively, it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
[0127] From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence, or the parts that make contributions to related technologies, and the computer software products can exist in computer-readable storage media, such as ROM/RAM, magnetic disks , optical disc, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods of various embodiments or portions of embodiments.
[0128] Conditional language such as "could," "could," "may," or "could," among others, is generally intended to convey unless specifically stated otherwise or otherwise understood within the context as used Particular embodiments can include (while other embodiments do not include) particular features, elements, and/or operations. Thus, such conditional language is also generally intended to imply that the features, elements, and/or operations are irrelevant for one or more embodiments. All are required or one or more implementations must include logic to determine, with or without input or prompting, whether these features, elements and/or operations are included or to be performed in any particular implementation.
[0129] What has been described herein in this specification and the accompanying drawings includes examples that can provide methods and apparatuses for generating coded text based on adversarial training. Of course, not every conceivable combination of elements and/or methods has been described for the purpose of describing the various features of the present disclosure, but it will be appreciated that many additional combinations and permutations of the disclosed features are possible. Therefore, it will be apparent that various modifications can be made in the present disclosure without departing from the scope or spirit of the disclosure. In addition, or in the alternative, other embodiments of the present disclosure may be apparent from consideration of this specification and drawings, and from practice of the present disclosure as presented herein. It is intended that the examples presented in this specification and drawings are to be regarded in all respects as illustrative and not restrictive. Although specific terms are employed herein, they are used in a generic and descriptive sense and not for purposes of limitation.