Model training method and device, electronic equipment, storage medium and program product

By constructing correction sample pairs and expanding the samples using the initial training data, the problem of insufficient correction efficiency and stability in the fine-tuning training of large language models is solved, and the model's efficient correction capability under semantically similar inputs is realized.

CN122262684APending Publication Date: 2026-06-23BEIJING BAIDU NETCOM SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies cannot effectively improve the efficiency and stability of error correction in the fine-tuning training of large language models, especially when faced with similar inputs but with slightly different expressions, which can easily lead to misjudgments.

Method used

By constructing correction sample pairs, the initial training data of the target model is used to expand the correction sample pairs, generating expanded correction sample pairs, and the target model is trained based on these, ensuring the stability of model parameters and generalization ability.

Benefits of technology

It significantly improves the error correction efficiency and stability of the target model, avoids overfitting, and enhances the model's ability to correct semantically similar inputs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122262684A_ABST
    Figure CN122262684A_ABST
Patent Text Reader

Abstract

The present disclosure provides a model training method and device, electronic equipment, storage medium and program product, relates to the field of artificial intelligence, in particular to the field of large language model technology. The specific implementation scheme is: obtaining a bias sample, the bias sample including bias question information, corresponding bias reply information and corresponding true reply information, the bias reply information being obtained based on the bias question information by using a target model, the target model being trained based on preset initial training data; constructing a bias correction sample pair, the bias correction sample pair taking the bias question information and the bias reply information as negative samples and taking the bias question information and the true reply information as positive samples; performing sample expansion on the bias correction sample pair based on the initial training data to obtain an expanded bias correction sample pair; and training the target model based on the expanded bias correction sample pair. In this way, the bias correction efficiency and stability of the target model are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence, and more particularly to the field of large language model technology. Specifically, this disclosure relates to a model training method, apparatus, electronic device, storage medium, and program product. Background Technology

[0002] With the development of large language model technology, large language models can be applied in various scenarios, such as content moderation, risk identification, and business question answering. In the actual operation of large language models, sometimes incorrect output results may occur. To correct these errors, bias samples (i.e., bad cases) can be obtained based on the incorrect results, and the large language model can be fine-tuned based on these bias samples.

[0003] In existing methods, large language models are often fine-tuned based solely on biased samples, or training samples related to biased samples are manually selected and used to fine-tune the large language model along with the biased samples. However, these methods fail to produce good training results. Summary of the Invention

[0004] This disclosure provides a model training method, apparatus, electronic device, storage medium, and program product to improve the error correction efficiency and stability of a target model.

[0005] According to one aspect of this disclosure, a model training method is provided, comprising: Obtain deviation samples, which include deviation question information, corresponding deviation response information, and corresponding true response information. The deviation response information is obtained by using the target model based on the deviation question information. The difference between the deviation response information and the true response information satisfies a preset first difference condition. The target model is trained based on preset initial training data. Construct a correction sample pair, wherein the deviation question information and the deviation response information are negative samples, and the deviation question information and the true response information are positive samples. Based on the initial training data, the correction sample pairs are augmented to obtain augmented correction sample pairs. The target model is trained based on the expanded and corrected sample pairs.

[0006] According to another aspect of this disclosure, a model training apparatus is provided, comprising: The acquisition unit is configured to acquire deviation samples, the deviation samples including deviation question information, corresponding deviation response information and corresponding real response information, the deviation response information being obtained by the target model based on the deviation question information, the difference between the deviation response information and the real response information satisfying a preset first difference condition, and the target model being trained based on preset initial training data; The construction unit is configured to construct a correction sample pair, wherein the deviation question information and the deviation response information are negative samples, and the deviation question information and the true response information are positive samples. An expansion unit is configured to expand the correction sample pair based on the initial training data to obtain an expanded correction sample pair. The training unit is configured to train the target model based on the augmented and corrected sample pairs.

[0007] According to another aspect of this disclosure, an electronic device is provided, comprising: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the methods described in the embodiments of this disclosure.

[0008] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are configured to cause the computer to perform the methods described in embodiments of this disclosure.

[0009] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the methods described in the embodiments of this disclosure.

[0010] This disclosure allows for the construction of correction sample pairs based on biased samples, followed by sample augmentation using the initial training data of the target model. The target model is then trained based on these augmented correction sample pairs. This approach yields augmented correction sample pairs even with only a small number of biased samples, significantly improving the correction efficiency of the target model, ensuring the controllability of the training process, and enhancing the generalization ability of the target model. Furthermore, augmenting the correction sample pairs based on the initial training data avoids large parameter variations in the target model during training, thus improving its stability.

[0011] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0012] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure.

[0013] Figure 1 This is the system architecture diagram to which this disclosure applies.

[0014] Figure 2 This is a flowchart of the model training method provided in this publication.

[0015] Figure 3 This is a schematic diagram of the process of determining the expanded and corrected sample pairs provided in this disclosure.

[0016] Figure 4 This is a schematic diagram of the process of training the target model provided in this publication.

[0017] Figure 5 This is a schematic block diagram of the model training device provided in this disclosure.

[0018] Figure 6 This is a block diagram of an electronic device used to implement the model training method of the embodiments of this disclosure. Detailed Implementation

[0019] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0020] The terminology used in the embodiments of this invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. The singular forms “a,” “the,” and “the” as used in the embodiments of this invention and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.

[0021] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0022] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."

[0023] In the field of large language model technology, the first step is to perform initial training on the large language model, followed by deployment. After deployment, the large language model may output incorrect responses to certain questions. In such cases, fine-tuning of the large language model is necessary. A common approach is to obtain biased samples based on the question information and the incorrect responses, and then use these biased samples for reinforcement training of the large language model. Specifically, the correct responses corresponding to the biased samples are obtained, and the question information and its corresponding correct responses from the biased samples are used as new training samples. Supervised fine-tuning is then employed to train the large language model. This approach is simple to implement, requires no changes to the target model's structure, and can complete targeted corrections for individual biased samples in a relatively short time.

[0024] However, this method of optimizing the fitting ability of large language models to individual biased samples may impair the output stability of large language models across the entire semantic domain, and may even cause the same type of error to recur after correction. For example, in content security review scenarios, when large language models encounter inputs that are highly similar in semantics to the question information in the biased samples but are expressed in slightly different ways (such as synonym rewriting or contextual perturbation), they may misjudge again.

[0025] In view of this, this disclosure provides a new approach. To facilitate understanding of this disclosure, the system architecture on which this disclosure is based will first be described. Figure 1 Exemplary system architectures that can be applied to embodiments of this disclosure are shown, such as Figure 1 As shown, the system architecture may include: a client and a server.

[0026] The server side and the client side are the two main components of an application service. The server side uses a server as its primary hardware infrastructure and may include one or more software service modules. The server side and the client side form a collaborative front-end and back-end.

[0027] The client can be set on the terminal device. In this embodiment of the disclosure, the client can be a local application, a mini-program, or a web application running through a browser on the terminal device.

[0028] Terminal devices can include, but are not limited to, smart mobile terminals, wearable devices, PCs (Personal Computers), and smart home devices. Smart mobile devices can include devices such as mobile phones, tablets, laptops, PDAs (Personal Digital Assistants), and connected car terminals. Wearable devices can include devices such as smartwatches, smart glasses, smart bracelets, VR (Virtual Reality) devices, AR (Augmented Reality) devices, and mixed reality devices (devices that support both virtual and augmented reality). Smart home devices can include devices such as smart TVs and smart refrigerators with displays.

[0029] A server can be a single server, a server cluster consisting of multiple servers, or a cloud server. A cloud server, also known as a cloud computing server or cloud host, is a hosting product in the cloud computing service system, designed to address the shortcomings of traditional physical hosts and Virtual Private Servers (VPS) services, such as high management difficulty and weak service scalability.

[0030] It should be understood that Figure 1 The number of client and server components shown is merely illustrative. Depending on implementation needs, there can be any number of client and server components.

[0031] As one feasible implementation, a user can input target question information on a user interface provided on the client side. In response to a user-triggered submission event for the target question information, the client side sends the target question information to the server side. The server side then calls the target model to generate target response information based on the target question information and returns the target response information to the client side. In response to a user-triggered negative feedback event for the target response information, the server side identifies the target question information as a biased question and the target response information as a biased response, and obtains the corresponding real response information to construct a bias sample.

[0032] After constructing the deviation samples, the server constructs the correction sample pairs based on the deviation samples, and expands the correction sample pairs using the initial training data of the target model to obtain the expanded correction sample pairs. Subsequently, the server trains the target model based on the expanded correction sample pairs and puts the trained target model online so that users can use the target model to complete tasks through the user interface provided by the client.

[0033] Figure 2 This is a flowchart of a model training method provided in an embodiment of this disclosure. The model training method can be performed by... Figure 1 The server-side execution in the system shown. For example... Figure 2 As shown, the method may include the following steps: Step 201: Obtain deviation samples. Deviation samples include deviation question information, corresponding deviation response information, and corresponding real response information. Deviation response information is obtained by using the target model based on deviation question information. The difference between deviation response information and real response information satisfies a preset first difference condition. The target model is trained based on preset initial training data.

[0034] Step 202: Construct a correction sample pair, in which the deviation question information and deviation response information are negative samples, and the deviation question information and the true response information are positive samples.

[0035] Step 203: Based on the initial training data, the correction sample pairs are expanded to obtain expanded correction sample pairs.

[0036] Step 204: Train the target model based on the expanded and corrected sample pairs.

[0037] As can be seen from the above process, this disclosure can construct correction sample pairs based on biased samples, then expand the correction sample pairs using the initial training data of the target model, and finally train the target model based on the expanded correction sample pairs. This method can obtain expanded correction sample pairs with only a small number of biased samples, significantly improving the correction efficiency of the target model, ensuring the controllability of the training process, and also improving the generalization ability of the target model. Furthermore, expanding the correction sample pairs based on the initial training data avoids large parameter changes in the target model during training, thus improving the stability of the target model.

[0038] The following describes in detail each step of the above process and the effects that can be further produced, with reference to the embodiments.

[0039] First, step 201, namely "obtaining deviation samples", will be described in detail with reference to the embodiments.

[0040] In the embodiments of this disclosure, the server can first obtain deviation samples, which include deviation question information, corresponding deviation response information, and corresponding real response information.

[0041] Here, biased samples generally refer to the question information input by the user, biased response information is the incorrect response information output by the target model (referring to the target model trained on the initial training data) based on the question information, and true response information refers to the correct response information corresponding to the re-acquired question information.

[0042] As one feasible approach, users can input target questions on a user interface provided on the client side. In response to a user-triggered submission event, the client side sends the target questions to the server. The server then uses a target model to generate a target response based on the target questions and returns it to the client side. Conversely, in response to negative feedback events triggered by the user regarding the target response (e.g., the user clicking a "Dissatisfied" or "Error" button corresponding to the target response), the server identifies the target questions and responses as biased questions and responses.

[0043] Then, the server can input the biased question information into a large language model with a larger parameter scale than the target model to obtain the corresponding real response information, or obtain the real response information obtained by manually annotating the biased question information. Based on this, the server can construct biased samples based on the biased question information, biased response information, and real response information.

[0044] As can be seen, when a user provides negative feedback to the target response generated by the target model in response to the target question, both the target question and the target response are automatically identified as deviation samples. The corresponding real response is then retrieved to complete the construction of the deviation sample. This approach automatically collects high-quality deviation samples from the actual operating scenarios of the target model, directly transforming user feedback into motivation for target model optimization. This allows the target model to continuously learn from its mistakes, thereby constantly improving its generation performance and user satisfaction in real-world application environments.

[0045] As another feasible approach, users can input feedback information via a client application. This feedback includes deviation questions, deviation responses, and actual responses. The client application then sends this feedback to the server. The server parses the collected feedback to obtain the deviation questions, deviation responses, and actual responses, and constructs a deviation sample based on these information.

[0046] In addition to the above methods for obtaining biased samples, the server can also periodically obtain target question information and target response information from the target model's running logs. Based on the target question information, a large language model with a larger scale parameter than the target model can be used to obtain detection response information. By comparing the similarity between the detection response information and the target response information, it can be determined whether to treat the target question information as biased question information. In other words, if the similarity between the detection response information and the target response information is low, the target question information is identified as biased question information, the target response information is identified as biased response information, and the detection response information is identified as true response information.

[0047] It should be noted that the difference between the deviation response information and the true response information meets the preset first difference condition, which actually means that the deviation response information and the true response information are different. Specifically, it can be determined whether the deviation response information and the true response information are the same by whether the similarity between the deviation response information and the true response information exceeds the similarity threshold. If it exceeds the threshold, the deviation response information and the true response information are considered to be the same. If it does not exceed the threshold, the deviation response information and the true response information are considered to be different.

[0048] When the target model is used to perform a classification task, the biased response information and the true response information can refer to the category of the biased question information. The difference between the biased response information and the true response information satisfies the preset first difference condition, that is, the category represented by the biased response information is different from the category represented by the true response information.

[0049] The following describes step 202, namely "constructing correction sample pairs", in detail with reference to the embodiments.

[0050] After obtaining the biased samples, corrective sample pairs can be constructed based on them. These corrective sample pairs include biased question information, biased response information, and true response information. In these corrective sample pairs, the biased question information and the biased response information are considered negative samples, and the biased question information and the true response information are considered positive samples. In other words, a corrective sample pair can explicitly represent the preference relationship where, for a given biased question information, the true response information is superior to the biased response information.

[0051] The following describes in detail step 203, namely, "amplifying the correction sample pairs based on the initial training data to obtain expanded correction sample pairs," with reference to the embodiments.

[0052] As one feasible approach, the server can use biased samples to train the target model. Specifically, biased question information is input into the target model to obtain the predicted response information corresponding to the biased question information output by the target model, so as to minimize the difference between the predicted response information and the true response information. The target model is trained with the goal of maximizing the difference between the predicted response information and the biased response information.

[0053] However, directly fine-tuning the target model based on a small number of correction sample pairs can easily cause the model parameters of the target model to over-adapt to the new correction sample pairs, thereby disrupting the overall response capability of the target model learned based on the initial training data.

[0054] Therefore, as another feasible approach, the server can use the initial training data to augment the correction sample pairs, obtaining augmented correction sample pairs, and then train the target model based on these augmented correction sample pairs. The initial training data includes multiple initial training samples, which contain sample question information and corresponding sample response information.

[0055] Specifically, candidate samples related to the biased samples can be identified from the initial training data. Based on these candidate samples, the correction sample pairs are augmented to obtain augmented sample pairs. In the augmented sample pairs, the question information and corresponding response information from the candidate samples are considered positive samples, and the question information and biased response information from the candidate samples are considered negative samples. The augmented correction samples include both the correction sample pairs and the augmented sample pairs. The target model is trained based on the augmented correction sample pairs.

[0056] By selecting candidate samples related to the biased samples from the initial training data and utilizing the question and response information from these candidate samples, additional augmented sample pairs are constructed with the biased response information. This method can efficiently construct more relevant augmented sample pairs based on the acquired biased samples from the initial training data. This enhances the target model's bias correction capability while improving data utilization efficiency and the generalization ability of the target model, avoiding the overfitting problem that may result from training only on a single biased sample.

[0057] It should be noted that when determining candidate samples related to the biased samples from the initial training data, the initial training samples whose similarity to the biased question information meets a preset similarity condition can be used as candidate samples. For example, a text encoder can be used to convert the biased question information and the question information of all initial training samples into vectors, and then the similarity between the vector corresponding to the biased question information and the vector corresponding to each sample question information can be calculated. The initial training samples corresponding to the question information with the highest similarity are selected as candidate samples (M is a positive integer). Alternatively, the initial training samples corresponding to the question information with similarity exceeding a similarity threshold can be selected as candidate samples. Or, the initial training samples corresponding to the question information with the highest similarity are first selected as initial screening samples, and then the initial screening samples corresponding to the question information with similarity exceeding a similarity threshold are selected as candidate samples.

[0058] The similarity here can refer to cosine similarity, the reciprocal of the dot product, the reciprocal of the Euclidean distance, etc. The similarity threshold can be a fixed value or dynamically determined according to the similarity distribution of the current initial training samples.

[0059] In this way, the selection criteria for candidate samples are clearly defined as the similarity between the sample question information and the deviation question information must meet the preset similarity conditions. This ensures that the selected candidate samples for expansion are highly correlated with the deviation samples that need to be corrected at the question information level, making the expanded sample pairs more targeted and representative. This improves the accuracy and efficiency of the correction training and avoids the interference caused by introducing irrelevant samples to the correction target.

[0060] To further improve the efficiency of candidate sample determination, this disclosure can pre-construct a training sample vector retrieval library. The training sample vector retrieval library stores the vectors of the sample question information of all initial training samples after being transformed by a text encoder. Here, the text encoder can refer to the encoding layer of the target model or an independent text vector model.

[0061] When it is necessary to calculate the similarity between the biased question information and the sample question information, the biased question information can be vectorized using the same text encoder first, and then a similarity search can be performed in the training sample vector retrieval library. For example, a similarity threshold can be set, and the initial training samples whose similarity to the biased question information exceeds the similarity threshold can be used as candidate samples. Another example is to set a value M, and select the initial training samples corresponding to the M sample question information with the highest similarity as candidate samples.

[0062] For example, we can first select the M samples with the highest similarity to the question information from the initial training data as the initial screening samples. This process can be represented by the following formula: (Formula 1) in, Indicates the initial screening sample. This indicates a question about the deviation. This represents the vector corresponding to the deviation query information. This indicates the training sample vector retrieval library.

[0063] Then, the initial screening samples corresponding to the question information of samples with similarity exceeding the similarity threshold are selected as candidate samples. The process can be represented by the following formula 2.

[0064] (Formula 2) in, Indicates candidate samples, This indicates the sample question information included in the initial screening. Indicates the similarity threshold. This represents the similarity between the vector corresponding to the biased question information and the vector corresponding to the question information of the samples included in the initial screening.

[0065] By constructing a training sample vector retrieval library as described above, candidate samples related to the current biased sample can be located more efficiently and accurately from a massive amount of initial training samples. This avoids the randomness and inefficiency of manual selection, and eliminates the need to vectorize the sample query information every time a candidate sample is selected.

[0066] In this embodiment of the disclosure, the correction sample pair can be expanded based on the candidate sample to obtain an expanded sample pair, or different labeled samples can be further screened from the candidate sample and the correction sample pair can be expanded based on the different labeled samples to obtain an expanded sample pair.

[0067] In other words, after identifying candidate samples, this application also provides a scheme for heterolabel screening of candidate samples in order to construct expanded sample pairs with stronger contrast signals. The purpose of this limitation is to filter out samples that, although the sample question information is similar, the sample response information is consistent with the true response information. This is because if the sample response information in the identified candidate samples is the same as the true response information, although the expanded sample pairs based on the sample response information and the biased response information have a training effect, they fail to introduce new and easily confused sample response information, and their effect on clarifying the complex boundaries corresponding to each response information is limited.

[0068] Specifically, heterolabeled samples are identified from the candidate samples, where the difference between the sample response information and the true response information in the heterolabeled samples meets a preset second difference condition. This process can be represented by the following formula: (Formula 3) in, Indicates samples with different labels. This indicates the sample question information included in the candidate samples. This indicates that the candidate samples include the sample response information corresponding to the sample question information. This indicates a genuine response.

[0069] In simple terms, candidate samples whose sample response information differs from the real response information are considered heterolabeled samples. Specifically, the similarity between the sample response information and the real response information can be used to determine whether the sample response information and the real response information are the same. If they exceed the similarity threshold, the sample response information is considered to be the same as the real response information; if they do not exceed the threshold, the sample response information is considered to be different from the real response information.

[0070] When the target model is used to perform a classification task, the sample response information and the true response information can refer to the categories of the sample question information and the biased question information. The difference between the sample response information and the true response information satisfies the preset second difference condition, that is, the category represented by the sample response information is different from the category represented by the true response information.

[0071] Furthermore, the correction sample pairs are augmented based on the heterolabeled samples to obtain augmented sample pairs. As one feasible approach, the augmented sample pairs include a first augmented sample pair, which uses the sample question information and corresponding sample response information from the heterolabeled samples as positive samples, and the sample question information and biased response information from the heterolabeled samples as negative samples. In other words, the first augmented sample pair keeps the sample question information of the heterolabeled samples unchanged, and compares the original sample response information of the heterolabeled samples with the biased response information.

[0072] As another feasible approach, the expanded sample pair includes a second expanded sample pair. This second expanded sample pair uses the biased question information and the sample responses from the dissimilarly labeled samples as positive samples, and the biased question information and the biased responses as negative samples. In other words, the second expanded sample pair keeps the biased question information unchanged and compares the sample responses from the dissimilarly labeled samples with the biased responses.

[0073] As another feasible approach, the augmented sample pairs include the first and second augmented sample pairs mentioned above.

[0074] This approach can filter out heterolabeled samples from relevant candidate samples where the sample responses differ from the true responses of biased samples. This avoids a situation where a large number of candidate samples contain responses identical to the true responses, thus failing to form an effective comparative signal. Furthermore, this augmentation method guides the target model not only to learn and correct biases corresponding to biased samples but also to avoid generating incorrect responses to semantically similar questions. This significantly enhances the target model's ability to identify and correct diverse biases, thereby improving its robustness.

[0075] The following describes step 204, namely "training the target model based on the expanded and corrected samples," in detail with reference to the embodiments.

[0076] After obtaining the expanded and corrected sample pairs, the target model is incrementally trained based on the expanded and corrected samples. The goal of the training is to make the update direction of the model parameters of the target model satisfy the preference relationship defined in the expanded and corrected sample pairs. The training method can be any of contrastive learning, ranking learning, or preference learning.

[0077] Contrastive learning is a machine learning method that trains a target model to distinguish between positive and negative samples to learn data representation. Its core idea is to bring positive samples closer and push negative samples further apart in the representation space. Ranking learning aims to train a target model to rank a set of items (such as products, categories, responses, search results, etc.) so that the ranking results conform to a certain relevance criterion; that is, ranking learning focuses more on the relative order between items. Preference learning is a machine learning method that learns preferences from human (or expert) preference feedback. Its loss function encourages the target model to increase the difference between the log probability of responses to positive samples and the log probability of responses to negative samples.

[0078] To better illustrate the training process described above, the biased question information and the sample question information in the expanded and corrected sample pairs will be collectively referred to as question information. Furthermore, the response information corresponding to positive samples in the expanded and corrected sample pairs will be called correct response information, and the response information corresponding to negative samples in the expanded and corrected sample pairs will be called incorrect response information. For example, if the biased question information and the true response information are considered positive samples in the corrected sample pair, then the true response information is the correct response information. Another example is that if the first expanded sample pair uses the sample question information and the corresponding sample response information from the dislabeled samples as positive samples, then the sample response information is the correct response information. Yet another example is that if the biased question information and the biased response information are considered negative samples in the corrected sample pair, then the biased response information is the incorrect response information.

[0079] As one example, the output of the target model needs to satisfy the following formula four: (Formula 4) in, This indicates that the reply was correct. This indicates a question. This indicates an error response message. This indicates that the target model is based on the input. The output is the probability, confidence level, or equivalent score of the correct response. This indicates that the target model is based on the input. The output is the probability, confidence level, or equivalent score of an incorrect response.

[0080] In summary, contrastive learning, ranking learning, or preference learning training methods are particularly suitable for training data in the form of positive and negative sample pairs. They can effectively enable the target model to learn the subtle differences and ranking relationships between "good responses" and "bad responses," thereby optimizing the model parameters of the target model more accurately and efficiently. This causes the internal representation or generation strategy to adjust in a direction that favors positive samples (true response information or sample response information) and moves away from negative samples (biased response information), thus accelerating the deviation correction and convergence process of the target model.

[0081] After training, the model parameters of the target model are updated. When processing inputs that are semantically similar to the biased question information of the biased sample, the probability of outputting biased response information will decrease, thereby correcting specific bias errors. Furthermore, since the training signal covers semantic neighbors, this correction has better generalization ability.

[0082] To more clearly illustrate the process of determining the expanded and corrected sample pairs in this disclosure, a schematic diagram is provided here, as follows: Figure 3 As shown, a training sample vector retrieval library is constructed based on the initial training data. This library includes vectors of sample question information from each initial training sample, vectors of deviation question information based on the biased samples, and vectors of sample question information from each initial training sample. Candidate samples related to the biased samples are obtained from the initial training data. From these candidate samples, heterolabeled samples are further selected. Based on these heterolabeled samples, expanded sample pairs are obtained. Based on the biased samples, corrected sample pairs are obtained. The expanded and corrected sample pairs together constitute the expanded sample pairs.

[0083] Furthermore, such as Figure 4 As shown, the target model is trained based on the expanded sample pairs, and the updated model parameters of the target model are obtained. Based on the updated model parameters, the target parameters for online operation are optimized.

[0084] For example, suppose the biased sample includes biased question information (called query1), biased response information (called labelA), and true response information (called labelB). Based on the biased question information, three candidate samples are obtained, namely candidate sample 1 (the corresponding sample question information is query2 and the sample response information is labelB), candidate sample 2 (the corresponding sample question information is query2 and the sample response information is labelC), and candidate sample 3 (the corresponding sample question information is query2 and the sample response information is labelD).

[0085] Then, heterolabeled samples are selected from the candidate samples. At this point, only candidate samples 2 and 3 have different response information from the true response information. Therefore, candidate samples 2 and 3 are selected as heterolabeled samples. The expanded sample pairs constructed based on the heterolabeled samples are (query2, labelC, labelA) and (query3, labelD, labelA). In addition, expanded sample pairs (query1, labelC, labelA) and (query1, labelD, labelA) can also be constructed based on the heterolabeled samples. The bias correction sample pair constructed based on the biased sample candidates is (query1, labelB, labelA). Finally, the expanded and bias correction sample pairs can include (query2, labelC, labelA), (query3, labelD, labelA), (query1, labelC, labelA), (query1, labelD, labelA), and (query1, labelB, labelA).

[0086] It should be noted that the target model in this disclosure can be used for various tasks, such as question answering, code completion, and code rewriting. If the target model is used for question answering, the biased question information can refer to the question description input by the user into the target model (e.g., "Please give a city suitable for travel and give the reasons for the recommendation"), and the biased response information can refer to the answer to the question description (e.g., "City XX is suitable for travel because of its beautiful scenery"). If the target model is used for code completion, the biased question information can refer to the prefix and suffix codes of the position to be completed, and the biased response information can refer to the completed code obtained based on the prefix and suffix codes.

[0087] However, the model training method disclosed herein is particularly suitable for training target models for classification tasks. If the target model is used for classification tasks, the bias question information can refer to the given classification question (such as "What writing style does this passage belong to?"), and the bias response information can refer to the category output by the target model (such as "prose").

[0088] In scenarios involving various classification tasks such as content moderation and qualification determination, the target model needs to stably classify complex inputs into predetermined categories. As the number of categories increases, it becomes more common for categories to have similar semantics and overlapping boundaries. In such cases, the target model is prone to making judgment errors, which affects the stability of the output.

[0089] Therefore, the model training method provided in this disclosure can be used to train the target model. It should be noted that when the target model is used for a classification task, the output of the target model based on the input data is actually one of the categories in a preset category set. During the training phase, the target model can output the generated score or log probability of each category in the preset category set given the input. During the inference phase, the target model, based on the given input, obtains the generated score or log probability of each category in the preset category set and outputs the category with the highest generated score or log probability. Furthermore, this invention does not limit the size, structure, or parameter form of the target model, as long as it meets the requirements of this disclosure.

[0090] In this way, in scenarios where there are many categories in the preset category set and the categories have similar semantics, when a small number of high-value deviation samples are obtained after the target model is launched, the target model can be quickly repaired and online stability can be maintained in a short period of time.

[0091] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0092] The foregoing has described specific embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0093] According to another embodiment, a model training apparatus is provided. Figure 5 A schematic block diagram of the model training apparatus according to one embodiment is shown, the model training apparatus being disposed in... Figure 1 The server side in the illustrated architecture. For example... Figure 5 As shown, the model training device 500 includes an acquisition unit 501, a construction unit 502, an expansion unit 503, and a training unit 504, and further includes a generation unit 505. The main functions of each component unit are as follows: The acquisition unit 501 is configured to acquire deviation samples, which include deviation question information, corresponding deviation response information, and corresponding real response information. The deviation response information is obtained by using the target model based on the deviation question information. The difference between the deviation response information and the real response information satisfies a preset first difference condition. The target model is trained based on preset initial training data.

[0094] Construction unit 502 is configured to construct correction sample pairs, in which the deviation question information and deviation response information are negative samples, and the deviation question information and the true response information are positive samples.

[0095] The expansion unit 503 is configured to expand the correction sample pairs based on the initial training data to obtain expanded correction sample pairs.

[0096] Training unit 504 is configured to train the target model based on augmented and corrected sample pairs.

[0097] One possible approach is to include multiple initial training samples, which include sample question information and corresponding sample response information.

[0098] The expansion unit 503, when expanding the correction sample pair based on the initial training data to obtain the expanded correction sample pair, can be specifically configured as follows: determining candidate samples related to the deviation sample from the initial training data; expanding the correction sample pair based on the candidate samples to obtain the expanded sample pair, wherein the expanded sample pair uses the sample question information and corresponding sample response information in the candidate samples as positive samples, and the sample question information and deviation response information in the candidate samples as negative samples, and the expanded correction sample includes the correction sample pair and the expanded sample pair.

[0099] As one possible implementation, when determining candidate samples related to the biased samples from the initial training data, the expansion unit 503 can be specifically configured to: use the initial training samples whose similarity between the sample question information and the biased question information meets the preset similarity conditions as candidate samples.

[0100] As one possible implementation method, when expanding the correction sample pair based on the candidate samples to obtain expanded sample pairs, the expansion unit 503 can be specifically configured as follows: determining heterolabeled samples from the candidate samples, wherein the difference between the sample response information and the true response information in the heterolabeled samples satisfies a preset second difference condition; expanding the correction sample pair based on the heterolabeled samples to obtain expanded sample pairs, wherein the expanded sample pairs include a first expanded sample pair and / or a second expanded sample pair, wherein the first expanded sample pair uses the sample question information and the corresponding sample response information in the heterolabeled samples as positive samples and the sample question information and the deviation response information in the heterolabeled samples as negative samples, and the second expanded sample pair uses the deviation question information and the sample response information in the heterolabeled samples as positive samples and the deviation question information and the deviation response information as negative samples.

[0101] As one possible implementation method, the generation unit 505 can be specifically configured to: in response to a user-triggered submission event of target question information, call the target model to generate target response information based on the target question information.

[0102] The acquisition unit 501, when acquiring deviation samples, can be specifically configured to: respond to a negative feedback event triggered by the user for the target response information, determine the target question information as deviation question information, determine the target response information as deviation response information, and acquire the real response information corresponding to the deviation question information to construct deviation samples.

[0103] As one possible approach, when training the target model based on the expanded and corrected sample pairs, the training unit 504 can be specifically configured to: train the target model using any one of contrastive learning, ranking learning, or preference learning based on the expanded and corrected sample pairs.

[0104] As one possible approach, the target model is used to perform classification tasks.

[0105] The difference between the deviation response information and the true response information satisfies the preset first difference condition, including: the category represented by the deviation response information is different from the category represented by the true response information.

[0106] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0107] Figure 6 A schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0108] like Figure 6As shown, device 600 includes a computing unit 601, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 602 or a computer program loaded into random access memory (RAM) 603 from storage unit 608. RAM 603 may also store various programs and data required for the operation of device 600. The computing unit 601, ROM 602, and RAM 603 are interconnected via bus 604. Input / output (I / O) interface 605 is also connected to bus 604.

[0109] Multiple components in device 600 are connected to I / O interface 605, including: input unit 606, such as keyboard, mouse, etc.; output unit 607, such as various types of monitors, speakers, etc.; storage unit 608, such as disk, optical disk, etc.; and communication unit 609, such as network card, modem, wireless transceiver, etc. Communication unit 609 allows device 600 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0110] The computing unit 601 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as model training methods. For example, in some embodiments, the model training method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and / or installed on device 600 via ROM 602 and / or communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform model training methods by any other suitable means (e.g., by means of firmware).

[0111] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0112] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0113] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0114] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0115] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0116] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0117] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0118] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A model training method, comprising: Obtain deviation samples, which include deviation question information, corresponding deviation response information, and corresponding true response information. The deviation response information is obtained by using the target model based on the deviation question information. The difference between the deviation response information and the true response information satisfies a preset first difference condition. The target model is trained based on preset initial training data. Construct a correction sample pair, wherein the deviation question information and the deviation response information are negative samples, and the deviation question information and the true response information are positive samples. Based on the initial training data, the correction sample pairs are augmented to obtain augmented correction sample pairs. The target model is trained based on the expanded and corrected sample pairs.

2. The method according to claim 1, wherein, The initial training data includes multiple initial training samples, and the initial training samples include sample question information and corresponding sample response information; The step of augmenting the correction sample pairs based on the initial training data to obtain augmented correction sample pairs includes: From the initial training data, determine candidate samples related to the biased samples; The correction sample pair is expanded based on the candidate sample to obtain an expanded sample pair. In the expanded sample pair, the sample question information and the corresponding sample response information in the candidate sample are positive samples, and the sample question information and the deviation response information in the candidate sample are negative samples. The expanded correction sample includes the correction sample pair and the expanded sample pair.

3. The method according to claim 2, wherein, The step of determining candidate samples related to the biased samples from the initial training data includes: The initial training samples whose similarity to the included sample question information and the biased question information meets the preset similarity conditions are used as candidate samples.

4. The method according to claim 2, wherein, The step of expanding the correction sample pair based on the candidate sample to obtain expanded sample pairs includes: Dissimilar labeled samples are determined from the candidate samples, wherein the difference between the sample response information and the real response information in the dissimilar labeled samples satisfies a preset second difference condition; Based on the heterolabeled samples, the correction sample pairs are expanded to obtain expanded sample pairs. The expanded sample pairs include a first expanded sample pair and / or a second expanded sample pair. The first expanded sample pair uses the sample question information and corresponding sample response information in the heterolabeled samples as positive samples and the sample question information and deviation response information in the heterolabeled samples as negative samples. The second expanded sample pair uses the deviation question information and the sample response information in the heterolabeled samples as positive samples and the deviation question information and deviation response information as negative samples.

5. The method according to claim 1, wherein, The method further includes: In response to a user-triggered submission event of target question information, the target model is invoked to generate target response information based on the target question information; The acquisition of deviation samples includes: In response to a negative feedback event triggered by the user for the target response information, the target question information is identified as the deviation question information, the target response information is identified as the deviation response information, and the actual response information corresponding to the deviation question information is obtained to construct the deviation sample.

6. The method according to claim 1, wherein, Training the target model based on the expanded and corrected sample pairs includes: Based on the expanded and corrected sample pairs, the target model is trained using any one of contrastive learning, ranking learning, or preference learning.

7. The method according to claim 1, wherein, The target model is used to perform classification tasks; The difference between the deviation response information and the true response information satisfies a preset first difference condition, including: the category represented by the deviation response information is different from the category represented by the true response information.

8. A model training device, comprising: The acquisition unit is configured to acquire deviation samples, the deviation samples including deviation question information, corresponding deviation response information and corresponding real response information, the deviation response information being obtained by the target model based on the deviation question information, the difference between the deviation response information and the real response information satisfying a preset first difference condition, and the target model being trained based on preset initial training data; The construction unit is configured to construct a correction sample pair, wherein the deviation question information and the deviation response information are negative samples, and the deviation question information and the true response information are positive samples. An expansion unit is configured to expand the correction sample pair based on the initial training data to obtain an expanded correction sample pair. The training unit is configured to train the target model based on the augmented and corrected sample pairs.

9. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-7.

11. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-7.