Data processing model training method and system, and question answering method

By selecting and expanding the answer reasoning steps in a large language model and using the REINFORCE algorithm for fine-tuning training, the problem of insufficient data collection in traditional methods is solved, the reasoning and answering capabilities of the model are improved, and it is adapted to multi-domain question answering tasks.

CN122240756APending Publication Date: 2026-06-19ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2024-12-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional supervised fine-tuning and reinforcement learning methods struggle to collect sufficient and effective model inference data, which limits the improvement of reasoning ability of large language models in long inference chain problems.

Method used

By inputting sample questions into a pre-trained data processing model, selecting the target answer reasoning step and the preceding answer reasoning step for data expansion, and using the REINFORCE algorithm to fine-tune the model, the target data processing model is obtained.

Benefits of technology

It improves the understanding and reasoning capabilities of data processing models, enhances their ability to answer questions, and adapts to the question-and-answer needs of different fields.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240756A_ABST
    Figure CN122240756A_ABST
Patent Text Reader

Abstract

This specification provides a data processing model training method and system, and a question-answering method. The data processing model training method includes: inputting a sample question into a pre-trained data processing model to obtain answer reasoning steps; selecting a target answer reasoning step and its preceding answer reasoning steps from the answer reasoning steps, and expanding the target answer reasoning step based on the preceding answer reasoning steps to obtain an expanded answer reasoning step; and fine-tuning the pre-trained data processing model based on the preceding answer reasoning steps, the expanded answer reasoning steps, the sample question, and the sample answer to the sample question to obtain the target data processing model. By expanding the target answer reasoning step and then performing targeted fine-tuning training on the pre-trained large language model, the target data processing model's ability to understand and answer questions is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of artificial intelligence technology, and in particular to a data processing model training method. One or more embodiments of this specification also relate to a data processing model training system, a data processing model training apparatus, a question-answering method, a question-answering device, a computing device, a computer-readable storage medium, and a computer program product. Background Technology

[0002] With the continuous development of artificial intelligence technology, large language models (LLM) are being applied to various question-answering scenarios to perform multi-domain question-answering tasks. For example, the reasoning ability of large language models can be used to solve mathematical problems step by step, and the reasoning ability of large language models can be used to generate code.

[0003] When large language models utilize their reasoning abilities to solve mathematical and common-sense reasoning problems, traditional supervised fine-tuning (SFT) and reinforcement learning methods face the challenge of collecting sufficient and effective model reasoning data. This is because reasoning problems involve long reasoning chains and exhibit reward sparsity. Consequently, the reasoning capabilities of large language models cannot be significantly improved. Therefore, how to collect more effective training data to enhance the reasoning abilities of large language models and improve their accuracy in handling question-and-answer questions has become a pressing issue. Summary of the Invention

[0004] In view of the above, embodiments of this specification provide a data processing model training method. One or more embodiments of this specification also relate to a data processing model training system, a data processing model training apparatus, a question-answering method, a question-answering apparatus, a computing device, a computer-readable storage medium, and a computer program product, to address the technical deficiencies existing in the prior art.

[0005] According to a first aspect of the embodiments of this specification, a data processing model training method is provided, comprising: Input the sample question into a pre-trained data processing model to obtain the answer reasoning steps; From the answer reasoning steps, select the target answer reasoning step and the preceding answer reasoning step of the target answer reasoning step, and perform data expansion on the target answer reasoning step based on the preceding answer reasoning step to obtain the expanded answer reasoning step; Based on the preceding answer reasoning steps, the extended answer reasoning steps, the sample question, and the sample answer to the sample question, the pre-trained data processing model is fine-tuned to obtain the target data processing model.

[0006] According to a second aspect of the embodiments of this specification, a data processing model training system is provided, including an edge device and a cloud device: The edge device is used to generate a data processing model training request based on training samples and a pre-trained data processing model, and send the data processing model training request to the cloud device. The cloud-side device is used to determine a sample question from the training samples, input the sample question into the pre-trained data processing model to obtain an answer reasoning step; from the answer reasoning steps, select a target answer reasoning step and a preceding answer reasoning step, and perform data expansion on the target answer reasoning step based on the preceding answer reasoning step to obtain an expanded answer reasoning step; based on the preceding answer reasoning step, the expanded answer reasoning step, the sample question, and the sample answer to the sample question, perform model fine-tuning training on the pre-trained data processing model to obtain a target data processing model.

[0007] According to a third aspect of the embodiments of this specification, a question-answering method is provided, applied to a server, including: Receive target problem data and input the target problem data into a target data processing model to obtain the target answer reasoning steps corresponding to the target problem solving strategy, wherein the target data processing model is obtained through a data processing model training method; The target answer reasoning steps are executed step by step using the target data processing model to obtain the target answer data corresponding to the target question data.

[0008] According to a fourth aspect of the embodiments of this specification, a data processing model training apparatus is provided, comprising: The input module is configured to input sample questions into a pre-trained data processing model to obtain answer reasoning steps. The selection module is configured to select a target answer reasoning step and a preceding answer reasoning step from the answer reasoning steps, and to perform data expansion on the target answer reasoning step based on the preceding answer reasoning step to obtain an expanded answer reasoning step; The training module is configured to fine-tune the pre-trained data processing model based on the preceding answer reasoning steps, the extended answer reasoning steps, the sample question, and the sample answer to the sample question, to obtain the target data processing model.

[0009] According to a fifth aspect of the embodiments of this specification, a question-answering device is provided, applied to a server, comprising: The receiving module is configured to receive target problem data and input the target problem data into the target data processing model to obtain the target answer reasoning steps corresponding to the target problem solving strategy, wherein the target data processing model is obtained through a data processing model training method; The execution module is configured to use the target data processing model to execute the target answer reasoning steps step by step to obtain the target answer data corresponding to the target question data.

[0010] According to a sixth aspect of the embodiments of this specification, a computing device is provided, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which implement the steps of the above method when executed by the processor.

[0011] According to a seventh aspect of the embodiments of this specification, a computer-readable storage medium is provided that stores a computer program / instructions that, when executed by a processor, implement the steps of the above-described method.

[0012] According to an eighth aspect of the embodiments of this specification, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described method.

[0013] This specification provides an embodiment of a data processing model training method, aiming to improve the problem reasoning ability of the data processing model through fine-tuning training, thereby enabling a better understanding and analysis of user needs and problems. After pre-training the data processing model, a sample problem is input into the pre-trained model to obtain answer reasoning steps. From the answer reasoning steps, a target answer reasoning step and its preceding answer reasoning steps are selected, and the target answer reasoning step is expanded based on the preceding answer reasoning steps to obtain extended answer reasoning steps. This allows the data processing model to explore solutions to problems using the target answer reasoning steps, assisting the model in finding more effective training data. From the answer reasoning steps, the preceding answer reasoning steps of the target answer reasoning step are selected, and based on the preceding answer reasoning steps, extended answer reasoning steps, sample problems, and sample answers to sample problems, the pre-trained data processing model is fine-tuned to obtain the target data processing model. By expanding the reasoning steps of the target answer with data, and then fine-tuning the pre-trained data processing model, the target data processing model's ability to understand, reason, and solve problems is improved, thus enhancing the effectiveness of the data processing model's learning process. Attached Figure Description

[0014] Figure 1 This is a schematic diagram illustrating the application of a data processing model training method provided in one embodiment of this specification; Figure 2 This is a flowchart illustrating a data processing model training method provided in one embodiment of this specification; Figure 3 This is a flowchart illustrating the processing procedure of a data processing model training method provided in one embodiment of this specification. Figure 4 This is a schematic diagram of the structure of a data processing model training system provided in one embodiment of this specification; Figure 5 This is a flowchart illustrating a question-and-answer method provided in one embodiment of this specification; Figure 6 This is a schematic diagram of the structure of a data processing model training device provided in one embodiment of this specification; Figure 7 This is a schematic diagram of the structure of a question-and-answer device provided in one embodiment of this specification; Figure 8 This is a structural block diagram of a computing device provided in one embodiment of this specification; Figure 9 This is a structural block diagram of an electronic device provided in one embodiment of this specification. Detailed Implementation

[0015] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.

[0016] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” as used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.

[0017] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."

[0018] Furthermore, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of this specification are all information and data authorized by the user or fully authorized by all parties. Moreover, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0019] In one or more embodiments of this specification, a large model refers to a deep learning model with a large number of model parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of model parameters. A large model can also be called a foundation model. It is pre-trained using large-scale unlabeled corpora to produce a pre-trained model with hundreds of millions of parameters. Such models can adapt to a wide range of downstream tasks and have good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.

[0020] In practical applications, large models only require a small number of samples to fine-tune the pre-trained model before they can be applied to different tasks. Large models can be widely used in fields such as Natural Language Processing (NLP) and Computer Vision. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as natural language processing tasks such as text-based sentiment classification, text summarization, and machine translation. The main application scenarios of large models include digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design.

[0021] First, the terms and concepts used in one or more embodiments of this specification will be explained.

[0022] LLM: Large Language Model. A large language model is a neural network pre-trained on a large amount of data that can be used to perform various natural language processing tasks, such as text generation, translation, summarization, question answering, etc.

[0023] SFT: Supervised Fine-Tuning. This is a training method where a pre-trained model is fine-tuned using labeled data on a specific task to improve its performance on that task.

[0024] The REINFORCE algorithm is a policy gradient algorithm used for policy optimization in reinforcement learning. It optimizes the expected reward of a neural network policy by interacting with the environment. The core idea of ​​the REINFORCE algorithm is to update the policy parameters along the gradient direction of the expected reward using gradient ascent, thereby accumulating rewards.

[0025] Jaccard similarity: A metric used to measure the similarity between two sets, defined as the ratio of the size of the intersection to the size of the union of the two sets.

[0026] KL (Kullback-Leibler Divergence) constraint: Kullback-Leibler (KL) divergence, also known as relative entropy, is an important concept in information theory and statistics used to measure the difference between two probability distributions.

[0027] With the continuous development of artificial intelligence technology, Large Language Models (LLMs) are being applied to various question-answering scenarios to perform multi-domain question-answering tasks. In LLM applications, strong reasoning ability is crucial for enhancing the model's practical application capabilities, such as solving mathematical reasoning and common-sense reasoning problems. Due to the sparse reward nature of reasoning problems, traditional supervised fine-tuning (SFT) and reinforcement learning (RL) methods face the challenge of collecting sufficient and effective model reasoning data when improving reasoning capabilities. In contrast, methods using efficient exploration techniques can fully analyze and utilize the model's existing reasoning capabilities, collecting targeted data that is difficult to explore using traditional methods, thereby improving the model's reasoning ability. In practical applications, large language models with stronger reasoning capabilities can better understand and analyze user needs and questions, thus playing a greater role in various application domains, such as mathematical reasoning, code generation, and task planning.

[0028] For large models, performing reasoning with long chains, such as mathematical reasoning and code generation, has always been a challenging problem. If only the correctness of the answer is used as the reward, the reasoning problem itself has characteristics such as sparse rewards and a huge exploration space, making efficient exploration an important direction for optimizing reasoning capabilities.

[0029] Based on this, this specification provides a data processing model training method, a data processing model training device, a question-answering method, a question-answering device, a computing device, a computer-readable storage medium, and a computer program product, which will be described in detail in the following embodiments.

[0030] See Figure 1 , Figure 1 This diagram illustrates an application of a data processing model training method according to an embodiment of this specification. The data processing model training method includes: Input the sample question into the pre-trained data processing model to obtain the answer reasoning steps.

[0031] From the answer reasoning steps, select the target answer reasoning step and the preceding answer reasoning step of the target answer reasoning step, and expand the target answer reasoning step with data based on the preceding answer reasoning step to obtain the extended answer reasoning step.

[0032] Based on the preceding answer reasoning steps, the extended answer reasoning steps, the sample question, and the sample answer to the sample question, the pre-trained data processing model is fine-tuned to obtain the target data processing model.

[0033] After pre-training the data processing model, a sample problem is input into the pre-trained model to obtain answer reasoning steps. From these steps, a target answer reasoning step and its preceding answer reasoning steps are selected. Based on the preceding answer reasoning steps, the target answer reasoning step is expanded to obtain an extended answer reasoning step. This allows the data processing model to explore solutions to problems using the target answer reasoning step, helping it find more effective training data. From the answer reasoning steps, the preceding answer reasoning steps are selected. Based on these preceding answer reasoning steps, the extended answer reasoning steps, the sample problem, and the sample answer to the sample problem, the pre-trained data processing model is fine-tuned to obtain the target data processing model. By expanding the target answer reasoning step and then fine-tuning the pre-trained model, the target data processing model's ability to understand, reason, and solve problems is improved, enhancing the effectiveness of the model's learning process.

[0034] For example, to meet diverse user question-and-answer needs, training samples from different domains can be constructed. Specifically, in the scenario of solving math problems, math problem data can be provided to fine-tune the pre-trained data processing model. Math problem data includes the math problem, the solution approach, the corresponding solution steps, the answer for each solution step, and the answer to the math problem itself. When fine-tuning the pre-trained data processing model, a math problem is selected and input into the model to obtain the model's output answer reasoning steps. An extended search of solution methods is performed on each answer reasoning step to obtain the correct and incorrect solution steps, i.e., extended answer reasoning steps. When fine-tuning the pre-trained data processing model based on these extended answer reasoning steps, the model can be fine-tuned separately for the reasoning ability of each answer reasoning step. Furthermore, for the common sense domain, common sense questions and their corresponding answers can be selected to fine-tune the pre-trained data processing model, enabling the trained model to reason and solve common sense questions. In the legal field, legal questions and their corresponding answers can be selected to fine-tune the pre-trained data processing model, enabling the trained model to reason and solve legal questions.

[0035] See Figure 2 , Figure 2 A flowchart of a data processing model training method according to an embodiment of this specification is shown, which specifically includes the following steps.

[0036] Step 202: Input the sample question into the pre-trained data processing model to obtain the answer reasoning steps.

[0037] The sample questions can be from various domains such as e-commerce, education, law, scientific research, mathematics, code generation, and content recommendation. The sample questions are multi-step reasoning problems, meaning the sample answer contains at least two reasoning steps. The pre-trained data processing model can be a pre-trained large language model, i.e., a domain-specific data processing model obtained by pre-training the data processing model based on domain data, used to handle questions in a specific domain. The answer reasoning steps are the multiple reasoning steps obtained by the pre-trained data processing model in solving the sample question. Each reasoning step corresponds to a step answer, and the reasoning step at the end of the answer reasoning steps corresponds to the predicted answer to the sample question.

[0038] In practical applications, sample questions can serve as training samples during the pre-training of the data processing model. Based on these training samples, the pre-trained data processing model can be further fine-tuned. When predicting answers to sample questions, the pre-trained data processing model can first determine the answering strategy, then determine the solution steps based on the strategy, and finally solve the problem step by step according to the determined solution steps to obtain the answer reasoning steps.

[0039] For example, when fine-tuning a pre-trained data processing model capable of generating code based on given requirements using code generation problems in the code domain, the sample problem could be a code generation problem: "You need to write a product recommendation system for an e-commerce website. This system should recommend products that users might be interested in based on their purchase history and browsing history. You need to design and implement an algorithm that can calculate the similarity between products and recommend products to users based on the similarity." Inputting this code generation problem as a sample problem into the pre-trained data processing model yields the following reasoning steps from the model's output: Step 1: Define the requirements and problem Requirement: Develop a product recommendation system for an e-commerce website.

[0040] Problem Definition: Calculate the similarity between products based on a user's purchase history and browsing history. Recommend products that the user may be interested in based on the similarity score.

[0041] Step 2: Data Preparation and Preprocessing Data collection: Collecting users' purchase history and browsing records. Collecting basic product information, such as product ID, name, category, and price.

[0042] Data preprocessing: Clean user data, removing duplicates and invalid data. Construct a user-product interaction matrix to represent user purchasing and browsing behavior. Standardize product data to ensure that features of different categories have equal weight in similarity calculations.

[0043] Step 3: Algorithm Design and Implementation Algorithm selection: Choose a suitable similarity calculation method, such as cosine similarity or content-based similarity calculation. In this example, we assume that cosine similarity is used to calculate the similarity between products.

[0044] Algorithm Implementation: Calculate the cosine similarity between products based on the user-product interaction matrix. Construct a product similarity matrix to store the similarity information between products. Recommend products with high similarity to the user based on their purchase history and browsing history.

[0045] Step 4: Generate code based on algorithm design Step 5: Testing and Evaluation Testing: The recommendation system is tested using a test dataset to ensure it can correctly calculate the similarity between items and recommend suitable items to the user. The system's stability and reliability are also tested to ensure it can operate normally under high concurrency and large data volumes.

[0046] Evaluation: Use appropriate evaluation metrics, such as precision, recall, and F1 score, to evaluate the performance of the recommendation system. Based on the evaluation results, optimize and improve the system to enhance its accuracy and user satisfaction.

[0047] In one or more embodiments provided in this specification, before fine-tuning the model, the data processing model needs to have the ability to reason about and solve domain problems. By pre-training the data processing model, the data processing model can be made to have the corresponding problem-solving ability. The specific implementation method is as follows.

[0048] The problem samples in the pre-training samples are input into the initial data processing model to obtain the predicted answer corresponding to the problem-solving strategy; the pre-training loss value is calculated based on the pre-training loss function, the predicted answer and the answer sample corresponding to the problem sample, and the parameters of the initial data processing model are tuned based on the pre-training loss value until the data processing model that meets the pre-training stopping condition is obtained.

[0049] The training samples can be domain-specific question-and-answer data selected based on the question-and-answer requirements of the initial data processing model. The initial data processing model is the data processing model that has not undergone model training. The problem-solving strategy refers to the approach to solving the problem samples. For example, in the case of the problem sample "What is the result of adding 1 to 10 consecutively?", the solution strategy could be to first calculate 1+2, then add the sum of 1+2 to 3, and so on, until the result from 1 to 10 is obtained; the predicted answer is the answer obtained by the initial data processing model in predicting the problem sample. The predicted answer includes prediction inference steps, which refer to the problem-solving steps of the predicted answer. The pre-training loss function is used to calculate the loss value of the model training during the pre-training process of the initial data processing model. The pre-training stopping condition can be that the model training has reached a preset number of iterations, the prediction accuracy of the pre-trained data processing model has reached an accuracy threshold, or a preset pre-training time has been reached. This embodiment does not impose any limitations on the pre-training stopping condition.

[0050] In practical applications, the training samples contain multiple pairs of question and answer samples, which can satisfy multiple rounds of iterative training of the initial data processing model. After calculating the pre-training loss value based on the pre-training loss function, the predicted answer, and the answer sample corresponding to the question sample, the initial data processing model can be tuned based on the pre-training loss value to obtain an intermediate data processing model. The next question sample is selected from the training samples and input into the intermediate data processing model for prediction, obtaining an intermediate predicted answer. The loss value is then calculated again based on the pre-training loss function, the intermediate predicted answer, and the answer sample corresponding to the question sample, and the intermediate data processing model is tuned based on the calculated loss value. This process is repeated until a pre-trained data processing model that meets the training stopping condition is obtained.

[0051] Step 204: From the answer reasoning steps, select the target answer reasoning step and the preceding answer reasoning step of the target answer reasoning step, and perform data expansion on the target answer reasoning step based on the preceding answer reasoning step to obtain the expanded answer reasoning step.

[0052] The target answer reasoning step can be at least one selected reasoning step from the answer reasoning steps, or it can be all the reasoning steps in the answer reasoning steps. The preceding answer reasoning steps refer to the reasoning steps in the answer reasoning steps that precede the target answer reasoning step. Data expansion refers to expanding the target answer reasoning step from the perspective of the solution method. The extended answer reasoning steps include reasoning answers that can be either correct or incorrect.

[0053] Following the previous example, after inputting the sample question into the pre-trained data processing model and obtaining the five answer reasoning steps output by the model, one of these steps can be chosen as the target answer reasoning step, or all five can be used as the target answer reasoning steps. When choosing step three for data expansion, various algorithm extensions can be implemented. The resulting extended answer reasoning steps could be: a content-based recommendation algorithm. The principle is to recommend products based on their attributes and features (such as title, description, category, price, etc.) and the user's historical behavior (such as purchase history, browsing history, etc.). It assumes that users will like other products that are similar in content to those they have previously liked. The implementation steps are: feature extraction, extracting key features from product descriptions, attributes, categories, brands, etc.; similarity calculation, calculating the similarity between product features and user interests, using common similarity calculation methods such as cosine similarity and Jaccard similarity. The accuracy of similarity calculation can be improved by introducing the concept of word weight and topic clustering; and recommendation generation, displaying highly similar products to the user as recommendations. Steps one and two are the preceding steps in the reasoning process for the target answer.

[0054] In one or more embodiments provided in this specification, before selecting the target answer reasoning step, it is also necessary to determine whether the sample question meets the fine-tuning training conditions. If the sample question meets the fine-tuning training conditions, the target answer reasoning step is selected. The specific implementation method is as follows.

[0055] Based on the answer reasoning steps, the model-predicted answer corresponding to the sample question is determined; if the model-predicted answer determines that the sample question meets the fine-tuning training conditions, the step of selecting the target answer reasoning step from the answer reasoning steps is executed.

[0056] Here, the model's predicted answer refers to the predicted answer obtained by the pre-trained data processing model when predicting the sample question. The predicted answer can be the sample answer corresponding to the sample question, that is, the correct answer to the sample question. Fine-tuning the training conditions can be the prediction accuracy or error rate of the sample question in the historical model prediction process. The predicted answer obtained by the pre-trained data processing model for the sample question can be used as an influencing factor for fine-tuning the training conditions.

[0057] If the sample question meets the fine-tuning training conditions based on the model's predicted answer, and the fine-tuning training conditions are the prediction accuracy of the sample question in the historical model prediction process, then if the prediction accuracy is greater than 50%, it means that the sample question meets the fine-tuning training conditions, and the target answer reasoning step can be selected from the answer reasoning steps.

[0058] If the sample question meets the fine-tuning training conditions based on the model's predicted answer, and the fine-tuning training conditions are the prediction error rate of the sample question in the historical model prediction process, then if the prediction error rate is greater than 50%, it means that the sample question meets the fine-tuning training conditions, and the target answer reasoning step can be selected from the answer reasoning steps.

[0059] In one or more embodiments provided in this specification, in order to enable the pre-trained data processing model to find errors that are prone to be made in sample problems with high accuracy when fine-tuning the pre-trained data processing model, the erroneous reasoning steps in the answer reasoning steps can be used as target answer reasoning steps for data expansion. The specific implementation method is as follows.

[0060] If the prediction accuracy of the sample question is greater than the accuracy threshold, the answer reasoning step includes predicting the correct answer; at least one incorrect reasoning step is identified in the answer reasoning step, and the at least one incorrect reasoning step is taken as the target answer reasoning step.

[0061] The prediction accuracy of the sample question can be obtained by repeatedly inputting the sample question into a pre-trained data processing model. The accuracy threshold can be set according to the model training requirements. An incorrect reasoning step can be either an incorrect answer or an error in the reasoning process itself.

[0062] In practical applications, sample questions can be input multiple times into a pre-trained data processing model. Based on these multiple predictions, the model obtains multiple predicted answers, and the accuracy of these predictions can be used to calculate the prediction accuracy. An accuracy threshold can be set to 50%. If the prediction accuracy of the sample question is 68%, which is greater than the accuracy threshold, then when the pre-trained model predicts the sample question, it can select at least one incorrect reasoning step from the answer reasoning process. Based on this incorrect reasoning step, data expansion is performed to obtain both the incorrect and correct target answer reasoning steps, effectively improving the diversity and coverage of the dataset. The correct target answer reasoning step could be a recommendation algorithm based on association rules. The principle is to analyze the relationships between products purchased or browsed by users, discover association rules between products (e.g., "users who buy product A also tend to buy product B"), and recommend products accordingly. The implementation steps are: data preprocessing, cleaning and organizing user purchase or browsing records; association rule mining, using association rule mining algorithms to extract association rules between products from the data. Rule evaluation assesses metrics such as confidence and support of association rules to identify effective ones. Recommendation generation, based on the association rules, recommends products that the user may be interested in.

[0063] In one or more embodiments provided in this specification, in order to enable the pre-trained data processing model to find more correct solutions in sample problems with high error rates when fine-tuning the pre-trained data processing model, the correct reasoning steps in the answer reasoning steps can be used as the target answer reasoning steps for data expansion. The specific implementation method is as follows.

[0064] If the prediction accuracy of the sample question is less than or equal to the accuracy threshold, the answer reasoning step includes predicting an incorrect answer; at least one correct reasoning step is determined to be included in the answer reasoning step, and the at least one correct reasoning step is taken as the target answer reasoning step.

[0065] Among them, a correct reasoning step can be either a correct answer or a correct reasoning process.

[0066] In practical applications, sample questions can be input multiple times into a pre-trained data processing model. Based on these multiple predictions, the model generates multiple predicted answers. The accuracy of these predictions can be used to calculate the prediction accuracy. An accuracy threshold can be set to 50%. If the prediction accuracy of the sample question is less than or equal to the accuracy threshold (e.g., 18%), then during the current prediction by the pre-trained model, at least one correct reasoning step can be selected from the answer reasoning steps. Data expansion is then performed based on these correct reasoning steps to obtain both incorrect and correct target answer reasoning steps, effectively improving the diversity and coverage of the dataset. Fine-tuning the pre-trained model using both incorrect and correct target answer reasoning steps enhances the training effect and improves the model's problem reasoning and solution capabilities.

[0067] In one or more embodiments provided in this specification, when expanding the data for the reasoning steps of the target answer, multiple predictions can be made using a pre-trained data processing model to obtain expanded answer reasoning steps with different answers. The specific implementation method is as follows.

[0068] The preceding answer reasoning steps and the sample question are input into the data processing model to obtain an initial extended answer reasoning step for data expansion of the target answer reasoning step; the reasoning step answers included in the initial extended answer reasoning step are determined, and the reasoning step sample answers of the target answer reasoning step are determined based on the sample answers; the predicted reward of the initial extended answer reasoning step is determined by comparing the reasoning step answers and the reasoning step sample answers; the initial extended answer reasoning step and the predicted reward are used as the extended answer reasoning step.

[0069] The initial extended answer reasoning step refers to an alternative solution method besides the target answer reasoning step, obtained by inputting the preceding answer reasoning step and the sample question into the data processing model, and expanding and exploring the model to address the target answer reasoning step. The reasoning step answer refers to the step answer corresponding to the initial extended answer reasoning step. The reasoning step sample answer refers to the accurate answer among the sample answers set for the target answer reasoning step. The prediction reward represents the accuracy of the reasoning step answers included in the initial extended answer reasoning step.

[0070] Continuing with the previous example, if the reasoning step in the initial expanded answer reasoning step results in a correct answer, the predicted reward is set to 1; if the reasoning step in the initial expanded answer reasoning step results in an incorrect answer, the predicted reward is set to 0. This is the setting used when fine-tuning with reinforcement learning. When optimizing and fine-tuning a data processing model pre-trained using reinforcement learning, the reward signal for the entire process is amplified. Setting the reward for a correct answer to 1 and the reward for an incorrect answer to 0 guides the model to generate correct answers. A correct answer results in a reward of 1, while an incorrect answer results in a reward of 0, without needing to label the entire reasoning process. This simplifies the overall process by replacing process rewards with outcome rewards.

[0071] Step 206: Based on the preceding answer reasoning step, the extended answer reasoning step, the sample question, and the sample answer to the sample question, perform model fine-tuning training on the pre-trained data processing model to obtain the target data processing model.

[0072] The REINFORCE algorithm can be used to fine-tune the pre-trained data processing model to improve its inference capabilities.

[0073] In practical applications, after expanding the data of the reasoning steps for the target answer to obtain the expanded reasoning steps, the pre-trained data processing model can be fine-tuned based on the preceding reasoning steps, the expanded reasoning steps, the sample question, and the sample answer to the sample question. Multiple rounds of iterative training can be performed during the fine-tuning of the pre-trained data processing model.

[0074] In one or more embodiments provided in this specification, in order to improve the inference ability of the pre-trained data processing model and enhance the stability and effectiveness of the training process during model fine-tuning training, the REINFORCE algorithm was selected, which combines KL constraints, relative advantage calculation and other techniques. The specific implementation method is as follows.

[0075] The preceding answer reasoning step, the extended answer reasoning step, and the sample question are input into the pre-trained data processing model to obtain a fine-tuned predicted answer associated with the target answer reasoning step. The model fine-tuning loss value is calculated based on the model fine-tuning loss function, the fine-tuned predicted answer, and the sample answer of the sample question. The parameters of the pre-trained data processing model are tuned based on the model fine-tuning loss value until the target data processing model that meets the model fine-tuning training stopping condition is obtained.

[0076] In this context, fine-tuning the predicted answer refers to the predicted answer obtained by the pre-trained data processing model incorporating the extended answer inference steps to predict the sample question again. The model fine-tuning loss function is constructed based on the REINFORCE algorithm and is used to calculate the loss value of the pre-trained data processing model during fine-tuning training, thereby adjusting the parameters. The stopping condition for model fine-tuning training can be reaching a preset number of iterations, the predicted accuracy of the pre-trained data processing model reaching an accuracy threshold during fine-tuning training, or reaching a preset pre-training time. This embodiment does not impose any limitations on the stopping condition for model fine-tuning training. The target data processing model is the data processing model obtained after fine-tuning the pre-trained data processing model.

[0077] Following the previous example, we expand the reasoning steps for the target answer by adding data, resulting in the expanded reasoning steps: "Content-based recommendation algorithm, principle: Recommends products based on their attributes and features (such as title, description, category, price, etc.) and the user's historical behavior (such as purchase history, browsing history, etc.). It assumes that users will like other products that are similar in content to those they have liked in the past. Implementation steps: Feature extraction: Extract key features from product descriptions, attributes, categories, brands, etc. Similarity calculation: Calculate the similarity between product features and user interests. Common similarity calculation methods include cosine similarity and Jaccard similarity. The accuracy of similarity calculation can be improved by introducing the concept of word weight and topic clustering. Recommendation generation: Display products with high similarity as recommendation results to the user."

[0078] Next, the pre-reasoning steps for the target answer are determined: "Step 1: Define the requirements and problem. Requirements: Develop a product recommendation system for an e-commerce website. Problem definition: Calculate the similarity between products based on users' purchase history and browsing records. Recommend products that users may be interested in based on the similarity. Step 2: Data preparation and preprocessing. Data collection: Collect users' purchase history and browsing record data. Collect basic product information, such as product ID, name, category, price, etc. Data preprocessing: Clean the user data, removing duplicates and invalid data. Construct a user-product interaction matrix to represent users' purchase and browsing behavior. Standardize the product data to ensure that features of different categories have the same weight in the similarity calculation."

[0079] The preceding answer reasoning steps, the extended answer reasoning steps, and the sample question are input into the pre-trained data processing model to obtain the fine-tuned predicted answer associated with the target answer reasoning steps. The model fine-tuning loss value is calculated based on the model fine-tuning loss function, the fine-tuned predicted answer, and the sample answer of the sample question. The parameters of the pre-trained data processing model are then tuned based on the model fine-tuning loss value to obtain an intermediate data processing model. At this point, the extended answer reasoning steps input into the model can be changed, and the model fine-tuning training can continue. This process is repeated until the target data processing model that meets the model fine-tuning training stopping condition is obtained. The model fine-tuning loss function is shown in the following formula (1).

[0080]

[0081] Where y represents the sample answer (the subsequent answer reasoning step of the target answer reasoning step); x represents the preceding answer reasoning step and the sample question; π θ denoted as the pre-trained data processing model; (y|x) represents the conditional probability of the pre-trained data processing model generating a response y after inputting x; D represents the training set consisting of all sample questions; r represents the reward value set for the response; E represents the expected value.

[0082] In one or more embodiments provided in this specification, after the target data processing model is obtained by fine-tuning the pre-trained data processing model, the target data processing model can be used to perform answer reasoning and prediction on the question data. The specific implementation method is as follows.

[0083] Receive problem data and input the problem data into the target data processing model to obtain the problem reasoning steps corresponding to the problem data solution strategy; use the target data processing model to execute the problem reasoning steps step by step to obtain the answer data corresponding to the problem data.

[0084] In this context, "question data" refers to the question data input into the target data processing model in an actual question-and-answer scenario. Question data can be generated from code. "Question data solution strategy" refers to the problem-solving approach for the question data. "Question reasoning steps" refer to the problem-solving steps for the question data; each problem reasoning step contains at least one solution step, and each solution step includes its answer. "Answer data" is the answer corresponding to the question data, and it may contain the solution steps.

[0085] Following the previous example, after fine-tuning the pre-trained data processing model to obtain the target data processing model, it can be used to solve code generation problems. Given the problem data: "Write a search function for an online bookstore system that allows users to search for books by title and author. The system needs to return all matching book information, and the search must be case-sensitive," inputting the problem data into the target data processing model for answer reasoning yields: "Step 1: Define the requirements and functions. Requirements: Write a search function for an online bookstore system. Function: Receive user-input search keywords (book title, author). Search the database for matching book information. Return all matching book information; the search is case-sensitive. Step 2: Design the algorithm and logic. Input processing: Receive user-input search keywords, ensuring they are strings." Type. Database Design: Assume there is already a database table containing book information, with fields including id (book ID), title (book title), author (author), etc. Search Logic: Construct an SQL query statement based on the keywords entered by the user. Case-sensitive search means that the = operator should be used in the SQL query. Separate searches should be performed for book title and author, and the results should be combined. Considering performance, strategies such as full table scan, index, or full-text search can be used, but in this example, we assume a simple SQL query. Step 3: Choose a programming language and write code. Programming language: Choose Python as the backend development language. Write code to obtain the "answer code". The output code can be the answer data.

[0086] In summary, one embodiment of this specification provides a data processing model training method, aiming to improve the problem reasoning ability of the data processing model through fine-tuning training, thereby enabling a better understanding and analysis of user needs and problems. After pre-training the data processing model, a sample problem is input into the pre-trained model to obtain answer reasoning steps. From the answer reasoning steps, a target answer reasoning step and its preceding answer reasoning steps are selected, and the target answer reasoning step is expanded based on the preceding answer reasoning steps to obtain extended answer reasoning steps. This allows the data processing model to explore solutions to problems using the target answer reasoning steps, assisting the model in finding more effective training data. From the answer reasoning steps, the preceding answer reasoning steps of the target answer reasoning step are selected, and the pre-trained data processing model is fine-tuned based on the preceding answer reasoning steps, extended answer reasoning steps, sample problems, and sample answers to sample problems to obtain the target data processing model. By expanding the reasoning steps of the target answer with data, and then fine-tuning the pre-trained data processing model, the target data processing model's ability to understand, reason, and solve problems is improved, thus enhancing the effectiveness of the data processing model's learning process.

[0087] The following is in conjunction with the appendix Figure 3 Taking the application of the data processing model training method provided in this specification in the reasoning and solution of mathematical problems as an example, the data processing model training method will be further explained. Among them, Figure 3 The present specification illustrates a flowchart of a data processing model training method according to an embodiment, which includes the following steps.

[0088] Step 302: Input the problem samples in the pre-training samples into the initial data processing model to obtain the predicted answer corresponding to the problem-solving strategy, wherein the predicted answer includes the prediction inference steps.

[0089] Data processing models can be used to solve mathematical problems. By pre-training the initial data processing model using training samples of mathematical problems, the initial data processing model can learn the ability to reason and solve mathematical problems.

[0090] In practical applications, when pre-training the initial data processing model, mathematical problem samples from the training samples are input into the initial data processing model. The initial data processing model predicts the solution approach (solution steps) for the mathematical problem samples. Based on the solution approach (solution steps), the problem is solved step by step to obtain the predicted answer to the mathematical problem sample.

[0091] Step 304: Calculate the pre-training loss value based on the pre-training loss function, the predicted answer, and the answer sample corresponding to the question sample. Then, tune the parameters of the initial data processing model based on the pre-training loss value until a data processing model that meets the pre-training stopping condition is obtained.

[0092] The loss value is calculated based on the sample labels (true answers to mathematical problems), predicted answers, and loss function in the training samples. The initial data processing model is then tuned based on the loss value to obtain the tuned data processing model.

[0093] Select new problem samples and continue iteratively training the parameter-tuned data processing model. After multiple rounds of training, a data processing model that meets the training stopping condition is obtained.

[0094] Step 306: Input the sample question into the pre-trained data processing model to obtain the answer reasoning steps.

[0095] When fine-tuning a pre-trained data processing model with mathematical problem-solving capabilities based on mathematical problem samples from the mathematical domain, the sample problem can be a mathematical problem such as: "How many positive integers less than 20 can be expressed as the sum of positive multiples of 2 and positive multiples of 3?". This mathematical problem is then input into the pre-trained data processing model, and the model's output reasoning steps are obtained as follows: 1. To understand the problem, we need to find out how many positive integers less than 20 can be expressed as the sum of positive integer multiples of 2 and positive integer multiples of 3.

[0096] 2: Define the general form: any number that can be represented as the sum of positive multiples of 2 and 3 can be written in the form (2a+3b), where (a) and (b) are positive integers.

[0097] 3: Find the minimum and maximum possible values ​​of \(2a+3b\) below 20. The minimum value of \(2a+3b\) occurs when \(a=1\) and \(b=1\), which gives \(2(1)+3(1)=5\). The maximum value below 20 is when \(2a+3b\) is as close to 20 as possible without exceeding 20. Trying some combinations, we find \(2(6)+3(2)=12+6=18\), which is the maximum value below 20 that can be obtained using this method.

[0098] 4: Determine the range of possible values ​​for (2a+3b) below 20. Since we cannot exceed 18, we must start from 5. Therefore, the possible values ​​for (2a+3b) are 5, 6, 7, ..., 18.

[0099] 5: Calculate the possible values. To find how many positive integers less than 20 can be expressed as the sum of positive multiples of 2 and 3, we only need to calculate the integers from 5 to 18 (inclusive). [18 - 5 + 1 = 14]. Therefore, there are 14 positive integers less than 20 that can be expressed as the sum of positive multiples of 2 and 3. Final answer: 14.

[0100] Step 308: Select the target answer reasoning step from the answer reasoning steps.

[0101] In practical applications, all the reasoning steps for the answer (1-5) can be explored as the reasoning steps for the target answer.

[0102] Step 310: Input the preceding answer reasoning steps and the sample question into the data processing model to obtain the initial extended answer reasoning steps for data expansion of the target answer reasoning steps.

[0103] Step 312: Determine the reasoning steps included in the initial extended answer reasoning steps, and the reasoning step sample answers for determining the target answer reasoning steps based on the sample answers.

[0104] Step 314: Determine the predicted reward for the initial extended answer reasoning step by comparing the answer to the reasoning step with the sample answer to the reasoning step.

[0105] Step 316: Use the initial extended answer reasoning step and the predicted reward as the extended answer reasoning step.

[0106] This study explores problem-solving strategies based on the answer reasoning steps. Preceding answer reasoning steps and question samples are input into a pre-trained data processing model to explore the steps involved in the target answer reasoning process. The model can be input multiple times to obtain different extended answer reasoning steps. The reward for correct extended answer reasoning steps is set to 1, and the reward for incorrect extended answer reasoning steps is set to 0.

[0107] In practical applications, we can explore each step from 1 to 5 to obtain the following incorrect extended answer reasoning steps: 1. To understand the problem, we need to find out how many positive integers less than 20 can be expressed as the sum of positive integer multiples of 2 and positive integer multiples of 3.

[0108] 2: Formulate an equation: For a number \(n\) that can be expressed as the sum of positive multiples of 2 and positive multiples of 3, we can write it as: \[n=2a+3b\], where \(a\) and \(b\) are positive integers.

[0109] 3: Find the minimum and maximum possible values ​​of \(n\), which can be expressed as the smallest positive integer sum of positive multiples of 2 and positive multiples of 3, which is 5 (i.e., \(2*1+3*1\)). The largest integer less than 20 can be found by maximizing \(a\) and \(b\). This is \(14+5=19\) (which is \(2*7+3*2\)).

[0110] 4: To generate a set of possible values ​​for \(n\), we now need to find all possible values ​​from 5 to 19 (inclusive), which can be written as (2a+3b\). Let's try generating these values: -(n=5) (i.e., (2*1+3*1)) -(n=7) (i.e., (2*2+3*1)) -(n=9) (i.e., (2*3+3*1)) -\(n=11\)(that is,\(2*1+3*2\)) -(n=13) (i.e., (2*4+3*1)) -(n=15) (i.e., (2*2+3*3)) -(n=17) (i.e., (2*3+3*2)) -(n=19) (i.e., (2*7+3*2)) 5: Counting. From the list above, there are 8 positive integers less than 20 that can be represented as the sum of positive multiples of 2 and positive multiples of 3.

[0111] The answer is: \(\boxed{8}\). The reward for this continuation answer is 0 because it is incorrect.

[0112] In addition, the following correct extended answer reasoning steps can be obtained: 1. To understand the problem, we need to find out how many positive integers less than 20 can be expressed as the sum of positive integer multiples of 2 and positive integer multiples of 3.

[0113] 2: Define the general form: any number that can be represented as the sum of positive multiples of 2 and 3 can be written in the form (2a+3b), where (a) and (b) are positive integers.

[0114] 3: Find the range of (a) and (b), because we are looking for numbers less than 20, so we need to find the maximum possible value of \(2a+3b<20\).

[0115] 4: Find the possible values ​​of \(2a+3b\). To find the possible values ​​of \(2a+3b\), let's list the \(a\) and \(b\) that meet the conditions. For \(a=1\), the value is \(2(1)+3b=2+3b\). As (b) increases, the possible values ​​are as follows: -(b=1),(2+3(1)=5) -(b=2),(2+3(2)=8) -(b=3),(2+3(3)=11) -(b=4),(2+3(4)=14) For \(a=2\), the value is \(2(2)+3b=4+3b\). As (b) increases, the possible values ​​are as follows: -(b=1),(4+3(1)=7) -(b=2),(4+3(2)=10) -(b=3),(4+3(3)=13) -(b=4),(4+3(4)=16) For \(a=3\), the value is \(2(3)+3b=6+3b\). As (b) increases, the possible values ​​are as follows: -(b=1),(6+3(1)=9) -(b=2),(6+3(2)=12) -(b=3),(6+3(3)=15) For (a=4) and above, the sum will exceed 20.

[0116] 5: Calculate the unique values. Now, let's list all the unique values ​​found from the above combinations: -\(5\),\(7\),\(8\),\(9\),\(10\),\(11\),\(12\),\(13\),\(14\),\(15\),\(16\),\(16\)(repeated),\(17\),\(18\),\(19\).

[0117] Since \(16\) appears twice, we have a total of 14 unique positive integers less than 20, which can be represented as the sum of positive integers of 2 and positive integers of 3.

[0118] Answer: 14, the reward for this answer is 1.

[0119] The purpose of the exploration phase in problem-solving strategies is to assist the data processing model in finding more effective training data. The exploration method is as follows: From the collected reasoning answers for each question, select one answer as a reference and continue writing from each intermediate step of that answer to guide the model's generation. Problems with an accuracy rate below 50% are considered difficult. For more difficult problems, it's necessary to help the data processing model find the correct solution. To this end, a correct answer is randomly selected, and the data processing model continues writing the solution from an intermediate step, improving the model's accuracy. Similarly, for simple problems with an accuracy rate above 50%, the goal is to help the data processing model identify common errors. To this end, an incorrect answer is randomly selected, and the data processing model continues writing the solution from an intermediate step, finding more incorrect solutions. Through this exploration, training data in different problem-solving steps can be found, effectively improving the diversity and coverage of the dataset.

[0120] Step 318: From the answer reasoning steps, select the preceding answer reasoning steps of the target answer reasoning step, and based on the preceding answer reasoning steps, the extended answer reasoning steps, the sample questions, and the sample answers to the sample questions, fine-tune the pre-trained data processing model to obtain the target data processing model.

[0121] Based on the extended answer reasoning steps and other data, the pre-trained data processing model is fine-tuned through multiple iterations. The REINFORCE algorithm is used, employing the loss function corresponding to the above formula (1), to improve the reasoning ability of the data processing model during fine-tuning. Combined with techniques such as KL constraints and relative advantage calculation, the stability and effectiveness of the model training process are enhanced.

[0122] In summary, one embodiment of this specification, through proactive exploration of problem-solving strategies, helps increase the model's chances of encountering positive feedback, thereby alleviating the sparse reward problem. For simple problems with high accuracy, analyzing incorrect answers helps the model identify and avoid common errors. This not only increases the diversity of training data but also enhances the model's understanding of error patterns, further enriching the reward signal and avoiding additional training and inference burdens. During reinforcement learning fine-tuning training, the REINFORCE algorithm combined with KL constraints and relative advantage calculation techniques is used to stabilize the model's performance. Through multiple iterations, the model's inference ability is continuously improved.

[0123] Corresponding to the above method embodiments, this specification also provides embodiments of a data processing model training system. Figure 4 A schematic diagram of the structure of a data processing model training system provided in one embodiment of this specification is shown. Figure 4As shown, the data processing model training system 400 includes an edge device 410 and a cloud device 420. The edge device 410 is used to generate a data processing model training request based on training samples and a pre-trained data processing model, and send the data processing model training request to the cloud device 420. The cloud device 420 is used to determine a sample question in the training samples, input the sample question into the pre-trained data processing model, and obtain an answer reasoning step. From the answer reasoning steps, a target answer reasoning step and a preceding answer reasoning step are selected, and the target answer reasoning step is expanded based on the preceding answer reasoning step to obtain an expanded answer reasoning step. Based on the preceding answer reasoning step, the expanded answer reasoning step, the sample question, and the sample answer to the sample question, the pre-trained data processing model is fine-tuned to obtain a target data processing model.

[0124] This specification provides an embodiment of a data processing model training system. When a client user corresponding to the edge device in this system requires fine-tuning of the data processing model, a data processing model training request can be generated based on training samples and a pre-trained data processing model. This request is then sent to the cloud-side device. The cloud-side device aims to improve the problem reasoning ability of the data processing model through fine-tuning training, enabling it to better understand and analyze user needs and problems.

[0125] After receiving a data processing model training request from the edge device, the cloud-side device parses the request to obtain a pre-trained data processing model and training samples. It inputs sample questions from the training samples into the pre-trained model to obtain answer reasoning steps. From these steps, it selects a target answer reasoning step and its preceding answer reasoning steps, and expands the target answer reasoning step based on the preceding steps to obtain extended answer reasoning steps. This allows the data processing model to explore solutions to problems using the target answer reasoning steps, helping it find more effective training data. Finally, it selects the preceding answer reasoning steps from the target answer reasoning steps and fine-tunes the pre-trained model based on these steps, the extended steps, the sample questions, and the sample answers to the sample questions to obtain the target data processing model. This target data processing model is then sent to the edge device.

[0126] By expanding the reasoning steps of the target answer with data, and then fine-tuning the pre-trained data processing model, the target data processing model's ability to understand, reason, and solve problems is improved, thus enhancing the effectiveness of the data processing model's learning process.

[0127] The above is an illustrative scheme of a data processing model training system according to this embodiment. It should be noted that the technical solution of this data processing model training system and the technical solution of the data processing model training method described above belong to the same concept. Details not described in detail in the technical solution of the data processing model training system can be found in the description of the technical solution of the data processing model training method described above. Corresponding to the above method embodiment, this specification also provides a question-and-answer method embodiment. Figure 5 A flowchart of a question-and-answer method provided in one embodiment of this specification is shown. The question-and-answer method is applied to a server and specifically includes the following steps.

[0128] Step 502: Receive the target problem data and input the target problem data into the target data processing model to obtain the target answer reasoning steps corresponding to the target problem solving strategy, wherein the target data processing model is obtained through a data processing model training method; Step 504: Execute the target answer reasoning steps step by step using the target data processing model to obtain the target answer data corresponding to the target question data.

[0129] In practical applications, after fine-tuning the pre-trained data processing model to obtain the target data processing model, the target data processing model can be used to answer questions. By reasoning about the target question data, the target answer data corresponding to the target question data can be obtained.

[0130] One embodiment of the question-answering method provided in this specification, after receiving target question data, inputs the target question data into a target data processing model. The target data processing model infers the answer steps based on the target question data, and determines the target answer inference steps corresponding to the target question data by analyzing the problem-solving approach. By executing the target answer inference steps step by step, the target answer data corresponding to the target question data can be obtained. By using the target data processing model obtained through fine-tuning training to process the target question data, the accuracy of the target answer data can be improved.

[0131] The above is an illustrative scheme of a question-answering method according to this embodiment. It should be noted that the technical solution of this question-answering method and the technical solution of the data processing model training method described above belong to the same concept. For details not described in detail in the technical solution of the question-answering method, please refer to the description of the technical solution of the data processing model training method described above.

[0132] Corresponding to the above method embodiments, this specification also provides embodiments of a data processing model training device. Figure 6 A schematic diagram of a data processing model training apparatus according to one embodiment of this specification is shown. Figure 6As shown, the device includes: Input module 602 is configured to input sample questions into a pre-trained data processing model to obtain answer reasoning steps; The selection module 604 is configured to select a target answer reasoning step and a preceding answer reasoning step from the answer reasoning steps, and to perform data expansion on the target answer reasoning step based on the preceding answer reasoning step to obtain an expanded answer reasoning step; Training module 606 is configured to fine-tune the pre-trained data processing model based on the preceding answer reasoning steps, the extended answer reasoning steps, the sample question, and the sample answer to the sample question, to obtain the target data processing model.

[0133] Optionally, the input module 602 is further configured to: Input the problem samples from the pre-trained samples into the initial data processing model to obtain the predicted answer corresponding to the problem-solving strategy; The pre-training loss value is calculated based on the pre-training loss function, the predicted answer, and the answer sample corresponding to the question sample. The parameters of the initial data processing model are then tuned based on the pre-training loss value until the data processing model that meets the pre-training stopping condition is obtained.

[0134] Optionally, the selection module 604 is further configured to: Based on the answer reasoning steps described above, the model-predicted answer corresponding to the sample question is determined. If the sample question satisfies the fine-tuning training conditions based on the model's predicted answer, the step of selecting the target answer from the answer reasoning step is executed.

[0135] Optionally, the selection module 604 is further configured to: If the prediction accuracy of the sample question is greater than the accuracy threshold, the answer reasoning step includes predicting the correct answer; Identify at least one incorrect reasoning step included in the answer reasoning step, and use the at least one incorrect reasoning step as the target answer reasoning step.

[0136] Optionally, the selection module 604 is further configured to: If the prediction accuracy of the sample question is less than or equal to the accuracy threshold, the answer reasoning step includes predicting an incorrect answer; Identify at least one correct reasoning step included in the answer reasoning step, and use the at least one correct reasoning step as the target answer reasoning step.

[0137] Optionally, the selection module 604 is further configured to: The preceding answer reasoning steps and the sample question are input into the data processing model to obtain the initial extended answer reasoning steps for data expansion of the target answer reasoning steps; Determine the reasoning step answers included in the initial extended answer reasoning step, and determine the reasoning step sample answers of the target answer reasoning step based on the sample answers; The predicted reward for the initial extended answer reasoning step is determined by comparing the answer to the reasoning step sample with the answer to the reasoning step. The initial extended answer reasoning step and the predicted reward are used as the extended answer reasoning step.

[0138] Optionally, the training module 606 is further configured to: The preceding answer reasoning step, the extended answer reasoning step, and the sample question are input into the pre-trained data processing model to obtain a fine-tuned predicted answer associated with the target answer reasoning step; The model fine-tuning loss value is calculated based on the model fine-tuning loss function, the fine-tuning predicted answer, and the sample answer of the sample question. The parameters of the pre-trained data processing model are then tuned based on the model fine-tuning loss value until the target data processing model that meets the model fine-tuning training stopping condition is obtained.

[0139] Optionally, the training module 606 is further configured to: Receive problem data and input the problem data into the target data processing model to obtain the problem reasoning steps corresponding to the problem data solution strategy; The target data processing model is used to execute the problem reasoning steps step by step to obtain the answer data corresponding to the problem data.

[0140] This specification provides a data processing model training device in one embodiment, aiming to improve the problem reasoning ability of the data processing model through fine-tuning training, enabling it to better understand and analyze user needs and problems. After pre-training the data processing model, a sample problem is input into the pre-trained model to obtain answer reasoning steps. From the answer reasoning steps, a target answer reasoning step and its preceding answer reasoning steps are selected, and the target answer reasoning step is expanded based on the preceding answer reasoning steps to obtain extended answer reasoning steps. This allows the data processing model to explore solutions to problems using the target answer reasoning steps, assisting the model in finding more effective training data. From the answer reasoning steps, the preceding answer reasoning steps of the target answer reasoning step are selected, and the pre-trained data processing model is fine-tuned based on the preceding answer reasoning steps, extended answer reasoning steps, sample problems, and sample answers to sample problems to obtain the target data processing model. By expanding the reasoning steps of the target answer with data, and then fine-tuning the pre-trained data processing model, the target data processing model's ability to understand, reason, and solve problems is improved, thus enhancing the effectiveness of the data processing model's learning process.

[0141] The above is an illustrative scheme of a data processing model training device according to this embodiment. It should be noted that the technical solution of this data processing model training device and the technical solution of the data processing model training method described above belong to the same concept. For details not described in detail in the technical solution of the data processing model training device, please refer to the description of the technical solution of the data processing model training method described above.

[0142] Corresponding to the above method embodiments, this specification also provides embodiments of a question-and-answer device. Figure 7 A schematic diagram of a question-and-answer device according to one embodiment of this specification is shown. Figure 7 As shown, this device is used in a server and includes: The receiving module 702 is configured to receive target problem data and input the target problem data into the target data processing model to obtain the target answer reasoning steps corresponding to the target problem solving strategy, wherein the target data processing model is obtained through a data processing model training method; The execution module 704 is configured to use the target data processing model to execute the target answer reasoning steps step by step to obtain the target answer data corresponding to the target question data.

[0143] One embodiment of the question-answering device provided in this specification, after receiving target question data, inputs the target question data into a target data processing model. The target data processing model infers the answer steps based on the target question data, and determines the target answer reasoning steps corresponding to the target question data by analyzing the problem-solving approach. By executing the target answer reasoning steps step by step, the target answer data corresponding to the target question data can be obtained. By using the target data processing model obtained through fine-tuning training to process the target question data, the accuracy of the target answer data can be improved.

[0144] The above is an illustrative scheme of a question-and-answer device according to this embodiment. It should be noted that the technical solution of this question-and-answer device and the technical solution of the question-and-answer method described above belong to the same concept. For details not described in detail in the technical solution of the question-and-answer device, please refer to the description of the technical solution of the question-and-answer method described above.

[0145] Figure 8 A structural block diagram of a computing device 800 provided according to one embodiment of this specification is shown.

[0146] The computing device 800 includes: Memory 810 and processor 820; The memory 810 is used to store computer programs / instructions, and the processor 820 is used to execute the computer programs / instructions. When the computer programs / instructions are executed by the processor 820, they implement the steps of the data processing model training method.

[0147] In one or more embodiments of this specification, the computing device can be understood as an integrated smart terminal, including but not limited to a server, desktop computer, PC (Personal Computer), all-in-one model machine, mobile phone, tablet computer or other portable smart terminal, etc., and the computing device may have the model described in the above embodiments of this application pre-installed.

[0148] Specifically, this computing device can pre-install various types of models, including but not limited to models in natural language processing, visual processing, speech processing, code processing, and multimodal task processing, thus providing diverse model selection. In different product forms, this computing device can support one or more model usage methods, including but not limited to model training, model invocation, model fine-tuning, model deployment, model inference, and application. In some product forms, this computing device also supports model management, including but not limited to multi-type model management (supporting the management of discriminative, generative, and other model types), model version control (supporting the control of different model versions), and model evaluation (evaluating model performance and effectiveness based on model evaluation tools). In other product forms, this computing device can also create applications based on models, providing API (Application Programming Interface) calling capabilities. Users can call models into created applications through the API interface, and application management tools are also provided to manage and monitor the applications.

[0149] Furthermore, the computing device can also include data management (supporting the creation and management of model tuning datasets), a training center (providing abundant training resources to help users learn and master AI (Artificial Intelligence) technology), and basic control capabilities (providing enterprise-level basic control capabilities to ensure the security and efficient operation of the system). Through the above functions, it provides a comprehensive and integrated device for AI development, training, deployment, and application.

[0150] Figure 9 A structural block diagram of an electronic device 900 provided according to one embodiment of this specification is shown.

[0151] A memory 910 and a processor 920 are connected via a bus 930; The memory 910 is used to store computer programs / instructions, and the processor 920 is used to execute the computer programs / instructions, which, when executed by the processor 920, implement the steps of the method.

[0152] Specifically, the components of the electronic device 900 include, but are not limited to, a memory 910 and a processor 920. The processor 920 is connected to the memory 910 via a bus 930, and the database 950 is used to store data.

[0153] Electronic device 900 also includes access device 940, which enables electronic device 900 to communicate via one or more networks 960. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. Access device 940 may include one or more of any type of wired or wireless network interface (e.g., network interface card (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, a Wi-MAX (Worldwide Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a Near Field Communication (NFC) interface, and so on.

[0154] In one embodiment of this specification, the above-described components of the electronic device 900 and Figure 9 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 9 The block diagram of the electronic device shown is for illustrative purposes only and is not intended to limit the scope of this specification. Those skilled in the art can add or replace other components as needed.

[0155] Electronic device 900 can be any type of stationary or mobile electronic device, including mobile computers or mobile electronic devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable electronic devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary electronic devices such as desktop computers or personal computers (PCs). Electronic device 900 can also be a mobile or stationary server.

[0156] The above is an illustrative scheme of an electronic device according to this embodiment. It should be noted that the technical solution of this electronic device and the technical solution of the above method belong to the same concept, and all details not described in detail in the technical solution of the electronic device can be referred to the description of the technical solution of the above method.

[0157] An embodiment of this specification also provides a computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the above-described method.

[0158] The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium and the technical solution of the above method belong to the same concept, and all details not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above method.

[0159] An embodiment of this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described method.

[0160] The above is an illustrative scheme of a computer program product according to this embodiment. It should be noted that the technical solution of this computer program product and the technical solution of the above method belong to the same concept, and all details not described in detail in the technical solution of the computer program product can be referred to in the description of the technical solution of the above method.

[0161] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0162] The computer program / instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0163] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.

[0164] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0165] The preferred embodiments disclosed above are merely illustrative of this specification. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments described herein. These embodiments are selected and specifically described in this specification to better explain the principles and practical applications of the embodiments, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.

Claims

1. A data processing model training method, comprising: Input the sample question into a pre-trained data processing model to obtain the answer reasoning steps; From the answer reasoning steps, select the target answer reasoning step and the preceding answer reasoning step of the target answer reasoning step, and perform data expansion on the target answer reasoning step based on the preceding answer reasoning step to obtain the expanded answer reasoning step; Based on the preceding answer reasoning steps, the extended answer reasoning steps, the sample question, and the sample answer to the sample question, the pre-trained data processing model is fine-tuned to obtain the target data processing model.

2. The data processing model training method according to claim 1, wherein the pre-training of the data processing model includes: Input the problem samples from the pre-trained samples into the initial data processing model to obtain the predicted answer corresponding to the problem-solving strategy; The pre-training loss value is calculated based on the pre-training loss function, the predicted answer, and the answer sample corresponding to the question sample. The parameters of the initial data processing model are then tuned based on the pre-training loss value until the data processing model that meets the pre-training stopping condition is obtained.

3. The data processing model training method according to claim 1, wherein selecting the target answer reasoning step from the answer reasoning steps includes: Based on the answer reasoning steps described above, the model-predicted answer corresponding to the sample question is determined. If the sample question satisfies the fine-tuning training conditions based on the model's predicted answer, the step of selecting the target answer from the answer reasoning step is executed.

4. The data processing model training method according to claim 1, wherein selecting the target answer reasoning step from the answer reasoning steps includes: If the prediction accuracy of the sample question is greater than the accuracy threshold, the answer reasoning step includes predicting the correct answer; Identify at least one incorrect reasoning step included in the answer reasoning step, and use the at least one incorrect reasoning step as the target answer reasoning step.

5. The data processing model training method according to claim 4, wherein selecting the target answer reasoning step from the answer reasoning steps includes: If the prediction accuracy of the sample question is less than or equal to the accuracy threshold, the answer reasoning step includes predicting an incorrect answer; Identify at least one correct reasoning step included in the answer reasoning step, and use the at least one correct reasoning step as the target answer reasoning step.

6. The data processing model training method according to claim 1, wherein the step of expanding the target answer reasoning step based on the preceding answer reasoning step to obtain the expanded answer reasoning step includes: The preceding answer reasoning steps and the sample question are input into the data processing model to obtain the initial extended answer reasoning steps for data expansion of the target answer reasoning steps; Determine the reasoning step answers included in the initial extended answer reasoning step, and determine the reasoning step sample answers of the target answer reasoning step based on the sample answers; The predicted reward for the initial extended answer reasoning step is determined by comparing the answer to the reasoning step sample with the answer to the reasoning step. The initial extended answer reasoning step and the predicted reward are used as the extended answer reasoning step.

7. The data processing model training method according to claim 1, wherein the step of fine-tuning the pre-trained data processing model based on the preceding answer reasoning step, the extended answer reasoning step, the sample question, and the sample answer to the sample question to obtain the target data processing model includes: The preceding answer reasoning step, the extended answer reasoning step, and the sample question are input into the pre-trained data processing model to obtain a fine-tuned predicted answer associated with the target answer reasoning step; The model fine-tuning loss value is calculated based on the model fine-tuning loss function, the fine-tuning predicted answer, and the sample answer of the sample question. The parameters of the pre-trained data processing model are then tuned based on the model fine-tuning loss value until the target data processing model that meets the model fine-tuning training stopping condition is obtained.

8. The data processing model training method according to any one of claims 1-7, wherein the method further comprises: Receive problem data and input the problem data into the target data processing model to obtain the problem reasoning steps corresponding to the problem data solution strategy; The target data processing model is used to execute the problem reasoning steps step by step to obtain the answer data corresponding to the problem data.

9. A data processing model training system, comprising edge devices and cloud devices: The edge device is used to generate a data processing model training request based on training samples and a pre-trained data processing model, and send the data processing model training request to the cloud device. The cloud-side device is used to determine a sample question from the training samples, input the sample question into the pre-trained data processing model to obtain an answer reasoning step; from the answer reasoning steps, select a target answer reasoning step and a preceding answer reasoning step, and perform data expansion on the target answer reasoning step based on the preceding answer reasoning step to obtain an expanded answer reasoning step; based on the preceding answer reasoning step, the expanded answer reasoning step, the sample question, and the sample answer to the sample question, perform model fine-tuning training on the pre-trained data processing model to obtain a target data processing model.

10. A question-answering method, applied to a server, comprising: Receive target problem data and input the target problem data into a target data processing model to obtain the target answer reasoning steps corresponding to the target problem solving strategy, wherein the target data processing model is obtained by the data processing model training method according to any one of claims 1-8; The target answer reasoning steps are executed step by step using the target data processing model to obtain the target answer data corresponding to the target question data.

11. A computing device, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1 to 8 or 10.

12. An electronic device, comprising: A memory and a processor, the memory and the processor being connected via a bus; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1 to 8 or 10.

13. A computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 8 or 10.

14. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 8 or 10.