A large model inference verification method, device, medium and equipment
By generating a thought chain to verify large models, the problem of verifying the output results of large language models in complex reasoning tasks is solved, realizing universal verification across tasks and domains, improving verification efficiency and accuracy, and reducing development costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG ANT SECRET TECH CO LTD
- Filing Date
- 2026-05-08
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, large language models (LLMs) are prone to "illusions" when outputting results in complex reasoning tasks, and existing verification methods lack universality and scalability, rely on manually customized rules, have high development costs, and are difficult to transfer between different tasks.
By acquiring the training dataset of a large model, we reverse-engineer the steps that satisfy the boundary conditions of the input data as a thought chain, train and generate a validation large model, use the thought chain for supervised fine-tuning, and generate a validation path to verify the execution results of the business large model.
It implements a universal verification method that can be used across tasks and domains, avoiding the process of solving complex problems, improving the efficiency and accuracy of verification, and reducing development costs.
Smart Images

Figure CN122242769A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of computer technology, and in particular to a method, apparatus, storage medium and device for verifying large-scale model reasoning. Background Technology
[0002] With the development of artificial intelligence (AI) technology, large language models (LLM) have been widely applied in various fields.
[0003] However, when LLM performs complex reasoning tasks, it may be limited by the training data or the task reasoning process may exceed its own capabilities. In such cases, LLM can only generate answers based on existing data and capabilities, which may cause LLM to experience "illusions" when performing tasks, thereby reducing the accuracy and reliability of LLM in performing tasks.
[0004] In existing technologies, when verifying the output of an LLM (Local Level Management) system to detect errors promptly, a specific verification method is typically developed for each task. For example, mathematical problems might be verified using calculators, code generation might be validated by compiling and running test cases, and fact-based question answering might rely on relevant knowledge bases for verification. However, this approach heavily relies on manually designed verification methods and external tools, requiring redevelopment for different tasks and lacking versatility and scalability.
[0005] Therefore, how to verify the results of large-scale model inference in a more general way has become an urgent problem to be solved. Thus, this specification provides a method for verifying large-scale model inference. Summary of the Invention
[0006] This specification provides a method, apparatus, storage medium, and electronic device for verifying large model inference, in order to partially solve the problems existing in the prior art.
[0007] The embodiments in this specification adopt the following technical solutions: This specification provides a method for verifying large-scale model inference, the method comprising: Obtain the training dataset of a large model, wherein the training samples of the training dataset include input data and output results; For each training sample, based on the output of that training sample, determine the steps that deduce the boundary conditions of the input data from the output, as the thought chain of the verification process; Using the training samples as input and the thought chain as annotation, supervised fine-tuning is performed on the large validation model; In response to a verification request for the task execution result of the business big model, the task input and the task execution result are sent to the verification big model to obtain the thought chain output by the verification big model; Execute each step in the defined thought chain, determine the verification result, and perform the task based on the verification result.
[0008] This specification provides a large-scale model reasoning verification device, the device comprising: The acquisition module is used to acquire the training dataset of a large model, wherein the training samples of the training dataset include input data and output results; The thought chain construction module is used to determine, for each training sample, the steps that deduce the boundary conditions of the input data from the output results of the training sample, as the thought chain of the verification process; The training module is used to perform supervised fine-tuning of the large validation model using the training samples as input and the thought chain as annotation. The verification module is used to respond to the verification request for the task execution result of the business big model, send the task input and the task execution result to the verification big model, and obtain the thought chain output by the verification big model; The execution module is used to execute each step contained in the determined thought chain, determine the verification result, and execute the task based on the verification result.
[0009] This specification provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described large model inference verification method.
[0010] This specification provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the above-described large model inference verification method.
[0011] The above-described at least one technical solution used in the embodiments of this specification can achieve the following beneficial effects: This specification discloses a large-scale model reasoning verification method. This method uses the output results from the reasoning training data used to train the large model to infer the steps that satisfy the boundary conditions of the input data, serving as a thought chain for the verification process of the training data. Using the input data and output results as samples and the thought chain as annotations, a large-scale verification model generating the thought chain is trained. This allows for the generation of a reasoning process based on the business input and execution results when verifying the execution results of a large business model. The verification process is then performed according to this reasoning process to determine whether the execution result is normal. By avoiding the need for solving problems and instead verifying the correctness based on a given solution, the method avoids reasoning for complex solutions. The thought chain characterizes whether the verification output satisfies the input constraints, solving the fundamental bottleneck of the lack of universality in verification methods. Attached Figure Description
[0012] The accompanying drawings, which are included to provide a further understanding of this specification and form part of this specification, illustrate exemplary embodiments and are used to explain this specification, but do not constitute an undue limitation thereof. In the drawings: Figure 1 A flowchart of a large model inference verification provided for embodiments of this specification; Figure 2 A schematic diagram of a large-scale model reasoning verification device provided in the embodiments of this specification; Figure 3 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this specification. Detailed Implementation
[0013] To make the objectives, technical solutions, and advantages of this specification clearer, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of them. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this specification.
[0014] Large Language Models (LLMs), also known simply as large models, are models that perform logical reasoning on input natural language text and output user-understandable natural language descriptions. When LLMs are applied to tasks requiring rigorous verification, such as mathematical proofs, multi-step reasoning, or complex decision-making, their outputs often cannot be guaranteed to be absolutely correct, a phenomenon known as model "hallucination." Verifying the results of LLM reasoning is essentially a process of determining whether the output satisfies the input constraints given the input and output conditions. The difficulty of this process is similar to the non-deterministic polynomial complete (NP) problem in computational complexity theory. In this case, generating the answer is equivalent to the solution process, while verifying the correctness of the answer is equivalent to the verification process. In NP problems, verifying the correctness of a solution is often much easier than solving it, but for the reasoning results of large models, due to the black-box nature of the reasoning process and the lack of factual evidence, the complexity of verification is no less than solving the problem again.
[0015] Existing methods for verifying whether LLM outputs exhibit "illusions" in complex reasoning tasks primarily rely on manually customized rules, such as equipping math problems with calculators, configuring compilers for code generation, and connecting fact-based question answering to knowledge bases. This problem-specific verification approach is equivalent to designing a separate validator for each task domain, which is not only costly and time-consuming to develop, but also difficult to migrate and reuse across different tasks.
[0016] More importantly, large models generate diverse inference results in open-domain scenarios, meaning that error forms can vary widely and new types of errors constantly emerge. Fixed verification rules struggle to detect these continuously evolving errors. While there are methods to suppress illusions through supervised fine-tuning (SFT) of LLMs, these methods require large amounts of high-quality manually labeled data, resulting in high training costs. Furthermore, the differences between different models make it difficult to guarantee the transfer effect of training samples. Therefore, how to verify the inference results of LLMs in complex tasks without relying on manually customized rules or sacrificing model generality has become a problem restricting the application of large models.
[0017] The technical solutions provided in the various embodiments of this specification are described in detail below with reference to the accompanying drawings.
[0018] Figure 1 A flowchart for large model inference verification provided in the embodiments of this specification specifically includes the following steps: S100: Obtain the training dataset of a large model, wherein the training samples of the training dataset include input data and output results.
[0019] In the embodiments described in this specification, the following are employed: Figure 1 The device used for large model inference verification as illustrated can be any electronic device, such as a computer, server, or server cluster consisting of multiple servers. For ease of description, the following explanation uses a server as an example only.
[0020] To enable reasoning verification of whether LLM outputs exhibit "illusions" and avoid using NP-hard problem-solving approaches, a large-scale verification model is needed to define the verification process from output to input data. This model must learn how to generate reasoning steps to verify the correctness of a given input and output. The learning process of this large-scale verification model requires a large number of training samples, which should be constructed based on the input data and its corresponding correct outputs.
[0021] Specifically, the server acquires the training dataset for the large model. Each training sample in the training dataset contains two parts: input data and output results. The input data is the original problem or task instruction input to the large model, such as a math word problem, a text to be summarized, or a detailed requirement for code generation. The output result is the standard answer or expected execution result corresponding to the input data. For example, in a math word problem solving scenario, the input data could be a problem containing specific numbers and operational relationships, while the output result would be the standard solution steps and the final numerical answer. Acquiring this input and output for large model training allows the training dataset to serve as the foundational data for constructing training samples.
[0022] Furthermore, in the embodiments described in this specification, the server may employ various methods to acquire the training dataset, and this specification does not impose any restrictions on these methods. For example, the server may collect real user requests and high-quality responses that have undergone manual review from historical business interaction logs, using them as raw training samples. Alternatively, in a simulation testing environment, based on preset rules or templates, it may batch generate input data covering various typical scenarios and simultaneously generate corresponding standard output results.
[0023] It's important to note that regardless of the method used, the output results must be accurate. Furthermore, to ensure the validation model has sufficient versatility and general applicability, the training samples in the dataset should cover as many scenarios as possible, with typical samples for each scenario. In other words, it needs to encompass multiple task domains to improve the model's adaptability to verification requirements in different business scenarios. On the other hand, for each task type, it needs to include typical samples that reflect different verification logics, enabling the model to learn verification methods for diverse input conditions within the same task type.
[0024] In this way, when training the model based on this training dataset, the large model can effectively learn the inherent constraint relationship between the output results and the input data, thus having a reliable verification capability in the application stage.
[0025] S102: For each training sample, based on the output of the training sample, determine the steps that deduce the boundary conditions of the input data from the output, as the thought chain of the verification process.
[0026] In one or more embodiments of this specification, in order to enable the large verification model to learn how to determine the verification steps, the server needs to determine, for each training sample, whether the verification of the training sample conforms to the verification steps of the input data by reverse deducing the output results, and use this as the thought chain for verification.
[0027] Chain of Thought (CoT) refers to a model that generates a series of intermediate reasoning steps before outputting the final answer, thereby simulating the step-by-step thinking process humans use to solve complex problems. This explicit reasoning path breaks down multi-step logic into a traceable sequence of thought, allowing large models to not only output conclusions but also demonstrate the basis for those conclusions, thus improving the accuracy and interpretability of handling complex tasks.
[0028] In the embodiments described in this specification, the thought chain is still a series of ordered steps, but this sequence is no longer a reasoning path designed to generate an answer, but rather a verification path specifically used to verify whether the output result is correct. Specifically, the thought chain starts from the output result and works backward to deduce each verification node that the result should satisfy the boundary conditions of the input data, forming a complete verification logic chain.
[0029] In the examples provided in this manual, the thought chain corresponding to the training samples is an explicit expression of the verification process. Traditional verification of large model outputs often relies on a black-box, holistic approach. However, in this example, the thought chain transforms the implicit verification logic into an explicit sequence of steps, making the verification process executable.
[0030] It's important to note that in the examples provided in this manual, the Mind Chain approach shifts the focus from a problem-solving mindset to a verification mindset. Unlike focusing on how to generate the correct answer using a model, this example borrows the core ideas of NP-hard problems, shifting the emphasis to quickly verifying the correctness of the answer. The Mind Chain is precisely this verification process. By breaking down complex verification tasks into a series of actionable atomic steps, it transforms the verification process, which originally required manually defined rules, into a learnable workflow.
[0031] Specifically, the server first determines for each training sample that it contains input data and corresponding output results.
[0032] The server sends the training samples to the sample annotation terminal, where relevant users manually annotate them. This annotation is a sequence of steps, where users deduce the validation nodes that the output result should satisfy based on the constraints contained in the input data, and describe each validation operation in natural language.
[0033] For example, for a sample whose input is "calculate the result of 25 multiplied by 16 plus 38" and whose output is "438", it is labeled as: "Step 1: calculate 25 multiplied by 16, the intermediate result should be 400; Step 2: add 38 to 400 to get 438; Step 3: compare the final result 438 with the output result, if they match, the verification is successful." In this model, the server does not require any additional processing or tool binding for the steps in the thought chain, which exists in the form of natural language. The subsequent large-scale validation model uses these natural language steps as learning targets, and through supervised fine-tuning, learns the ability to generate similar validation steps based on the input and output.
[0034] In addition, in the embodiments of this specification, the server can also determine the thought chain from historical verification records collected from historical business logs and from inference step samples obtained from public datasets.
[0035] Alternatively, the server can combine prompts for each training sample and send them to a general-purpose model, requesting the model to describe how to verify whether the output meets the constraints of the input data in a step-by-step reasoning manner. This yields a sequence of steps described in natural language by the general-purpose model. Then, a quality check is performed on the generated thought chain to identify potential errors. For example, the server can perform step-by-step verification based on the thought chain to determine if the verification result is correct. Since the output results determined in step S100 are all correct, the server can determine whether there are problems with the thought chain based on the verification results.
[0036] Alternatively, the server can determine the thought chain through multi-model voting. The server can send the same input data and output results to multiple different general-purpose models to obtain multiple thought chains. Then, semantic similarity analysis and step alignment are performed on these thought chains, and the thought chains with consistent expressions at the verification nodes are selected as annotations.
[0037] Alternatively, the server can use different temperature parameters or prompt templates to generate multiple thought chains for the same training sample using the same general-purpose model. The server calculates the similarity between these generated results to determine which steps are consistently occurring core validation nodes, eliminating randomly generated noisy steps. After multiple rounds of generation and filtering, the server identifies the consistently occurring sequence of steps as the thought chain for that training sample.
[0038] S104: Using the training samples as input and the thought chain as annotation, perform supervised fine-tuning on the large validation model.
[0039] After obtaining the thought chains corresponding to each training sample, the server can perform supervised fine-tuning on the large validation model, enabling it to learn the ability to generate validation thought chains based on input data and output results.
[0040] The training objective of the validation model has shifted from traditional problem solving to validation. Therefore, the validation model is no longer merely a knowledge base that memorizes answers, but rather abstracts a universal validation methodology that spans tasks and domains by learning from massive amounts of validation logic. When faced with entirely new business scenarios, the validation model no longer relies on pre-defined rules, but can dynamically generate suitable validation paths based on the semantic features of the input data, achieving a leap from rule matching to path generation.
[0041] Furthermore, the thought chain generated by the large verification model makes the implicit verification logic explicit into an executable sequence of steps, which not only ensures the interpretability of the verification process, but also provides an operational basis for subsequent automated verification.
[0042] S106: In response to the verification request for the task execution result of the business big model, the task input and the task execution result are sent to the verification big model to obtain the thought chain output by the verification big model.
[0043] In the embodiments described in this specification, the large-scale verification model trained through the aforementioned steps can be used to generate a verification thought chain from the output results of the large-scale business model, so as to verify whether the output results are correct through subsequent steps. Of course, the server that performs the detection process is usually also a server, but this specification does not limit whether it is the server that trained the large-scale verification model or another server.
[0044] Specifically, firstly, the server's business environment continuously monitors the output of the business big model, or receives verification requests from the business big model. For example, when a business application obtains output through the configured business big model, the business process can encapsulate the output and task input into a verification request and send it to the server. After receiving the verification request, the server can parse the content of the verification request and extract the task input and task execution result. Here, the task input refers to the original request or question data that triggers the business big model to execute the task, and the task execution result refers to the answer or processing result generated and output by the business big model based on the task input.
[0045] Subsequently, the server organizes the extracted task input and task execution result according to the input format used during the training of the validation model, forming a complete prompt. For example, if the business scenario is a mathematical problem to be solved, and the task input is "A rectangle has a length of 12 meters and a width of 8 meters, find its area", and the task execution result is "96 square meters", the server concatenates these texts into a structured input and sends it to the validation model that has been fine-tuned.
[0046] After receiving the input, the large-scale verification model, based on its verification capabilities learned during the training phase, generates a sequence of steps, i.e., a thought chain, describing how to verify the correctness of the result for the current task input and execution result.
[0047] The process of validating the large model's generated thought chain is entirely based on its internal parameters and input content, without accessing any external knowledge bases or tools. After receiving the thought chain returned by the large model, the server uses it as the response result of this verification request, which is then used to execute the verification steps. Through step S106, the server successfully transforms an abstract verification request into a concrete and operable verification process.
[0048] S108: Execute each step contained in the determined thought chain, determine the verification result, and perform the task based on the verification result.
[0049] In one or more embodiments of this specification, after obtaining the thought chain, the server needs to perform a verification operation based on the descriptive thought chain. By sequentially executing each step in the thought chain, a final verification conclusion is obtained, and the subsequent direction of the business task is determined based on the verification conclusion.
[0050] Specifically, the server can use basic scripts to execute the verification steps in the thought chain. Using the thought chain as an operational guide, through a combination of manual interpretation and basic tools, each verification operation is completed sequentially, and the results are recorded. Ultimately, the direction of the business process is determined based on the execution conclusions.
[0051] After receiving the thought chain returned by the large verification model, the server can return it to the user terminal in a readable format. Since the thought chain exists as a sequence of natural language steps, the verification steps can be returned to the user terminal sequentially. For each step, the server can provide only a natural language description, allowing the user to complete the verification operation and return the result. The server can then record the execution result of each step and control the flow according to the thought chain's sequence. After all steps are executed, the server summarizes the final step's execution result, determines whether it matches the task input boundary conditions, and thus derives the verification result.
[0052] In addition, in the embodiments of this specification, the server can first determine the execution order of each step in the thought chain.
[0053] Then, following this execution order, for each step, the execution result of the previous step or the task execution result is used as input, and the execution result of the step is obtained according to the verification logic of the step.
[0054] Specifically, for each step, the server can identify the validation logic defined for that step and determine its input source. Depending on the validation logic, the input may be taken directly from the original task execution result, or it may need to incorporate boundary conditions from the task input. The server performs the validation operations for that step, such as numerical comparison, format checking, and logical judgment, and records the execution result of this step.
[0055] Subsequently, the server continues to execute the next step according to the execution order marked in the thought chain. For each step from the second to the next, the server uses the execution result of the previous step as the main input for this step. Depending on the specific verification logic of that step, it may also need to reference the original task execution result or specific fields from the task input again. The server executes each step sequentially, generating corresponding intermediate execution results, until all steps recorded in the thought chain have been executed.
[0056] After each step is completed, the server can determine whether the execution result matches the task input. Specifically, the server can perform semantic matching between the execution result and the task input based on the task type.
[0057] For example, in a code generation scenario, the task input requires generating a function that can calculate the Fibonacci sequence, and the task execution result is a code snippet returned by the larger business model. After going through the steps of compilation checks, test case execution, and boundary condition testing executed sequentially in the thought process, the final result might be an indicator recording the test pass rate. The server determines whether this indicator meets the implicit expected standard in the task input, such as all test cases passing.
[0058] Finally, if the final execution result matches the expected boundary conditions of the task input, the server determines that the task execution result meets the requirements, and the verification result is normal. Conversely, if any verification step fails, or the final result does not match the expectation, the server determines that the task execution result is illusory or erroneous, and the verification result is abnormal.
[0059] Furthermore, after obtaining the verification result, the server can return the verification result, which the business application uses to determine whether to continue executing the business.
[0060] When the verification result is normal, the server will return the verified status along with the task execution result. Specifically, this return can be routed to the business process corresponding to the larger business model that initiated the verification request. The relevant business system can then treat the execution result of the larger business model as valid output based on the returned content and continue with subsequent processing steps. For example, it can present the answer to the user or submit the code to the code repository.
[0061] When the verification result is abnormal, the server can return a verification result indicating that the verification failed and trigger a blocking mechanism. The corresponding business system can then stop continuing to execute business processes based on the task execution result, preventing the use of illusory or erroneous output for business execution.
[0062] based on Figure 1 The large-scale model inference verification method described herein uses the output results from the inference training data used to train the large model to infer the steps that satisfy the boundary conditions of the input data. This serves as a thought chain for the verification process of the training data. Using the input data and output results as samples and the thought chain as annotations, a large-scale verification model generating the thought chain is trained. This allows for the generation of an inference process based on the business input and execution results when verifying the execution results of the large-scale business model. Verification is then performed according to this inference process to determine whether the execution results are normal. By avoiding the need for solving problems and instead verifying the correctness of the results through inference based on a given solution, the method avoids the inference of complex problem solutions. The thought chain characterizes whether the verification output satisfies the input constraints, thus overcoming the fundamental bottleneck of the lack of universality in verification methods.
[0063] In the embodiments of this specification, after step S102, since the verification steps in the thought chain often not only rely on the reasoning ability of the large model itself, they may also need to use external tools to complete certain specific operations. For example, accurate numerical calculations require calling a calculator, the correctness verification of code snippets requires relying on a code executor, and the verification of factual information requires relying on a search engine or knowledge base.
[0064] Therefore, the server can also be configured to use the tools required for the verification steps, so that the tools can be called for verification when the large model is applied.
[0065] First, the server obtains the thought chains corresponding to each labeled training sample. For each thought chain, the server parses each verification step it contains, identifying whether the verification operation described in that step requires external tools to complete. For example, if a step is described as "calculate the result of 25 multiplied by 16," the server recognizes that this operation requires arithmetic calculation ability and therefore categorizes it as requiring a calculator tool. Another step is described as "execute the above Python code and verify whether the output is the expected result," the server recognizes that this operation requires a code execution environment and therefore categorizes it as requiring a code executor tool.
[0066] Then, after the server has traversed all the steps in each thought chain, it obtains an initial list of tool requirements.
[0067] Subsequently, the server consolidates and unifies the tool requirements in the tool requirements list. Since different thought chains may use different methods to describe the same type of verification operation, for example, some steps require "calling a calculator to calculate", while others require "using a math library to evaluate", the server needs to cluster these semantically similar operations into the same tool category and design a general-purpose tool for each category that can cover the requirements of that type of operation.
[0068] For example, steps involving arithmetic operations can be grouped into a "calculator" tool, steps involving the code execution environment into a "code executor" tool, and steps requiring real-time information retrieval into a "search engine" tool. Through this grouping, the server can be configured with sufficiently versatile tools to meet the invocation requirements of verification steps within the generated thought chain in practical applications.
[0069] Finally, the server can build a tool library and its corresponding APIs for each tool within the merged tool categories. Specifically, for each tool category, the server can retrieve the corresponding implementation module based on the required functionality. For example, for a calculator tool, the server can encapsulate a function that parses arithmetic expressions and returns the calculation result. For a code executor tool, the server can build a secure code execution sandbox that supports execution and result recording of multiple programming languages. For a search engine tool, the server can interface with existing search engine application programming interfaces (APIs) and encapsulate unified query and result parsing logic.
[0070] After the implementation of each tool is completed, the server configures a unified calling interface for each tool, defining the format of input parameters, the specifications of output results, and the calling method.
[0071] Through the above steps, the server builds a library of calling tools that includes a variety of standardized tools, and provides a calling interface for each tool.
[0072] Furthermore, in the embodiments of this specification, after the server determines the tool library to be invoked, the invoked tool library can provide support for subsequent automated verification, while the thought chain of the training samples can remain unchanged, still a sequence of steps described in natural language, without indicating which specific tool needs to be invoked for each step. When the large verification model generates the thought chain in step S106, it generates a plain text description, and the execution stage of the subsequent step S108 needs to re-parse the semantics of each step and temporarily match the corresponding tool.
[0073] To further improve verification efficiency and reduce the complexity of executing step S108, the server can update the thought chain, enabling the large verification model to learn the association between verification steps and executable external tools. The server can integrate tool types and their API call information into the thought chain, making the thought chain itself an executable instruction sequence with tool annotations. This reduces the parsing burden during subsequent applications and improves the determinism and execution efficiency of the verification process.
[0074] Specifically, for each thought chain, the server determines the steps that deduce the boundary conditions that satisfy the input data from the output results.
[0075] Next, for each step, the type of tool required for that step is determined. The server can analyze the step to identify the type of tool needed for the verification operation described in that step. If the server determines that the step can be completed by the large model itself without calling external tools, then tool type labeling is not required.
[0076] Then, based on the determined tool type and the API configured in the API library, this step is updated. The server can query the API library to obtain the specific API information corresponding to the tool type. The API library pre-configures a unified API for each type of tool; for example, the calculator tool's API is defined to receive an arithmetic expression in string form and return a numerical result.
[0077] The server can embed this interface identifier or invocation method into the description of the corresponding step to update the step. For example, the updated step description is: "Invoke the calculator tool, enter the expression '25*16', and get the result 400." Through the update, the server transforms the originally plain text description of the step into an executable step with explicit tool invocation instructions.
[0078] Finally, based on the adjusted steps, the thought process chain for the verification process is determined. Then, when the training sample is used to verify the supervised fine-tuning of the large model, the large model will learn a series of verification operation sequences with tool annotations.
[0079] This ensures that in step S106, when the large verification model faces a new verification request in actual application, the generated thought chain will directly contain the tool type and interface information that should be called in each step. In the subsequent execution step S108, no additional semantic parsing and tool matching are required. The steps can be executed in the order of the calling instructions marked in the thought chain.
[0080] Furthermore, the server can parse each step in each thought chain separately and identify the verification operation corresponding to each step.
[0081] Next, the identified verification operations are clustered semantically, and from these clusters, those requiring external tools are selected. Through this process, the server can group operations that are different in description but essentially similar into the same category.
[0082] Clustering can be based on keywords of the operations, semantic vector similarity, or manually defined rules. For example, all operations involving mathematical operations such as addition, subtraction, multiplication, division, exponentiation, and root extraction, regardless of their specific numerical values, are clustered into the "Mathematical Calculation" category. All operations involving running code and obtaining output, regardless of the code language (Python, Java, or JavaScript), are clustered into the "Code Execution" category. All operations involving retrieving information from the internet or knowledge bases are clustered into the "Information Retrieval" category. Through clustering, the server integrates previously scattered operations into several representative clusters.
[0083] Then, for each selected cluster, based on the tools called for verification operations in that cluster, the common semantics and common requirements of the calling tools are determined, a general-purpose tool covering each specific operation in that class is constructed, and the corresponding calling interface is configured.
[0084] In other words, after clustering, the server can filter out those clusters that require external tools to complete. Some validation operations may only rely on the reasoning capabilities of the large model itself, such as logical judgments and text comparisons, which do not require external tools. However, operations like mathematical calculations, code execution, and information retrieval require specialized external tools to complete accurately and efficiently. The server can mark these clusters that require external tools as categories for which tools need to be built.
[0085] For each selected cluster, the server further analyzes the common semantics and shared requirements of all verification operations within that cluster. For example, in the "mathematical calculation" cluster, although it includes different specific operations such as multiplication, addition, and equation solving, their common requirement is to receive a mathematical expression and return the calculation result. Based on this common requirement, the server determines a general-purpose calculator tool, which is used to parse various forms of mathematical expressions, perform corresponding operations, and return the calculation result. Similarly, for the code execution cluster, the server can determine a general-purpose code executor tool, which is used to receive code snippets from different programming languages, run them in a sandbox environment, and return execution output or error information. For the information retrieval cluster, the server determines a general-purpose search engine tool, which is used to receive query keywords, call external search engine or knowledge base interfaces, and return structured query results.
[0086] Ultimately, the server constructs a toolkit that has undergone semantic clustering and generalization abstraction. Each tool can cover a class of semantically similar verification operations and has a unified calling interface. This toolkit not only provides a standardized basis for tool types for updating the thought chain, but also provides reusable tools for the execution phase of the verification in the subsequent step S108.
[0087] In addition, assuming the server has built a tool library, in step S108, the server can determine whether it needs to call a tool in response to each step in the determined thought chain.
[0088] If so, the corresponding tool is invoked from the tool library to perform this step.
[0089] If not, then proceed with this step. That is, if the result of this step is that no tool needs to be called and this step can be completed by relying on the reasoning ability of the large model itself, then the server directly executes the verification logic defined in this step.
[0090] Based on the foregoing content, the verification process is described using a specific example in the embodiments of this specification: For example, in a graph coloring problem verification scenario, the task input is: Graph coloring problem, given an undirected graph G=(V,E), verify whether there exists a color assignment scheme such that adjacent vertices have different colors, given k colors. The graph has four vertices A, B, C, and D, and four edges (A,B), (A,C), (B,D), and (C,D). Two colors are given, such as red and blue. The business model output is a specific coloring scheme where vertex A is red, vertex B is blue, vertex C is blue, and vertex D is red.
[0091] The server then calls the large-scale verification model, constructing input information from the task input and task execution results, and inputs it into the large-scale verification model. The large-scale verification model outputs the following thought chain: First, identify adjacent vertex pairs; second, for each vertex pair, check whether the colors of the vertices are the same. If they are the same, the solution is determined to be incorrect; otherwise, continue verification until all vertex pairs pass the verification.
[0092] Therefore, the server executes the verification process based on this thought process.
[0093] First, it is necessary to identify all adjacent vertex pairs that need to be checked, that is, to determine the four pairs of adjacent relationships (A, B), (A, C), (B, D), and (C, D) based on the edge set.
[0094] Next, check whether the colors of each pair of adjacent vertices are the same. Specifically, this includes the following steps: 1. For edge (A, B), vertex A is red and vertex B is blue, and the two have different colors.
[0095] 2. For edge (A,C), vertex A is red and vertex C is blue, so the colors are different.
[0096] 3. For edge (B,D), vertex B is blue and vertex D is red, so the colors are different.
[0097] 4. For edge (C,D), vertex C is blue and vertex D is red, so the colors are different.
[0098] Finally, after the above four checks, the server determined that all adjacent vertex pairs met the requirement of having distinct colors, and therefore determined that the task execution result was correct.
[0099] The above is an example of a large model inference verification method provided in this specification. Based on the same idea, this specification also provides corresponding devices, storage media and electronic devices.
[0100] Figure 2 This is a schematic diagram of a large-scale model inference verification device provided in an embodiment of this specification. The device includes: The acquisition module 201 is used to acquire the training dataset of a large model, wherein the training samples of the training dataset include input data and output results; The thought chain construction module 202 is used to determine, for each training sample, the steps that satisfy the boundary conditions of the input data by inversely deducing the output results of the training sample, as the thought chain of the verification process; Training module 203 is used to perform supervised fine-tuning of the large validation model using the training samples as input and the thought chain as annotation; The verification module 204 is used to respond to the verification request for the task execution result of the business big model, send the task input and the task execution result to the verification big model, and obtain the thought chain output by the verification big model; The execution module 205 is used to execute each step contained in the determined thought chain, determine the verification result, and execute the task based on the verification result.
[0101] Optionally, the device further includes: a tool library module 206, used to determine the thought chain corresponding to each training sample; determine the various tools to be called for verification according to each step in the thought chain; and construct a calling tool library and calling interfaces for each tool in the calling tool library according to the various tools.
[0102] Optionally, the thought chain construction module 202 is used to determine each step that satisfies the boundary conditions of the input data by inferring from the output result; for each step, determine the type of tool that needs to be called; update the step according to the determined tool type and the calling interface configured in the calling tool library; and determine the thought chain as the verification process according to the adjusted steps.
[0103] Optionally, the execution module 205 is configured to, in response to executing each step contained in the determined thought chain, determine whether executing the step requires calling a tool; if so, call the corresponding tool from the tool library to execute the step.
[0104] Optionally, the execution module 205 is configured to, according to the determined execution order of each step in the thought chain, sequentially perform each step, taking the execution result of the previous step or the task execution result as input, and obtain the execution result of the step according to the verification logic of the step; after each step is completed, determine whether the obtained execution result matches the task input; if they match, determine that the task execution result satisfies the boundary conditions of the task input, and the verification result is normal; if they do not match, determine that the task execution result does not satisfy the boundary conditions of the task input, and the verification result is abnormal.
[0105] Optionally, the execution module 205 is configured to return the task execution result to the business process corresponding to the business model when the verification result is normal, for subsequent processing; and return the verification result that failed when the verification result is normal, and stop continuing to execute the business process corresponding to the business model based on the task execution result.
[0106] Optionally, the tool library module 206 is used to parse each step in each thought chain and identify the verification operation corresponding to each step; cluster the identified verification operations according to semantics, and select the clusters that need to call external tools from the clusters obtained; for each selected cluster, determine the common semantics and common requirements of the calling tools according to the tools called by the verification operations in the cluster, construct a general-purpose tool covering each specific operation in the class, and configure the corresponding calling interface.
[0107] This specification also provides a computer-readable storage medium storing a computer program that, when executed by a processor, can be used to perform the large model inference verification method provided above.
[0108] based on Figure 1 The large model inference verification method shown in this specification is further illustrated in the embodiments. Figure 3 The diagram shows the structure of the electronic device. Figure 3 At the hardware level, the electronic device includes a processor, internal bus, network interface, memory, and non-volatile storage, and may also include other hardware required for business operations. The processor reads the corresponding computer program from the non-volatile storage into memory and then runs it to implement the large-model inference verification method described above.
[0109] The above description is merely an embodiment of this specification and is not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this specification.
Claims
1. A method for verifying large-scale model inference, the method comprising: Obtain the training dataset of a large model, wherein the training samples of the training dataset include input data and output results; For each training sample, based on the output of that training sample, determine the steps that deduce the boundary conditions of the input data from the output, as the thought chain of the verification process; Using the training samples as input and the thought chain as annotation, supervised fine-tuning is performed on the large validation model; In response to a verification request for the task execution result of the business big model, the task input and the task execution result are sent to the verification big model to obtain the thought chain output by the verification big model; Execute each step in the defined thought chain, determine the verification result, and perform the task based on the verification result.
2. The method of claim 1, further comprising: Determine the thought chain corresponding to each training sample; Based on each step in the aforementioned thought chain, determine the various tools required for verification; Based on the aforementioned tools, a tool library for invoking and the invoking interfaces for each tool in the tool library are constructed.
3. The method as described in claim 2, wherein determining the steps for deriving the boundary conditions satisfying the input data from the output results, as the thought process chain of the verification process, specifically includes: The steps for deriving the boundary conditions that satisfy the input data from the output results are determined. For each step, determine the type of tool that needs to be called; Update this step based on the determined tool type and the calling interface configured in the calling tool library; Based on the adjusted steps, determine the thought process chain for the verification process.
4. The method as described in claim 2, wherein executing the steps contained in the determined thought chain specifically includes: For each step in the identified thought chain, in response to executing that step, determine whether it is necessary to invoke a tool. If so, the corresponding tool is invoked from the tool library to perform this step.
5. The method as described in claim 1, performing each step contained in the determined thought chain to determine the verification result, specifically including: Based on the execution order of each step in the determined thought chain, for each step, the execution result of the previous step or the task execution result is used as input, and the execution result of the step is obtained according to the verification logic of the step. After each step is completed, determine whether the execution result matches the task input. If a match is found, the task execution result is determined to satisfy the boundary conditions of the task input, and the verification result is normal. If there is a mismatch, it is determined that the task execution result does not meet the boundary conditions of the task input, and the verification result is abnormal.
6. The method as described in claim 5, wherein the task is performed based on the verification result, specifically includes: When the verification result is normal, the task execution result is returned to the business process corresponding to the business big model for subsequent processing; When the verification result is normal, the verification result that failed is returned, and the business process corresponding to the business model is stopped from continuing to be executed based on the task execution result.
7. The method as described in claim 2, wherein, based on each step in the thought chain, the various tools required for verification are determined, specifically including: Each step in each thought chain is analyzed separately, and the corresponding verification operation is identified for each step. The identified verification operations are clustered according to semantics, and the clusters that require calling external tools are selected from the clusters obtained. For each selected cluster, based on the tools called for verification operations in that cluster, determine the common semantics and common requirements of the calling tools, build a general-purpose tool that covers each specific operation in that cluster, and configure the corresponding calling interface.
8. A large-scale model reasoning verification device, the device comprising: The acquisition module is used to acquire the training dataset of a large model, wherein the training samples of the training dataset include input data and output results; The thought chain construction module is used to determine, for each training sample, the steps that deduce the boundary conditions of the input data from the output results of the training sample, as the thought chain of the verification process; The training module is used to perform supervised fine-tuning of the large validation model using the training samples as input and the thought chain as annotation. The verification module is used to respond to the verification request for the task execution result of the business big model, send the task input and the task execution result to the verification big model, and obtain the thought chain output by the verification big model; The execution module is used to execute each step contained in the determined thought chain, determine the verification result, and execute the task based on the verification result.
9. A computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in any one of claims 1-7.
10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the method described in any one of claims 1-7.