A calibration method, device, equipment, computer readable storage medium and computer program product of a large language model

By introducing a structured correction strategy that combines confidence mapping and cognitive graph retrieval with multi-dimensional reward calculation, the problem of insufficient risk prediction and correction logic in action generation of large language models is solved, the accuracy and reliability of action execution are improved, and the model's continuous learning and task execution stability are realized.

CN122242764APending Publication Date: 2026-06-19TENCENT TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECH (BEIJING) CO LTD
Filing Date
2026-03-26
Publication Date
2026-06-19

Smart Images

  • Figure CN122242764A_ABST
    Figure CN122242764A_ABST
Patent Text Reader

Abstract

This application provides a calibration method, apparatus, device, computer-readable storage medium, and computer program product for a large language model. The method includes: responding to a task instruction and generating an initial action to be executed using a large language model in conjunction with the global observation state; calculating the action confidence and mapping it to a risk assessment value; if the risk assessment value exceeds a preset risk threshold, intercepting the initial action and extracting risk features; retrieving a diagnostic strategy template from a pre-constructed cognitive graph using the global observation state and risk features as indexes; generating a structured correction strategy and a corrected action based on the template and issuing it for execution; obtaining execution feedback and calculating immediate reward values ​​under multiple indicator dimensions; mapping the expected value of goal achievement to a predictive reward value in conjunction with the task objective; weighted fusion of the multi-dimensional immediate reward and predictive reward to obtain a global reward value; and using this reward value as a feedback signal to iteratively update network parameters. This application can improve the reliability of action execution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to artificial intelligence technology, and more particularly to a method, apparatus, device, computer-readable storage medium, and computer program product for calibrating a large language model. Background Technology

[0002] In related technologies, large language models, upon receiving task instructions, can generate actions to be executed based on the current observation state. To improve the accuracy of action execution, after the model outputs the action, it typically intervenes in the action through preset logic or feedback mechanisms, and uses feedback signals to adjust the model parameters.

[0003] Given the complexity of the task environment and the diversity of task objectives, it is important to improve the foresight of large language models in the action generation process and establish more granular feedback paths to achieve accurate model calibration. Summary of the Invention

[0004] This application provides a calibration method, apparatus, device, computer-readable storage medium, and computer program product for large language models, which can achieve forward-looking risk avoidance and accurate calibration of large models, and improve the reliability of action execution in complex scenarios.

[0005] The technical solution of this application embodiment is implemented as follows: This application provides a method for calibrating a large language model, the method comprising: In response to the received task instruction, and in conjunction with the current global observation state, the initial action to be executed is generated through the large language model; Calculate the confidence level of the initial action to be executed under the global observation state, and perform a mapping calculation on the confidence level to obtain a risk assessment value; If the risk assessment value exceeds a preset risk threshold, the initial action to be executed is intercepted, and the risk characteristics of the initial action to be executed are extracted. Using the global observation state and risk characteristics as an index, a matching diagnostic strategy template is retrieved from the pre-constructed cognitive graph. Based on the retrieved diagnostic strategy template, a structured correction strategy is generated through the large language model, and a corrected action to be executed is generated according to the structured correction strategy and sent to the target execution environment. Obtain the execution result feedback of the target execution environment, calculate the real-time reward value of the execution result feedback under multiple preset indicator dimensions, and combine it with the final task objective to quantify the expected value of the corrected action to be executed to achieve the final task objective, and map the quantified expected value of the objective achievement into a predictive reward value. The real-time reward values ​​under the multiple indicator dimensions are weighted and fused with the predictive reward values ​​to obtain the global reward value; The global reward value is used as a feedback signal to iteratively update the network parameters of the large language model.

[0006] This application provides an action control method for a large language model, the method comprising: In response to the received task instruction, and in conjunction with the current global observation state, the initial action to be executed is generated through the large language model; Calculate the confidence level of the initial action to be executed under the global observation state, and perform a mapping calculation on the confidence level to obtain a risk assessment value; If the risk assessment value exceeds a preset risk threshold, the initial action to be executed is intercepted, and the risk characteristics of the initial action to be executed are extracted. Using the global observation state and risk characteristics as an index, a matching diagnostic strategy template is retrieved from the pre-constructed cognitive graph. Based on the retrieved diagnostic strategy template, a structured correction strategy is generated through the large language model. The corrected action to be executed is generated according to the structured correction strategy and sent to the target execution environment.

[0007] In the above technical solution, the step of generating an initial action to be executed by combining the current global observation state with the large language model includes: processing the task instruction, the dialogue history information in the global observation state, and the system context information through the large language model; and generating the initial action to be executed that includes a tool call instruction, a text reply, or an intent clarification request action.

[0008] In the above technical solution, the step of mapping the confidence level to obtain a risk assessment value includes: acquiring the internal state information and historical interaction trajectory data of the large language model; constructing an objective function representing execution uncertainty based on the internal state information and the historical interaction trajectory data; using the objective function representing execution uncertainty to map the confidence level, and determining the result of the mapping calculation as the risk assessment value.

[0009] In the above technical solution, the step of extracting the risk characteristics of the initial action to be executed includes: calculating the matching degree between the initial action to be executed and the system context information; identifying structural defects or parameter defects existing in the initial action to be executed; identifying logical consistency defects or tool invocation defects existing in the initial action to be executed based on the matching degree; extracting abnormal characteristics of at least one of the structural defects, parameter defects, logical consistency defects, and tool invocation defects; mapping the abnormal characteristics to corresponding expected execution exception types; and determining the expected execution exception types as the risk characteristics.

[0010] In the above technical solution, the method further includes: if the risk assessment value does not exceed the preset risk threshold, sending the initial action to be executed to the target execution environment, so as to execute the initial action to be executed in the target execution environment.

[0011] In the above technical solution, the step of retrieving a matching diagnostic strategy template from a pre-constructed cognitive graph using the global observation state and risk features as indexes includes: determining the global observation state as the current context index; obtaining the misclassification type corresponding to the risk features; and performing a retrieval operation in the cognitive graph based on the current context index and the misclassification type to obtain matching error diagnosis information and the diagnostic strategy template.

[0012] In the above technical solution, the step of generating a structured correction strategy based on the retrieved diagnostic strategy template through the large language model includes: generating error analysis logic based on the diagnostic strategy template; generating an error evidence search path; generating logical steps for correcting call parameters, or generating logical steps for replanning the tool call process; and assembling the error analysis logic, the error evidence search path, and the logical steps into the structured correction strategy.

[0013] In the above technical solution, generating the corrected action to be executed according to the structured correction strategy includes: generating a call correction action, generating a new tool call sequence action, or generating an intent clarification request action to request intent clarification from the user, according to the reflection process indicated by the structured correction strategy; and determining the call correction action, the new tool call sequence action, or the intent clarification request action as the corrected action to be executed.

[0014] In the above technical solution, the method further includes: receiving a subsequent task instruction containing target modality data; when triggering a structured self-diagnosis and correction process for the subsequent task instruction, extracting subsequent risk features for the subsequent task instruction; parsing the modality-specific error features contained in the subsequent risk features; querying the node mapping relationship corresponding to the modality-specific error features in the pre-expanded cognitive graph, and converting the modality-specific error features into underlying logical error features; using the underlying logical error features as an index, retrieving the corresponding subsequent diagnosis strategy template from the pre-expanded cognitive graph; and generating a subsequent structured correction strategy for the target modality data based on the subsequent diagnosis strategy template through the large language model.

[0015] This application provides a calibration device for a large language model, comprising: The first response module is used to respond to the received task instruction and, in conjunction with the current global observation state, generate an initial action to be executed through the large language model. The first calculation module is used to calculate the confidence level of the initial action to be executed under the global observation state, and to perform a mapping calculation on the confidence level to obtain a risk assessment value; The first interception module is used to intercept the initial action to be executed when the risk assessment value exceeds a preset risk threshold, and extract the risk characteristics of the initial action to be executed. Using the global observation state and risk characteristics as indexes, it retrieves a matching diagnostic strategy template from the pre-constructed cognitive map. The first correction module is used to generate a structured correction strategy based on the retrieved diagnostic strategy template through the large language model, generate a corrected action to be executed according to the structured correction strategy, and send it to the target execution environment. The first feedback module is used to obtain the execution result feedback of the target execution environment, calculate the real-time reward value of the execution result feedback under multiple preset indicator dimensions, and combine it with the final task goal to quantify the expected value of the corrected action to be executed to achieve the final task goal, and map the quantified expected value of the goal achievement into a predictive reward value. The first update module is used to perform weighted fusion of the real-time reward value and the predictive reward value under the multiple indicator dimensions to obtain a global reward value; and to use the global reward value as a feedback signal to iteratively update the network parameters of the large language model.

[0016] This application provides an action control device for a large language model, including: The second response module is used to respond to the received task instruction and, in combination with the current global observation state, generate an initial action to be executed through the large language model. The second calculation module is used to calculate the confidence level of the initial action to be executed under the global observation state, and to perform a mapping calculation on the confidence level to obtain a risk assessment value; The second interception module is used to intercept the initial action to be executed when the risk assessment value exceeds a preset risk threshold, and extract the risk features of the initial action to be executed. Using the global observation state and risk features as indexes, it retrieves a matching diagnostic strategy template from the pre-built cognitive map. The second correction module is used to generate a structured correction strategy based on the retrieved diagnostic strategy template through the large language model, generate the corrected action to be executed according to the structured correction strategy, and send it to the target execution environment.

[0017] This application provides an electronic device, the electronic device comprising: Memory is used to store executable instructions or computer programs. When the processor executes computer-executable instructions or computer programs stored in the memory, it implements the calibration method or action control method of the large language model provided in the embodiments of this application.

[0018] This application provides a computer-readable storage medium storing a computer program or computer-executable instructions, which, when executed by a processor, implements the calibration method or action control method of the large language model provided in this application.

[0019] This application provides a computer program product, including a computer program or computer executable instructions. When the computer program or computer executable instructions are executed by a processor, they implement the calibration method or action control method of the large language model provided in this application.

[0020] The embodiments of this application have the following beneficial effects: By introducing a risk assessment and interception mechanism based on confidence mapping before the initial action to be executed is issued, it is possible to proactively identify potential erroneous actions before the large language model generates such actions, thereby improving the safety and predictability of action execution; by using risk features and global observation status to retrieve diagnostic strategy templates from a pre-constructed cognitive graph, and generating structured correction strategies accordingly, it provides standardized logical guidance for the correction of actions to be executed, ensuring the accuracy and logical consistency of the correction actions; by combining the immediate reward value of multiple indicator dimensions with the expected value of target achievement representing future contributions, a predictive reward mechanism that takes into account both short-term execution feedback and long-term goal orientation is constructed. The weighted and fused global reward value is used as a feedback signal to guide the iteration of model network parameters, effectively improving the continuous learning ability and calibration accuracy of the large language model in complex task scenarios, and ensuring the final achievement rate of task execution. Attached Figure Description

[0021] Figure 1 This is a first structural schematic diagram of the calibration system architecture for a large language model provided in an embodiment of this application; Figure 2A This is a first structural schematic diagram of the electronic device provided in the embodiments of this application; Figure 2B This is a schematic diagram of the second structure of the electronic device provided in the embodiments of this application; Figure 3A This is a first flowchart illustrating the calibration method for a large language model provided in an embodiment of this application; Figure 3B This is a schematic diagram of the second process of the calibration method for a large language model provided in the embodiments of this application; Figure 3CThis is a schematic diagram of the third process of the calibration method for a large language model provided in the embodiments of this application; Figure 3D This is a schematic diagram of the fourth process of the calibration method for a large language model provided in the embodiments of this application; Figure 3E This is a first flowchart illustrating the action control method for a large language model provided in this application embodiment; Figure 4 This is a second structural schematic diagram of the calibration system architecture for a large language model provided in the embodiments of this application; Figure 5 This is a schematic diagram of the structured self-diagnosis and cognitive map correction logic provided in the embodiments of this application; Figure 6 This is a schematic diagram illustrating the principle of multi-dimensional reward calculation and weighted fusion provided in the embodiments of this application; Figure 7 This is a third structural diagram of the calibration system architecture for a large language model provided in the embodiments of this application.

[0022] It should be noted that the terms "first" and "second" mentioned above are only used to distinguish between different options and do not represent the degree of superiority or inferiority of the options or their priority in the implementation process. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0024] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0025] In the following description, the terms "first, second, third" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, third" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0026] In the embodiments of this application, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.

[0027] Unless otherwise defined, all technical and scientific terms used in the embodiments of this application have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the embodiments of this application is for the purpose of describing the embodiments of this application only and is not intended to limit this application.

[0028] In the implementation of this application, the collection and processing of relevant data should strictly comply with the requirements of relevant laws and regulations, obtain the informed consent or separate consent of the personal information subject, and carry out subsequent data use and processing within the scope of laws and regulations and the authorization of the personal information subject.

[0029] Before providing a further detailed description of the embodiments of this application, the nouns and terms involved in the embodiments of this application will be explained, and the nouns and terms involved in the embodiments of this application shall be interpreted as follows.

[0030] 1) Large Language Model (LLM): A deep neural network model with a multi-layered network structure that has been iteratively optimized using massive amounts of training data. In the embodiments of this application, the large language model is used to combine observation state and contextual information to generate actions to be performed, perform error attribution analysis, and generate structured correction strategies.

[0031] 2) Knowledge Graph (KG): A data organization collection that represents entity concepts and their semantic relationships in the form of a graph structure. In the embodiments of this application, the knowledge graph is used to persistently store extracted error occurrence patterns, failure cause features, and corresponding repair strategies to support the matching and retrieval of diagnostic strategy templates.

[0032] 3) Application Programming Interface (API): A pre-defined protocol and communication channel for data interaction and function calls between different software programs. In this embodiment, the API serves as the entry point for external tools and as a comparison standard for verifying the legality of the execution action parameter format.

[0033] 4) Reinforcement Learning (RL): A machine learning method based on the interaction between a machine agent and its environment, and the optimization of action policies based on the obtained reward signals. In the embodiments of this application, reinforcement learning is used to use the global reward value as a feedback signal to iteratively update the parameters of the deep network to improve error correction and action generation capabilities.

[0034] 5) Directly-Aligned Policy Optimization (DAPO): A reinforcement learning algorithm that directly updates network weights using the reward signal gradient to make the policy distribution closely approximate a preset reward objective. In this embodiment, DAPO is used to calculate the model running gradient and adjust network parameters based on the policy optimization samples and the global reward value.

[0035] 6) Grounded Skill Policy Optimization (GSPO): An algorithm that decomposes a macro-level strategy into specific executable skill sequences at the underlying level and adjusts the network weights of each skill module accordingly. In this embodiment, GSPO is used to update the action generation strategy based on multi-dimensional reward signals and historical interaction experience.

[0036] 7) Human-in-the-loop: This involves introducing a human operator into the automated processing flow to review key decisions or take over anomaly handling. In this embodiment, human intervention is used to send a status report and request manual intervention instructions to the terminal when the number of error handling retries reaches a set limit or exceeds the known map repair capability.

[0037] 8) Token: A discrete semantic unit entity used for text segmentation and encoding computation in a natural language processing model. In this embodiment, the token is used as a data quantification indicator to measure the total amount of computing resources consumed by the model during action generation and task processing.

[0038] 9) Reward Shaping: An engineering method used in the training environment to modify the basic reward function numerically or add intermediate reward signals to accelerate policy convergence. In this embodiment, reward shaping is used to dynamically adjust the immediate reward weights in conjunction with the node update features of the cognitive graph, so as to guide the network to avoid specific execution risks.

[0039] 10) State Augmentation: A data processing operation that concatenates or incorporates auxiliary observation information into the state feature representation to improve the completeness of the input state representation. In the embodiments of this application, state augmentation is used to fuse historical failure modes and repair path feature vectors into the current global observation state, providing a structured contextual input for policy parameter optimization.

[0040] 11) Intelligent Assistant (Agent): An automated computing entity program capable of sensing the operating environment and autonomously formulating plans to execute control actions to achieve specific goals. In the embodiments of this application, the intelligent assistant is used to receive task instructions, generate tool call action instructions by integrating system context and error diagnosis conclusions, and realize closed-loop evolution optimization of self-decision strategy parameters.

[0041] 12) Structured Correction Strategy: A data set containing specific error attribution logic and step-by-step repair operation guidelines. In this embodiment, the structured correction strategy is used to guide the action generation network to rewrite and output the corrected actions to be executed according to strict logical diagnostic steps.

[0042] 13) Expected Goal Achievement Value: A quantized floating-point scalar value representing the probabilistic contribution of the action instruction at the current single time step to the global task completion rate in a multi-turn session. In the embodiments of this application, the expected goal achievement value is used to evaluate the predictive value of local execution actions in achieving the overall goal intention, in conjunction with the state transition probability of the external environment.

[0043] 14) Error Correction Interaction Feature Sequence: A collection of node data that records the entire sequence of events, including the original defect instruction, the attribution analysis and deduction process, and the final effective output action. In this embodiment, the error correction interaction feature sequence serves as the source of original features for extracting abstract error occurrence patterns and underlying logic repair strategies.

[0044] 15) Global Observation State: A set of associated information about the current external environment context and the internal running memory state records obtained by the computing entity when processing input instructions. In this embodiment, the global observation state is used as the underlying state data input for evaluating the potential execution risk of the initial action to be executed and generating action instructions that match the context.

[0045] In related technologies, when using large language models to perform automated tasks, prompt words are typically used to guide the model to generate actions to be performed or tool invocation instructions based on environmental observations. During the implementation of the embodiments of this application, the applicant discovered that related technical solutions may face the following challenges when handling complex tasks or high-risk decision-making scenarios: (1) The problem of forward-looking risk prediction in action execution: The relevant technologies usually send the action directly to the execution environment after the model generates the action, and only trigger the passive remediation mechanism after the execution fails. This execution process without prediction is prone to ineffective resource consumption when faced with parameter errors or logical defects, and it is difficult to achieve proactive interception before the action has a negative impact.

[0046] (2) Issues with the depth and consistency of error diagnosis and correction logic: When the model needs to correct erroneous actions, it often relies on simple retry logic or general reflection prompts, lacking in-depth diagnosis for specific error patterns. Due to the lack of structured knowledge guidance, the correction scheme generated by the model may have problems such as logical incoherence and unclear evidence acquisition paths, which affect the effectiveness of the correction actions.

[0047] (3) Issues with the fineness and long-term guidance of feedback signals: When training or fine-tuning the parameters of a model, the reward signals used in related technologies are often coarse (e.g., based solely on the final success or failure), making it difficult to comprehensively evaluate fine-grained indicators such as action format, parameter accuracy, and tool selection. At the same time, due to the lack of quantification of the probability of achieving the current action and the long-term task goal, the model struggles to learn forward-looking optimal behavioral strategies.

[0048] This application provides a calibration method, action control method, device, electronic device, computer-readable storage medium, and computer program product for a large language model, aiming to improve the accuracy and reliability of the large language model in the process of action generation and error correction, as detailed below: To address the aforementioned technical problem (1), this application introduces a risk assessment mechanism based on confidence mapping. By calculating the confidence level of the initial action to be executed under the current observation state and converting it into a risk assessment value, the action is immediately intercepted and risk features are extracted when the risk exceeds the limit. This solution realizes the transformation from "passive error correction" to "active defense," enabling effective interception before erroneous actions are executed, thereby reducing execution risk.

[0049] To address the aforementioned technical problem (2), this application proposes a structured correction scheme based on a cognitive graph. By retrieving diagnostic strategy templates from a pre-constructed cognitive graph using risk features as an index, the model is guided to generate a structured correction strategy that includes error analysis logic and evidence search paths. This scheme utilizes the expert experience and pattern knowledge stored in the graph to provide standardized logical guidance for the model, ensuring the depth of the correction process and the logical consistency of the corrective actions.

[0050] To address the aforementioned technical problem (3), this application's embodiments design a bootstrap reward mechanism that combines multi-dimensional immediate rewards with predictive feedback. It not only calculates immediate rewards from multiple indicator dimensions such as format, parameters, and tool selection, but also quantifies the expected value of goal achievement in conjunction with the final task objective and maps it to predictive rewards. Through weighted fusion of dynamic weights, this scheme provides the model with refined and long-term guiding feedback signals, significantly improving the model's calibration accuracy and learning efficiency during parameter iteration.

[0051] The following describes exemplary applications of the electronic devices provided in the embodiments of this application. These devices can be implemented as various types of terminals such as laptops, tablets, desktop computers, set-top boxes, smartphones, smart speakers, smartwatches, smart TVs, and in-vehicle terminals, or as servers. The following will describe exemplary applications when the device is implemented as a server.

[0052] See Figure 1 , Figure 1 This is a first structural diagram of the calibration system architecture for a large language model provided in this application embodiment, which is intended to support the processing application for calibrating a large language model. Figure 1 The process involves a server 100 and a terminal 200 connected via a network 300. The network 300 can be a wide area network (WAN), a local area network (LAN), or a combination of both.

[0053] In some embodiments, the present application embodiments can be implemented solely by server 100. Server 100 obtains task instructions directly from its internal data source (such as a task queue), or receives task instructions and the current global observation status from terminal 200 via network 300. Server 100 independently executes the calibration method for the large language model provided in the present application embodiments, completing action risk assessment, strategy correction, and closed-loop optimization of model parameters through internal algorithm logic. The generated corrected actions to be executed can then be used to optimize its task execution process, improve the accuracy of tool calls, and ensure the reliability of behavior under complex logic.

[0054] As a specific application scenario of this implementation, an automated task execution platform needs to implement highly reliable control over its multi-step planning decisions and external resource invocation behavior. The calibration method of the large language model in this embodiment is deployed on server 100. Server 100 can act as an intelligent decision-making and verification center, independently predicting and accurately correcting the risks of actions in the task sequence. For example, server 100 obtains a task instruction to be processed from the task flow, and the large language model, combined with the global observation state, generates an initial action to be executed. If the action is assessed internally by server 100 as having a risk assessment value exceeding a preset risk threshold, an interception mechanism is triggered, and risk features are extracted. Then, a diagnostic strategy template is retrieved from a pre-built cognitive graph. After executing the method of this embodiment, server 100 determines the corrected action to be executed based on a structured correction strategy. The platform can then utilize this calibrated action data from multiple dimensions: in terms of task execution security, the platform can eliminate potential logical defects before the action is sent to the execution environment through pre-interception and correction mechanisms, ensuring the robust operation of the business process. In terms of knowledge asset accumulation, once server 100 confirms that the corrected action to be executed has been successfully executed, it can extract abstract error occurrence patterns from the error correction interaction feature sequence and persistently store them in the cognitive graph, enabling the continuous expansion and evolution of knowledge nodes. Regarding intelligent evolution, server 100 utilizes execution result feedback obtained from terminal 200 or the target execution environment to calculate multi-dimensional immediate reward values ​​and predictive reward values ​​pointing to the final task goal. The weighted and fused global reward value is used as a feedback signal to iteratively update the network parameters of the large language model.

[0055] In this way, the independent processing of server 100 provides key technical support for the platform's backend operations such as multi-step logical planning, tool integration and invocation, and autonomous model evolution.

[0056] In other embodiments, the embodiments of this application can be implemented collaboratively by terminal 200 and server 100. For example, terminal 200 provides a task interaction interface or environmental awareness dashboard to the user, receives task instructions input by the user, and collects the current global observation status in real time. Terminal 200 then sends the task instructions and global observation status to server 100 via network 300. Server 100 executes the calibration method of the large language model provided in the embodiments of this application, determines whether to intercept the action by evaluating the risk assessment value of the initial action to be executed, and generates a corrected action to be executed by combining the structured correction strategy generated by the cognitive graph after interception. The corrected action to be executed finally determined by server 100 is fed back to terminal 200 via network 300. Terminal 200 executes the corrected action to be executed as the target execution environment, collects the execution result feedback, and sends it back to server 100 via network 300 again. Server 100 quantifies the multi-dimensional instantaneous reward value and predictive reward value based on the execution result feedback, synthesizes the global reward value as a feedback signal, and iteratively updates the network parameters of the large language model. Through this end-to-cloud collaborative processing approach, the terminal 200 can provide flexible task access and real-time environmental feedback, while the server 100 can utilize its pre-built cognitive graph and powerful computing resources to complete complex risk avoidance decisions and model calibration iterations, thereby improving the foresight and stability of the overall task processing.

[0057] See Figure 2A , Figure 2A This is a first structural schematic diagram of the electronic device provided in an embodiment of this application. Figure 2A The server 100 shown includes at least one processor 110, memory 130, and at least one network interface 120. The various components of server 100 are coupled together via a bus system 140. It is understood that the bus system 140 is used to implement communication between these components. In addition to a data bus, the bus system 140 also includes a power bus, a control bus, and a status signal bus. However, for clarity, ... Figure 2A The general labeled all buses as Bus System 140.

[0058] The processor 110 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0059] The memory 130 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state storage, hard disk drives, optical disk drives, etc. The memory 130 may optionally include one or more storage devices physically located away from the processor 110.

[0060] The memory 130 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), and the volatile memory may be random access memory (RAM). The memory 130 described in this application embodiment is intended to include any suitable type of memory.

[0061] In some embodiments, memory 130 is capable of storing data to support various operations, examples of which include programs, modules, and data structures or subsets or supersets thereof, as illustrated below.

[0062] Operating system 131 includes system programs for handling various basic system services and performing hardware-related tasks, such as the framework layer, core library layer, and driver layer, for implementing various basic business functions and handling hardware-based tasks. The network communication module 132 is used to reach other electronic devices via one or more (wired or wireless) network interfaces, exemplary network interfaces including: Bluetooth, WiFi, and Universal Serial Bus (USB), etc. In some embodiments, the calibration device for the large language model provided in this application can be implemented in software. See [link to relevant documentation]. Figure 2A , Figure 2A A calibration device 133 for a large language model stored in memory 130 is shown. This device can be software in the form of programs and plug-ins, and includes the following software modules: a first response module 1331, a first calculation module 1332, a first interception module 1333, a first correction module 1334, a first feedback module 1335, and a first update module 1336. These modules are logically linked and can therefore be arbitrarily combined or further split according to their implemented functions. The functions of each module will be described below.

[0063] In some embodiments, the motion control device for the large language model provided in this application can be implemented in software. See [link to relevant documentation]. Figure 2B , Figure 2BAn action control device 134 for a large language model stored in memory 130 is shown. This device can be software in the form of programs and plug-ins, and includes the following software modules: a second response module 1341, a second calculation module 1342, a second interception module 1343, and a second correction module 1344. These modules are logically connected and can therefore be arbitrarily combined or further separated according to the functions they implement. The functions of each module will be described below. It is worth noting that... Figure 2B Apart from the motion control device 134 of the large language model shown, the rest of the structure can be connected to... Figure 2A same.

[0064] It should be noted that the calibration device 133 of the large language model and the motion control device 134 of the large language model can be integrated into one electronic device, that is, the electronic device can simultaneously implement the calibration method and the motion control method of the large language model; or the calibration device 133 of the large language model and the motion control device 134 of the large language model can be integrated into two separate electronic devices, that is, the electronic device implements either the calibration method or the motion control method of the large language model.

[0065] The calibration method for the large language model provided in this application will be described below with reference to the exemplary application and implementation of the server provided in the embodiments of this application.

[0066] See Figure 3A , Figure 3A This is a schematic diagram of the first process of the calibration method for a large language model provided in the embodiments of this application, which will be combined with Figure 3A The steps shown are explained.

[0067] In step 101, in response to the received task instruction, and in conjunction with the current global observation state, an initial action to be executed is generated through the large language model.

[0068] Among them, the global observation state refers to the external environment context information and internal running status records obtained when executing task instructions.

[0069] In some embodiments, see Figure 3B Step 101 can be achieved through the following steps 1011 to 1012, which are explained in detail below.

[0070] In step 1011, task instructions, dialogue history information in the global observation state, and system context information are processed through a large language model.

[0071] System context information refers to a set of data that reflects the constraints of the operating environment, the status of equipment, and the preset business logic.

[0072] As an example, step 1011 can be implemented as follows: receive task instructions and extract dialogue history information containing historical requests and responses and system context information from the global observation state; input the task instructions, dialogue history information and system context information into the encoder network, and extract semantic association features between the task instructions, dialogue history information and system context information by performing multi-head self-attention computation.

[0073] For example, if the received task instruction is a first query instruction, extract dialogue history information containing historical weather inquiry records and extract system context information containing current geographical location information; input the first query instruction, dialogue history information and system context information into the encoder network for multi-head self-attention calculation, determine the consistency between the location name mentioned in the first query instruction and the historical location name in the dialogue history information, and the degree of conformity between the operation required by the first query instruction and the permission constraints in the system context information, and obtain semantic association features.

[0074] In this embodiment of the application, step 1011 achieves the beneficial effect of integrating contextual background data for semantic analysis, and solves the technical problem of contextual intent understanding gaps caused by single task instructions.

[0075] In step 1012, an initial action to be executed is generated, which includes a tool invocation command, a text response, or an intent clarification request action.

[0076] As an example, step 1012 can be implemented in the following way: based on the semantic association features extracted in step 1011, the action sequence output by the decoding network is used to determine the execution requirements of the current task instruction. If the execution requirement requires external application interface support, a tool call instruction is generated. If the execution requirement meets the direct reply conditions, a text reply is generated. If the matching degree between the execution requirement and the preset intent template is lower than the preset intent matching degree threshold, an intent clarification request action is generated. The generated tool call instruction, text reply, or intent clarification request action is determined as the initial action to be executed.

[0077] For example, based on semantic association features, if the execution requirement is to call the first query interface and the first calendar application, a tool call instruction containing the first query interface call instruction and the first calendar application call instruction is generated, and the tool call instruction containing the first query interface call instruction and the first calendar application call instruction is determined as the initial action to be executed.

[0078] In this embodiment of the application, step 1012 achieves the beneficial effect of generating an initial action to be executed that includes instructions, responses, or clarification requests, and solves the technical problem that the output format is too simple to cover external task scenarios.

[0079] See also Figure 3A In step 102, the confidence level of the initial action to be executed under the global observation state is calculated, and the confidence level is mapped to obtain the risk assessment value.

[0080] Confidence level refers to the predicted probability that the initial action to be executed can successfully complete the task instruction.

[0081] As an example, step 102 can be implemented as follows: extract the network probability distribution when the initial action to be executed is output, calculate the confidence level of the initial action to be executed by combining the global observation state, apply a preset nonlinear transformation function to the confidence level for mapping calculation, and convert the confidence level after mapping calculation into a risk assessment value.

[0082] In some embodiments, see Figure 3C The step 102, "mapping the confidence level to obtain the risk assessment value", can be achieved through the following steps 1021 to 1023, which are explained in detail below.

[0083] In step 1021, the internal state information and historical interaction trajectory data of the large language model are obtained.

[0084] As an example, step 1021 can be implemented as follows: read the hidden layer activation value and attention weight distribution when generating the initial action to be executed from the running memory as internal state information, and retrieve and extract historical interaction trajectory data within a preset time period from the persistent storage medium.

[0085] For example, the activation values ​​of the hidden layers of the last layer of the network are read from the tensor state as internal state information, while historical interaction trajectory data containing a preset number of tool call success and failure records are read from the database.

[0086] In this embodiment of the application, step 1021 achieves the beneficial effect of comprehensive collection of multi-dimensional low-level feature data, and solves the technical problem of lack of objective observation basis for risk prediction.

[0087] In step 1022, an objective function representing the execution uncertainty is constructed based on internal state information and historical interaction trajectory data.

[0088] As an example, step 1022 can be implemented as follows: extract the information entropy feature from the internal state information, extract the error frequency feature from the historical interaction trajectory data, calculate the weighted combination parameters of the information entropy feature and the error frequency feature, and construct the objective function.

[0089] For example, extract the information entropy feature of the probability distribution of the vocabulary corresponding to the tool call instruction lexical in the internal state information, extract the error frequency feature of the failure to call the same tool interface in the historical interaction trajectory data, set the first parameter weight of the information entropy feature and the second parameter weight of the error frequency feature, and add the product of the information entropy feature and the first parameter weight to the product of the error frequency feature and the second parameter weight to construct the objective function representing the execution uncertainty according to the preset formula (1): (1) in, Describe the objective function. This represents the weight of the first parameter. Represents information entropy features. This represents the weight of the second parameter. This indicates the frequency characteristics of errors.

[0090] In this embodiment of the application, step 1022 achieves the beneficial effect of quantifying the running state characteristics and historical records into a computable mathematical model, solving the technical problem that the uncertainty of action execution is difficult to quantify.

[0091] In step 1023, the confidence level is mapped using an objective function that characterizes the uncertainty of execution, and the result of the mapping calculation is determined as the risk assessment value.

[0092] As an example, step 1023 can be implemented as follows: substitute the confidence level calculated in step 102 into the objective function representing the execution uncertainty, use the objective function representing the execution uncertainty to perform normalization and inverse proportional operation on the confidence level to perform mapping calculation, obtain the result of the mapping calculation, and determine the result of the mapping calculation as the risk assessment value.

[0093] For example, a confidence level of 85% is substituted into the objective function representing execution uncertainty, and the negative logarithm of the confidence level is calculated using the objective function representing execution uncertainty as a mapping calculation, resulting in a mapping calculation result of 0.72. The mapping calculation result of 0.72 is then determined as the risk assessment value.

[0094] In this embodiment of the application, step 1023 achieves the beneficial effect of accurately converting confidence level into standardized risk indicator, and solves the technical problem that a single confidence level value cannot directly reflect the final execution risk level.

[0095] See also Figure 3A In step 103, if the risk assessment value exceeds the preset risk threshold, the initial action to be executed is intercepted, and the risk characteristics of the initial action to be executed are extracted. Using the global observation state and risk characteristics as indexes, a matching diagnostic strategy template is retrieved from the pre-constructed cognitive map.

[0096] Among them, the cognitive graph refers to a set of data node relationships that are pre-built and dynamically updated and continuously evolved through a bootstrapping mechanism based on error correction interaction experience. It stores error occurrence patterns, failure cause characteristics, and effective repair strategies in a graph structure.

[0097] As an example, step 103 can be implemented as follows: compare the risk assessment value with the preset risk threshold. If the risk assessment value is greater than the preset risk threshold, prevent the initial action to be executed from being sent to the target execution environment. Extract risk features from the intercepted initial action to be executed, and construct a query index by splicing the global observation state and risk features. Perform similarity matching in the pre-built cognitive graph using the query index to retrieve matching diagnostic strategy templates.

[0098] In some embodiments, see Figure 3D The "extraction of risk features of the initial action to be executed" in step 103 can be achieved through the following steps 1031 to 1033, which are explained in detail below.

[0099] In step 1031, the matching degree between the initial action to be executed and the system context information is calculated; structural defects or parameter defects existing in the initial action to be executed are identified.

[0100] As an example, step 1031 can be implemented in the following way: compare the action format requirements of the initial action to be executed with the environmental constraint rules in the system context information, calculate the matching degree between the initial action to be executed and the system context information, and when the matching degree is lower than the preset matching degree threshold, verify the structure or execute parameter verification logic through the syntax tree analyzer to identify structural defects in the initial action to be executed that do not conform to the format specification, or identify parameter defects in the initial action to be executed that are missing parameters or have parameter values ​​that are out of bounds.

[0101] For example, by comparing the first write instruction in the initial action to be executed with the current time permission configuration rules in the system context information, the matching degree between the timestamp of the first write instruction and the system context information is calculated. When the matching degree is lower than the preset matching degree threshold, it is determined that the first write interface requires three first mandatory fields while the first write instruction only contains two first parameter fields, thus identifying the parameter defects in the initial action to be executed.

[0102] In this embodiment of the application, step 1031 achieves the beneficial effects of initial alignment of the action command with the current operating environment and underlying syntax verification, and solves the technical problem that the recognition cannot be triggered due to inconsistencies in format specifications and missing parameters.

[0103] In step 1032, based on the matching degree, logical consistency defects or tool invocation defects existing in the initial action to be executed are identified; and abnormal features of at least one of structural defects, parameter defects, logical consistency defects and tool invocation defects are extracted.

[0104] As an example, step 1032 can be implemented in the following way: based on the matching degree being lower than the preset matching degree threshold, determine that the initial action to be executed violates the multi-step reasoning chain, identify the logical consistency defects existing in the initial action to be executed, or determine that the initial action to be executed references an unregistered application interface, identify the tool call defects existing in the initial action to be executed, and perform feature vectorization operation on at least one of the abnormal features of structural defects, parameter defects, logical consistency defects and tool call defects to extract the abnormal features represented by multi-dimensional vectors.

[0105] For example, based on a matching degree lower than a preset matching degree threshold, the initial action to be executed is determined to use the weather data of the first location to plan the itinerary of the second location after completing the query of the first location. Logical consistency defects in the initial action to be executed are identified, and text vector embedding processing is performed on structural defects, parameter defects, as well as the identified logical consistency defects and tool call defects to extract abnormal features containing defect location encoding vectors.

[0106] In this embodiment of the application, step 1032 achieves the beneficial effects of identifying multi-dimensional defects and extracting features in a structured manner, and solves the technical problem of difficulty in capturing contextual logical contradictions during the reasoning process.

[0107] In step 1033, the abnormal features are mapped to the corresponding expected execution exception types; the expected execution exception types are determined as risk features.

[0108] As an example, step 1033 can be implemented in the following way: input the extracted abnormal features into a preset abnormal type classification model, output the classification node labels corresponding to the abnormal features through the abnormal type classification model, use the classification node labels as the expected execution abnormal type, thereby mapping the abnormal features to the corresponding expected execution abnormal type, and determining the expected execution abnormal type as a risk feature describing the error attribute of the current action.

[0109] For example, anomaly features containing vector features of logical consistency defects are input into a pre-trained support vector machine model. The support vector machine model outputs the first classification node label, which is used as the first expected execution anomaly type. Thus, the anomaly features are mapped to the corresponding first expected execution anomaly type, and the first expected execution anomaly type is determined as a risk feature.

[0110] In this embodiment of the application, step 1033 achieves the beneficial effect of converting low-level multidimensional defect features into high-level abstract cognitive concepts, and solves the technical problem that scattered data features cannot be used as retrieval clues for a unified knowledge base.

[0111] In some embodiments, the following scheme can also be implemented for the judgment logic of the risk assessment value in step 103: if the risk assessment value does not exceed the preset risk threshold, the initial action to be executed is sent to the target execution environment so as to execute the initial action to be executed in the target execution environment.

[0112] As an example, if the risk assessment value does not exceed the preset risk threshold, the initial action to be executed is sent to the target execution environment so that the initial action to be executed in the target execution environment can be executed in the following way: compare the risk assessment value with the preset risk threshold. If the risk assessment value is less than or equal to the preset risk threshold, that is, if the risk assessment value does not exceed the preset risk threshold, encode and package the initial action to be executed through the network communication interface and send it to the target execution environment so as to drive the target execution environment to execute the initial action to be executed.

[0113] For example, the first risk assessment value calculated is compared with the preset risk threshold set as the first threshold. If the first risk assessment value is less than or equal to the preset risk threshold, that is, if the first risk assessment value does not exceed the preset risk threshold, the initial action to be executed containing the query instruction is sent to the external service application server, which is the target execution environment, through the cross-process communication channel, so as to execute the initial action to be executed containing the query instruction in the external service application server.

[0114] In this embodiment, by sending the initial action to be executed to the target execution environment when the risk assessment value does not exceed the preset risk threshold, the beneficial effect of executing low-risk actions directly without diagnosis is achieved, and the technical problem of increased running delay caused by unconditional execution of diagnostic verification is solved.

[0115] In some embodiments, the step 103 of "retrieving matching diagnostic strategy templates from a pre-built cognitive map using global observation status and risk characteristics as indexes" can be achieved through the following steps 1034 to 1036, which are described in detail below.

[0116] In step 1034, the global observation state is determined as the current context index.

[0117] As an example, step 1034 can be implemented as follows: extract the key business entity tags, timestamps, and task domain classification fields from the global observation status; concatenate the key business entity tags, timestamps, and task domain classification fields to construct a query string sequence; and determine the global observation status with the query string sequence form as the current context index.

[0118] For example, the first business entity label in the global observation state is extracted, dimensionality-reduced, and encoded into an environmental multidimensional vector. The global observation state containing the environmental multidimensional vector is then determined as the current context index for inputting a search query.

[0119] In this embodiment of the application, step 1034 achieves the beneficial effect of structured dimensionality reduction representation of complex observation environments, and solves the technical problem that the unstructured global observation state cannot directly participate in database query and retrieval.

[0120] In step 1035, the error classification type corresponding to the risk feature is obtained.

[0121] As an example, step 1035 can be implemented in the following way: perform semantic feature decoding on the received risk feature to obtain the decoding result, compare the decoding result with the preset system error type outline dictionary for hierarchical comparison, and obtain the error classification type with hierarchical structure corresponding to the risk feature.

[0122] For example, semantic feature decoding is performed on risk features containing the first intent attribute to obtain the decoding result. By comparing the decoding result with the system error type outline dictionary, the logical error classification type corresponding to the risk feature can be obtained.

[0123] In this embodiment of the application, step 1035 achieves the beneficial effect of logical alignment of risk features with standardized nodes of the knowledge graph, solving the technical problem that non-standard risk descriptions cannot be accurately located in the knowledge graph grid.

[0124] In step 1036, based on the current context index and error classification type, a retrieval operation is performed in the cognitive graph to obtain matching error diagnosis information and diagnosis strategy templates.

[0125] As an example, step 1036 can be implemented as follows: construct a joint graph query search statement by combining the current context index and the error classification type; input the joint graph query search statement into the cognitive graph database storing pre-built nodes to perform a retrieval operation; calculate the Euclidean distance between the joint graph query search statement and the features of each knowledge node; find the matching data node with the smallest Euclidean distance; parse the matching error diagnosis information and the diagnostic strategy template containing the guidance step reflection instructions from the first matching data node.

[0126] For example, a joint graph query search statement is constructed by using the current context index containing the first query parameter and the logical error classification type. The retrieval operation is performed in the cognitive graph database to locate the first knowledge node. From the first knowledge node, error diagnosis information about incorrect parameter passing order is obtained, as well as a diagnostic strategy template containing the first preset logical steps.

[0127] In this embodiment of the application, step 1036 achieves the beneficial effect of automatically mapping the current error repair path using long-accumulated system prior knowledge, and solves the technical problem that when encountering complex errors, the same action is repeatedly executed based on a preset number of rounds, resulting in a repair success rate lower than a preset success rate threshold.

[0128] See also Figure 3A In step 104, based on the retrieved diagnostic strategy template, a structured correction strategy is generated through a large language model, and the corrected action to be executed is generated according to the structured correction strategy and sent to the target execution environment.

[0129] Among them, the structured correction strategy refers to a data dictionary with standardized fields that contains specific error analysis logic and step-by-step repair operation guidelines.

[0130] As an example, step 104 can be implemented as follows: Receive input data containing the initial action to be executed and the retrieved diagnostic strategy template; parse the diagnostic strategy template using a large language model; output a structured correction strategy conforming to preset field rules; parse the step-by-step repair operation guidance in the structured correction strategy to rewrite the initial action to be executed, generating a corrected action; and call the network communication interface to send the corrected action to the target execution environment. Simultaneously, after the corrected action is executed and results are fed back, continuously and automatically collect and extract abstract error patterns and effective repair strategies from error correction experience, persistently store them to update the data nodes of the cognitive graph, enabling the cognitive graph to exhibit non-static, continuous self-evolutionary growth.

[0131] In some embodiments, the step 104 of "generating a structured correction strategy based on the retrieved diagnostic strategy template through a large language model" can be implemented through the following steps 1041 to 1044, which are described in detail below.

[0132] In step 1041, error analysis logic is generated based on the diagnostic strategy template.

[0133] As an example, step 1041 can be implemented as follows: Extract the error analysis prompt word field from the diagnostic strategy template. Input the error analysis prompt word field as context control information into the large language model. Drive the large language model to infer and calculate the failure reason for the initial action to be executed being intercepted. Output text content containing specific error attribution dimensions (error analysis logic).

[0134] For example, extract the first error analysis prompt word field from the first diagnostic strategy template. Input the first error analysis prompt word field into the large language model. The large language model calculates the failure reason of the first initial action to be executed based on the attention mechanism. The output is the first text content containing the parameter type missing attribution dimension.

[0135] In this embodiment of the application, step 1041 achieves the beneficial effect of using prior diagnostic strategies to guide the model's inference and attribution, and solves the technical problem that the model's lack of target self-reflection leads to computing power consumption exceeding the preset resource threshold.

[0136] In step 1042, an error evidence lookup path is generated.

[0137] Among them, the error evidence search path refers to the sequence of data access instructions used to locate the key context information node that caused the current error in the global observation state.

[0138] As an example, step 1042 can be implemented as follows: Based on text content containing specific error attribution dimensions, identify the information nodes to be verified that contain key business parameters in the historical log sequence of the global observation state. Generate a corresponding data access instruction sequence according to the storage index level position of the information nodes to be verified. Determine the data access instruction sequence as the error evidence search path.

[0139] For example, based on the first text content containing the attribution dimension for missing parameter types, a first information node containing the first time entity parameter is identified in the first historical log sequence. According to the first storage index level position of the first information node to be verified in the cache database, a first data access instruction sequence is generated. This first data access instruction sequence is then determined as the first erroneous evidence search path.

[0140] In this embodiment, step 1042 achieves the beneficial effect of structured localization of the distribution location of error root causes. This solves the technical problem of difficulty in tracing back and locating local feature-related causes in long text contexts.

[0141] In step 1043, a logical step is generated to modify the call parameters, or a logical step is generated to re-plan the tool call flow.

[0142] As an example, step 1043 can be implemented as follows: Based on the error evidence generated in step 1042, locate the node data pointed to by the search path to determine the error level. If the error level is the parameter assignment level, generate logical steps to correct the called parameters for the parameter list. If the error level is the external interface interaction process level, generate logical steps to re-plan the tool call process based on the execution order of the target external application interface.

[0143] For example, based on the data of the first information node to be verified pointed to by the first error evidence search path, the first error level is determined to be the parameter assignment level. The logical steps involve generating a first corrected call parameter to supplement the current system time variable based on the first query instruction parameter list.

[0144] In this embodiment, step 1043 achieves the beneficial effect of outputting adaptive repair operation instructions for different error levels. This solves the technical problem that a single preset modification template cannot simultaneously address local parameter correction and global logic flow reconstruction.

[0145] In step 1044, the error analysis logic, error evidence search path, and logical steps are assembled into a structured correction strategy.

[0146] As an example, step 1044 can be implemented as follows: the text description content output by the error analysis logic, the data instruction encoding of the generated error evidence search path, and the description text of the logic steps are mapped by field concatenation to the data format template according to preset key values. This is then assembled into a structured correction strategy.

[0147] For example, the first text description content, the first data instruction encoding, and the logical step description text of the first correction call parameters are concatenated in the field order of the data format template according to the key value of the first preset object. This is then assembled into a first structured correction strategy.

[0148] In this embodiment, step 1044 achieves the beneficial effect of standardized encapsulation of multidimensional reflection and repair result data. This solves the technical problem that fragmented intermediate reflection variable data cannot be directly passed to the execution mechanism.

[0149] In some embodiments, the step 104 of "generating the corrected action to be executed according to the structured correction strategy" can be achieved through the following steps 1045 to 1046, which are described in detail below.

[0150] In step 1045, following the reflection process indicated by the structured correction strategy, a call correction action is generated, a new tool call sequence action is generated, or an intent clarification request action is generated to request intent clarification from the user.

[0151] As an example, step 1045 can be implemented as follows: Parse the logical step field in the structured correction strategy to determine the action intent type of the reconsideration process. If the action intent type is supplementary local parameter type, generate a call to correct the action. If the action intent type is change service interface, generate a new tool call sequence action. If the action intent type is missing context information, generate an intent clarification request action to ask a question to the input terminal.

[0152] For example, the first structured correction strategy is parsed to determine that the first action intent type is a supplementary first timestamp local parameter type. A first call correction action containing the first timestamp parameter is then generated.

[0153] In this embodiment of the application, step 1045 achieves the beneficial effect of strictly following the analysis conclusions to execute the secondary directional execution instruction generation, and solves the technical problem of non-convergent random mutations occurring during the action retry process, which deviates from the diagnostic expectation.

[0154] In step 1046, the action to be executed after correction, the action to be executed is ...

[0155] As an example, step 1046 can be implemented as follows: Perform a final format validity check on the input format validator for the call correction action generated in step 1045. After passing the check, change the action's status from the intermediate generation state to the pending execution state. Identify the action in the pending execution state as the corrected action to be executed.

[0156] For example, the first call correction action is input into the first format validator. After passing the first final format validity check, the status flag of the first call correction action is set to the first pending state. The first call correction action with the first pending state is determined as the first corrected action to be executed.

[0157] In this embodiment, step 1046 achieves the beneficial effect of consistent binding of the final issued instruction entity state. This solves the technical problem of inconsistent call identifiers for different types of correction action issuance interfaces.

[0158] In some embodiments, in step 104, the following scheme may also be performed: if the execution result feedback indicates that the corrected action to be executed has been successfully executed, the error diagnosis information is obtained; the initial action to be executed, the error diagnosis information, and the corrected action to be executed are spliced ​​together in time sequence to obtain an error correction interaction feature sequence; an abstract error occurrence pattern is extracted from the error correction interaction feature sequence; and an effective repair strategy corresponding to the abstract error occurrence pattern is extracted from the error correction interaction feature sequence.

[0159] Among them, the error correction interaction feature sequence refers to the structured time-series node data that records the entire process of generating the original defect, performing attribution analysis, and finally obtaining effective output actions.

[0160] As an example, the successful execution of the corrected action can be achieved by parsing the execution result feedback returned by the target execution environment. If the response status code indicates success, the error analysis logic portion of the structured correction strategy is extracted as error diagnosis information. The initial action to be executed, the error diagnosis information, and the corrected action to be executed are assembled according to the timestamp order of the events, resulting in a correction interaction feature sequence. A masking network is used to replace proprietary names in the correction interaction feature sequence with generic placeholders. An abstract error occurrence pattern is extracted. From the response content corresponding to the corrected action to be executed, an effective repair strategy corresponding to the abstract error occurrence pattern is extracted.

[0161] For example, a first response status code with a value of 200 is obtained through parsing. If the first response status code indicates success, first error diagnosis information is extracted. A first error correction interaction feature sequence is obtained by combining the first timestamp sequence. The first error correction interaction feature sequence is processed using a first masking network. A first abstract error occurrence pattern with missing parameters is extracted. A first effective repair strategy for supplementing parameters is extracted from the first corrected action to be executed.

[0162] In the embodiments of this application, by executing the error correction interaction feature sequence construction and pattern feature extraction scheme, the beneficial effect of transforming single effective error correction experience into a high-dimensional reusable knowledge structure is achieved, which solves the technical problem that single-instance solving experience is destroyed with context cleanup, resulting in the inability to improve the long-term cognitive ability of the model.

[0163] In some embodiments, for an abstract error occurrence pattern and an effective repair strategy corresponding to the abstract error occurrence pattern, the following steps 10411 to 10414 may also be performed: In step 10411, the error occurrence type is parsed from the abstract error occurrence pattern; the failure cause characteristics are parsed from the abstract error occurrence pattern.

[0164] As an example, step 10411 can be implemented as follows: Input the abstract error occurrence pattern into a text classification tree. Separate the error occurrence types that describe surface non-compliance attributes. Use a text feature parser to segment failure cause features describing logical chain breakpoints from the abstract error occurrence pattern.

[0165] For example, the first abstract error occurrence pattern is input into the first text classification tree. The first error occurrence type intercepted by the required interface fields is separated. The first text feature parser is used to segment the first failure reason feature where the information retrieval order is reversed.

[0166] In this embodiment of the application, step 10411 achieves the beneficial effect of fine-grained dimensionality reduction and deconstruction of composite macro error patterns, and solves the technical problem that the matching degree of hybrid coupling features directly used as the graph retrieval matching degree is lower than the preset matching degree threshold.

[0167] In step 10412, effective error correction practice characteristics are parsed from effective repair strategies; error occurrence types, failure cause characteristics, and effective error correction practice characteristics are compared with existing historical data nodes in the cognitive map.

[0168] As an example, step 10412 can be implemented as follows: Utilize a natural language phrase extraction algorithm to process effective repair strategies, extracting effective error-correction practice features describing the core logic modification instructions of the actions. Combine the error occurrence type, failure cause features, and effective error-correction practice features into a feature vector of the graph to be inserted. Calculate the cosine similarity between the feature vector of the graph to be inserted and the graph embedding representation vector of existing historical data nodes in the cognitive graph.

[0169] For example, the first effective error correction practice features of the preceding insertion call action are extracted using the first algorithm. These features are combined to obtain the first graph feature vector to be inserted. The cosine similarity between the first graph feature vector to be inserted and the first graph embedding representation vector of the first existing historical data node is calculated, resulting in a comparison result of 0.95.

[0170] In this embodiment, step 10412 achieves the beneficial effect of structurally aligning newly generated knowledge elements with the system's existing knowledge reserves. This solves the technical problem of disordered insertion of redundant nodes into the graph database.

[0171] In step 10413, based on the comparison results, a knowledge feature fusion operation is performed on the error occurrence type, failure cause characteristics, and effective error correction practice characteristics, or a data conflict resolution operation is performed; the features after the knowledge feature fusion operation or data conflict resolution operation are persistently stored in the cognitive graph.

[0172] As an example, step 10413 can be implemented as follows: If the cosine similarity is greater than a first similarity preset threshold, perform a knowledge feature fusion operation. The knowledge feature fusion operation includes summing and updating the feature weights of each feature and existing historical data node features. If the cosine similarity is between a second similarity preset threshold and a first similarity preset threshold, and the node semantic labels are mutually exclusive, perform a data conflict resolution operation. The data conflict resolution operation includes retaining the latest timestamp feature. The processed target features are then persistently stored in the physical storage medium of the cognitive graph using a write instruction.

[0173] For example, if the cosine similarity of 0.95 is greater than the first similarity preset threshold of 0.90, the first knowledge feature fusion operation is performed. The first target feature is then persistently stored in the solid-state drive where the cognitive graph resides.

[0174] In this embodiment, step 10413 achieves the beneficial effect of incremental knowledge being filtered and written based on rules. This solves the technical problem of logical conflicts in the evolution of cognitive graph knowledge caused by multiple batches of execution actions.

[0175] In step 10414, the data nodes of the cognitive graph are expanded by persistently storing the features.

[0176] As an example, step 10414 can be implemented as follows: Read the features written to persistent storage in step 10413. In the memory view of the relational structure of the cognitive graph, create new entity objects marked with globally unique identifiers as data nodes. Draw the edge weight vector connections between the newly established connected entities.

[0177] For example, read the first feature. Create a new entity object in the memory view that includes a parameter validation exception class flag. Draw the first edge weight vector connection with a numerical connection weight of 0.5.

[0178] In this embodiment, step 10414 achieves the beneficial effect of dynamically constructing a continuous learning knowledge network structure. This solves the technical problem of static diagnostic libraries lacking the ability to handle novel interface errors.

[0179] In some embodiments, steps 10415 to 10417 may also be performed on the expanded cognitive map: In step 10415, a subsequent task instruction containing target modal data is received; when the structured self-diagnosis and correction process is triggered in response to the subsequent task instruction, subsequent risk features for the subsequent task instruction are extracted.

[0180] Target modal data refers to an input data sequence containing at least one of text, image, audio, or video (excluding the fusion of text data). The structured self-diagnosis and correction process refers to a closed-loop processing mechanism whereby, after generating the initial action to be executed, the model automatically calculates a risk assessment value. If the risk assessment value exceeds a preset risk threshold, it proactively intercepts and uses a cognitive graph to retrieve diagnostic strategy templates to generate structured correction strategies and corrective actions.

[0181] As an example, step 10415 can be implemented as follows: Receive subsequent task instructions containing target modality data. Generate subsequent initial actions to be executed for the subsequent task instructions. Calculate the subsequent risk assessment value for the subsequent initial actions to be executed. If the subsequent risk assessment value is greater than a preset risk threshold, trigger a structured self-diagnosis and correction process. When the structured self-diagnosis and correction process is triggered, input the target modality data into a preset multimodal feature encoder for tensor feature mapping to obtain a multidimensional tensor set containing values ​​of abnormal distribution regions. Extract the values ​​of the abnormal distribution regions from the multidimensional tensor set as subsequent risk features.

[0182] For example, a first follow-up task instruction containing first image data (target modality data) is received. A first follow-up initial action to be executed is generated in response to the first follow-up task instruction. A first follow-up risk assessment value of 0.8 is calculated for the first follow-up initial action. If the first follow-up risk assessment value of 0.8 is greater than a preset risk threshold of 0.6, a structured self-diagnosis and correction process is triggered. When the structured self-diagnosis and correction process is triggered, the first image data is input into a visual feature encoder (multimodal feature encoder) for tensor mapping, resulting in a first multidimensional tensor set containing values ​​of the first abnormal distribution region. The values ​​of the first abnormal distribution region are extracted from the first multidimensional tensor set as the first follow-up risk feature.

[0183] In this embodiment, step 10415 extends the forward-looking interception mechanism of the structured self-diagnosis and correction process to non-pure text input scenarios, thus solving the technical problem of only being able to predict risks for single text commands and being unable to identify anomalies in multimodal data input.

[0184] In step 10416, the modality-specific error features contained in the subsequent risk features are parsed; the node mapping relationship corresponding to the modality-specific error features in the expanded cognitive graph is queried, and the modality-specific error features are converted into underlying logical error features.

[0185] Among them, the underlying logic error feature refers to the general structured semantic vector used to characterize the failure of model inference after stripping away the apparent attributes of the data input format.

[0186] As an example, step 10416 can be implemented as follows: Utilize a pre-defined feature classifier to perform hierarchical parsing of the extracted subsequent risk features, separating out modality-specific error features bound to specific data formats. Use these modality-specific error features as query objects and perform a matching query in the node relationship storage table of the expanded cognitive graph. Obtain the node mapping relationships corresponding to the modality-specific error features in the expanded cognitive graph. Based on the transformation rules defined in the node mapping relationships, replace the data dimension labels of the modality-specific error features, converting them into cross-modality universal underlying logical error features.

[0187] For example, a pre-defined feature classifier is used to perform hierarchical analysis on the first subsequent risk feature, separating the first modality-specific error feature of "image resolution below the baseline threshold". This first modality-specific error feature is then used as the query object, and a matching query is performed in the node relationship storage table of the first expanded cognitive graph. The first node mapping relationship corresponding to the first modality-specific error feature in the first expanded cognitive graph is obtained. Based on the transformation rules defined in the first node mapping relationship, the data dimension labels of the first modality-specific error feature are replaced, transforming the first modality-specific error feature into the first underlying logical error feature of "missing key input elements".

[0188] In this embodiment, step 10416 achieves the beneficial effect of mapping isolated physical modal anomalies to a unified system logical semantics. This solves the technical problem of low reuse rate of underlying error correction rules due to the diverse manifestations of multimodal errors.

[0189] In step 10417, the corresponding subsequent diagnostic strategy template is retrieved from the expanded cognitive graph using the underlying logical error features as an index; based on the subsequent diagnostic strategy template, a subsequent structured correction strategy for the target modality data is generated through a large language model.

[0190] As an example, step 10417 can be implemented as follows: The transformed underlying logical error features are used as query index keys. These query index keys are input into the expanded cognitive graph, and knowledge nodes are traversed using a similarity distance algorithm. Subsequent diagnostic strategy templates matching the query index keys are retrieved. The corrective step guidance logic contained in the subsequent diagnostic strategy templates is extracted. This corrective step guidance logic is input as contextual instructions into the large language model. The large language model decodes and generates subsequent structured correction strategies containing action correction guidance to achieve correction for the target modality data.

[0191] For example, the first underlying logical error feature of "missing key input elements" is used as the query index key value. This query index key value is input into the first expanded cognitive graph, and knowledge nodes are traversed using a cosine similarity distance algorithm. A first follow-up diagnostic strategy template containing "requesting the replenishment of missing elements" is retrieved. The repair step guidance logic contained in the first follow-up diagnostic strategy template is extracted. The repair step guidance logic is input as a context instruction into the large language model. The large language model decodes and generates a first follow-up structured correction strategy of "requesting refocusing and uploading a clear image" to achieve correction for the first image data (target modality data).

[0192] In this embodiment, step 10417 achieves the beneficial effects of adaptive error reflection and strategy closure-loop generation across modal data formats. This solves the technical problem that anomalies caused by novel modal interactions cannot be captured by a unified reflection mechanism to generate targeted repair guidelines.

[0193] In step 105, the execution result feedback of the target execution environment is obtained, and the real-time reward value of the execution result feedback under multiple preset indicator dimensions is calculated. Combined with the final task objective, the expected value of the target achievement of the corrected action to be executed is quantified to achieve the final task objective, and the quantified expected value of the target achievement is mapped to a predictive reward value.

[0194] Among them, the target achievement expectation value refers to the scalar value used to characterize the probability contribution of the current single-round action instruction to the global task completion degree of the multi-round session.

[0195] As an example, step 105 can be implemented as follows: Receive the execution result feedback from the target execution environment through an asynchronous callback interface. Substitute the execution result feedback into the pre-configured score scoring algorithm equation to calculate the instantaneous reward value under multiple preset indicator dimensions. Parse the total task planning description string containing the final task objective. Input the corrected action to be executed into the Markov state transition prediction model of the execution environment. Calculate the probability of the corrected action to be executed as an intermediate behavior node satisfying the final task objective to obtain the target achievement expectation value. Transform the target achievement expectation value into a predictive reward value through extreme value mapping transformation. The target achievement expectation value can be calculated using the following formula (2): (2) in, Indicates the expected value of achieving the goal. Indicates the current environmental state. This indicates the revised action to be performed. Indicates the state Next action Transition to the next state The probability, Indicates the next state With the final mission objective The semantic matching score.

[0196] In some embodiments, the step 105 of “calculating the instant reward value of the execution result feedback under multiple preset indicator dimensions” can be achieved through the following steps 1051 to 1056, which are described in detail below.

[0197] In step 1051, the environment response status included in the parsed execution result feedback is analyzed; based on the environment response status, the conformity of the corrected action to be executed with the application interface specification is verified.

[0198] Among them, the environment response status refers to the data payload returned by the target execution environment after executing the modified pending action, which includes network status codes, error description information, or execution result streams.

[0199] As an example, step 1051 can be implemented as follows: Receive a feedback response packet from the communication channel and extract the execution result feedback. Parse the execution result feedback and separate the environment response status containing the network status request code. Retrieve the constraint data stored in the registry. Compare and verify the execution parameter format of the modified action to be executed with the parameter length specification included in the application programming interface (API) specification set in the constraint data. Determine the conformity of the modified action to be executed with the API specification based on the verification result.

[0200] For example, extract the first execution result feedback and separate the first environment response status containing the error network status request code 404. Retrieve the first constraint data contained in the first interface registry. Compare and verify the execution parameter format of the first identifier field contained in the first corrected action to be executed with the specific domain name prefix format specified in the first application interface specification. Based on the verification result, obtain the first compliance case that does not follow the specific domain name prefix content string.

[0201] In this embodiment, step 1051 achieves the beneficial effect of using the objective call appearance execution feedback status obtained from interface-level real verification as the basis for determining the legality of the model's underlying format. This solves the technical problem of the model making incorrect judgments about physical reports caused by the lack of underlying format legality verification.

[0202] In step 1052, the reward value for the format correctness dimension is determined based on the compliance status; and the correctness and completeness of the parameter values ​​contained in the corrected action to be executed are verified in conjunction with the environmental response status.

[0203] As an example, step 1052 can be implemented as follows: Based on the Boolean result attribute value of the conforming condition, assign a negative scalar deduction to cases where the conforming condition is false, and assign a positive scalar bonus to cases where the conforming condition is true, thus determining the reward value for the format correctness dimension. Extract error message information returned from the environment response status. Extract the actual transmitted data from the parameter payload within the corrected action to be executed. Verify the correctness of the parameter values ​​contained in the parameter payload falling within the constraints of a specific set of values. Verify the completeness of the required input form fields, ensuring that no empty fields exist.

[0204] For example, based on the Boolean result attribute value where the first compliance condition is false, assign a first negative scalar deduction of -1 to determine the first reward value for the format correctness dimension. Extract the first return content value overflow error message contained in the first environmental response status. Extract the actual transmitted data of the first latitude and longitude coordinate fields input within the first corrected action to be executed. Verify the first correctness of the input parameter value falling within the longitude constraint set condition. Verify the first completeness requirement that the first altitude form does not have any empty or missing values.

[0205] In this embodiment, step 1052 achieves the beneficial effects of negative evaluation of structural grammatical errors and in-depth data analysis and detection of the substantial effective content of core action information parameters. This solves the technical problem that coarse-grained evaluation feedback cannot accurately locate missing parameters and numerical anomalies, making it difficult to perform targeted correction and optimization of the network parameters of large language models.

[0206] In step 1053, the reward value for the parameter accuracy dimension is determined based on correctness and completeness; and the target tool program to be called after the corrected action is verified in conjunction with the environmental response status.

[0207] As an example, step 1053 can be implemented as follows: Verify correctness and completeness, output feature values ​​to calculate a comprehensive score. Determine the reward value for the parameter accuracy dimension based on the comprehensive score. Obtain the return string identification information containing the target identity flag from the environmental response status. Compare the consistency between the request path generated by the corrected action to be executed and the return string identification information. Verify the target tool program called by the corrected action to be executed.

[0208] For example, a judgment and test is performed on the first correctness and the first completeness of the missing first elevation form, and a comprehensive score of 0.5 is output. The first reward value for the parameter accuracy dimension is determined based on the comprehensive score of 0.5. The first return string identification information, including the return string of the first query program identifier, is obtained from the first environment response state. The request call path contained in the first corrected action to be executed is compared and verified for first consistency with the first return string identification information. The target first query tool program called by the first corrected action to be executed is then verified.

[0209] In this embodiment, step 1053 achieves the beneficial effect of separating business attribute verification from the precise control verification of calling logic components, forming an independent evaluation system indicator. This solves the technical problem that large language models are prone to getting confused by overlapping reward targets and repeatedly replacing tools due to offsetting suppression signal positioning when making decisions about retrieving similar tools.

[0210] In step 1054, the reward value for the accuracy dimension of tool selection is determined based on the verification results of the target tool program; the consistency between the corrected action to be executed and the structured correction strategy in the diagnostic logic is compared.

[0211] As an example, step 1054 can be implemented as follows: The verification results of the target tool program are converted into a Boolean scalar score. The reward value for the accuracy dimension of tool selection is determined based on the Boolean scalar score. The descriptive attribute text record data set is extracted from the structured correction strategy. The actual submitted text of the corrected action to be executed is extracted to form an action description string. A semantic similarity extraction network layer is used to connect the descriptive attribute text record data set and the action description string input text. The angular distance between the multidimensional geometric vector of the action description string and the multidimensional geometric vector of the descriptive attribute text record data set is calculated. The consistency of the corrected action to be executed and the structured correction strategy in diagnostic logic is determined based on the angular distance.

[0212] For example, the verification result of the first target query tool program is assigned a scalar score with a value of 0.8. A first reward value for the first tool selection accuracy dimension is determined based on the scalar score of 0.8. A set of descriptive attribute text records containing the first change parameter indication text information included in the first structured correction strategy is extracted. The actual submitted text of the first corrected action to be executed is extracted to form the first action description string. The semantic similarity extraction network layer is used to compare the first descriptive attribute text record data set with the input text of the first action description string. A cosine distance of 0.92 is calculated. The consistency of the first corrected action to be executed and the first structured correction strategy in diagnostic logic is determined based on the cosine distance of 0.92.

[0213] In this embodiment, step 1054 achieves the beneficial effect of supervising the accuracy of functional program decisions and constructing a closed-loop test for model intent compliance with constraints. This solves the technical problem of increased failure rate caused by random variations and modifications leading to deviations in the actual performance of generated actions from the error correction framework guided by reflective judgment.

[0214] In step 1055, the reward value of the reflective semantic consistency dimension is determined based on consistency; based on the execution result feedback, the contribution of the revised action to be executed to the final task objective is evaluated.

[0215] As an example, step 1055 can be implemented as follows: Based on the consistency judgment result, determine the reward value of the reflective semantic consistency dimension according to the scoring mapping rules. Extract entity data containing execution progress performance characteristics from the execution result feedback. Analyze the final task objective and obtain the stage achievement standards of the planned objective sequence set. Combine the stage achievement standards to evaluate the contribution of the revised actions to be executed to achieving the stage objectives in satisfying the final task objective.

[0216] For example, based on the result of the first judgment's logical consistency, a first score of 0.92 is mapped to determine the first reward value for the first reflection semantic consistency dimension. First performance feature entity data containing the timestamp verification success status is extracted from the first execution result feedback. The final task objective of the first query schedule's entire process is analyzed to obtain the first stage achievement standard with the confirmation date as a prerequisite. The time input required for the successful completion of the first revised pending action is determined, and the first contribution level to achieving the 10% target for the execution progress is evaluated.

[0217] In this embodiment, step 1055 achieves the beneficial effect of ensuring controlled execution of local optimization steps while also taking into account the overall correlation value and benefit measurement of macro-planning progress. This solves the technical problem of some instructions passing format and tool checks but failing to make substantial progress on global user-mandated requests, leading to prolonged stagnation.

[0218] In step 1056, the reward value for the final task completion dimension is determined based on the degree of contribution; the reward values ​​for the format correctness dimension, parameter accuracy dimension, tool selection accuracy dimension, reflection semantic consistency dimension, and final task completion dimension are merged to obtain the real-time reward value under multiple indicator dimensions.

[0219] As an example, step 1056 can be implemented as follows: The contribution level is quantified using configured parameter weights to calculate the reward value for the final task completion dimension. The reward values ​​for the format correctness dimension, parameter accuracy dimension, tool selection accuracy dimension, and reflective semantic consistency dimension, contained in the memory location of the storage array, are obtained. These reward values, along with the reward values ​​for the format correctness dimension, parameter accuracy dimension, tool selection accuracy dimension, reflective semantic consistency dimension, and final task completion dimension, are constructed into a multi-dimensional floating-point scalar set using tensor concatenation instructions. This multi-dimensional floating-point scalar set is then used as the immediate reward value for multiple indicator dimensions.

[0220] For example, the first contribution value, representing a 10% advancement, is processed by a first preset smoothing function rule to calculate a first reward value of 0.1 for the first final task completion dimension. The reward values ​​contained in the first preset memory are then obtained: a reward value of -1 for format correctness, 0.5 for parameter accuracy, 0.8 for tool selection accuracy, and 0.92 for reflective semantic consistency. These reward values ​​(-1, 0.5, 0.8, 0.92) and the reward value of 0.1 for the first final task completion dimension are then constructed into a five-dimensional multi-dimensional floating-point scalar set using a first tensor concatenation instruction. This five-dimensional multi-dimensional floating-point scalar set is then determined as the first immediate reward value across multiple indicator dimensions.

[0221] In this embodiment, step 1056 achieves the beneficial effect of integrating a series of fine-grained feedback signals, such as the micro-level legal execution verification criteria and the global task planning stage rewards. This solves the technical problem of poor convergence caused by the lack of fine-grained guidance gradients for generating complex, multi-structured operational processes due to the single dimension and coarse granularity of the reinforcement learning reward feedback index.

[0222] In some embodiments, the "quantification of the expected value of the final task objective achieved by the corrected action to be executed" in step 105 can be achieved through the following steps 1057 to 1059, which are described in detail below.

[0223] In step 1057, the global user intent features contained in the final task objective are extracted.

[0224] Among them, global user intent features refer to semantic vectors extracted from multi-round interactive task instructions that represent the user's ultimate core needs and intents.

[0225] As an example, step 1057 can be implemented as follows: Parse the final task objective. Identify the action predicates and target objects in the final task objective using a semantic analysis model. Transform the action predicates and target objects into global user intent features with multi-dimensional dimensions.

[0226] For example, the final task objective is "to book tickets for the first city and arrange the first hotel". From this final task objective, the first action intent of "booking tickets" and the first target intent of "arranging a hotel" are extracted. A semantic embedding algorithm is then used to map the first action intent of "booking tickets" and the first target intent of "arranging a hotel" into first global user intent features.

[0227] In this embodiment, step 1057 achieves the beneficial effect of feature extraction of macroscopic task objectives. This solves the technical problem that the complexity of the final task objective description makes it difficult to directly establish logical connections in the model.

[0228] In step 1058, the probability distribution of subsequent multi-round state transitions triggered by the modified action to be executed in the external environment is predicted.

[0229] As an example, step 1058 can be implemented as follows: The corrected action to be executed is input into a preset environment prediction model (prediction model). The environment prediction model simulates the execution process of the corrected action in the external environment. Multiple candidate environment states are calculated for the external environment in subsequent preset rounds after the corrected action is executed. A corresponding transition probability value is assigned to each candidate environment state, constructing a state transition probability distribution for subsequent rounds.

[0230] For example, the first modified action to be executed is input into the first environment prediction model. The first environment prediction model simulates the execution of the first modified action in a first external environment. A first candidate state and a second candidate state are determined after execution. A first probability of 0.7 is assigned to the first candidate state, and a second probability of 0.3 is assigned to the second candidate state. The first subsequent multi-round state transition probability distribution is obtained.

[0231] In this embodiment, step 1058 achieves the beneficial effect of simulating and predicting future trends after an action is executed. This solves the technical problem that relying solely on immediate feedback leads to a lack of foresight in model learning.

[0232] In step 1059, the matching degree between the subsequent multi-round state transition probability distribution and the global user intent features is calculated as the target achievement expectation value.

[0233] As an example, step 1059 can be implemented as follows: Calculate the vector similarity between each candidate environment state and the global user intent feature in the subsequent multi-round state transition probability distribution. Perform a weighted sum of the vector similarities and the corresponding transition probability values ​​to obtain a matching score. Determine this matching score as the expected value for achieving the target.

[0234] For example, calculate the first similarity between the first candidate state and the first global user intent feature, and calculate the second similarity between the second candidate state and the first global user intent feature. Calculate the matching degree using formula (3): (3) in, This indicates the degree of matching (i.e., the expected value of achieving the goal). Indicates the first The transition probability values ​​of each candidate environment state. Indicates the first The similarity between each candidate environment state and the global user intent features is calculated using formula (3). The expected value of achieving the first goal is obtained.

[0235] In this embodiment, step 1059 achieves the beneficial effect of aligning future state predictions with global intentions. This solves the decision bias problem that easily gets trapped in local optima and deviates from the global objective during model error correction.

[0236] In step 106, the real-time reward value and the predictive reward value under multiple indicator dimensions are weighted and fused to obtain the global reward value.

[0237] The global reward value refers to the total score that comprehensively considers both the immediate execution quality of the action and its contribution to achieving future goals.

[0238] As an example, step 106 can be implemented as follows: Obtain the instant reward values ​​across multiple indicator dimensions. Obtain the predictive reward values. Determine the instant reward weights and predictive weights. Multiply the instant reward values ​​and instant reward weights, and multiply the predictive reward values ​​and predictive weights. Sum the results of the multiplication operations to obtain the global reward value.

[0239] In some embodiments, step 106 can be implemented by steps 1061 to 1063, which are described in detail below.

[0240] In step 1061, the execution performance indicators, satisfaction feedback, and node evolution characteristics of the cognitive graph are obtained.

[0241] Among them, the node evolution characteristics of the cognitive graph refer to the frequency of newly added error pattern nodes and the update magnitude of the connection weights between nodes.

[0242] As an example, step 1061 can be implemented as follows: extract execution time and resource consumption from task execution records to determine execution performance indicators; receive task completion ratings through a human-computer interaction interface to determine satisfaction feedback; and analyze node update records within a preset period from the cognitive graph to determine the node evolution characteristics of the cognitive graph.

[0243] For example, the first execution time of 3 seconds and the first memory consumption of 120MB are extracted and combined to obtain the first execution performance index. A satisfaction rating of 4 points is received. Five new error pattern nodes are obtained from the cognitive graph in the past hour, and the node evolution characteristics of the first cognitive graph are determined.

[0244] In this embodiment, step 1061 achieves the beneficial effect of comprehensive collection of multi-source evaluation data. This solves the technical problem that reward allocation lacks consideration of operational efficiency and knowledge evolution status.

[0245] In step 1062, based on performance indicators, satisfaction feedback, and node evolution characteristics, the weights of multiple immediate rewards for multiple indicator dimensions and the predictive weights for predictive reward values ​​are dynamically adjusted.

[0246] As an example, step 1062 can be implemented as follows: Establish a weight mapping function between multiple immediate reward weights and predictive weights and execution performance indicators, satisfaction feedback, and node evolution characteristics. Input the obtained execution performance indicators, satisfaction feedback, and node evolution characteristics into the weight mapping function. Output the dynamically adjusted multiple immediate reward weights and predictive weights.

[0247] For example, if the first performance indicator shows that resource consumption exceeds a preset resource threshold, the weight mapping function outputs a value that lowers the weight of the first immediate reward. If the node evolution characteristics of the first cognitive graph show that the node update frequency is greater than a preset update frequency threshold, the function outputs a value that increases the weight of the first predictive feature.

[0248] In this embodiment, step 1062 achieves the beneficial effect of dynamically adjusting the reward mechanism to adapt to changes in the task environment. This solves the learning efficiency bottleneck problem caused by the inability of fixed weight allocation to flexibly cope with the training needs at different stages.

[0249] In step 1063, the dynamically adjusted multiple immediate reward weights and predictive weights are used to perform a weighted summation of the immediate reward value and the predictive reward value to obtain the global reward value.

[0250] As an example, step 1063 can be implemented as follows: Using a bootstrapping mechanism, dynamically adjust multiple immediate reward weights and predictive weights based on actual feedback data from environmental interactions. Perform a first weighted summation operation on the immediate reward values ​​across multiple indicator dimensions using the immediate reward weights. Perform a second weighted summation operation on the predictive reward values ​​using the predictive weights. Add the result of the first weighted summation operation to the result of the second weighted summation operation to obtain the global reward value, such as... Figure 6 As shown.

[0251] For example, the reward values ​​for the format correctness dimension and the parameter accuracy dimension are summed using the first multinomial immediate reward weight. The summation result is then combined with the first predictive reward value after being weighted by the first predictive weight. This yields a first global reward value of 0.85.

[0252] In this embodiment, step 1063 achieves the beneficial effect of constructing a unified evaluation index that integrates short-term feedback and long-term contribution. This solves the problem of model behavior divergence caused by imbalanced reward signals during reinforcement learning.

[0253] In step 107, the global reward value is used as a feedback signal to iteratively update the network parameters of the large language model.

[0254] Among them, network parameters refer to the values ​​of the internal weight matrix that determine the probability distribution of predictions in a large language model.

[0255] In some embodiments, step 107 can be implemented by steps 1071 to 1075, which are described in detail below.

[0256] In step 1071, the global observation state, the corrected action to be executed, and the global reward value are combined.

[0257] As an example, step 1071 can be implemented as follows: obtain the global observation state from step 101, the corrected action to be executed from step 104, and the global reward value from step 106. Then, concatenate the global observation state, the corrected action to be executed, and the global reward value according to their temporal order to obtain the combined result.

[0258] For example, the global observation state of the first task scenario, the first corrected action to be executed, and the first global reward value of 0.85 are concatenated to obtain the first combined result.

[0259] In this embodiment, step 1071 achieves the beneficial effect of explicitly linking the interaction process with the evaluation result, thus solving the technical problem of training data lacking evaluation criteria guidance.

[0260] In step 1072, the combined results are structured and stored as policy optimization samples.

[0261] As an example, step 1072 can be implemented as follows: convert the combination result into a preset tensor format data, assign a unique identifier to the tensor format data, and store the tensor format data with the unique identifier in the policy optimization sample database, thus identifying it as a policy optimization sample.

[0262] For example, the first combination result is transformed into a first tensor format data containing a state vector, action index, and reward scalar. This data is then stored in a first sample database and identified as a first policy optimization sample.

[0263] In this embodiment, step 1072 achieves the beneficial effect of standardizing and solidifying high-quality experience data. This solves the technical problem of disordered data distribution and inefficient utilization of model training data.

[0264] In step 1073, a sampling operation is performed on the stored policy optimization samples.

[0265] As an example, step 1073 can be implemented as follows: determine the current training step number, and determine the sampling ratio based on the training step number. Randomly select a preset number of samples from the policy optimization sample database according to the sampling ratio, and determine the selected samples as the sampling operation results of the current batch.

[0266] For example, the first batch of sampling results with 128 samples is randomly drawn from the first sample database according to the first sampling ratio.

[0267] In this embodiment, step 1073 achieves the beneficial effects of unbiased sampling and dynamic utilization of training data. This solves the technical problem of learning curve fluctuations caused by insufficient utilization of past experience.

[0268] In step 1074, the gradient of network parameter update is calculated based on the sampled strategy optimization samples.

[0269] As an example, step 1074 can be implemented in the following way: based on the sampled policy optimization samples, when using the direct alignment policy optimization algorithm, the partial derivatives of the loss function with respect to the network parameters of each layer in the large language model are calculated using the policy gradient algorithm, and the obtained partial derivatives are determined as the network parameter update gradient; or, when using the concrete skill policy optimization algorithm, the macroscopic action policy contained in the sampled policy optimization samples is decomposed into multiple low-level specific executable skill sequences, and the partial derivatives of the loss function of the network weights of the corresponding specific skill modules are calculated for each low-level specific executable skill sequence, and the set of the partial derivatives of the loss function of each skill module is determined as the network parameter update gradient.

[0270] For example, when using the first strategy to optimize samples and employing the direct alignment strategy optimization algorithm, the first partial derivative is calculated using the backpropagation algorithm of the loss function. This first partial derivative is then used as the gradient for updating the first network parameters. Alternatively, when using the concrete skill strategy optimization algorithm, the first strategy optimization samples are decomposed into a first tool-calling skill sequence and a first parameter parsing skill sequence. The partial derivatives of the first skill module corresponding to the first tool-calling skill sequence and the second skill module corresponding to the first parameter parsing skill sequence are calculated separately. These two partial derivatives are then combined to determine the gradient for updating the first network parameters.

[0271] In this embodiment, step 1074 achieves the beneficial effect of accurately quantifying the transformation of execution experience into the driving force for model optimization. This solves the technical problems of vague direction and lack of mathematical basis for model strategy improvement.

[0272] In step 1075, the network parameters of the large language model are adjusted based on the updated gradient of the network parameters.

[0273] As an example, step 1075 can be implemented as follows: determine a preset learning rate step size, multiply the network parameter update gradient by the preset learning rate step size, and use the calculation result to update the weight matrix of the large language model, thereby adjusting the network parameters of the large language model.

[0274] For example, the gradient of the first network parameter update is multiplied by the first learning rate step size, and the first weight matrix of the large language model is updated by subtraction using the gradient descent method, thus completing the operation of adjusting the network parameters of the large language model.

[0275] In this embodiment, step 1075 achieves the beneficial effect of closed-loop optimization of the action decision-making ability of the large language model. This solves the long-term evolutionary problem of the large language model's inability to self-correct and improve based on actual error correction feedback.

[0276] The following will describe the action control method for a large language model provided in this application embodiment, with reference to the exemplary application and implementation of the server provided in the embodiments of this application.

[0277] See Figure 3E , Figure 3E This is a first flowchart illustrating the action control method for a large language model provided in this application embodiment, which will be combined with... Figure 3E The steps shown are explained.

[0278] In step 201, in response to the received task instruction, and in conjunction with the current global observation state, an initial action to be executed is generated through the large language model.

[0279] As an example, step 201 can be implemented as follows: The large language model receives task instructions. It then combines the global observation state to generate initial actions representing the task intent.

[0280] In some embodiments, the step 201 of "generating an initial action to be executed by combining the current global observation state with the large language model" can be achieved by processing task instructions, dialogue history information in the global observation state and system context information through the large language model; generating an initial action to be executed that includes tool call instructions, text replies or intent clarification request actions.

[0281] As an example, the large language model treats the task instruction as the current request. It retrieves stored dialogue history and system context information from the global observation state. The task instruction, dialogue history, and system context information are concatenated and input into the encoding layer of the large language model. The output of the large language model contains tool invocation instructions, text responses, or intent clarification request actions. These tool invocation instructions, text responses, or intent clarification request actions are then identified as the initial actions to be executed.

[0282] For example, upon receiving the task instruction to "check the weather," the system retrieves historical dialogue information regarding locations previously inquired about, as well as system context information regarding currently enabled location permissions. After processing this information, the large language model generates a tool call instruction containing "call the weather API" as the initial action to be executed.

[0283] In this embodiment, by comprehensively processing dialogue history and contextual information to generate actions, the beneficial effect of precise matching of action generation with the contextual environment is achieved. This solves the technical problem that isolated action generation leads to non-compliance with business logic or permission requirements.

[0284] In step 202, the confidence level of the initial action to be executed under the global observation state is calculated, and the confidence level is mapped to obtain the risk assessment value.

[0285] In some embodiments, the step 202 of "mapping the confidence level to obtain the risk assessment value" can be achieved by: obtaining the internal state information and historical interaction trajectory data of the large language model; constructing an objective function representing the execution uncertainty based on the internal state information and historical interaction trajectory data; using the objective function representing the execution uncertainty to map the confidence level, and determining the result of the mapping calculation as the risk assessment value.

[0286] As an example, the large language model obtains the neuron activation weights at the time of action generation from the kernel buffer as internal state information. Historical interaction trajectory data for the same type of action is extracted from the storage log. An objective function representing the uncertainty of execution is constructed. The confidence level from step 202 is input into the objective function and subjected to a nonlinear transformation. The output of the objective function is determined as the risk assessment value.

[0287] For example, obtain the activation vector of the first hidden layer as internal state information. Obtain the records of the past three failed executions of the first tool. Construct the first objective function. Substitute a confidence score of 0.6 into the first objective function. The mapping calculation yields a risk assessment value of 0.75.

[0288] In this embodiment, by combining internal states and historical trajectories to construct an uncertainty function, a multi-dimensional quantitative assessment of execution risk is achieved. This solves the technical problem that simply relying on probability scores cannot truly reflect the unreliability of the model in specific scenarios.

[0289] In step 203, if the risk assessment value exceeds the preset risk threshold, the initial action to be executed is intercepted, and the risk characteristics of the initial action to be executed are extracted. Using the global observation status and risk characteristics as indexes, a matching diagnostic strategy template is retrieved from the pre-constructed cognitive map.

[0290] In some embodiments, the "extracting risk features of the initial action to be executed" in step 203 can be achieved by: calculating the matching degree between the initial action to be executed and the system context information; identifying structural defects or parameter defects in the initial action to be executed; identifying logical consistency defects or tool invocation defects in the initial action to be executed based on the matching degree; extracting abnormal features of at least one of structural defects, parameter defects, logical consistency defects, and tool invocation defects; mapping the abnormal features to the corresponding expected execution anomaly types; and determining the expected execution anomaly types as risk features.

[0291] As an example, risk feature extraction can be achieved as follows: Calculate the matching degree between the initial action instruction to be executed and the configuration specifications in the system context information using a rule-matching algorithm. Identify whether there are structural defects such as missing required fields or parameter defects such as out-of-bounds values ​​in the initial action to be executed. Simultaneously, identify logical consistency defects such as plaintext contradictions or tool call defects involving invalid application programming interfaces based on the matching degree. Convert the identified four types of defects into multidimensional tensors and extract anomalous features. Use a pre-trained classifier to map the anomalous features to the corresponding expected execution anomaly types. Determine the expected execution anomaly types as risk features.

[0292] For example, calculate the matching degree between the first invocation instruction and the current business permissions. Identify parameter defects where the second parameter is missing. Based on this matching degree, identify logical consistency defects where the first invocation instruction does not match the context environment. Extract anomaly features containing the aforementioned defect label codes. Input the anomaly features into a first classifier and map them to the first expected execution exception type, "interface interaction failure". Identify "interface interaction failure" as the first risk feature.

[0293] In this embodiment, by comprehensively identifying four-dimensional defects and mapping them to unified type features, the beneficial effects of accurately deconstructing the kernel of action errors and completing standardized classification are achieved. This solves the technical problem that subsequent graph retrieval cannot be triggered after action interception due to the lack of standardized description of error attribution.

[0294] In some embodiments, the following scheme may also be implemented for the logic in step 203: if the risk assessment value does not exceed the preset risk threshold, the initial action to be executed is sent to the target execution environment so as to execute the initial action to be executed in the target execution environment.

[0295] As an example, issuing an action when the risk assessment value does not exceed a preset risk threshold can be achieved as follows: Determine whether the risk assessment value is less than or equal to the preset risk threshold. If the determination result is yes, send the initial action to be executed to the target execution environment. Trigger the target execution environment to perform actual business execution according to the instruction content of the initial action to be executed.

[0296] For example, if the second risk assessment value of 0.2 is determined to be less than the preset risk threshold of 0.6, the second initial action to be executed is sent to the business microservice environment (target execution environment). This causes the business microservice environment to execute a query operation on the first database.

[0297] In this embodiment, by setting a threshold-based traffic diversion mechanism to allow low-risk actions, a beneficial effect of dynamically balancing processing efficiency and error avoidance security is achieved. This solves the technical problem of a surge in response latency and a decrease in throughput caused by forcibly blocking all instructions.

[0298] In some embodiments, step 203, "retrieving matching diagnostic strategy templates from a pre-built cognitive graph using the global observation state and risk features as indexes," can be achieved as follows: determining the global observation state as the current context index; obtaining the error classification type corresponding to the risk features; and performing a retrieval operation in the cognitive graph based on the current context index and the error classification type to obtain matching error diagnosis information and diagnostic strategy templates.

[0299] As an example, retrieving diagnostic strategy templates can be achieved as follows: Key attribute fields from the global observation state are extracted and used as the current context index. The misclassification type to which the aforementioned risk characteristics belong is obtained through table lookup parsing. The current context index and the misclassification type are used together as query conditions and input into the cognitive graph's search engine. A matching search is performed within the graph nodes to obtain the error diagnosis information and the corresponding diagnostic strategy template.

[0300] For example, the field currently defined as "financial query domain" is extracted as the first current context index. A table lookup is performed to obtain the first error classification type corresponding to the first risk feature of "interface interaction failure". The first current context index and the first error classification type are input into the first search engine. Matching yields first error diagnostic information containing modification permission prompts, and a first diagnostic strategy template with accompanying parameter completion instructions.

[0301] In this embodiment, by jointly retrieving the cognitive graph based on context and error type, the beneficial effect of accurately matching and calling the prior knowledge base with the current operational dilemma is achieved. This solves the technical problem that large language models lack standard references during error correction and can only randomly generate repair strategies under ruleless guidance.

[0302] In step 204, based on the retrieved diagnostic strategy template, a structured correction strategy is generated through a large language model, and the corrected action to be executed is generated according to the structured correction strategy and sent to the target execution environment.

[0303] In some embodiments, the "generating a structured correction strategy" in step 204 can be implemented in the following ways: generating error analysis logic based on a diagnostic strategy template; generating an error evidence search path; generating logical steps for correcting call parameters, or generating logical steps for replanning the tool call flow; assembling the error analysis logic, the error evidence search path, and the logical steps for correcting call parameters into a structured correction strategy, such as... Figure 5 As shown.

[0304] As an example, generating a structured correction strategy can be achieved as follows: Parse the error analysis prompt word field in the diagnostic strategy template, drive the large language model to perform inference on the reasons for interception failure, and form error analysis logic. Locate the information node that caused the current error in the global observation state, and generate the corresponding data access instruction sequence as the error evidence search path. Determine the error level that caused the interception; if it belongs to the parameter assignment level, generate logical steps to correct the call parameters; if it belongs to the interface interaction process level, generate logical steps to re-plan the tool call process. Assemble the text description of the error analysis logic, the data instruction encoding of the error evidence search path, and the text description of the logical steps according to the key-value pair format template to form a structured correction strategy.

[0305] For example, based on the first diagnostic strategy template, the large language model executes the first error analysis logic for missing permissions. It locates the permission status node in the first log and generates the first error evidence search path. Determining the first error level to be the parameter assignment level, it generates the first corrective call parameter logic step of adding a digital signature certificate. The first error analysis logic, the first error evidence search path, and the first corrective call parameter logic step are then concatenated according to a preset format to assemble the first structured correction strategy.

[0306] In this embodiment, the structured assembly of four parts achieves the beneficial effect of standardized encapsulation of multidimensional reflection and repair results. This solves the technical problem that reflection conclusions in natural language form output by large language models are difficult to be directly parsed and executed by downstream programs.

[0307] In some embodiments, the step 204 of "generating the corrected action to be executed according to the structured correction strategy" can be implemented in the following way: according to the reflection process indicated by the structured correction strategy, generate the call correction action, generate the new tool call sequence action, or generate the intent clarification request action to request intent clarification from the user; and determine the call correction action, the new tool call sequence action, or the intent clarification request action as the corrected action to be executed.

[0308] As an example, generating corrected actions to be executed can be achieved as follows: Parse the logical step field in the structured correction strategy to determine the specific action intent indicated by the reflexive process. Generate a call correction action when the action intent is a supplementary parameter type. Generate a new tool call sequence action when the action intent is to change the execution flow. Generate an intent clarification request action when the action intent is to indicate insufficient context. Change the generated call correction action, new tool call sequence action, or intent clarification request action to the pending execution state, thus identifying it as the corrected action to be executed.

[0309] For example, the first reflection process in the first structured correction strategy is analyzed to determine that the intent of the first action is to supplement certificate parameters. A first call correction action containing the digital signature certificate is generated. The first call correction action is changed to a first pending execution state. The first call correction action after the change of state is determined as the first corrected pending execution action.

[0310] In this embodiment, by strictly adhering to a structured strategy to generate the final action, the beneficial effect of a high degree of consistency between the secondary execution action and the diagnostic conclusion is achieved. This solves the technical problem of secondary distortion caused by the error correction action deviating from the error correction intention during the actual generation stage.

[0311] In some embodiments, the above-described action control method may further include the following steps: receiving a subsequent task instruction containing target modality data; when triggering a structured self-diagnosis and correction process for the subsequent task instruction, extracting subsequent risk features for the subsequent task instruction; parsing the modality-specific error features contained in the subsequent risk features; querying the node mapping relationship corresponding to the modality-specific error features in the pre-expanded cognitive graph, and converting the modality-specific error features into underlying logical error features; using the underlying logical error features as an index, retrieving the corresponding subsequent diagnosis strategy template from the pre-expanded cognitive graph; and generating a subsequent structured correction strategy for the target modality data based on the subsequent diagnosis strategy template through a large language model.

[0312] Target modal data refers to a sequence of input data that includes non-plain text formats such as images or audio.

[0313] As an example, generalized control for multimodal data can be achieved as follows: Receive subsequent task instructions containing target modality data. When the risk assessment value of the subsequent task instruction exceeds a preset risk threshold, triggering a structured self-diagnosis and correction process, perform feature mapping on the target modality data to extract subsequent risk features. Use a feature classifier to perform hierarchical parsing of the subsequent risk features, separating modality-specific error features bound to the target modality data format. Use these modality-specific error features as query objects and perform a query in a pre-expanded cognitive graph. Obtain the node mapping relationship corresponding to the modality-specific error features. Replace the data dimension labels according to the node mapping relationship, converting the modality-specific error features into cross-modality generalized underlying logical error features. Input the underlying logical error features as indexes into the pre-expanded cognitive graph. Retrieve matching subsequent diagnostic strategy templates. Extract the repair steps contained in the subsequent diagnostic strategy templates. Input the repair steps into a large language model for decoding, generating subsequent structured correction strategies to achieve action correction and control for the target modality data.

[0314] For example, a first follow-up task instruction containing first speech data is received. Upon triggering the structured self-diagnosis and correction process, a first follow-up risk feature is extracted from the first speech feature data. A first feature classifier is used to separate the first modality-specific error feature where the first speech noise exceeds the standard. A query is performed in the first pre-expanded cognitive graph to obtain the first node mapping relationship. Based on the first node mapping relationship, the first modality-specific error feature is converted into a first low-level logical error feature where key parameters are unreadable. The first low-level logical error feature is used as the first index input to the graph. A first follow-up diagnostic strategy template prompting the user to re-record is retrieved. The first large language model decodes and generates a first follow-up structured correction strategy requesting recording permission based on the first follow-up diagnostic strategy template.

[0315] In this embodiment, by extending the process and mapping physical modal features to unified logical error features, the beneficial effect of cross-modal smooth generalization of the large language model pre-correction control mechanism is achieved. This solves the technical problem that systems relying solely on text logic verification cannot identify and correct errors in complex visual and auditory representations.

[0316] For a macroscopic structural description of the method provided in the embodiments of this application, see [link to relevant documentation]. Figure 4 , Figure 4 This is a second structural diagram of the calibration system architecture for a large language model provided in the embodiments of this application. Figure 4 The demonstrated architecture receives task instructions containing the first query command and global observation status through a large language model, and generates an initial action to be executed. This initial action is passed to a node that determines if the risk assessment value exceeds a preset risk threshold. If the determination is yes, the cognitive graph is accessed based on extracted risk features and the current context index. The cognitive graph feeds back the first diagnostic strategy template to the large language model, driving it to generate a revised action to be executed that includes the first call correction action, and then sends it to the target execution environment. If the determination is no, the initial action branch is directly sent to the target execution environment. The target execution environment generates execution result feedback and passes it to the reward calculation logic module. The reward calculation logic module outputs a global reward value and passes it back to the large language model.

[0317] In summary, the calibration method and action control method for the large language model provided in this application enable the large language model to proactively predict and intercept risks before actions are issued, and generate structured correction strategies based on cognitive graphs, effectively avoiding potential operational risks. Simultaneously, the global reward value is calculated using the execution result feedback from the target execution environment as a feedback signal to iteratively update the network parameters of the large language model. This allows the large language model to dynamically optimize network parameters based on feedback evaluation and error correction knowledge, continuously improving the foresight of action decisions and the accuracy of tool invocation.

[0318] The following will describe an exemplary application of the embodiments of this application in a real-world application scenario.

[0319] The method provided in this application can be applied to an intelligent development assistant that provides services to software developers. In this scenario, the assistant (agent) needs to invoke external tools such as code analysis tools, debuggers, or version control systems based on the developer's natural language instructions (i.e., the task instructions mentioned above) to complete complex tasks such as code review, performance analysis, defect location, and automated testing. The method provided in this application enables the intelligent development assistant to proactively predict and avoid potential operational errors before executing tasks, learn and summarize abstract error correction experiences from failed interactions, dynamically optimize its reward mechanism to better align with the developer's intentions, and ultimately continuously improve its ability to autonomously solve problems through continuous learning, thereby providing developers with a more reliable, efficient, and intelligent automated programming assistance service.

[0320] See Figure 7 , Figure 7 This is a third structural diagram of the calibration system architecture for a large language model provided in this application embodiment. The following refers to... Figure 7 The steps shown are explained below. The process is mainly divided into four stages.

[0321] The first stage is the meta-reflection and proactive avoidance stage, in which the agent performs initial action generation, risk assessment, proactive avoidance decision-making, and generates and executes corrective actions when necessary.

[0322] In step 401, user input is received (i.e., the above-mentioned response to the received task instruction).

[0323] In some embodiments, the intelligent development assistant receives a task instruction from the developer. For example, the developer enters the instruction: "Analyze the performance bottlenecks of this business logic code and find the function that is called most frequently."

[0324] In step 402, preliminary actions are generated (i.e., the initial actions to be executed as described above).

[0325] In some embodiments, the core of the Agent, a Large Language Model (LLM), generates an initial action based on the received user input and the current system state, denoted as... This action is typically a structured tool invocation command. The process can be represented by the following formula: (4) in, This represents the current user input or environmental observation (i.e., the current global observation state mentioned above), which is the developer's instruction; This represents the previous dialogue history (i.e., the dialogue history information mentioned above); This represents the system context (i.e., the system context information mentioned above), such as currently open files, project environment, etc. In this scenario, a possible initial action... Yes: tool_call(analyzer.run, file='entry.logic', mode='fast_mode').

[0326] In step 403, the context is perceived and uncertainty is quantified.

[0327] In some embodiments, during the actual execution of the action Previously, a context-aware module would analyze the potential risks and uncertainties of the action. This module would calculate a risk score (i.e., the risk assessment value mentioned above), denoted as... This process can be represented by the following formula: (5) in, It is a function used to calculate uncertainty (i.e., the objective function that represents the execution uncertainty mentioned above), which is evaluated based on the model's internal state (i.e., the internal state information mentioned above) and historical experience (i.e., the historical interaction trajectory data mentioned above).

[0328] For example, the context-aware module discovers The code in the code runs on an older system environment, and the initial actions... The fast_mode called in the code has known compatibility issues in this older environment. Based on this, the uncertainty quantization module calculates... This is a relatively high value, such as 0.85, where "relatively high" means greater than the preset risk threshold.

[0329] In step 404, an active avoidance decision is made.

[0330] In some embodiments, the risk score calculated in step 403 With a preset risk threshold (That is, the aforementioned preset risk threshold) is compared. This decision-making process can be represented by the following formula: (6) like Greater than (For example, a preset threshold) If the value is 0.7, it is judged as high risk and low confidence, which will trigger meta-reflection ( Proceed to step 405. If... Not greater than If the risk is low and the confidence level (i.e., the confidence level mentioned above) is high, then proceed to step 408.

[0331] In step 405, a meta-reflection strategy is generated.

[0332] In some embodiments, due to the triggering of meta-reflection, the Agent first queries its self-evolving cognitive knowledge graph (KG) (i.e., the pre-built cognitive graph mentioned above) to retrieve diagnostic information (i.e., the matching error diagnostic information mentioned above) and meta-reflection strategy templates (i.e., the diagnostic strategy templates mentioned above) related to the current context and the predicted error type (i.e., the error classification type mentioned above). This retrieval process can be represented by the following formula: (7) in, This is diagnostic information (i.e., the erroneous diagnostic information mentioned above). It is a meta-reflection strategy template. It is an error type. The function represents a semantic-based knowledge retrieval process, whose computation process mainly includes: [details of the current context]. (i.e., the current context index mentioned above) and error types The information is encoded into a query vector; then, in the knowledge graph, the "error pattern" node that best matches the query semantically is found by calculating vector similarity ("best match" means that the cosine similarity score between the node's vector representation and the query vector is the highest, indicating that the error pattern it represents is most semantically close to the problem currently encountered); finally, once the most relevant node (i.e., the node with the highest score) is determined, the associated diagnostic information is extracted along the knowledge edge of that node. and strategy templates As output.

[0333] Subsequently, the LLM core is based on diagnostic information Heyuan Reflection Strategy Template Generate a specific meta-reflection strategy (i.e., the structured correction strategy mentioned above), denoted as... This process can be represented by the following formula: (8) For example, the generated Possible message: "Reflection: The current environment is an older system version V1.0, and fast_mode may not be applicable. Compatibility should be checked first, or compatible_mode should be used instead. Also, directly analyzing the entire file may time out; focus should be placed on core functions."

[0334] In step 406, a correction action is generated.

[0335] In some embodiments, in meta-reflection strategies Driven by [the mechanism], the LLM core generates one or more modified actions (i.e., the modified actions to be executed mentioned above) at the current time step t, denoted as [the modified actions to be executed]. This process can be represented by the following formula: (9) For example, the generated correction action The incompatible `mode='cProfile'` can be replaced with the more compatible `mode='profile'` to address the potential risks of the initial action in step 402; at the same time, the action can be corrected. The focus_func='process_data' parameter can also be added based on the reflection strategy to narrow the analysis scope and improve execution efficiency.

[0336] In step 407, a correction action is performed.

[0337] In some embodiments, the Agent performs the correction action generated in step 406. .

[0338] In step 408, the initial action is performed directly.

[0339] In some embodiments, step 408 is another branch of the decision made in step 404. If the risk score is not greater than a threshold, the agent will directly execute the preliminary action generated in step 402. .

[0340] In the second stage, the agent updates its cognitive knowledge graph based on the results of the action execution, thereby accumulating and evolving knowledge.

[0341] In step 409, the action result is fed back.

[0342] In some embodiments, the environment (i.e., the target execution environment described above) affects the actions performed (whether corrective actions). Still in the initial stages The system provides feedback on the action result (i.e., the execution result feedback mentioned above), which includes an indicator of whether the execution was successful or failed (success and failure experience) and related output information. When the feedback result indicates that the execution failed, the process proceeds to step 405 to generate a new meta-reflection strategy, thereby forming a feedback loop for iterative error correction.

[0343] To ensure the effectiveness of this feedback loop and the rational use of resources, it does not execute indefinitely but is strictly controlled by a series of loop termination conditions. Before each execution failure is determined and preparations are made to return to step 405, the following loop termination conditions are checked: 1) Compare the number of consecutive failures for the current problem with a preset threshold (e.g., 3 times). If the number of consecutive failures reaches the threshold, the loop will terminate.

[0344] 2) When the number of retries reaches the limit, or when the current error type is determined by querying the knowledge graph (KG) to be beyond its known repair capabilities, the loop will terminate and actively report the reason for the failure to the user, requesting human intervention (Human-in-the-loop).

[0345] 3) Real-time token consumption or task execution time. Once the consumption reaches the preset limit, the loop will be forcibly terminated to prevent excessive consumption of computing resources on unsolvable tasks.

[0346] The loop will only continue to execute at step 405 if no termination condition is triggered.

[0347] In step 410, error patterns and strategies are extracted.

[0348] In some embodiments, abstract error patterns (i.e., abstract error occurrence patterns extracted from the error correction interaction feature sequence as described above) and effective repair strategies (i.e., effective repair strategies corresponding to the abstract error occurrence patterns extracted from the error correction interaction feature sequence as described above) are extracted from the action results. This extraction process can be represented by the following formula: (10) in, It's an error mode. It's a repair strategy. This provides specific error diagnosis information for the current time step t. Functions are not simply arithmetic operations, but involve multiple steps of semantic abstraction and structured mapping. First, through semantic abstraction and feature dimensionality reduction, the reasoning capabilities of LLM are used to transform specific error instances (such as specific filenames and parameters) into general error patterns (such as "API call parameter mismatch"). Second, knowledge triples are generated, mapping the abstracted patterns into triple structures (error pattern - recommendation strategy - correction action template) that can be stored in the knowledge graph. Finally, clustering and knowledge alignment are performed, vectorizing the newly extracted patterns and calculating their similarity with existing nodes in the knowledge graph. If the similarity is high, the nodes are merged and updated; if the similarity is low, they are inserted as new knowledge nodes.

[0349] For example, the extracted error patterns A fix strategy could be implemented for "using fast_mode in older environments". You can choose to "replace with compatible_mode or check the environment version first".

[0350] In step 411, a self-evolving knowledge graph is constructed.

[0351] In some embodiments, the error patterns extracted in step 410 and repair strategies Used to update the knowledge graph The update process can be represented by the following formula: (11) in, The function performs updates through a "matching-weighting-merging" process. It merges similar knowledge based on semantic similarity, adjusts the credibility weights of knowledge edges based on success or failure feedback, and performs knowledge dimensionality enhancement through logical reasoning. Representing the current modal context, its computation process involves encoding, aligning, and compressing multimodal inputs into a high-dimensional vector, supporting cross-modal generalization of knowledge.

[0352] Next, update the knowledge graph after step 411. This meta-reflective knowledge is provided to step 405, and this meta-reflective knowledge can be used by step 405. The functions are queried and utilized. This forms a closed loop of knowledge accumulation and application, enabling the agent's meta-reflective capabilities to continuously improve with experience.

[0353] The third stage is evolutionary rewards and predictive feedback, which calculates multi-dimensional rewards based on action results and user intent, and dynamically adjusts the reward weights themselves.

[0354] In step 412, the multidimensional reward is calculated.

[0355] In some embodiments, a multi-dimensional reward is calculated based on the action result output in step 409. (That is, the instant reward values ​​under the aforementioned preset multiple indicator dimensions). These reward dimensions include the correctness of the tool call format (that is, the reward value of the above-mentioned format correctness dimension), parameter accuracy (that is, the reward value of the above-mentioned parameter accuracy dimension), and final task completion rate (that is, the reward value of the above-mentioned final task completion rate dimension, etc.). The calculation is expressed by the following formula: (12) in, This represents the action executed and submitted to the environment at the current time step t. It may be a corrective action executed after step 407. It could also be the initial action directly executed by step 408 when meta-reflection (i.e., the predictive reward value mentioned above) is not triggered. . This represents the action result obtained from step 409. This result includes not only indicators of task success or failure, but may also contain detailed error codes, log output, or data returned upon successful execution, providing a concrete basis for reward calculation. It is a multi-dimensional reward calculation function. The calculation process is not a simple arithmetic operation, but a decomposed evaluation process that calculates separately for multiple preset reward dimensions: First, for the "format correctness" dimension, the function verifies the action through syntactic analysis or comparison with predefined application programming interface (API) specifications. The function checks if the structure is valid. If it matches perfectly, this dimension receives a positive fixed reward value (e.g., +1.0); otherwise, it receives 0 or a negative value. Secondly, for the "parameter accuracy" dimension, the function parses the environmental feedback. The content within the feedback is evaluated. If the feedback contains specific error codes or keywords such as parameter errors or file not found, this dimension receives a negative reward value (e.g., -0.5). Finally, for the "final task completion" dimension, the function evaluates... Whether the user's ultimate goal was achieved. For example, if the feedback includes the expected analytical results, this dimension receives a higher positive reward (e.g., +5.0); if the feedback is a timeout or task anomaly, it receives a higher negative suppression (e.g., -5.0). By aggregating these independently calculated dimension scores, The function ultimately outputs a structured set of rewards. .

[0356] At the same time, a predictive reward is calculated. (That is, the predictive reward value mentioned above), which evaluates the contribution of the current action to achieving the user's ultimate long-term goal, namely the ultimate task goal mentioned above. This calculation process can be expressed by the following formula: (13) in, This represents a structured, internal representation of the user's fundamental intent. It is not the user's raw text input, but rather a high-level semantic target derived from the analysis and refinement of the entire dialogue history by a model. For example, in performance analysis tasks... It may be represented as a vector or data structure that contains semantic features such as "optimizing efficiency" and "locating bottlenecks". It is a function used for value prediction, and in some embodiments, it is a pre-trained neural network model. This model learns from a large number of "action-intent-end result" data pairs, enabling it to predict value based on the current action. and the user's long-term intentions Estimate the probability or contribution of this action to ultimate success in the future (i.e., the expected value of achieving the above-mentioned goal).

[0357] Final total reward The global reward value (as mentioned above) is a weighted sum of rewards across all dimensions (i.e., a weighted fusion of the immediate reward values ​​and the predictive reward values ​​across the multiple indicator dimensions), as shown in the following formula: (14) in, It is the weight of the i-th reward dimension. It is the weight of predictive reward.

[0358] In step 413, the reward weights evolve automatically.

[0359] In some embodiments, the reward weight in step 412 and It is not fixed, but rather dynamically and adaptively adjusted based on performance and knowledge graph updates (i.e., the aforementioned dynamic adjustment applies to the weights of multiple immediate rewards across the various indicator dimensions and the predictive weights for the predictive reward values). This evolutionary process can be represented by the following formula: (15) in, This represents the knowledge graph at the current time step t. Information about changes that have occurred. This does not refer to the entire updated knowledge graph, but rather to structured data containing incremental updates such as "what new error patterns have been added" or "which remediation strategies have proven effective." For example, it might indicate the discovery of a new "high-frequency and subtle" API misuse pattern. It is a meta-learning function or a set of update rules used for the evolution of reward weights. The core function is to... and This adjusts the focus of the reward strategy. In some embodiments, its calculation logic can be expressed as: when... When a certain type of error (such as API misuse) becomes more prominent or is identified as more important, The function will accordingly increase the weight of reward dimensions associated with the error (such as "format correctness"). This will make the agent pay more attention to avoiding such errors in future learning; when The agent consistently performs poorly on a certain dimension (such as "task completion"), while predictive rewards... But at a very high time, The weight of predictive rewards may be reduced appropriately. This encourages agents to focus more on the immediate success of the current task, achieving a balance between short-term and long-term goals.

[0360] Through the evolutionary process in step 413, the reward mechanism itself achieves adaptive adjustment, enabling it to more accurately guide the agent to learn in a constantly changing environment and cognition.

[0361] In some embodiments, a tight bidirectional feedback loop is formed between steps 412 and 413. Specifically, the total reward calculated in step 412... and its various dimensions of rewards This will serve as a learning adjustment signal input to step 413, providing a performance basis for the self-evolution of the reward weights. Conversely, step 413 generates the "updated reward weights" through formula (15). This will be immediately used in step 412 in subsequent reward calculations, thus directly affecting the total reward in the next round. The structure of this system is such that this rapid internal loop ensures the reward mechanism can be fine-tuned in real time based on the performance of the current interaction, enabling instant optimization of the reward strategy.

[0362] The fourth stage is the continuous learning and strategy optimization stage, where the agent uses the generated reward signals and contextual information to optimize its core policy model through reinforcement learning.

[0363] In step 414, the Agent strategy is optimized (i.e., the network parameters of the large language model are iteratively updated as described above).

[0364] In some embodiments, the total reward calculated in step 412 is The reward signal is provided to step 414. Using this experience, which includes the reward signal, the parameters of the LLM core are optimized through reinforcement learning (RL) algorithms such as Directly-Aligned Policy Optimization (DAPO) or Grounded Skill Policy Optimization (GSPO). (i.e., the network parameters mentioned above). This optimization process can be represented by the following formula: (16) in, This represents the collected empirical data (i.e., the strategy optimization samples mentioned above). The updated knowledge graph. It plays a key role here: it makes reward signals more accurately reflect the risks of new discoveries through dynamic reward shaping; it provides context for reinforcement learning through enhanced state augmentation; and it guides the generation of higher-quality interactive experiences, thereby improving the efficiency and effectiveness of policy optimization.

[0365] In step 415, the Agent policy is updated.

[0366] In some embodiments, after optimization in step 414, updated model parameters are obtained. These new parameters are applied to the core LLM of the Agent, thus forming a new and more powerful Agent strategy.

[0367] This updated agent strategy constitutes another key feedback loop. Specifically, when new user input reaches step 402 in the next round of interaction, the LLM core used to "generate the initial action" is already an optimized model. This means that the agent not only improves its error correction and reflection capabilities, but also fundamentally improves the quality and foresight of its initial action generation from the beginning, thereby enabling it to more proactively avoid historical failure patterns and achieve continuous evolution of its overall intelligence level.

[0368] The following description continues to illustrate the exemplary structure of the large language model calibration device 133 provided in the embodiments of this application as a software module. In some embodiments, such as Figure 2A As shown, the software modules in the calibration device 133 of the large language model stored in the memory 130 may include: The first response module 1331 is used to respond to the received task instruction and generate an initial action to be executed by combining the current global observation state through the large language model; The first calculation module 1332 is used to calculate the confidence level of the initial action to be executed under the global observation state, and to perform a mapping calculation on the confidence level to obtain a risk assessment value; The first interception module 1333 is used to intercept the initial action to be executed when the risk assessment value exceeds a preset risk threshold, and extract the risk features of the initial action to be executed, and retrieve a matching diagnostic strategy template from the pre-constructed cognitive map using the global observation state and risk features as indexes. The first correction module 1334 is used to generate a structured correction strategy based on the retrieved diagnostic strategy template through the large language model, generate a corrected action to be executed according to the structured correction strategy, and send it to the target execution environment. The first feedback module 1335 is used to obtain the execution result feedback of the target execution environment, calculate the instant reward value of the execution result feedback under multiple preset indicator dimensions, and combine it with the final task goal to quantify the target achievement expectation value of the corrected action to be executed to achieve the final task goal, and map the quantified target achievement expectation value into a predictive reward value. The first update module 1336 is used to perform weighted fusion of the real-time reward value and the predictive reward value under the multiple indicator dimensions to obtain a global reward value; and to use the global reward value as a feedback signal to iteratively update the network parameters of the large language model.

[0369] In some embodiments, the first response module 1331 is further configured to: process the task instruction, the dialogue history information in the global observation state, and the system context information through the large language model; and generate the initial action to be executed, which includes a tool call instruction, a text reply, or an intent clarification request action.

[0370] In some embodiments, the first calculation module 1332 is further configured to: acquire the internal state information and historical interaction trajectory data of the large language model; construct an objective function representing execution uncertainty based on the internal state information and the historical interaction trajectory data; perform a mapping calculation on the confidence level using the objective function representing execution uncertainty; and determine the result of the mapping calculation as the risk assessment value.

[0371] In some embodiments, the first interception module 1333 is further configured to: calculate the matching degree between the initial action to be executed and the system context information; identify structural defects or parameter defects existing in the initial action to be executed; identify logical consistency defects or tool invocation defects existing in the initial action to be executed based on the matching degree; extract abnormal features of at least one of the structural defects, the parameter defects, the logical consistency defects, and the tool invocation defects; map the abnormal features to the corresponding expected execution exception types; and determine the expected execution exception types as the risk features.

[0372] In some embodiments, the first interception module 1333 is further configured to: send the initial action to be executed to the target execution environment when the risk assessment value does not exceed the preset risk threshold, so as to execute the initial action to be executed in the target execution environment.

[0373] In some embodiments, the first interception module 1333 is further configured to: determine the global observation state as the current context index; obtain the error classification type corresponding to the risk feature; and perform a retrieval operation in the cognitive graph based on the current context index and the error classification type to obtain matching error diagnosis information and the diagnosis strategy template.

[0374] In some embodiments, the first correction module 1334 is further configured to: generate error analysis logic based on the diagnostic strategy template; generate error evidence search path; generate logical steps for correcting call parameters, or generate logical steps for replanning the tool call flow; and assemble the error analysis logic, the error evidence search path, and the logical steps into the structured correction strategy.

[0375] In some embodiments, the first correction module 1334 is further configured to: generate a call correction action, generate a new tool call sequence action, or generate an intent clarification request action to request intent clarification from the user, according to the reflection process indicated by the structured correction strategy; and determine the call correction action, the new tool call sequence action, or the intent clarification request action as the corrected action to be executed.

[0376] In some embodiments, the first correction module 1334 is further configured to: obtain the error diagnosis information when the execution result feedback indicates that the corrected action to be executed has been successfully executed; concatenate the initial action to be executed, the error diagnosis information, and the corrected action to be executed in chronological order to obtain an error correction interaction feature sequence; extract an abstract error occurrence pattern from the error correction interaction feature sequence; and extract an effective repair strategy corresponding to the abstract error occurrence pattern from the error correction interaction feature sequence.

[0377] In some embodiments, the first correction module 1334 is further configured to: parse the error occurrence type from the abstract error occurrence pattern; parse the failure cause features from the abstract error occurrence pattern; parse the effective error correction practice features from the effective repair strategy; compare the error occurrence type, the failure cause features, and the effective error correction practice features with existing historical data nodes in the cognitive graph; based on the comparison results, perform a knowledge feature fusion operation on the error occurrence type, the failure cause features, and the effective error correction practice features, or perform a data conflict resolution operation; persistently store the features after the knowledge feature fusion operation or the data conflict resolution operation in the cognitive graph; and expand the data nodes of the cognitive graph through the persistently stored features.

[0378] In some embodiments, the first correction module 1334 is further configured to: receive a subsequent task instruction containing target modality data; when a structured self-diagnosis and correction process is triggered for the subsequent task instruction, extract subsequent risk features for the subsequent task instruction; parse the modality-specific error features contained in the subsequent risk features; query the node mapping relationship corresponding to the modality-specific error features in the expanded cognitive graph, and convert the modality-specific error features into underlying logical error features; use the underlying logical error features as an index to retrieve the corresponding subsequent diagnosis strategy template from the expanded cognitive graph; and generate a subsequent structured correction strategy for the target modality data based on the subsequent diagnosis strategy template through the large language model.

[0379] In some embodiments, the first feedback module 1335 is further configured to: parse the environmental response status included in the execution result feedback; verify the conformity of the corrected action to be executed with the application programming interface specification based on the environmental response status; determine the reward value for the format correctness dimension based on the conformity; verify the correctness and completeness of the parameter values ​​included in the corrected action to be executed based on the environmental response status; determine the reward value for the parameter accuracy dimension based on the correctness and completeness; verify the target tool program called by the corrected action to be executed based on the environmental response status; and determine the tool selection based on the verification result of the target tool program. The reward value for the accuracy dimension is calculated; the consistency between the corrected action to be executed and the structured correction strategy in diagnostic logic is compared; the reward value for the reflective semantic consistency dimension is determined based on the consistency; the contribution of the corrected action to be executed to the final task objective is evaluated based on the execution result feedback; the reward value for the final task completion dimension is determined based on the contribution; the reward values ​​for the format correctness dimension, the parameter accuracy dimension, the tool selection accuracy dimension, the reflective semantic consistency dimension, and the final task completion dimension are combined to obtain the immediate reward value under the multiple indicator dimensions.

[0380] In some embodiments, the first feedback module 1335 is further configured to: extract global user intent features contained in the final task objective; predict the probability distribution of subsequent multi-round state transitions triggered by the corrected action to be executed in the external environment; and calculate the matching degree between the subsequent multi-round state transition probability distribution and the global user intent features, as the expected value for achieving the objective.

[0381] In some embodiments, the first update module 1336 is further configured to: acquire performance indicators, satisfaction feedback, and node evolution characteristics of the cognitive graph; dynamically adjust the weights of multiple immediate rewards for the multiple indicator dimensions and the predictive weights for the predictive reward value based on the performance indicators, the satisfaction feedback, and the node evolution characteristics; and use the dynamically adjusted weights of multiple immediate rewards and the predictive weights to perform a weighted summation of the immediate reward value and the predictive reward value to obtain the global reward value.

[0382] In some embodiments, the first update module 1336 is further configured to: combine the global observation state, the corrected action to be executed, and the global reward value; store the combination result in a structured manner as a policy optimization sample; perform a sampling operation on the stored policy optimization sample; calculate the network parameter update gradient based on the sampled policy optimization sample; and adjust the network parameters of the large language model according to the network parameter update gradient.

[0383] The following description continues to illustrate the exemplary structure of the action control device 134 for the large language model provided in this application embodiment as a software module. In some embodiments, such as... Figure 2B As shown, the software modules in the motion control device 134 of the large language model stored in the memory 130 may include: The second response module 1341 is used to respond to the received task instruction and, in conjunction with the current global observation state, generate an initial action to be executed through the large language model. The second calculation module 1342 is used to calculate the confidence level of the initial action to be executed under the global observation state, and to perform a mapping calculation on the confidence level to obtain a risk assessment value; The second interception module 1343 is used to intercept the initial action to be executed when the risk assessment value exceeds a preset risk threshold, and extract the risk features of the initial action to be executed, and retrieve a matching diagnostic strategy template from the pre-built cognitive map using the global observation state and the risk features as indexes. The second correction module 1344 is used to generate a structured correction strategy based on the retrieved diagnostic strategy template through the large language model, generate a corrected action to be executed according to the structured correction strategy, and send it to the target execution environment.

[0384] In some embodiments, the second response module 1341 is further configured to: process the task instruction, the dialogue history information in the global observation state, and the system context information through the large language model; and generate the initial action to be executed, which includes a tool call instruction, a text reply, or an intent clarification request action.

[0385] In some embodiments, the second calculation module 1342 is further configured to: acquire the internal state information and historical interaction trajectory data of the large language model; construct an objective function representing execution uncertainty based on the internal state information and the historical interaction trajectory data; perform a mapping calculation on the confidence level using the objective function representing execution uncertainty; and determine the result of the mapping calculation as the risk assessment value.

[0386] In some embodiments, the second interception module 1343 is further configured to: calculate the matching degree between the initial action to be executed and the system context information; identify structural defects or parameter defects existing in the initial action to be executed; identify logical consistency defects or tool invocation defects existing in the initial action to be executed based on the matching degree; extract abnormal features of at least one of the structural defects, the parameter defects, the logical consistency defects, and the tool invocation defects; map the abnormal features to the corresponding expected execution exception types; and determine the expected execution exception types as the risk features.

[0387] In some embodiments, the second interception module 1343 is further configured to: send the initial action to be executed to the target execution environment when the risk assessment value does not exceed the preset risk threshold, so as to execute the initial action to be executed in the target execution environment.

[0388] In some embodiments, the second interception module 1343 is further configured to: determine the global observation state as the current context index; obtain the error classification type corresponding to the risk feature; and perform a retrieval operation in the cognitive graph based on the current context index and the error classification type to obtain matching error diagnosis information and the diagnosis strategy template.

[0389] In some embodiments, the second correction module 1344 is further configured to: generate error analysis logic based on the diagnostic strategy template; generate error evidence search path; generate logical steps for correcting call parameters, or generate logical steps for replanning the tool call process; The error analysis logic, the error evidence search path, and the logical steps are assembled into the structured correction strategy.

[0390] In some embodiments, the second correction module 1344 is further configured to: generate a call correction action, generate a new tool call sequence action, or generate an intent clarification request action to request intent clarification from the user, according to the reflection process indicated by the structured correction strategy; and determine the call correction action, the new tool call sequence action, or the intent clarification request action as the corrected action to be executed.

[0391] In some embodiments, the second correction module 1344 is further configured to: receive a subsequent task instruction containing target modality data; when a structured self-diagnosis and correction process is triggered for the subsequent task instruction, extract subsequent risk features for the subsequent task instruction; parse the modality-specific error features contained in the subsequent risk features; query the node mapping relationship corresponding to the modality-specific error features in the pre-expanded cognitive graph, and convert the modality-specific error features into underlying logical error features; use the underlying logical error features as an index to retrieve the corresponding subsequent diagnosis strategy template from the pre-expanded cognitive graph; and generate a subsequent structured correction strategy for the target modality data based on the subsequent diagnosis strategy template through the large language model.

[0392] This application provides a computer program product, which includes a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of an electronic device reads the computer-executable instructions from the computer-readable storage medium and executes the computer-executable instructions, causing the electronic device to perform the large language model calibration method or the large language model action control method described in this application embodiment.

[0393] This application provides a computer-readable storage medium storing computer-executable instructions or a computer program. When the computer-executable instructions or the computer program are executed by a processor, the processor will execute the calibration method or action control method for a large language model provided in this application. For example, ... Figure 3A The calibration method of the large language model shown or Figure 3E The action control method of the large language model is shown.

[0394] In some embodiments, the computer-readable storage medium may be a memory such as RAM, ROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or it may be a variety of devices including one or any combination of the above-mentioned memories.

[0395] In some embodiments, computer-executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

[0396] As an example, computer-executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple co-located files (e.g., files that store one or more modules, subroutines, or code sections).

[0397] As an example, computer-executable instructions can be deployed to execute on a single electronic device, or on multiple electronic devices located at one location, or on multiple electronic devices distributed across multiple locations and interconnected via a communication network.

[0398] In summary, the calibration method and action control method of the large language model provided in this application, in response to the received task instruction and combined with the current global observation state, generate an initial action to be executed through the large language model, calculate the confidence of the initial action to be executed under the global observation state, and then map it to obtain a risk assessment value. If the risk assessment value exceeds a preset risk threshold, the initial action to be executed is intercepted and risk features are extracted. This allows for proactive prediction and avoidance of potential operational errors and logical defects before the action is issued to the target execution environment. Subsequently, using the global observation state and risk features as indexes, a matching diagnostic strategy template is retrieved from the pre-constructed cognitive graph, and a structured correction strategy is generated through the large language model. Based on the structured correction strategy, a corrected action to be executed is generated. This utilizes historical diagnostic experience to guide the current execution process and improves the efficiency of external tool calls. Accuracy and behavioral reliability: After obtaining the execution result feedback from the target execution environment, the immediate reward value of the execution result feedback under multiple preset indicator dimensions is calculated respectively. Combined with the final task target, the expected value of the action to be executed after quantitative correction is mapped to the target achievement of the final task target. Then, the immediate reward value and the predicted reward value under multiple indicator dimensions are weighted and fused to obtain the global reward value, realizing a multi-dimensional comprehensive quantitative evaluation of the short-term compliance and long-term goal contribution of the execution action. Finally, the global reward value is used as a feedback signal to iteratively update the network parameters of the large language model, enabling the large language model to continuously extract effective repair strategies from the interaction sequence and realize dynamic knowledge evolution, forming a closed-loop link from action evaluation, active avoidance to policy adaptive optimization, and continuously improving the large language model's ability to autonomously process complex tasks and its forward-looking stability.

[0399] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of this application are included within the scope of protection of this application.

Claims

1. A calibration method for a large language model, characterized in that, The method includes: In response to the received task instruction, and in conjunction with the current global observation state, the initial action to be executed is generated through the large language model; Calculate the confidence level of the initial action to be executed under the global observation state, and perform a mapping calculation on the confidence level to obtain a risk assessment value; If the risk assessment value exceeds a preset risk threshold, the initial action to be executed is intercepted, and the risk features of the initial action to be executed are extracted. Using the global observation state and the risk features as indexes, a matching diagnostic strategy template is retrieved from the pre-constructed cognitive graph. Based on the retrieved diagnostic strategy template, a structured correction strategy is generated through the large language model, and a corrected action to be executed is generated according to the structured correction strategy and sent to the target execution environment. Obtain the execution result feedback of the target execution environment, calculate the real-time reward value of the execution result feedback under multiple preset indicator dimensions, and combine it with the final task objective to quantify the expected value of the corrected action to be executed to achieve the final task objective, and map the quantified expected value of the objective achievement into a predictive reward value. The real-time reward values ​​under the multiple indicator dimensions are weighted and fused with the predictive reward values ​​to obtain the global reward value; The global reward value is used as a feedback signal to iteratively update the network parameters of the large language model.

2. The method according to claim 1, characterized in that, The process of generating initial actions to be executed using the large language model, based on the current global observation state, includes: The task instructions, dialogue history information in the global observation state, and system context information are processed through the large language model. Generate the initial action to be executed, which includes a tool invocation command, a text response, or an intent clarification request.

3. The method according to claim 2, characterized in that, The process of mapping the confidence level to obtain the risk assessment value includes: Obtain the internal state information and historical interaction trajectory data of the large language model; Construct an objective function representing execution uncertainty based on the internal state information and the historical interaction trajectory data; The confidence level is mapped using the objective function representing the uncertainty, and the result of the mapping calculation is determined as the risk assessment value.

4. The method according to claim 1, characterized in that, The extraction of risk characteristics of the initial action to be executed includes: Calculate the matching degree between the initial action to be executed and the system context information; Identify structural or parameter defects in the initial action to be executed; Based on the matching degree, identify logical consistency defects or tool invocation defects in the initial action to be executed; Extract the abnormal features of at least one of the structural defects, parameter defects, logical consistency defects, and tool invocation defects; Map the abnormal features to the corresponding expected execution exception types; The expected execution anomaly type is identified as the risk characteristic.

5. The method according to claim 4, characterized in that, The method further includes: If the risk assessment value does not exceed the preset risk threshold, the initial action to be executed is sent to the target execution environment so that the initial action to be executed is performed in the target execution environment.

6. The method according to claim 1, characterized in that, The step of retrieving matching diagnostic strategy templates from a pre-constructed cognitive map, using the global observation state and risk characteristics as indexes, includes: The global observation state is determined as the current context index; Obtain the error classification type corresponding to the risk feature; Based on the current context index and the error classification type, a retrieval operation is performed in the cognitive graph to obtain a matching diagnostic strategy template.

7. The method according to claim 6, characterized in that, The step of generating a structured correction strategy based on the retrieved diagnostic strategy template using the large language model includes: Based on the diagnostic strategy template, error analysis logic is generated; Generate a path to find erroneous evidence; Generate the logical steps for correcting the call parameters, or generate the logical steps for replanning the tool call flow; The error analysis logic, the error evidence search path, and the logical steps are assembled into the structured correction strategy.

8. The method according to claim 7, characterized in that, The step of generating the corrected action to be executed based on the structured correction strategy includes: Following the reflection process indicated by the structured correction strategy, generate a call to correction action, generate a new tool call sequence action, or generate an intent clarification request action to request intent clarification from the user. The call correction action, the new tool call sequence action, or the intent clarification request action are determined as the corrected action to be executed.

9. The method according to claim 8, characterized in that, The method further includes: If the execution result feedback indicates that the corrected action to be executed was successfully executed, a retrieval operation is performed in the cognitive graph based on the current context index and the error classification type to obtain matching error diagnosis information; By sequentially concatenating the initial action to be executed, the error diagnosis information, and the corrected action to be executed, an error correction interaction feature sequence is obtained. Extract abstract error occurrence patterns from the error correction interaction feature sequence; Effective repair strategies corresponding to the abstract error occurrence patterns are extracted from the error correction interaction feature sequence.

10. The method according to claim 9, characterized in that, The method further includes: The error occurrence type is parsed from the abstract error occurrence pattern; Extract failure cause characteristics from the abstract error occurrence pattern; Effective error correction practice characteristics are extracted from the effective repair strategies described above; Compare the error occurrence type, the failure cause characteristics, and the effective error correction practice characteristics with existing historical data nodes in the cognitive graph; Based on the comparison results, a knowledge feature fusion operation is performed on the error occurrence type, the failure cause characteristics, and the effective error correction practice characteristics, or a data conflict resolution operation is performed. The features after the knowledge feature fusion operation or the data conflict resolution operation are persistently stored in the cognitive graph; The data nodes of the cognitive graph are expanded by persistently storing the features.

11. The method according to claim 10, characterized in that, The method further includes: Receive subsequent task instructions containing target modal data; When the structured self-diagnosis and correction process is triggered in response to the subsequent task instructions, the subsequent risk characteristics of the subsequent task instructions are extracted. Analyze the modality-specific error features contained in the subsequent risk features; The mapping relationship between nodes in the expanded cognitive graph and the modality-specific error features is queried, and the modality-specific error features are converted into underlying logical error features. Using the underlying logical error features as an index, the corresponding subsequent diagnostic strategy template is retrieved from the expanded cognitive graph; Based on the aforementioned subsequent diagnostic strategy template, a subsequent structured correction strategy for the target modality data is generated through the large language model.

12. The method according to claim 1, characterized in that, The step of calculating the instant reward value of the execution result feedback under multiple preset indicator dimensions includes: The execution result feedback includes the environmental response status; Based on the environmental response status, verify the conformity of the corrected action to be executed with the application programming interface specification; The reward value for the format correctness dimension is determined based on the aforementioned compliance status; Based on the environmental response status, verify the correctness and completeness of the parameter values ​​included in the corrected action to be executed; The reward value for the parameter accuracy dimension is determined based on the correctness and completeness described above; Based on the environmental response status, verify the target tool program invoked by the corrected action to be executed; The reward value for the accuracy dimension of tool selection is determined based on the verification results of the target tool program; Compare the consistency of the corrected action to be executed with the structured correction strategy in terms of diagnostic logic; The reward value for the reflective semantic consistency dimension is determined based on the aforementioned consistency. Based on the execution result feedback, evaluate the contribution of the revised action to be executed to the final task objective; The reward value for the final task completion dimension is determined based on the degree of contribution. The reward values ​​for the format correctness dimension, the parameter accuracy dimension, the tool selection accuracy dimension, the reflective semantic consistency dimension, and the final task completion dimension are combined to obtain the instantaneous reward value under the multiple indicator dimensions.

13. The method according to claim 1, characterized in that, The quantification of the expected value of achieving the final task objective by the modified action to be executed includes: Extract the global user intent features contained in the final task objective; Predict the probability distribution of subsequent multi-round state transitions triggered by the modified action to be executed in the external environment; The matching degree between the subsequent multi-round state transition probability distribution and the global user intent feature is calculated as the expected value for achieving the target.

14. The method according to claim 1, characterized in that, The step of weightedly fusing the real-time reward values ​​under the multiple indicator dimensions with the predictive reward values ​​to obtain the global reward value includes: Obtain performance indicators, satisfaction feedback, and node evolution characteristics of the cognitive graph; Based on the execution performance indicators, the satisfaction feedback, and the node evolution characteristics, the weights of multiple instant rewards for the multiple indicator dimensions and the predictive weights for the predictive reward values ​​are dynamically adjusted. Using the dynamically adjusted multiple instant reward weights and the predictive weights, the instant reward value and the predictive reward value are weighted and summed respectively to obtain the global reward value.

15. The method according to claim 1, characterized in that, The step of using the global reward value as a feedback signal to iteratively update the network parameters of the large language model includes: Combine the global observation state, the corrected action to be executed, and the global reward value; The combined results are stored in a structured manner as policy optimization samples; Perform a sampling operation on the stored policy-optimized samples; Based on the sampled strategy, the network parameter update gradient is calculated using optimized samples. The network parameters of the large language model are adjusted by updating the gradient based on the network parameters.

16. A method for action control of a large language model, characterized in that, The method includes: In response to the received task instruction, and in conjunction with the current global observation state, the initial action to be executed is generated through the large language model; Calculate the confidence level of the initial action to be executed under the global observation state, and perform a mapping calculation on the confidence level to obtain a risk assessment value; If the risk assessment value exceeds a preset risk threshold, the initial action to be executed is intercepted, and the risk features of the initial action to be executed are extracted. Using the global observation state and the risk features as indexes, a matching diagnostic strategy template is retrieved from the pre-constructed cognitive graph. Based on the retrieved diagnostic strategy template, a structured correction strategy is generated through the large language model. The corrected action to be executed is generated according to the structured correction strategy and sent to the target execution environment.

17. A calibration device for a large language model, characterized in that, The device includes: The first response module is used to respond to the received task instruction and, in combination with the current global observation state, generate an initial action to be executed through the large language model. The first calculation module is used to calculate the confidence level of the initial action to be executed under the global observation state, and to perform a mapping calculation on the confidence level to obtain a risk assessment value; The first interception module is used to intercept the initial action to be executed when the risk assessment value exceeds a preset risk threshold, and extract the risk features of the initial action to be executed. Using the global observation state and the risk features as indexes, it retrieves a matching diagnostic strategy template from the pre-built cognitive map. The first correction module is used to generate a structured correction strategy based on the retrieved diagnostic strategy template through the large language model, generate a corrected action to be executed according to the structured correction strategy, and send it to the target execution environment. The first feedback module is used to obtain the execution result feedback of the target execution environment, calculate the real-time reward value of the execution result feedback under multiple preset indicator dimensions, and combine it with the final task goal to quantify the expected value of the corrected action to be executed to achieve the final task goal, and map the quantified expected value of the goal achievement into a predictive reward value. The first update module is used to perform weighted fusion of the real-time reward value and the predictive reward value under the multiple indicator dimensions to obtain a global reward value; and to use the global reward value as a feedback signal to iteratively update the network parameters of the large language model.

18. An electronic device, characterized in that, The electronic device includes: Memory is used to store executable instructions or computer programs. A processor, configured to execute computer-executable instructions or computer programs stored in the memory, implements the method of any one of claims 1 to 15 or claim 16.

19. A computer-readable storage medium, characterized in that, It stores a computer program or computer-executable instructions for implementing the method of any one of claims 1 to 15 or claim 16 when executed by a processor.

20. A computer program product comprising a computer program or computer-executable instructions, characterized in that, When the computer program or computer-executable instructions are executed by a processor, they implement the method described in any one of claims 1 to 15 or claim 16.