Method and apparatus for evaluating accuracy of large model planning
By building a test interface library and generating predicted path sequences through multiple rounds of interaction, and combining multi-dimensional constraint information for evaluation, the shortcomings of large-scale model planning accuracy evaluation are solved, fine-grained evaluation and optimization suggestions are realized, and the accuracy of the model in application scenarios is improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies lack a standardized, quantifiable, and personalized evaluation system for the accuracy of large-scale model planning, which makes it difficult for large-scale models to meet specific inference needs in specific application scenarios.
By constructing a test interface library that is highly consistent with the online environment, the model to be evaluated is driven to interact with it in multiple rounds to generate a predicted path sequence. Based on the preset reference path sequence and its multi-dimensional constraint information, a multi-dimensional evaluation is performed, including path integrity, planning accuracy, execution order, and parameter semantic consistency.
It enables fine-grained and structured evaluation of the accuracy of large-scale model planning, avoids the misjudgment problem in traditional evaluation methods, improves the comprehensiveness and robustness of the evaluation, and provides optimization suggestions to improve model accuracy.
Smart Images

Figure CN122262003A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of model evaluation technology, and in particular to a method and apparatus for evaluating the accuracy of large model planning. Background Technology
[0002] With the widespread adoption of Large Language Models (LLMs), also known as large models, intelligent agents built upon these models are being applied across various industries. For example, intelligent agents embedded in healthcare systems can provide corresponding medical consultation services to patients based on their input of medical inquiries, not only saving healthcare staff costs but also improving the efficiency of healthcare services.
[0003] In practical applications, intelligent agents primarily rely on the core capabilities provided by the planning module of large models, such as goal decomposition, path planning, and dynamic adjustment, to infer the results corresponding to input information. However, before large models are put into use, there is usually a lack of a standardized, quantifiable, and personalized evaluation system for their planning accuracy. This makes it difficult for large models to meet specific inference needs in particular application scenarios.
[0004] Therefore, effectively assessing the planning accuracy of large-scale models before they are put into use is crucial for improving their application performance. Summary of the Invention
[0005] In order to effectively evaluate the planning accuracy of large-scale models, embodiments of this application provide a method and apparatus for evaluating the planning accuracy of large-scale models.
[0006] In a first aspect, embodiments of this application provide a method for evaluating the accuracy of large-scale model planning, used to evaluate a system. The method includes: acquiring target query information, which is used to instruct an online large-scale model to perform target planning operations; based on the target query information, driving the model to be evaluated to interact with a test interface library in multiple rounds and obtaining a predicted path sequence to simulate the multi-step planning behavior of the online large-scale model in performing the target planning operation; and performing a multi-dimensional evaluation on the predicted path sequence based on a preset reference path sequence and its multi-dimensional constraint information to obtain a multi-dimensional evaluation result.
[0007] In an optional embodiment, before obtaining the target query information, the method further includes: configuring a reference path sequence and its multi-dimensional constraint information for performing the target planning operation on the online large model according to preset planning requirement information. The multi-dimensional constraint information includes path integrity constraint information, planning accuracy constraint information, execution order constraint information, and parameter semantic constraint information.
[0008] In one optional embodiment, based on the target query information, driving the model to be evaluated to perform multiple rounds of interaction with the test interface library to obtain a predicted path sequence includes: inputting the target query information as the initial context into the model to be evaluated, so as to generate the interface identifier and its parameter value to be called in the first round of interaction through the model to be evaluated; the evaluation system calls the corresponding test interface in the test interface library according to the interface identifier and its parameter value to obtain the corresponding return value; updating the context state with the return value and using it as the input for the next round of interaction, repeating the above process until a preset termination condition is met; and constructing a structured predicted path sequence based on the interface identifier, corresponding parameter value, return value and calling order generated in each round of interaction.
[0009] In one optional embodiment, the current context state includes the target query information and the historical sequence of invoked interfaces, wherein the interface sequence includes the interface identifier, parameter value, and corresponding return value of each interface.
[0010] In one optional embodiment, each test interface in the test interface library has the same interface identifier as the corresponding online interface, and the parameter values and return values are consistent in data structure and semantics.
[0011] In one optional embodiment, the predicted path sequence is evaluated in multiple dimensions based on a preset reference path sequence and its multidimensional constraint information to obtain a multidimensional evaluation result. This includes: inputting the predicted path sequence, the preset reference path sequence and its multidimensional constraint information into an evaluation model, and using the evaluation model to calculate the matching degree between the predicted path sequence and the reference path sequence in terms of path integrity, planning accuracy, execution order, and parameter semantic consistency to obtain a multidimensional evaluation result.
[0012] In one optional embodiment, the reference path sequence is represented as a planned trajectory map containing multiple nodes, each node corresponding to an interface identifier and its parameter value, as well as the semantic parameter text of the corresponding parameter value; based on the preset reference path sequence and its multidimensional constraint information, the predicted path sequence is evaluated in multiple dimensions to obtain a multidimensional evaluation result, including: inputting the predicted path sequence, the preset reference path sequence and its multidimensional constraint information into an evaluation model, so as to calculate the matching degree between the predicted path sequence and the planned trajectory map in terms of path integrity, node matching, sequence dependency and parameter semantic consistency through the evaluation model, and obtain a multidimensional evaluation result.
[0013] In an optional embodiment, the method further includes: generating optimization suggestion information based on the multidimensional evaluation results; adjusting the prompt template of the model to be evaluated or fine-tuning the model parameters based on the optimization suggestion information to improve the planning accuracy of the model to be evaluated.
[0014] Secondly, embodiments of this application provide an apparatus for evaluating the accuracy of large-scale model planning, comprising: an acquisition module for acquiring target query information, the target query information being used to instruct an online large-scale model to perform target planning operations; a scheduling module for driving the model to be evaluated to interact with a test interface library in multiple rounds based on the target query information and obtaining a predicted path sequence, thereby simulating the multi-step planning behavior of the online large-scale model in performing the target planning operation; and an evaluation module for performing multi-dimensional evaluation on the predicted path sequence based on a preset reference path sequence and its multi-dimensional constraint information, thereby obtaining a multi-dimensional evaluation result.
[0015] Thirdly, embodiments of this application provide an electronic device, which includes: a memory for storing a computer program product; and a processor for executing the computer program product stored in the memory, wherein when the computer program product is executed, it implements the method for evaluating the accuracy of large model planning described in the first aspect above.
[0016] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer program instructions, which, when executed, implement the method for evaluating the accuracy of large model planning described in the first aspect.
[0017] In summary, the method and apparatus for evaluating the planning accuracy of large models provided in this application construct a test interface library that is highly consistent with the online environment, and drive the model to be evaluated to interact with the interface library in a dynamic context state for multiple rounds, generating a predicted path sequence corresponding to the multiple rounds of interaction. This realistically reproduces the multi-step planning behavior of large online models when performing complex tasks. Furthermore, based on a preset reference path sequence and multi-dimensional constraint information covering path integrity, planning accuracy, execution order, and parameter semantic consistency, the generated predicted path sequence is evaluated in a fine-grained and structured manner, avoiding the misjudgment problem caused by traditional evaluation methods due to a single evaluation dimension or reliance on precise matching. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 The software structure diagram is provided for an evaluation system that performs an evaluation method for assessing the accuracy of large model planning, as provided in the embodiments of this application.
[0020] Figure 2A flowchart illustrating a method for evaluating the accuracy of large-scale model planning, provided as an embodiment of this application.
[0021] Figure 3 This is a structural block diagram of an apparatus for evaluating the accuracy of large-scale model planning, provided in an embodiment of this application.
[0022] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0023] The embodiments of this specification will be further described in detail below with reference to the accompanying drawings and examples. Through these descriptions, the features and advantages of the embodiments of this specification will become clearer and more apparent.
[0024] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments. Although various aspects of embodiments are shown in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated otherwise.
[0025] Furthermore, the technical features involved in the different embodiments of this specification described below can be combined with each other as long as they do not conflict with each other.
[0026] With the widespread application of Large Language Model (LLM)-driven agents, the requirements for the planning accuracy of these models in vertical application scenarios (such as travel and government affairs) are becoming increasingly stringent. Therefore, to improve the performance of large models in specific application scenarios, it is necessary to effectively evaluate their planning accuracy through an evaluation system before deployment. However, traditional evaluation systems, such as those based on the Benchmark for Function Calling in Large Language Models (BFCL), have significant limitations in assessing the planning accuracy of large models.
[0027] For example, the evaluation system fails to model the call order and dependencies of the various interfaces called during the multi-step planning operation of the large model, making it difficult to accurately identify the call order and dependencies between interfaces. Furthermore, when identifying the input and output parameters of each interface, the evaluation system relies solely on formal matching or Abstract Syntax Tree (AST) for accurate matching, failing to perform semantic understanding or equivalent substitution, thus leading to parameter misjudgment. Additionally, the evaluation tasks performed by the system cannot cover the real call chain in vertical application scenarios, resulting in evaluation omissions and incorrect evaluation results. Moreover, the evaluation samples used by the system are typically represented by static triples (query function call answers). This static, single-round, stateless structure cannot capture the call dependencies and context transitivity between multiple interfaces, nor can it achieve dynamic decision-making; therefore, it is difficult to simulate the multi-step planning behavior of a large online model performing planning operations.
[0028] Therefore, for large models used in complex application scenarios, traditional evaluation systems struggle to conduct multi-dimensional, fine-grained, and application-scenario-aligned quantitative evaluations of their planning accuracy, severely restricting algorithm iteration efficiency and product optimization effectiveness.
[0029] To address this, this application provides a method for evaluating the accuracy of large-scale model planning. This method simulates the multi-step planning behavior of an online large-scale model to obtain the corresponding predicted path sequence. Then, based on a preset reference path sequence for a given target and multi-dimensional constraint information, the matching degree between the predicted path sequence and the reference path sequence is evaluated in multiple dimensions through an evaluation model. This solves the problem of insufficient evaluation in traditional evaluation systems in terms of path integrity, planning accuracy, execution sequence, and parameter semantic consistency.
[0030] The implementation principle of the method for evaluating the accuracy of large model planning provided in this application will be explained below through specific embodiments.
[0031] For ease of explanation, in this embodiment, any specific application scenario using the large model is referred to as the target scenario, and any online service included in the target scenario is referred to as the target service. The target service implements various user-specified tasks based on the large model. For example, in a government affairs scenario, the target service could be a "government affairs assistant," which implements various government affairs tasks specified by the user based on the large model; similarly, in a travel scenario, the target service could be a "navigation assistant," which implements various travel tasks specified by the user based on the large model.
[0032] In practical applications, the process of a large model performing planning operations is a step-by-step reasoning process. When the large model includes a planning module, this step-by-step reasoning process is executed by the planning module, which evaluates the state and selects action paths for each step of the reasoning to ultimately achieve the task objective (such as problem solving, dialogue management, code generation, intelligent agents, etc.). During this process, the large model generates a series of coherent and executable steps; therefore, the accuracy of these steps reflects the planning accuracy of the large model.
[0033] For ease of explanation, in this embodiment of the application, the sequence of steps generated by the large model performing planning operations is called the path sequence. Based on this, the sequence of steps generated by the model to be evaluated simulating the multi-step planning behavior of the online large model performing planning operations is called the predicted path sequence. The accuracy of the predicted path sequence reflects the planning accuracy of the model to be evaluated.
[0034] Based on this, in order to evaluate the accuracy of the predicted path sequence, in this embodiment of the application, a corresponding reference path sequence is pre-configured for each task objective under the target service according to the planning requirement information of the target service in the target application scenario. The planning requirement information defines the target planning operations that the online large model needs to perform to achieve each task objective, and the reference path sequence for each task objective includes the specific requirements and constraints for the online large model to perform the corresponding target planning operations.
[0035] This can be understood as the reference path sequence being equivalent to the standard step sequence generated by the large online model for performing the goal planning operation. These standard step sequences can be used as a reference to evaluate the accuracy of the predicted path sequence.
[0036] In this embodiment, the specific forms of the planning demand information, reference path sequence, and predicted path sequence are not limited; they can be text or structured data. Optionally, taking the planning demand information, reference path sequence, and predicted path sequence as all structured data as an example, the data type can be a lightweight data exchange format such as JavaScript Object Notation (JSON), Extensible Markup Language (XML), or Shell script. The types of the planning demand information, reference path sequence, and predicted path sequence can be the same or different, and this is not limited here.
[0037] Of course, the above types are only illustrative examples. In practical applications, they are not limited to these, and specific types can be selected according to actual needs.
[0038] In this embodiment, the specific content of the reference path sequence is not limited, and the corresponding reference path sequence will vary depending on the different planning requirements. For example, the content of the reference path sequence includes, but is not limited to, the interface identifiers, parameter values, return values, and calling order, dependencies, and termination conditions of each interface required for the online large model to perform the target planning operation, etc. The specific content is determined by the target planning operation and will not be detailed here.
[0039] Below, we take the task of "re-ordering" in an e-commerce scenario as an example, assuming that the corresponding planning requirements include: Process A: Check order status -> Request return -> Process refund; Process B: Check inventory -> Place new order; Based on this, corresponding reference path sequences can be configured for process A and process B respectively, so as to evaluate the accuracy of the predicted path sequences generated by the model to be evaluated in simulating the task objective of "re-ordering".
[0040] In this example, the reference path sequence configured for process A is illustrated using Table 1, and the reference path sequence configured for process B is illustrated using Table 2.
[0041] Table 1
[0042] Table 2
[0043] Referring to Table 1, to achieve the task objective of "reordering" through process A, it is necessary to first call the "Order Status Query Interface" and pass in the "Order Number" as the input parameter to query the corresponding order information; then, after querying the order information that meets the requirements, call the "Request Return Interface" and determine the input parameter based on the return value of the previous step to apply for a return for the order; further, if it is determined that the return value of the previous step indicates that the return status is "approved", call the "Process Refund Interface" and pass in the "Return Request Number" as the input parameter to process the refund for the order, until the return value indicates that the refund status is "completed", thus achieving the task objective of "reordering".
[0044] Referring to Table 2, to achieve the task objective of "reordering" through process B, it is necessary to first call the "Inventory Query Interface" and pass in at least the "Product Number" as an input parameter to query the inventory information corresponding to the product; further, if the previous step returns a valid inventory status, the "Create Order Interface" is called and the necessary user information and product information are passed in as input parameters to create a new order until the return value indicates that it contains order information with the order status of "created", thus achieving the task objective of "reordering".
[0045] In other words, the online model can achieve the task objective of "reordering" in an e-commerce scenario by executing the above two step sequences. Based on this, after evaluating the accuracy of the predicted path sequence generated by the model under evaluation to simulate the task objective of "reordering," if the evaluation results show that the predicted path sequence is the same as or equivalent to the above two step sequences, then the planning accuracy of the model under evaluation is determined to meet the requirements and can be used online.
[0046] It should be noted that the above information regarding task objectives and their planning requirements, as well as the reference path sequence configured based on the planning requirements, is merely illustrative. In actual applications, the corresponding content will differ depending on the specific task objectives. Furthermore, each step indicated by the reference path sequence, i.e., each interface call, uses semantic parameters—parameter values that, while different in form, have the same or equivalent meaning and can be used for the same interface call.
[0047] In addition, the above explanation only uses some steps as examples for the reference path sequences indicated in Tables A and B. The determination principle for the step sequences corresponding to other planning processes (such as invalid orders, insufficient inventory, etc.) is similar to that of the above step sequences, and will not be elaborated further.
[0048] Furthermore, to ensure evaluation quality, in this embodiment, for each task objective, in addition to the pre-configured reference path sequence, corresponding multi-dimensional constraint information is also configured to evaluate the accuracy of the predicted path sequence from multiple dimensions. The specific content of the multi-dimensional constraint information is not limited; optionally, path integrity constraint information, planning accuracy constraint information, execution sequence constraint information, and parameter semantic constraint information are used as examples for illustration.
[0049] Based on this, the matching degree between the predicted path sequence and the reference path sequence can be evaluated in terms of path integrity, planning accuracy, execution sequence, and parameter semantic consistency, resulting in multi-dimensional evaluation results. This not only avoids misjudgment caused by a single standard, but also truly reflects the multi-step reasoning ability of the model to be evaluated, providing reliable feedback for model optimization.
[0050] In practical applications, when a user wants to achieve a certain task goal through a target service, they will input target query information into the target service. After receiving the target query information, the target service will input it as an input instruction into the online big model. Then, the online big model will perform target planning operations based on the received target query information, and query the query results that match the target query information through multi-step reasoning. Finally, the target service will output the query result to the user to achieve the corresponding task goal.
[0051] During this process, the large online model needs to call a series of online interfaces to execute the planning functions at each step through these interfaces.
[0052] Therefore, in order to simulate the multi-step interactive behavior of online large-scale models performing target planning operations, in this embodiment of the application, target query information corresponding to the task target is constructed, wherein the target query information is used to instruct the online large-scale models to perform target planning operations.
[0053] In addition, a test interface library was built as a simulation environment corresponding to the various online interfaces called by the online large model when performing target planning operations. The test interface library includes multiple preset test interfaces, each of which corresponds one-to-one with the online interface called by the online large model when performing target planning operations. The corresponding test interface and the online interface have the same interface identifier and interface name, and their parameter values and return values are consistent in data structure and semantics.
[0054] In this embodiment, the method for constructing the target query information is not limited. For example, in one optional approach, domain experts can manually construct the target query information based on the specific functions of the target service and the specific content of the task objective. Another optional approach is to input a pre-configured reference path sequence into a pre-trained language model, which then constructs the target query information corresponding to the reference path sequence through back-reasoning based on its semantic understanding capabilities. The specific construction method can be determined according to actual needs and will not be elaborated upon here.
[0055] Optionally, when constructing target query information by reverse engineering through a pre-trained language model, for the same reference path sequence, diverse target query information with different forms but the same or equivalent semantics can be constructed to adapt to the diverse query characteristics of users in real application scenarios.
[0056] Alternatively, when generating target query information through a pre-trained language model, alternative query information can be constructed for each path selection based on the path selection features in multi-step planning behavior, in order to match the user's true intent.
[0057] For example, the reference path sequence input to the pre-trained language model is: find nearby restaurants, determine restaurants with available seats on Saturdays, and determine restaurants with an average cost of less than 200 yuan per person.
[0058] Based on this, when constructing target query information, the pre-trained language model can not only construct query information corresponding to the reference path sequence, such as "I want to eat at a restaurant on Saturday, please help me find a restaurant nearby with an average cost of no more than 200 yuan per person," but also analyze the query information required for the online large model to adjust the path selection at each step when the user's intent changes. This allows it to generate query information corresponding to the change in user intent, such as "I want to eat at a restaurant, please help me find a restaurant nearby with an average cost of no more than 200 yuan per person," which adapts to situations where the user changes time; or "I want to go to the airport on Saturday, please help me find the shortest route to the airport nearby," which adapts to situations where the user changes location and query task.
[0059] This can be understood as follows: for a reference path sequence, after the pre-trained language model reverse-engineers the query information that is accurately adapted to it, it can also generate more semantically identical, equivalent, similar, or similar query information by transforming, supplementing, and adjusting the parameter objects in it. These can be used together as the target query information output to adapt to the user's real query needs. At the same time, it can also better test the planning accuracy of the model to be evaluated.
[0060] In this embodiment, each test interface performs only lightweight processing on the parameter values, such as format conversion and accuracy verification, without replicating the complex processing logic executed internally by the online interface. This not only makes it easier to build and saves data resources, but also allows each test interface to simulate the processing function of the online interface simply by receiving the same or equivalent parameter values and returning the same or equivalent return values, resulting in a simpler and more efficient processing flow.
[0061] It can be understood that the above configuration process is a preparatory work before executing the method for evaluating the accuracy of large model planning provided in the embodiments of this application. After the above information configuration is completed, the evaluation method can be executed to drive the model to be evaluated to interact with each test interface in the test interface library in multiple steps through the target query information, simulate the multi-step planning behavior of the online large model to perform target planning operations, thereby obtaining the predicted path sequence and evaluating its accuracy.
[0062] Optionally, the method for evaluating the accuracy of large model planning provided in this application embodiment is executed by an evaluation system. The specific form of the evaluation system is not limited, and it can be understood that, at the method execution level, the evaluation system is a software system.
[0063] In this embodiment of the application, the software structure of the evaluation system is not limited. Figure 1 A schematic diagram of the software architecture of an evaluation system is shown.
[0064] like Figure 1As shown, the evaluation system includes an input module, a calling module, a test interface library, a model to be evaluated, an evaluation module, and an output module. The input module receives external information and inputs it into the scheduling module. This external information includes pre-configured target query information, a reference path sequence, and multi-dimensional constraint information. The scheduling module schedules the model to be evaluated to interact with multiple test interfaces in the test interface library in multiple steps based on the target query information. It then generates a predicted path sequence based on the results of these interactions and inputs the predicted path sequence, the reference path sequence, and the multi-dimensional constraint information into the evaluation module. The evaluation module performs a multi-dimensional evaluation of the fit between the predicted path sequence and the reference path sequence based on the multi-dimensional constraint information and outputs the obtained multi-dimensional evaluation results through the output module.
[0065] Optionally, based on the obtained multidimensional evaluation results, optimization suggestions can be generated for model optimization of the model to be evaluated.
[0066] It should be noted that, Figure 1 The software structure of the evaluation system shown is for illustrative purposes only. In practical applications, it is not limited to this. Depending on the actual needs, the corresponding software structure may be different, which will not be elaborated here.
[0067] It should be further noted that the above embodiments are intended to illustrate the basic process of the evaluation system in performing the evaluation method for the accuracy of large model planning provided in the embodiments of this application, and this process is only an optional method and is not a limiting description.
[0068] The implementation principle of the method for evaluating the accuracy of large-scale model planning provided in the embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0069] Figure 2 A flowchart illustrating a method for evaluating the accuracy of large-scale model planning, provided as an embodiment of this application, is shown below. Figure 2 As shown, the evaluation method includes: S102. Obtain target query information, which is used to instruct the online large model to perform target planning operations; S104. Based on the target query information, drive the model to be evaluated to interact with the test interface library in multiple rounds and obtain the predicted path sequence to simulate the multi-step planning behavior of the online large model in performing target planning operations. S106. Based on the preset reference path sequence and its multidimensional constraint information, perform multidimensional evaluation on the predicted path sequence to obtain the multidimensional evaluation result.
[0070] In this embodiment, after obtaining the target query information used to instruct the online large model to perform target planning operations, the evaluation system can input the target query information into the model to be evaluated to drive the model to be evaluated to perform multiple rounds of interaction with the test interface library, and generate a predicted path sequence based on the results of the multiple rounds of interaction. Then, based on the preset reference path sequence and its multidimensional constraint information, the predicted path sequence is evaluated in multiple dimensions to obtain the corresponding multidimensional evaluation result.
[0071] In one optional embodiment, the process of the evaluation system driving the model to be evaluated to perform multiple rounds of interaction with the test interface library based on the target query information can be understood as follows: the evaluation system takes the received target query information as the initial context input to the model to be evaluated, and generates the interface identifier and its parameter value to be called in the first round of interaction through the model to be evaluated; then, according to the interface identifier and its parameter value, it calls the corresponding test interface in the test interface library to obtain the return value; further, it updates the context state with the return value and uses it as the input for the next round of interaction, and repeats the above process until the preset termination condition is met.
[0072] In this embodiment of the application, the preset termination conditions for multi-round interactions are not limited. Optionally, the preset termination conditions include, but are not limited to, the model to be evaluated explicitly outputting a planning completion signal, the number of test interfaces that have been called reaching a preset maximum number, and no new valid interface identifiers and their parameter values being generated in multiple consecutive rounds of interactions. Depending on the planning requirements, the preset termination conditions may be different, which will not be elaborated here.
[0073] In this embodiment, before the model to be evaluated interacts with the test interface library for the next round, it can dynamically adjust the interface identifier and its parameter values to be called in the next round of interaction based on the target query information and the updated context state, so as to simulate the step-by-step inference process of the online large model. The current context state corresponding to each round of interaction includes the target query information and the historical sequence of called interfaces, where the interface sequence includes the interface identifier, parameter values, and corresponding return values of each interface.
[0074] In this embodiment, during the multi-round interaction between the model to be evaluated and the test interface library, the evaluation system synchronously records the interface identifier, corresponding parameter value, return value, and calling order generated in each round of interaction. Finally, based on the interface identifier, corresponding parameter value, return value, and calling order generated in each round of interaction, a structured prediction path sequence is constructed. For the specific structure of the prediction path sequence, please refer to the description of the foregoing embodiment. Repeated descriptions will not be repeated here.
[0075] In this embodiment of the application, the process of the evaluation system performing multidimensional evaluation on the predicted path sequence based on the preset reference path sequence and its multidimensional constraint information can be understood as follows: the evaluation system inputs the predicted path sequence, the preset reference path sequence and its multidimensional constraint information into the evaluation model, and calculates the matching degree between the predicted path sequence and the reference path sequence in terms of path integrity, planning accuracy, execution order and parameter semantic consistency through the evaluation model.
[0076] Among them, the matching degree of the predicted path sequence and the preset reference path sequence in terms of path integrity refers to assessing whether the predicted path sequence has omitted any calling steps compared to the reference path sequence; the matching degree of the predicted path sequence and the preset reference path sequence in terms of planning accuracy refers to assessing whether the predicted path sequence has redundant calling steps compared to the reference path sequence; the matching degree of the predicted path sequence and the preset reference path sequence in terms of execution order refers to assessing whether the execution order of the predicted path sequence is incorrect compared to the reference path sequence; and the matching degree of the predicted path sequence and the preset reference path sequence in terms of parameter semantic consistency refers to assessing whether the predicted path sequence uses semantically incorrect parameter values in each calling step compared to the reference path sequence.
[0077] Based on this, the evaluation model can output a corresponding score for the evaluation results of each dimension, which together serve as the multi-dimensional evaluation result. For example, if the predicted path sequence matches the reference path sequence in every dimension, the output score is 1; otherwise, the output score is 0.
[0078] Of course, the above scoring rules are only illustrative examples. In actual applications, they are not limited to these rules. Specific scoring rules can be determined according to actual needs, and will not be elaborated here.
[0079] In an optional embodiment, the evaluation model can represent the reference path sequence as a planned trajectory map containing multiple nodes, wherein each node corresponds to an interface identifier and its parameter value, as well as the semantic parameter text of the corresponding parameter value. It can be understood that the parameter value of the current node can be replaced with its corresponding semantic parameter text, and the two have the same or equivalent semantics.
[0080] In this embodiment, the order of nodes in the planned trajectory diagram (from the root node to the leaf node) corresponds to the interface call order indicated by the reference path sequence. When the interface corresponding to the current node is called, its output return value is used as the parameter value of the branch node of the current node.
[0081] Based on this, the evaluation system's process of performing multidimensional evaluation on the predicted path sequence based on a preset reference path sequence and its multidimensional constraint information can also be understood as follows: the evaluation system inputs the predicted path sequence, the preset reference path sequence, and their multidimensional constraint information into the evaluation model. The evaluation model then calculates the matching degree between the predicted path sequence and the planned trajectory map in terms of path integrity, node matching, sequence dependency, and parameter semantic consistency, thereby obtaining the multidimensional evaluation result. This approach not only solves the problem of semantic fuzzy matching that traditional evaluation methods based on abstract syntax trees cannot perform, but also improves evaluation efficiency by utilizing a tree structure for rapid matching.
[0082] Of course, the above evaluation methods are only illustrative examples and are not limited to these in practical applications.
[0083] Furthermore, in the above embodiments, and to limit the number of target query information input to the evaluation system, it can be understood that the target query information input to the evaluation system can be one or more.
[0084] Optionally, when there is only one target query, the above evaluation method is executed once by the evaluation system. When there are multiple target queries, the above evaluation method is executed once by the evaluation system for each target query, and the execution processes corresponding to multiple target queries are executed in parallel.
[0085] Alternatively, for concurrent execution scenarios, a preset number of concurrent processes can be set according to the processing performance of the model to be evaluated, in order to match the throughput of the model to be evaluated.
[0086] Based on the above, after obtaining the multidimensional evaluation results corresponding to the predicted path sequence through the evaluation model, the planning accuracy of the model to be evaluated in each dimension can be analyzed based on the multidimensional evaluation results. Then, if it is determined that the model to be evaluated needs to be optimized, targeted optimization suggestions can be generated.
[0087] Furthermore, based on this optimization suggestion information, the prompt template of the model to be evaluated can be adjusted or the model parameters can be fine-tuned to improve the planning accuracy of the model to be evaluated. For example, the optimization suggestion information can be provided to the R&D team so that they can optimize the model to be evaluated.
[0088] Of course, the multi-dimensional evaluation results output by the evaluation model can also be used to formulate evaluation strategies for similar models, or to conduct data analysis and operational risk control on the production data of similar models used online, etc. The specific usage can be selected according to actual needs, which will not be elaborated here.
[0089] In this embodiment, a test interface library highly consistent with the online environment is constructed, and the model to be evaluated is driven to interact with the interface library in a dynamic context state for multiple rounds, generating a predicted path sequence corresponding to the multiple rounds of interaction. This realistically reproduces the multi-step planning behavior of a large online model when performing complex tasks. Furthermore, based on a preset reference path sequence and multi-dimensional constraint information covering path integrity, planning accuracy, execution order, and parameter semantic consistency, the generated predicted path sequence is evaluated in a fine-grained and structured manner, avoiding the misjudgment problem caused by the single evaluation dimension or reliance on precise matching in traditional evaluation methods.
[0090] In summary, the method for evaluating the accuracy of large-scale model planning provided in this application not only accurately identifies specific defects in the model during the planning process, such as missing steps, disordered sequences, and parameter expression deviations, but also improves the comprehensiveness, robustness, and interpretability of the evaluation by aligning judgments at the semantic level. Furthermore, based on the multi-dimensional evaluation results, targeted optimization suggestions can be generated to guide downstream processes in adjusting optimization prompt templates or fine-tuning the model, forming a closed loop of "evaluation, feedback, and optimization." This enhances the interface calling and multi-step inference capabilities of large-scale models in real-world application scenarios, thereby improving system quality.
[0091] It is understandable that the execution subject of each step in the above method can be the same device, or the method can be executed by different devices.
[0092] Furthermore, in some of the processes described in the above embodiments and figures, there are multiple operations that appear in a specific order. However, it should be clearly understood that these operations may not be executed in the order they appear in this document or may be executed in parallel. The sequence numbers of the operations, such as S102, S104, etc., are only used to distinguish the different operations, and the sequence numbers themselves do not limit the execution order.
[0093] In addition, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel.
[0094] It should be noted that the above embodiments are merely examples, and modifications can be made to the above embodiments in actual implementation. Those skilled in the art will understand that any modifications to the above embodiments that do not require creative effort fall within the protection scope of the embodiments in this specification, and will not be described in detail in the embodiments.
[0095] All the above-mentioned optional technical solutions can be referenced or combined with each other to form optional embodiments of this specification, and will not be described in detail here.
[0096] Based on the same inventive concept, embodiments of this application also provide a device for evaluating the accuracy of large-scale model planning. Figure 3 A schematic diagram of the evaluation device 300 is shown.
[0097] like Figure 3 As shown, the device 300 may include an acquisition module 301, a scheduling module 302, and an evaluation module 302. The acquisition module 301 is used to acquire target query information, which is used to instruct the online large model to perform target planning operations. The scheduling module 302 is used to drive the model to be evaluated to interact with the test interface library in multiple rounds based on the target query information and obtain a predicted path sequence to simulate the multi-step planning behavior of the online large model in performing target planning operations. The evaluation module 303 is used to perform multi-dimensional evaluation on the predicted path sequence based on a preset reference path sequence and its multi-dimensional constraint information to obtain a multi-dimensional evaluation result.
[0098] In an optional embodiment, before acquiring the target query information, the acquisition module 301 is further configured to: configure a reference path sequence and its multi-dimensional constraint information for performing target planning operations on the online large model according to the preset planning requirement information. The multi-dimensional constraint information includes path integrity constraint information, planning accuracy constraint information, execution order constraint information, and parameter semantic constraint information.
[0099] In an optional embodiment, the scheduling module 302, based on the target query information, drives the model to be evaluated to perform multiple rounds of interaction with the test interface library and obtains a predicted path sequence. This is used to: input the target query information as the initial context into the model to be evaluated, so that the model to be evaluated generates the interface identifier and its parameter values to be called in the first round of interaction; the evaluation system calls the corresponding test interface in the test interface library according to the interface identifier and its parameter values to obtain the corresponding return value; update the context state with the return value and use it as the input for the next round of interaction, repeating the above process until a preset termination condition is met; and construct a structured predicted path sequence based on the interface identifier, corresponding parameter values, return values, and calling order generated in each round of interaction.
[0100] In one optional embodiment, the current context state includes target query information and a sequence of historically invoked interfaces, wherein the interface sequence includes the interface identifier, parameter values, and corresponding return values of each interface.
[0101] In one optional embodiment, each test interface in the test interface library has the same interface identifier as the corresponding online interface, and the parameter values and return values are consistent in data structure and semantics.
[0102] In an optional embodiment, the evaluation module 303 performs a multidimensional evaluation on the predicted path sequence based on a preset reference path sequence and its multidimensional constraint information to obtain a multidimensional evaluation result. This result is used to: input the predicted path sequence, the preset reference path sequence and its multidimensional constraint information into the evaluation model, so as to calculate the matching degree between the predicted path sequence and the reference path sequence in terms of path integrity, planning accuracy, execution order and parameter semantic consistency through the evaluation model, thereby obtaining a multidimensional evaluation result.
[0103] In an optional embodiment, the reference path sequence is represented as a planned trajectory map containing multiple nodes, each node corresponding to an interface identifier and its parameter value, as well as the semantic parameter text of the corresponding parameter value; the evaluation module 303 performs multidimensional evaluation on the predicted path sequence based on the preset reference path sequence and its multidimensional constraint information to obtain a multidimensional evaluation result, which is used to: input the predicted path sequence, the preset reference path sequence and its multidimensional constraint information into the evaluation model, so as to calculate the matching degree between the predicted path sequence and the planned trajectory map in terms of path integrity, node matching, sequence dependency and parameter semantic consistency through the evaluation model, and obtain the multidimensional evaluation result.
[0104] In an optional embodiment, the scheduling module 302 is further configured to: generate optimization suggestion information based on the multidimensional evaluation results; adjust the prompt template of the model to be evaluated or fine-tune the model parameters based on the optimization suggestion information to improve the planning accuracy of the model to be evaluated.
[0105] It should be noted that since the principle by which this evaluation device solves the problem is similar to that of the aforementioned evaluation method, the implementation of this evaluation device can be found in the corresponding section of the aforementioned evaluation method implementation description, and repeated details will not be repeated.
[0106] Based on the same inventive concept, this application also provides an electronic device, see [link to relevant documentation]. Figure 4 , Figure 4 This is a structural block diagram of an electronic device provided in an embodiment of this application.
[0107] like Figure 4 As shown, the electronic device 400 may include a processor 401, a memory 402, and a program or instructions stored in the memory 402 and executable on the processor 401. When the program or instructions are executed by the processor 401, they implement the various processes of the above-described evaluation method embodiments and achieve the same technical effects. To avoid repetition, they will not be described again here.
[0108] It should be noted that the electronic devices in the embodiments of this application include mobile electronic devices and non-mobile electronic devices.
[0109] This application also provides a computer-readable storage medium storing a computer program thereon. When the computer program is executed by a processor, it implements the various processes of the above-described evaluation method embodiments and achieves the same technical effect. To avoid repetition, it will not be described again here.
[0110] The processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.
[0111] This specification is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0112] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0113] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0114] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on its differences from other embodiments. In particular, the apparatus embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the description of the method embodiments. In this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Those skilled in the art can understand the specific meaning of the above terms in the embodiments of this specification according to the specific circumstances.
[0115] It should be noted that, unless otherwise specified, the embodiments and features described in this specification can be combined with each other. This specification is not limited to any single aspect, nor to any single embodiment, nor to any combination and / or substitution of these aspects and / or embodiments. Moreover, each aspect and / or embodiment of this specification can be used alone or in combination with one or more other aspects and / or embodiments thereof.
[0116] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this specification, and are not intended to limit them. Although this specification has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this specification, and they should all be covered within the scope of this specification.
Claims
1. A method for evaluating the accuracy of large-scale model planning, used to evaluate a system, characterized in that, The method includes: Obtain target query information, which is used to instruct the online large model to perform target planning operations; Based on the target query information, the model to be evaluated is driven to interact with the test interface library in multiple rounds to obtain a predicted path sequence, so as to simulate the multi-step planning behavior of the online large model in performing the target planning operation; Based on a preset reference path sequence and its multidimensional constraint information, the predicted path sequence is evaluated in multiple dimensions to obtain a multidimensional evaluation result.
2. The method according to claim 1, characterized in that, Before obtaining the target query information, it also includes: Based on the preset planning requirements, the target planning operation is performed on the online large model, configuring a reference path sequence and its multi-dimensional constraint information. The multi-dimensional constraint information includes path integrity constraint information, planning accuracy constraint information, execution sequence constraint information, and parameter semantic constraint information.
3. The method according to claim 1, characterized in that, Based on the target query information, the model to be evaluated is driven to interact with the test interface library in multiple rounds to obtain a predicted path sequence, including: The target query information is used as the initial context input to the model to be evaluated, so as to generate the interface identifier and its parameter value to be called in the first round of interaction through the model to be evaluated; The evaluation system calls the corresponding test interface in the test interface library according to the interface identifier and its parameter values to obtain the corresponding return value; The return value is updated to the context state and used as the input for the next round of interaction. The above process is repeated until the preset termination condition is met. Based on the interface identifiers generated in each round of interaction, the corresponding parameter values, return values, and call order, a structured predicted path sequence is constructed.
4. The method according to claim 3, characterized in that, The current context state includes the target query information and the historical sequence of called interfaces, wherein the interface sequence includes the interface identifier, parameter value and corresponding return value of each interface.
5. The method according to claim 4, characterized in that, Each test interface in the test interface library has the same interface identifier as its corresponding online interface, and the parameter values and return values are consistent in data structure and semantics.
6. The method according to any one of claims 1-5, characterized in that, Based on a preset reference path sequence and its multidimensional constraint information, the predicted path sequence is evaluated in multiple dimensions to obtain multidimensional evaluation results, including: The predicted path sequence, the preset reference path sequence, and their multidimensional constraint information are input into the evaluation model. The evaluation model calculates the matching degree between the predicted path sequence and the reference path sequence in terms of path integrity, planning accuracy, execution order, and parameter semantic consistency, thereby obtaining a multidimensional evaluation result.
7. The method according to any one of claims 1-5, characterized in that, The reference path sequence is represented as a planned trajectory map containing multiple nodes, each node corresponding to an interface identifier and its parameter value, as well as the semantic parameter text of the corresponding parameter value; Based on a preset reference path sequence and its multidimensional constraint information, the predicted path sequence is evaluated in multiple dimensions to obtain multidimensional evaluation results, including: The predicted path sequence, the preset reference path sequence, and their multidimensional constraint information are input into the evaluation model. The evaluation model is used to calculate the matching degree between the predicted path sequence and the planned trajectory map in terms of path integrity, node matching, sequence dependency, and parameter semantic consistency, so as to obtain the multidimensional evaluation result.
8. The method according to any one of claims 1-5, characterized in that, Also includes: Based on the multidimensional evaluation results, optimization suggestions are generated. Based on the optimization suggestions, adjust the prompt template of the model to be evaluated or fine-tune the model parameters to improve the planning accuracy of the model to be evaluated.
9. A device for evaluating the accuracy of large-scale model planning, characterized in that, include: The acquisition module is used to acquire target query information, which is used to instruct the online large model to perform target planning operations. The scheduling module is used to drive the model to be evaluated to interact with the test interface library in multiple rounds based on the target query information and obtain the predicted path sequence, so as to simulate the multi-step planning behavior of the online large model in performing the target planning operation; The evaluation module is used to perform multidimensional evaluation on the predicted path sequence based on a preset reference path sequence and its multidimensional constraint information, and obtain multidimensional evaluation results.
10. An electronic device, characterized in that, The electronic device includes: Memory, used to store computer program products; A processor is configured to execute a computer program product stored in the memory, wherein, when the computer program product is executed, it implements the method described in any one of claims 1-8.
11. A computer-readable storage medium storing a computer program, characterized in that, The computer program is configured to implement the method of any one of claims 1-8 when executed by a processor.