Task execution method and device based on language model, equipment and storage medium

By using a language model-based task execution method, multiple inference trajectories are generated and autonomously planned, solving the problems of flexibility and reliability in task orchestration in complex service systems, and achieving efficient task execution and system adaptability.

CN122195589APending Publication Date: 2026-06-12HANGZHOU ALIBABA INT INTERNET IND CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU ALIBABA INT INTERNET IND CO LTD
Filing Date
2026-01-19
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve autonomous, flexible, efficient, and reliable task orchestration in complex, dynamic, and highly collaborative service systems. In particular, when faced with fuzzy inputs, cross-domain knowledge, and dynamic external disturbances, traditional automation solutions lack high-level semantic understanding and autonomous planning capabilities, leading to execution failures.

Method used

By employing a language model-based approach, multiple inference paths are generated through iterative reasoning. Combined with task descriptions and tool descriptions, the system autonomously plans execution paths and performs fault-tolerant recovery in the event of anomalies, thereby achieving autonomy, flexibility, and reliability in task execution.

Benefits of technology

It enables autonomous and flexible task orchestration and execution in complex tasks, reduces the complexity of post-inference processing, improves the efficiency and reliability of task execution, and enhances flexibility and fault tolerance when system interface changes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122195589A_ABST
    Figure CN122195589A_ABST
Patent Text Reader

Abstract

The one or more embodiments of the application provide a language model-based task execution method, device and equipment and storage medium, the method comprising: inputting task description of a target task and tool description of available tools into a language model, performing the following steps multiple times until the iteration termination condition is met to generate, by the language model, multiple reasoning tracks from the task description and the tool description to the task execution result of the target task, and determine the target task execution result from the task execution result of the target task contained in the multiple reasoning tracks, and determine the target task execution result as the task execution result of the target task: reasoning, by the language model, based on a context containing the task description, the tool description and the execution result of the operation that has been executed, to determine an operation to be executed and an available tool corresponding to the operation; calling the available tool corresponding to the operation to execute the operation, obtaining the execution result, and updating the context based on the execution result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] One or more embodiments of this application relate to the field of artificial intelligence, and more particularly to a task execution method, apparatus, device, and storage medium based on a language model. Background Technology

[0002] In numerous application scenarios such as enterprise information systems, intelligent operation and maintenance, financial services, healthcare, and intelligent manufacturing, there is an increasing reliance on complex, dynamic, and highly collaborative service systems. This necessitates achieving a balance between efficiency, cost, and robustness while meeting service quality and compliance requirements. Correspondingly, user-initiated tasks are often complex, multi-stage, and cross-system dependent, making them impossible to complete with a single operation or tool call. Instead, the overall goal must be decomposed into a series of logically coherent sub-goals, and multiple heterogeneous systems must be invoked collaboratively as needed. This process is known as task orchestration, which focuses on dynamically planning execution paths based on task semantics, coordinating tool call order, handling intermediate result feedback, and providing fault tolerance or replanning in case of anomalies. In practical applications, achieving autonomous, flexible, efficient, and reliable task orchestration has become a significant challenge. Summary of the Invention

[0003] One or more embodiments of this application provide the following technical solutions: This application provides a task execution method based on a language model, the method comprising: The task description of the target task and the tool description of the available tools are input into the language model. The following steps are executed iteratively multiple times until the iteration termination condition is met, so that the language model generates multiple inference trajectories from the task description and the tool description to the task execution result of the target task. The target task execution result is determined from the task execution results of the target task contained in the multiple inference trajectories, and the target task execution result is determined as the task execution result of the target task. The language model performs reasoning based on a context including the task description, the tool description, and the execution results of the executed operations to determine the operation to be executed and the available tools corresponding to the operation; Invoke the available tool corresponding to the operation to perform the operation, obtain the execution result, and update the context based on the execution result.

[0004] This application also provides a language model-based task execution device, the device comprising: The inference module inputs the task description of the target task and the tool description of the available tools into the language model, and iteratively executes the following steps multiple times until the iteration termination condition is met, so that the language model generates multiple inference trajectories from the task description and the tool description to the task execution result of the target task: The language model performs reasoning based on a context including the task description, the tool description, and the execution results of the executed operations to determine the operation to be executed and the available tools corresponding to the operation; Invoke the available tool corresponding to the operation to execute the operation, obtain the execution result, and update the context based on the execution result; The determination module determines the target task execution result from the task execution results of the target task contained in the multiple inference trajectories, and determines the target task execution result as the target task execution result.

[0005] This application also provides an electronic device, including: processor; Memory used to store processor-executable instructions; The processor executes the executable instructions to implement the steps of the method as described in any of the preceding descriptions.

[0006] This application also provides a computer-readable storage medium having computer instructions stored thereon, which, when executed by a processor, implement the steps of the method as described in any of the preceding claims.

[0007] In the above technical solution, a task description and a tool description of available tools for a given task can be input into a language model. The language model then generates multiple inference paths that deduce the task execution result from the task description and tool description. The target task execution result can then be determined from the task execution results contained in these multiple inference paths, and this target task execution result is identified as the task execution result of the given task. The process of generating an inference path by the language model can include iteratively executing the following steps until an iteration termination condition is met: the language model infers based on a context containing the task description, the tool description, and the execution results of already executed operations to determine the operation to be executed and the available tool corresponding to that operation; the available tool corresponding to the operation is invoked to execute the operation, obtaining the execution result, and the context is updated based on the execution result.

[0008] Using the above approach, on the one hand, inference trajectories can be generated based on task descriptions and tool descriptions, completing the entire process from understanding the task intent to calling the tool to execute the task, thus achieving autonomous and flexible task orchestration and execution; on the other hand, multiple inference trajectories can be generated independently based on the same task description and tool description, and the final task execution result can be determined from the task execution results contained in the multiple inference trajectories, without the need to align the intermediate inference steps in different inference trajectories. This preserves the exploration diversity and result consistency of the inference trajectories, while reducing the complexity of post-inference processing, thereby achieving efficient and reliable task orchestration and execution. Attached Figure Description

[0009] The accompanying drawings used in the description of the exemplary embodiments will now be explained, wherein: Figure 1 This is a schematic diagram illustrating a service system according to an exemplary embodiment of this application.

[0010] Figure 2 This is a flowchart illustrating a language model-based task execution method in an exemplary embodiment of this application.

[0011] Figure 3 This is a schematic diagram of the structure of a device shown in an exemplary embodiment of this application.

[0012] Figure 4 This is a block diagram illustrating a language model-based task execution device in an exemplary embodiment of this application. Detailed Implementation

[0013] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with one or more embodiments of this application. Rather, they are merely examples consistent with some aspects of one or more embodiments of this application.

[0014] It should be noted that the steps of the corresponding methods are not necessarily performed in the order shown and described in this application in other embodiments. In some other embodiments, the methods may include more or fewer steps than those described in this application. Furthermore, a single step described in this application may be broken down into multiple steps in other embodiments; and multiple steps described in this application may be combined into a single step in other embodiments.

[0015] In numerous application scenarios such as enterprise information systems, intelligent operation and maintenance, financial services, healthcare, and intelligent manufacturing, there is an increasing reliance on complex, dynamic, and highly collaborative service systems. This enables a balance between efficiency, cost, and robustness while meeting service quality and compliance requirements. Correspondingly, user-initiated tasks are often complex, multi-stage, and cross-system dependent. Tasks such as "analyzing customer churn and developing retention strategies," "handling sudden equipment failures and restoring production line operation," or "assessing loan application risks and generating compliance reports" cannot be completed through a single operation or a single tool call (e.g., database queries, APIs, functions). Instead, the overall goal must be decomposed into a series of logically coherent sub-goals, and multiple heterogeneous systems (e.g., databases, APIs, analytics engines, approval processes) must be called collaboratively as needed. This process is known as task orchestration, the core of which lies in dynamically planning execution paths based on task semantics, coordinating the order of tool calls, handling intermediate result feedback, and providing fault tolerance or replanning in case of anomalies.

[0016] The importance of task orchestration stems from the complexity and dynamism of service systems. On the one hand, data and functions are scattered across multiple independent systems, lacking a unified data model and semantic interface, forming information silos. On the other hand, tasks in practical applications often involve fuzzy inputs, conditional branches (e.g., if A, execute B; otherwise, try C), multi-objective trade-offs (e.g., trade-offs between cost, timeliness, and risk), and dynamic external disturbances (e.g., market changes, policy updates), requiring execution mechanisms to have high flexibility, adaptability, and robustness.

[0017] Traditional automation solutions typically rely on predefined workflow engines, hard-coded scripts, or rule-based decision trees for task orchestration. While these methods are stable and auditable in well-structured, fixed-boundary scenarios, they have fundamental limitations: they cannot handle task variations that are not explicitly modeled, struggle to adapt to system interface changes, or understand open-ended requests expressed in natural language. Especially when task objectives are vaguely stated (e.g., "Help me figure out why order fulfillment rates have recently declined"), involve cross-domain knowledge (e.g., combining financial data with logistics status), or require causal reasoning, traditional automation solutions often fail due to a lack of high-level semantic understanding and autonomous planning capabilities.

[0018] Supply Chain Management (SCM) serves as a prime example of task orchestration requirements. A modern supply chain is a complex service network encompassing global sourcing, multi-level production, distributed warehousing, cross-border logistics, and end-user sales. Its supporting systems typically include multiple heterogeneous systems such as ERP (Enterprise Resource Planning), WMS (Warehouse Management System), TMS (Transportation Management System), demand forecasting platforms, and supplier collaboration portals. These systems exhibit significant differences in data formats, update frequencies, and access permissions. In this context, a seemingly simple task actually involves complex multi-hop reasoning and dynamic orchestration logic.

[0019] Taking the task of "addressing the impact of a critical material delivery delay on this month's shipments" as an example: First, it's necessary to check the current inventory level, in-transit order status, and safety stock threshold for the material. Second, assess the impact of the delay on downstream production orders and simulate different alternative solutions such as activating alternative suppliers and adjusting production priorities. Next, it may be necessary to call the demand forecasting model to recalculate the fulfillment probability of customer orders and decide whether to postpone delivery based on customer level. If emergency procurement is required, trigger the sourcing process, compare prices, and verify the certification status of new suppliers. At the same time, it's also necessary to notify the logistics team to re-optimize delivery routes and update the estimated delivery time in the customer service system. Finally, integrate all the analysis results to generate a comprehensive report including the scope of impact, countermeasures, and risk warnings for management decision-making. The entire process involves more than ten cross-system tool calls and includes a large number of conditional judgments, data fusion, and feedback loops. Failure or deviation at any step may trigger a chain reaction, leading to global decision-making errors.

[0020] Language models (LMs) are ideal carriers for intelligent task orchestration due to their powerful semantic understanding and intent parsing capabilities, rich domain knowledge, flexible program generation and planning capabilities, multimodal heterogeneous information fusion capabilities, interactive iteration and feedback adaptation capabilities, and other unique capabilities. They are particularly suitable for undertaking autonomous orchestration of complex tasks.

[0021] However, current language model-based intelligent task orchestration frameworks still have some problems in implementing task orchestration: they rely on manually written standardized operating procedures (SOPs), meaning that most systems need to provide task templates or legal action spaces in advance, which makes it impossible to generalize when faced with composite anomalies or new service scenarios; the inference trajectory is fragile and uncontrollable; there is a lack of effective fault tolerance and backtracking mechanisms; and tool calls are disconnected from service semantics, meaning that the model may misuse interfaces due to misunderstanding of professional terms such as "lead time" and "order fulfillment rate," resulting in execution failure.

[0022] Furthermore, methods that improve the accuracy of a model's invocation on a specific toolset through instruction fine-tuning or reinforcement learning usually require a large amount of high-quality labeled data, and once the underlying system interface changes (e.g., API field adjustments, permission policy updates), the model needs to be retrained, resulting in high maintenance costs.

[0023] One or more embodiments of this application provide a technical solution for implementing task execution based on a language model. The task orchestration and execution mechanism in this technical solution does not rely on a preset workflow or SOP, can autonomously perform multi-step logical reasoning based on natural language task description, dynamically generate and execute tool call sequences, and has error detection and recovery capabilities at runtime.

[0024] In the above technical solution, a task description and a tool description of available tools for a given task can be input into a language model. The language model then generates multiple inference paths that deduce the task execution result from the task description and tool description. The target task execution result can then be determined from the task execution results contained in these multiple inference paths, and this target task execution result is identified as the task execution result of the given task. The process of generating an inference path by the language model can include iteratively executing the following steps until an iteration termination condition is met: the language model infers based on a context containing the task description, the tool description, and the execution results of already executed operations to determine the operation to be executed and the available tool corresponding to that operation; the available tool corresponding to the operation is invoked to execute the operation, obtaining the execution result, and the context is updated based on the execution result.

[0025] Using the above approach, on the one hand, inference trajectories can be generated based on task descriptions and tool descriptions, completing the entire process from understanding the task intent to calling the tool to execute the task, thus achieving autonomous and flexible task orchestration and execution; on the other hand, multiple inference trajectories can be generated independently based on the same task description and tool description, and the final task execution result can be determined from the task execution results contained in the multiple inference trajectories, without the need to align the intermediate inference steps in different inference trajectories. This preserves the exploration diversity and result consistency of the inference trajectories, while reducing the complexity of post-inference processing, thereby achieving efficient and reliable task orchestration and execution.

[0026] Please refer to Figure 1 , Figure 1 This is a schematic diagram of a service system provided in an exemplary embodiment.

[0027] like Figure 1 As shown, the above service system may include a server and at least one client that accesses the server via any type of wired or wireless network.

[0028] The aforementioned server can correspond to a server containing a single physical host, or a server cluster consisting of multiple independent physical hosts; alternatively, it can correspond to a virtual server, cloud server, etc., hosted by a host cluster.

[0029] The aforementioned client can correspond to terminal devices such as smartphones, tablets, laptops, desktop computers, PCs (Personal Computers), PDAs (Personal Digital Assistants), wearable devices (e.g., smart glasses, smartwatches), smart in-vehicle devices, or game consoles.

[0030] In practical applications, the client can upload information about the task to be executed and information about the tools that can be invoked to the server, and the server can then execute the specific steps of the proposed task.

[0031] In some embodiments, the server may be equipped with a language model, which can work together with various functional components or subsystems on the server to orchestrate and execute tasks.

[0032] A language model is a natural language processing model based on deep learning technology, possessing powerful language understanding and generation capabilities. A language model typically refers to a deep learning model trained on large amounts of text data, which can be used to understand the meaning of natural language text or generate natural language text. Language models can handle various natural language tasks, such as text classification, named entity recognition (NER), question answering, and dialogue, and are an important pathway to artificial intelligence.

[0033] In the field of Natural Language Processing (NLP), large-scale text datasets are often referred to as corpora. Corpora can contain various types of text data, such as literary works, academic papers, legal documents, news reports, everyday conversations, emails, and online forum posts. By learning from the text data in corpora, language models can acquire and understand the rules and patterns of natural language, thereby achieving effective processing and generation of human language.

[0034] Language models typically employ the Transformer architecture; that is, language models are usually deep learning models based on the Transformer architecture. Deep learning models based on the Transformer architecture are a class of neural network models that utilize the Transformer architecture, and these models perform exceptionally well in fields such as natural language processing.

[0035] The Transformer is a neural network model used for sequence-to-sequence modeling. It does not rely on recursive structures, enabling parallel training and inference, thus accelerating model processing. Deep learning models based on the Transformer architecture typically use multi-layered Transformer encoders to extract features from the input sequence and a Transformer decoder to transform the extracted features into an output sequence. These models also often employ self-attention mechanisms to capture long-range dependencies in the input sequence, and residual connections and normalization methods to accelerate training and improve model performance.

[0036] A pre-trained model is a language model pre-trained on large-scale unlabeled text data. Pre-trained models are general-purpose models; they are not designed or optimized for specific tasks. To adapt pre-trained models to specific application scenarios and task requirements, fine-tuning is needed to improve the model's performance on specific tasks. The final language model deployed is usually a model that has undergone further fine-tuning based on the pre-trained model, using supervised learning on labeled text data. Pre-training and fine-tuning are complementary processes; pre-training enables the model to possess broad language understanding capabilities, while fine-tuning makes the model more specialized and accurate for specific tasks.

[0037] In other words, the training process of a language model can be divided into two stages: pre-training and fine-tuning. In the pre-training stage, unsupervised learning (e.g., self-supervised learning) can be used to pre-train on large-scale, unlabeled text datasets (e.g., online encyclopedias, online articles, books, etc.). Specifically, it can predict missing parts or the next word based on context, learn semantic, syntactic, and other statistical rules and language structures, and minimize the prediction loss through backpropagation and optimization algorithms (e.g., gradient descent), iteratively updating the model parameters and gradually improving the model's ability to understand language. During the fine-tuning phase, a suitable supervised learning task (e.g., text classification, named entity recognition, question answering systems, dialogue systems, etc.) can be selected based on the specific application scenario and task requirements. A task-specific text dataset is prepared, and the pre-trained model can be used as the starting point for fine-tuning. Supervised learning is employed on this task-specific text dataset, where the task can be executed. Backpropagation and optimization algorithms (e.g., gradient descent) are used to minimize the loss used to measure the model's performance on the specific task, iteratively updating the model parameters to gradually improve its performance. In practical applications, fine-tuning can flexibly choose supervised, unsupervised, or semi-supervised learning methods based on the specific application scenario and the type of available data.

[0038] The language comprehension ability learned by a language model during the pre-training and fine-tuning phases enables it to perform logical inference, knowledge reasoning, or problem-solving by understanding, analyzing, and synthesizing textual information when faced with complex problems or tasks. This ability is often referred to as the reasoning ability of a language model.

[0039] In practical applications, the pre-trained language model is usually referred to as the base model of the language model, and the fine-tuned language model is referred to as the service model of the large language.

[0040] Language models typically perform specific tasks under the guidance or prompting of cues (also known as prompts). A prompt can be an initial text or text fragment provided to the language model, such as a sentence, a question, or a dialogue, designed to guide or stimulate the model to produce the corresponding output. Prompts are key tools for guiding the model's output and can be very simple or quite complex, including instructions, examples, and descriptions of the expected output format. Prompts explicitly tell the language model what task is expected of it, such as answering a question, simulating a dialogue, writing an article, or translating text. Simultaneously, prompts provide the language model with necessary background information and context, enabling it to understand the logic, style, theme, or stance that should be followed when generating content. Furthermore, prompts can stimulate the language model to demonstrate its inherent knowledge or specific language abilities, such as explaining complex concepts, citing rules, or mimicking a particular author's writing style.

[0041] Since language models are primarily used for understanding and generating human language based on text processing, cues are usually presented in text form. However, in practical applications, language models can also accept other forms of input as cues, such as images, audio, or even video, provided that the language model is designed or trained to process multimodal data (e.g., text, images, audio, video, etc.).

[0042] In practical applications, the server-side application can have one and only one language model, which can be used to perform a variety of different tasks. Alternatively, it can have multiple language models, each capable of performing one or more specific tasks.

[0043] It should be noted that the aforementioned server can also host various functional components or subsystems, such as prompt generation components and tool invocation components. These components or subsystems can work in conjunction with the language model hosted on the server to jointly achieve intelligent task orchestration and execution.

[0044] To improve the adaptability and response accuracy of the service system, the Retrieval-Augmented Generation (RAG) approach can be adopted, combining information retrieval and model generation. This allows the service system to answer questions not only by relying on the knowledge gained by its language model during training from static corpora, but also by first retrieving information from a large document set based on the question, then understanding and answering the question based on the retrieved documents, and generating the corresponding answer. In other words, the document set can be combined with the language model, retrieving relevant information from the document set in real time during model generation to assist the model in making more accurate and comprehensive answers or decisions. Because the model generation process considers the retrieved information and the context of the question, it ensures that the generated content is not only in line with actual needs, but also accurate, reliable, coherent, and natural.

[0045] Specifically, the aforementioned server can also incorporate a knowledge base and an information retrieval component. This knowledge base is external to the language model on the server; that is, the data in this knowledge base is not knowledge learned by the language model during training, but rather serves as auxiliary information in the language model's reasoning process, assisting it in generating answers. During the language model's reasoning process, the information retrieval component can retrieve information from the knowledge base based on prompts, using the retrieved relevant information to assist the language model in generating answers.

[0046] Please refer to Figure 2 , Figure 2 This is a flowchart of a language model-based task execution method provided in an exemplary embodiment.

[0047] The above-mentioned language model-based task execution method can be applied to, for example... Figure 1 The illustrated service system includes a language model. Specifically, this method may include the following steps: Step 201: Input the task description of the target task and the tool description of the available tools into the language model, and iterate through the following steps multiple times until the iteration termination condition is met, so that the language model generates multiple inference trajectories from the task description and the tool description to the task execution result of the target task: The language model infers based on the context containing the task description, the tool description and the execution result of the executed operation to determine the operation to be executed and the available tool corresponding to the operation; Invoke the available tool corresponding to the operation to execute the operation, obtain the execution result, and update the context based on the execution result.

[0048] In this embodiment, for any task that requires intelligent task orchestration and execution (which may be referred to as the target task), the task description of the target task and the tool schema of the available tools can be input into the language model, and the language model can generate multiple inference paths from the task description and tool schema to the task execution result of the target task.

[0049] In practical applications, the prompt generation component in the system can construct appropriate prompts based on the task description and tool description, and input the constructed prompts into the language model. Guided by the prompts, the language model generates multiple reasoning trajectories from the task description and tool description to the task execution result.

[0050] The process of generating inference trajectories using a language model can include intermediate steps where the language model infers from the task description and tool description (i.e., model input) to the task execution result (i.e., model output). These intermediate steps can specifically be alternating inference and operational steps. The inference step can include the language model reasoning based on the current context to determine the operation to be performed. The operational step can include the language model instructing a tool invocation component in the system to invoke an appropriate available tool to perform the operation, or the language model directly invoking an appropriate available tool to perform the operation.

[0051] Specifically, the following steps can be executed iteratively until the iteration termination condition is met, so that the language model can generate a reasoning trajectory from the task description and tool description to the task execution result: Reasoning steps: The language model reasons based on the context (i.e., the current context) containing the task description, tool description, and the execution results of the executed operations to determine the operation to be executed (i.e., the operation to be executed now) and the available tools corresponding to that operation; Operation steps: Invoke the available tool corresponding to the operation to execute the operation, obtain the execution result of the operation, and update the context based on the execution result of the operation, that is, use the execution result of the operation as part of the execution result of the already executed operation.

[0052] In practical applications, the language model can notify the tool invocation component in the system to invoke the available tool corresponding to the operation to be performed, or the language model can directly invoke the available tool corresponding to the operation to be performed. The prompt generation component in the system can update the context based on the execution result of the operation and re-input the updated context into the language model, which will then use the updated context as the current context for reasoning.

[0053] It should be noted that, if the iteration termination condition is met, the language model can output the execution result of the last operation as the task execution result, or it can obtain the task execution result by integrating the execution results of all operations and output it.

[0054] In one embodiment shown, the operation steps may specifically include: generating an instruction from a language model to invoke an available tool corresponding to the operation to be performed; executing the instruction from a tool invocation component in the system to invoke the available tool corresponding to the operation to perform the operation, and obtaining the execution result of the operation.

[0055] In one embodiment, the generation process of multiple inference trajectories can be recorded in an execution log. This execution log can include the intermediate operation sequence from the task description and tool description to the task execution result in each inference trajectory. Taking one inference trajectory as an example, suppose that during the generation of this inference trajectory, the inference steps first determine that available tool 1 needs to be called to execute operation 1, and then the operation steps complete the calling of available tool 1 to execute operation 1. Then, the inference steps determine that available tool 2 needs to be called to execute operation 2, and then the operation steps complete the calling of available tool 2 to execute operation 2. This iterative process outputs the task execution result. Therefore, {operation 1, operation 2} can be recorded in the execution log as the intermediate operation sequence from the task description and tool description to the task execution result in this inference trajectory.

[0056] In one embodiment shown, the iteration termination condition may include: generating the task execution result of the target task in the inference trajectory; or, the number of iterations reaching a preset threshold.

[0057] In practical applications, task descriptions can be text in natural language. Tool descriptions can be structured descriptions of available tools when using them to invoke functions; such descriptions generally include information such as the tool's name, function description, and parameter format, which guide the language model to correctly invoke external tools (e.g., database queries, APIs, functions, etc.) when needed.

[0058] For example, a tool description in JSON Schema form can look like this: { "type": "function", "function": { "name": "get_current_weather", "description": "Get current weather information for a specified city", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City name, such as 'Beijing' or 'New York'" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" } }, "required": ["location"] } } } In the tool descriptions above, `name` represents the tool's unique identifier (e.g., function name); `description` provides a brief overview of the tool's functionality, helping the model understand when to invoke it; and `parameters` represents the JSON schema of the input parameters. `parameters` further includes: `type` indicating the parameter type; `properties` indicating the parameter definition; `required` indicating the list of required parameters; and `enum` used to limit the optional values ​​of the parameters.

[0059] In one embodiment shown, the task description of the target task and the tool description of the available tools can be input into the language model. The following steps are executed independently multiple times until the iteration termination condition is met, so that the language model can independently generate multiple inference trajectories from the task description and tool description to the task execution result of the target task. This avoids the problem of highly similar inference trajectories, effectively breaks the determinism, and prompts the model to explore semantically different but logically reasonable inference trajectories.

[0060] It's important to note that the language model can independently generate multiple inference paths by randomly selecting the next token from the predicted probability distribution during the generation process. In this case, although the language model generates multiple inference paths based on the same task and tool descriptions, these paths are typically not identical to each other. This allows the language model to explore various possible inference routes, thereby improving robustness and accuracy in complex tasks.

[0061] In one embodiment shown, the task description of the target task and the tool description of the available tools can be input into the language model, and the following steps can be executed independently and repeatedly in parallel until the iteration termination condition is met. This allows the language model to independently generate multiple inference trajectories from the task description and tool description to the task execution result of the target task in parallel, thereby achieving a balance between computational efficiency, computational overhead, and inference diversity.

[0062] Alternatively, when computational resources are limited, the task description of the target task and the tool description of the available tools can be input into the language model, and the following steps can be executed independently and sequentially multiple times until the iteration termination condition is met. This allows the language model to independently generate multiple inference trajectories that infer from the task description and tool description to the task execution result of the target task in a sequential manner.

[0063] Step 202: Determine the target task execution result from the target task execution results contained in the multiple inference trajectories, and determine the target task execution result as the target task execution result.

[0064] In this embodiment, after generating multiple inference trajectories based on the same task description and tool description, the target task execution result can be determined from the target task execution results (usually the final answer output by the inference trajectory) contained in these multiple inference trajectories, and the target task execution result is determined as the target task execution result.

[0065] In one embodiment shown, the target task execution result can be determined from the execution results of the target task contained in the multiple generated inference trajectories based on a voting mechanism. For example, assuming that 5 inference trajectories are generated, where the target task execution result of the target task contained in 3 inference trajectories is result A, and the target task execution result of the target task contained in the other 2 inference trajectories is result B, then result A can be considered to have 3 votes and result B has 2 votes. Therefore, result A, which has more votes, can be determined as the target task execution result.

[0066] In one embodiment shown, before inputting the task description and tool description into the language model, a RAG approach can be used to retrieve relevant documents from an external knowledge base based on the task description, and the retrieved relevant documents, task description, and tool description can be used together as the initial context input into the language model.

[0067] In one embodiment shown, the target task can be a supply chain management task.

[0068] In the above technical solution, a task description and a tool description of available tools for a given task can be input into a language model. The language model then generates multiple inference paths that deduce the task execution result from the task description and tool description. The target task execution result can then be determined from the task execution results contained in these multiple inference paths, and this target task execution result is identified as the task execution result of the given task. The process of generating an inference path by the language model can include iteratively executing the following steps until an iteration termination condition is met: the language model infers based on a context containing the task description, the tool description, and the execution results of already executed operations to determine the operation to be executed and the available tool corresponding to that operation; the available tool corresponding to the operation is invoked to execute the operation, obtaining the execution result, and the context is updated based on the execution result.

[0069] Using the above approach, on the one hand, inference trajectories can be generated based on task descriptions and tool descriptions, completing the entire process from understanding the task intent to calling the tool to execute the task, thus achieving autonomous and flexible task orchestration and execution; on the other hand, multiple inference trajectories can be generated independently based on the same task description and tool description, and the final task execution result can be determined from the task execution results contained in the multiple inference trajectories, without the need to align the intermediate inference steps in different inference trajectories. This preserves the exploration diversity and result consistency of the inference trajectories, while reducing the complexity of post-inference processing, thereby achieving efficient and reliable task orchestration and execution.

[0070] Corresponding to the aforementioned embodiments of the language model-based task execution method, this application also provides embodiments of a language model-based task execution device.

[0071] Please refer to Figure 3 , Figure 3 This is a schematic diagram illustrating the structure of a device according to an exemplary embodiment of this application. At the hardware level, the device includes a processor 301, an internal bus 302, a network interface 303, memory 304, and non-volatile memory 305, and may also include other necessary hardware. One or more embodiments of this application can be implemented in software, for example, the processor 301 reads the corresponding computer program from the non-volatile memory 305 into memory 304 and then runs it. Of course, besides software implementation, one or more embodiments of this application do not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution entity of the following processing flow is not limited to individual logic modules, but can also be hardware or logic devices.

[0072] Please refer to Figure 4 , Figure 4 This is a block diagram illustrating a language model-based task execution device in an exemplary embodiment of this application.

[0073] The aforementioned language model-based task execution device can be applied to Figure 4 The device shown is used to implement the technical solution of this application. The language model-based task execution device may include: Inference module 401 inputs the task description of the target task and the tool description of the available tools into the language model, and iteratively executes the following steps multiple times until the iteration termination condition is met, so that the language model generates multiple inference trajectories from the task description and the tool description to the task execution result of the target task: The language model performs reasoning based on a context including the task description, the tool description, and the execution results of the executed operations to determine the operation to be executed and the available tools corresponding to the operation; Invoke the available tool corresponding to the operation to execute the operation, obtain the execution result, and update the context based on the execution result; The determination module 402 determines the target task execution result from the task execution results of the target task contained in the multiple inference trajectories, and determines the target task execution result as the target task execution result.

[0074] In one embodiment shown, the generation process of the multiple inference trajectories is recorded in an execution log; the execution log includes the sequence of intermediate operations in each inference trajectories that infer from the task description and the tool description to the task execution result.

[0075] In one embodiment shown, the multiple iterations perform the following steps until an iteration termination condition is met, to generate multiple inference trajectories from the task description and the tool description to the task execution result of the target task by the language model, including: The following steps are executed independently multiple times until the iteration termination condition is met, so that the language model can independently generate multiple inference trajectories from the task description and the tool description to the task execution result of the target task.

[0076] In one embodiment shown, the multiple iterations perform the following steps until the iteration termination condition is met, including: The following steps are executed in parallel, iteratively multiple times, until the iteration termination condition is met.

[0077] In one embodiment shown, determining the target task execution result from the task execution results of the target task contained in the plurality of inference trajectories includes: Based on a voting mechanism, the target task execution result is determined from the task execution results of the target task contained in the multiple inference trajectories.

[0078] In one embodiment shown, the invocation of an available tool corresponding to the operation to perform the operation and obtain an execution result includes: The language model generates instructions for invoking available tools corresponding to the operation; Execute the instruction, invoke the available tool corresponding to the operation to perform the operation, and obtain the execution result.

[0079] In one embodiment shown, the iteration termination condition includes: generating the task execution result of the target task in the inference trajectory; or, the number of iterations reaches a preset threshold.

[0080] In one embodiment shown, the device further includes a retrieval module: Before inputting the task description and the tool description into the language model, relevant documents are retrieved from an external knowledge base based on the task description, and the relevant documents, the task description, and the tool description are used together as context input into the language model.

[0081] In one embodiment shown, the target task is a supply chain management task.

[0082] For the device embodiments, they basically correspond to the method embodiments; therefore, relevant details can be found in the descriptions of the method embodiments. The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of the technical solution of this application according to actual needs.

[0083] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or physical entities, or by products with certain functions. A typical implementation device is a computer, which can take the form of a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email sending and receiving device, game console, tablet computer, wearable device, or any combination of these devices.

[0084] In a typical configuration, a computer includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0085] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0086] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), other types of random-access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0087] It should be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0088] The foregoing has described specific embodiments of this application. Other embodiments are within the scope of this application. In some cases, the actions or steps described in this application may be performed in a different order than those shown in the embodiments and still achieve the desired results. Furthermore, the processes depicted in the accompanying drawings do not necessarily require a specific or sequential order to achieve the desired results. In some implementations, multitasking and parallel processing are also possible or may be advantageous.

[0089] The terminology used in one or more embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the scope of one or more embodiments of this application. The singular forms “a,” “the,” and “the” are also intended to include the plural forms unless the context clearly indicates otherwise. The term “and / or” refers to and includes any or all possible combinations of one or more associated listed items.

[0090] The terms "an embodiment," "some embodiments," "example," "specific example," or "one implementation," as used in one or more embodiments of this application, refer to specific features or characteristics described in connection with that embodiment, which are included in at least one embodiment of this application. Illustrative descriptions of these terms do not necessarily refer to the same embodiment. Furthermore, the described specific features or characteristics may be combined in a suitable manner in one or more embodiments of this application. In addition, different embodiments and specific features or characteristics from different embodiments may be combined without contradiction.

[0091] It should be understood that although the terms first, second, third, etc., may be used to describe various information in one or more embodiments of this application, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of one or more embodiments of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."

[0092] The above description is merely a preferred embodiment of one or more embodiments of this application and is not intended to limit the scope of one or more embodiments of this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this application should be included within the protection scope of one or more embodiments of this application.

[0093] The user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

Claims

1. A task execution method based on a language model, the method comprising: The task description of the target task and the tool description of the available tools are input into the language model. The following steps are executed iteratively multiple times until the iteration termination condition is met, so that the language model generates multiple inference trajectories from the task description and the tool description to the task execution result of the target task. The target task execution result is determined from the task execution results of the target task contained in the multiple inference trajectories, and the target task execution result is determined as the task execution result of the target task. The language model performs reasoning based on a context including the task description, the tool description, and the execution results of the executed operations to determine the operation to be executed and the available tools corresponding to the operation; Invoke the available tool corresponding to the operation to perform the operation, obtain the execution result, and update the context based on the execution result.

2. The method according to claim 1, wherein the generation process of the multiple inference trajectories is recorded in an execution log; the execution log includes the intermediate operation sequence in each inference trajectory from the task description and the tool description to the task execution result.

3. The method according to claim 1, wherein the multiple iterations of the following steps are performed until the iteration termination condition is met, so as to generate multiple inference trajectories from the task description and the tool description to the task execution result of the target task by the language model, including: The following steps are executed independently multiple times until the iteration termination condition is met, so that the language model can independently generate multiple inference trajectories from the task description and the tool description to the task execution result of the target task.

4. The method according to claim 1, wherein the multiple iterations of the following steps are performed until the iteration termination condition is met, including: The following steps are executed in parallel, iteratively multiple times, until the iteration termination condition is met.

5. The method according to claim 1, wherein determining the target task execution result from the task execution results of the target task contained in the plurality of inference trajectories comprises: Based on a voting mechanism, the target task execution result is determined from the task execution results of the target task contained in the multiple inference trajectories.

6. The method according to claim 1, wherein invoking an available tool corresponding to the operation to perform the operation and obtain an execution result includes: The language model generates instructions for invoking available tools corresponding to the operation; Execute the instruction, invoke the available tool corresponding to the operation to perform the operation, and obtain the execution result.

7. The method according to claim 1, wherein the iteration termination condition includes: The task execution result of the target task is generated in the inference trajectory; Alternatively, the number of iterations may reach a preset threshold.

8. The method according to claim 1, further comprising: Before inputting the task description and the tool description into the language model, relevant documents are retrieved from an external knowledge base based on the task description, and the relevant documents, the task description, and the tool description are used together as context input into the language model.

9. The method according to claim 1, wherein the target task is a supply chain management task.

10. A task execution device based on a language model, the device comprising: The inference module inputs the task description of the target task and the tool description of the available tools into the language model, and iteratively executes the following steps multiple times until the iteration termination condition is met, so that the language model generates multiple inference trajectories from the task description and the tool description to the task execution result of the target task: The language model performs reasoning based on a context including the task description, the tool description, and the execution results of the executed operations to determine the operation to be executed and the available tools corresponding to the operation; Invoke the available tool corresponding to the operation to execute the operation, obtain the execution result, and update the context based on the execution result; The determination module determines the target task execution result from the task execution results of the target task contained in the multiple inference trajectories, and determines the target task execution result as the target task execution result.

11. An electronic device, comprising: processor; Memory used to store processor-executable instructions; The processor implements the method as described in any one of claims 1 to 9 by executing the executable instructions.

12. A computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the method as described in any one of claims 1 to 9.

13. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 9.