Inference method, system, device and medium based on heterogeneous large language model

By conducting autonomous capability assessment and structured consultation in executor model instances, and utilizing system prompts and tools to configure constrained expert models as advisors, the problem of incomplete decision information and inefficient upgrade methods in heterogeneous reasoning of large language models is solved, achieving high-efficiency, low-cost reasoning quality and the ability to quickly adapt to new scenarios.

CN122242745APending Publication Date: 2026-06-19SHENZHEN SHENMA INNOVATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN SHENMA INNOVATION TECHNOLOGY CO LTD
Filing Date
2026-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing heterogeneous reasoning solutions for large language models, incomplete decision information leads to high classification error rates, inefficient upgrade methods, and wasteful transfer of task execution rights. Furthermore, existing consultation and collaboration solutions require additional training and are difficult to adapt quickly to new models and scenarios.

Method used

By assessing autonomous capabilities in executor model instances, and using system prompts and tools to configure constrained expert models as advisors to provide guidance, the executor models complete reasoning tasks, avoiding state transitions and redoing between model instances. Structured consultation requests and deterministic middleware are used for inter-model collaboration.

Benefits of technology

It reduces the classification error rate of model selection, lowers the cost of inference execution, improves inference quality, reduces training and maintenance costs, and enables rapid adaptation to new models and new scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242745A_ABST
    Figure CN122242745A_ABST
Patent Text Reader

Abstract

This application discloses a reasoning method, system, device, and medium based on a heterogeneous large language model. The method is applied to a heterogeneous large language model reasoning system, which includes executor model instances and expert model instances. The method includes: obtaining a reasoning request and assigning it to an executor model instance; determining, based on the context information of the reasoning request, whether the executor model instance has the capability to complete the reasoning request; if not, sending an upgrade consultation request for the reasoning request to an expert model instance; generating a guidance scheme corresponding to the upgrade consultation request through the expert model instance and sending the guidance scheme to the executor model instance; and performing reasoning processing on the reasoning request according to the guidance scheme through the executor model instance to obtain a reasoning result. This solution can reduce the classification error rate in model execution capability assessment and reduce the inference execution cost of the model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to reasoning methods, systems, devices and media based on heterogeneous large language models. Background Technology

[0002] In recent years, the reasoning capabilities of Large Language Models (LLMs) have continued to improve, but the accompanying reasoning costs have also increased significantly. Taking the commercial LLM market as an example, the cost difference per inference between high-performance large models (such as GPT-4o, Claude Opus, and DeepSeek-R1) and low-cost lightweight models (such as GPT-4o-mini, ClaudeHaiku, and DeepSeek-V3) can be 10 to 50 times. In practical applications, a large number of inference requests (typically accounting for 70% to 80% of the total) can be completed with high quality without high-performance models; only a small portion of requests require strong reasoning capabilities. Therefore, maximizing the use of low-cost models while ensuring inference quality has become a core technical challenge for the large-scale deployment of large language models.

[0003] To address the aforementioned challenges, existing technologies have developed a model selection paradigm centered on "pre-routing." The basic architecture of this paradigm is as follows: before an inference request reaches a large language model instance, an external routing module determines the difficulty level of the request based on its characteristics, and then allocates the request to a model instance with the corresponding cost level for execution.

[0004] RouteLLM (UC Berkeley, 2024) trains router models to learn request difficulty classification rules, predicting whether a request is suitable for a strong or weak model upon arrival. MartianModel Router uses machine learning models to analyze the surface features of requests (such as text length, keywords, and domain tags) to predict request complexity and route to the corresponding model. INFERENCEDYNAMICS provides real-time request pre-routing services, allocating models based on request features before the request enters the inference pipeline. AutoMix (2024) proposes complexity-based strategic routing, first trying a small model and upgrading to a larger model if the output quality is substandard. FrugalGPT (Stanford University, 2023) proposes a cascading strategy, trying models sequentially from low to high cost until the output quality meets a threshold.

[0005] The aforementioned pre-routing schemes share a core architectural feature: model selection decisions are made by external modules before the request reaches the execution model. This architectural feature leads to a fundamental limitation: incomplete decision information. Specifically, when making model selection decisions, the external routing module only has surface-level characteristics of the request (text length, keywords, syntactic complexity, etc.) available to it, rather than a deep understanding of the problem. The true difficulty of a request often requires a complete understanding of the problem context to accurately determine—a seemingly simple coding request may involve complex architectural decisions, while a complexly worded question may only require a simple fact lookup. There is a structural discrepancy between the difficulty judgment made by the external routing module based on surface characteristics and the true difficulty.

[0006] This structural bias manifests in two types of errors in a real system: (a) false negatives (failure to upgrade when required) – complex requests are incorrectly classified as simple requests and assigned to a low-cost model, resulting in substandard output quality and requiring rework or manual intervention; (b) false positives (unnecessary upgrades) – simple requests are incorrectly classified as complex requests and assigned to a high-cost model, resulting in unnecessary cost incurred. Both types of errors stem from the same root cause: the external routing module makes decisions under conditions of incomplete information.

[0007] From an information theory perspective, when making model selection decisions, the external routing module only has surface-level information about the request, far less than the information obtainable after a complete understanding of the problem. This incompleteness of decision information leads to a theoretical upper bound on the classification accuracy of the routing module—no matter how optimized the routing algorithm is, the quality of decisions based on incomplete information cannot surpass the quality of decisions based on complete information. Existing pre-routing schemes all make decisions under conditions of incomplete information, which is the fundamental reason why their classification error rate has a theoretical lower bound.

[0008] Mixture of Experts (MoE) is another heterogeneous inference scheme. MoE sets up multiple expert subnetworks within a single model, using gating networks to select which subnetworks to activate during inference. However, MoE is a micro-architectural design within the model, with expert selection granularity at the neural network layer level, which is fundamentally different at an abstract level from the model instance-level heterogeneous scheduling discussed in this application. MoE cannot solve the cost optimization problem across model instances.

[0009] Multi-agent frameworks (such as CrewAI, AutoGen, and MetaGPT) support task delegation mechanisms that allow tasks to be transferred from one agent to another. When a low-capacity agent is unable to complete a task, the entire task can be delegated to a high-capacity agent. This mechanism is characterized by a transfer of task execution control—the high-capacity agent takes over the task and completes it independently, while the low-capacity agent no longer participates. This results in high-cost models undertaking not only the knowledge-intensive work of analysis and guidance but also the operationally intensive execution work (such as writing code line by line and calling tools), with costs far exceeding those required for simply providing guidance.

[0010] The aforementioned cascading strategies (AutoMix, FrugalGPT, etc.) employ a "try a small model first → detect failure → upgrade to a large model to redo" approach. Recent self-routing solutions have further developed this direction: DiSRouter (2025) trains the self-evaluation capabilities of each large language model through supervised fine-tuning and reinforcement learning, enabling the model to autonomously decide whether to answer or forward to a larger model; SATER (2025) trains small models to reject tasks beyond their capabilities, with rejected queries directly routed to the large model for processing. These solutions move the upgrade decision from the external router to the execution model itself, improving the completeness of decision information. However, all the aforementioned cascading and self-routing solutions share a fundamental limitation: the upgrade method still involves the large model completely redoing the original task, rather than providing guidance based on the existing work of the small model. This means that the execution cost of the small model is completely wasted (its intermediate results are not utilized by the large model), and the large model bears the full execution cost (not just the cost of providing guidance). Furthermore, both DiSRouter and SATER require additional training to enhance the self-evaluation capabilities, increasing deployment complexity.

[0011] Current technologies generally agree that to achieve autonomous upgrade decisions for executor model instances, it is necessary to enhance the model's self-evaluation capabilities through supervised fine-tuning and reinforcement learning training. This approach results in high deployment costs, long cycles, and difficulty in quickly adapting to new models and scenarios.

[0012] The recently developed SWE-Protege (2026) has implemented a collaborative consultation model in software engineering where "experts provide guidance, while the executor retains the right to execute." The smaller model (protege) acts as the primary executor, selectively consulting the larger model for inference guidance via the `ask_expert` tool, with the task execution always residing with the protege. However, SWE-Protege requires two stages of training: SFT (Supervised Fine-tuning) and RL (Reinforcement Learning), to teach the smaller model when to trigger expert consultation. Furthermore, the expert role constraint is an implicit effect of the training objective rather than an explicit constraint at the architectural level, and its consultation requests are in an unstructured format.

[0013] Furthermore, existing technologies generally fail to recognize the application value of the memoryless nature of large language models in heterogeneous model scheduling. Large language models are stateless reasoning systems—their output is entirely determined by the current input, without retaining internal state across calls. Theoretically, this characteristic supports real-time switching of model instances within a single session, requiring only that the new model instance receives complete context information to achieve seamless switching, eliminating the need for complex state serialization, state synchronization, or checkpoint recovery mechanisms found in traditional distributed systems. However, existing heterogeneous reasoning solutions do not systematically utilize this characteristic to design intra-session model switching protocols: pre-routing schemes select models before the session begins, without involving intra-session switching; cascading schemes involve model upgrades accompanied by task redoing, discarding the work of previous models; multi-agent delegation schemes require context reconstruction for task transfer, accompanied by a transfer of execution rights. Therefore, existing technologies lack a real-time, seamless intra-session model switching mechanism based on the memoryless nature of large language models, implemented through a context injection protocol.

[0014] Based on the above analysis, existing inference schemes for heterogeneous large language models have the following shortcomings: (1) Inappropriate decision timing: The pre-routing scheme makes a model selection decision before the request arrives at the execution model. At this time, the decision information is incomplete, resulting in a theoretical lower bound for the classification error rate and a high classification error rate in the model execution capability assessment.

[0015] (2) Inefficient upgrade method: The upgrade method of the cascade strategy is to completely redo the large model, which wastes the existing work results of the small model. Moreover, the large model bears the complete execution cost rather than just providing guidance, resulting in high inference execution cost.

[0016] (3) Waste of task execution rights transfer: The multi-agent delegation mechanism transfers the entire task to the high-cost model, which bears all the costs of analysis guidance and operation execution.

[0017] (4) Existing consultation collaboration solutions require additional training: SWE-Protege implements a consultation collaboration model where "experts only provide guidance, and executors retain the right to execute," but it requires two-stage training (SFT+RL) to teach executors when to upgrade. This training process involves backpropagation updates of model parameters, requiring a large amount of labeled data and computational resources, increasing data preparation and training costs, and limiting the ability to deploy in real time under new models and new scenarios. Its expert role constraints are implicit effects of the training process and lack explicit guarantees at the architectural level. In addition, although self-routing solutions such as DiSRouter and SATER move the decision point to the executor, their upgrade method is still to completely redo the original task with the large model. In summary, there is currently no solution that can achieve heterogeneous model consultation collaboration based on structured consultation protocols without requiring model parameter-level training and with architectural-level expert role constraints. Summary of the Invention

[0018] This application provides inference methods, systems, devices, and media based on heterogeneous large language models, aiming to solve the problems of high classification error rate, high inference execution cost, waste of task execution rights transfer, and the need for additional training in existing consultation and collaboration schemes in the prior art.

[0019] In a first aspect, embodiments of this application provide a reasoning method based on a heterogeneous large language model. The method is applied to a heterogeneous large language model reasoning system, which includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance. The method includes: Obtain the inference request and assign the inference request to the executor model instance; Based on the context information of the inference request, it is determined whether the executor model instance has the ability to complete the inference request. If the executor model instance does not have the ability to complete the inference request, then the executor model instance sends an upgrade consultation request for the inference request to the expert model instance. The expert model instance generates a guidance plan corresponding to the upgrade consultation request and sends the guidance plan to the executor model instance. When the expert model instance processes the upgrade consultation request, it does not execute the operation task in the reasoning request. Using the executor model instance, the inference request is processed according to the guidance scheme to obtain the inference result corresponding to the inference request; If the executor model instance has the ability to complete the inference request, then the executor model instance performs inference processing on the inference request based on the context information of the inference request to obtain the inference result corresponding to the inference request.

[0020] Secondly, embodiments of this application also provide a heterogeneous large language model inference system. The heterogeneous large language model inference system includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance, wherein: The executor model instance is used to obtain a reasoning request; determine whether the executor model instance has the ability to complete the reasoning request based on the context information of the reasoning request; if the executor model instance does not have the ability to complete the reasoning request, then send an upgrade consultation request for the reasoning request to the expert model instance. The expert model instance is used to generate a guidance plan corresponding to the upgrade consultation request and send the guidance plan to the executor model instance. When the expert model instance processes the upgrade consultation request, it does not execute the operation task in the reasoning request. The executor model instance is further configured to perform reasoning processing on the reasoning request according to the guidance scheme to obtain the reasoning result corresponding to the reasoning request; if the executor model instance has the ability to complete the execution of the reasoning request, it performs reasoning processing on the reasoning request based on the context information of the reasoning request to obtain the reasoning result corresponding to the reasoning request.

[0021] Thirdly, embodiments of this application also provide a computer device, which includes a memory and a processor. The memory stores a computer program, and the computer device deploys a heterogeneous large language model inference system. The heterogeneous large language model inference system includes an executor model service module and an expert model service module. The executor model service module includes at least one low-cost executor model instance, and the expert model service module includes at least one high-cost expert model instance. When the processor executes the computer program, it implements the above-described method.

[0022] Fourthly, embodiments of this application also provide a computer-readable storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, can implement the above-described method.

[0023] This application provides a reasoning method, system, device, and medium based on a heterogeneous large language model. The method is applied to a heterogeneous large language model reasoning system, which includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance. Firstly, in this application embodiment, the step of determining whether the executor model instance has the ability to complete the reasoning request (whether expert consultation is needed) occurs after the request is fully understood. This step is made by the executor model instance itself based on the context information of the reasoning request, resulting in high completeness of the decision information involved in determining the execution capability of the reasoning request, thereby reducing the classification error rate. Secondly, when the executor model instance does not have the execution capability of the current reasoning request, it will... The first aspect involves requesting guidance from high-cost expert model instances, enabling executor model instances to complete reasoning under the guidance of expert model instances, resulting in high-quality reasoning. Secondly, in this embodiment, the high-cost expert model instance only generates guidance schemes; that is, the task execution right in this embodiment is always held by the executor model instance and is not transferred to the expert model instance. The high-cost model is only used for knowledge-intensive analysis and guidance and does not undertake operation-intensive execution work. The token consumption for a single consultation is far lower than that for the complete execution of the original task, thereby reducing the cost of reasoning execution. Thirdly, this application evaluates the execution capability of the executor model instance based on system prompts, eliminating the need to train and maintain external routing models or classifiers, thus eliminating the development cost, training data requirements, and continuous maintenance cost of routing models. Attached Figure Description

[0024] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0025] Figure 1 A schematic block diagram of a heterogeneous large language model inference system provided in the embodiments of this application; Figure 2 A flowchart illustrating the reasoning method based on a heterogeneous large language model provided in this application embodiment; Figure 3 A schematic diagram of a sub-process of the reasoning method based on a heterogeneous large language model provided in the embodiments of this application; Figure 4 Another sub-process diagram of the reasoning method based on a heterogeneous large language model provided in the embodiments of this application; Figure 5A schematic diagram of the multi-turn interaction timing within a single request of the inference method based on a heterogeneous large language model provided in the embodiments of this application; Figure 6 A schematic diagram of the degradation protection mechanism for the reasoning method based on a heterogeneous large language model provided in the embodiments of this application; Figure 7 A schematic block diagram of a computer device provided in an embodiment of this application. Detailed Implementation

[0026] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0027] It should be noted that any AI models, software tools, or components not belonging to this company appearing in the embodiments of this application are merely illustrative examples and do not represent actual use. The user personal information involved in the embodiments of this application is obtained by an entity authorized (with the knowledge and consent) or fully authorized by all parties through various legal and compliant means. The collection, storage, use, processing, transmission, provision, and disclosure of the information, data, and signals involved all comply with relevant laws and regulations and do not violate public order and good morals.

[0028] It should be understood that, when used in this specification and the appended claims, the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.

[0029] It should also be understood that the terminology used in this application specification is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this application specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0030] It should also be further understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0031] This application provides inference methods, systems, devices, and media based on heterogeneous large language models.

[0032] The execution entity of the reasoning method based on the heterogeneous large language model can be the heterogeneous large language model reasoning system provided in the embodiments of this application, or a computer device that integrates the heterogeneous large language model reasoning system. The heterogeneous large language model reasoning system can be implemented in hardware or software, and the computer device can be a terminal or a server.

[0033] Please see Figure 1 , Figure 1 This is a schematic diagram of a framework for a heterogeneous large language model inference system provided in an embodiment of this application. The heterogeneous large language model inference system includes an executor model service module 10 and an expert model service module 20. The executor model service module 10 contains at least one low-cost executor model instance, and the expert model service module 20 contains at least one high-cost expert model instance, wherein: The executor model service module 10 is configured to receive all inference requests. Each executor model instance has complete tool calling capabilities (such as reading and writing files, executing commands, searching code, etc.) and can independently complete most operation tasks.

[0034] The system prompts for expert model instances explicitly define their role as "advisor," which can analyze problems, provide solutions, and explain principles, but cannot directly manipulate files, invoke tools, or modify system status.

[0035] The role constraints of the expert model instance are implemented through a dual mechanism: At the system prompt level, the system prompt explicitly defines the role of the expert model instance as an advisor: the operations that the expert model instance can perform include analyzing the root causes of problems, providing solutions, explaining technical principles, and assessing the risks of solutions; the operations that the expert model instance cannot perform include directly manipulating code files, directly calling external tools, directly modifying system state, and directly generating final deliverables; At the tool configuration level, the available tool list of the expert model instance does not include operation tools, only read-only tools are configured or no tools are configured, so that the expert model instance is structurally unable to directly execute operation tasks at the architecture level; This allows the high-cost reasoning capabilities of the expert model instance to be used only for knowledge-intensive analysis and guidance, while the operation-intensive execution tasks are completed by the low-cost executor model instance.

[0036] In one specific embodiment, the system prompts for expert model instances include the following role constraints: "Your role is technical advisor. Your responsibilities are:" 1. Analyze the root causes of the problem; 2. Provide solutions and implementation strategies; 3. Explain the technical principles and architectural considerations; 4. Assess the risks and alternative options of the proposed solution; You do not perform the following operations directly: - Write or modify code files; - Call external tools or APIs; - Modify system configuration or status.

[0037] Furthermore, such as Figure 1 As shown, the heterogeneous large language model reasoning system also includes a consultation agent module 30, which serves as a deterministic middleware between the executor model service module 10 and the expert model service module 20. This module is responsible for: (a) parsing the upgrade request from the output of the executor model instance; (b) extracting key information from the current session context of the executor model instance and constructing a structured consultation request; (c) sending the structured consultation request (upgrade consultation request) to the expert model instance; and (d) receiving the guidance scheme from the expert model instance and injecting it as a tool call result into the subsequent context of the executor model instance. The consultation agent module 30 is a deterministic procedure and does not contain probabilistic reasoning.

[0038] Figure 2 This is a flowchart illustrating the reasoning method based on a heterogeneous large language model provided in an embodiment of this application. For example... Figure 2 As shown, the method includes the following steps S110-S160.

[0039] S110. Obtain the inference request and assign the inference request to the executor model instance.

[0040] In this embodiment, after the inference request arrives at the system, it is directly assigned to the executor model instance. That is, the inference request goes directly from the system entry point to the executor model instance without going through any external pre-routing or classification modules. The executor model instance receives the complete context information of the request, which includes user instructions, historical dialogues, tool call results, and intermediate states (intermediate inference processes).

[0041] S120. Based on the context information of the inference request, determine whether the executor model instance has the ability to complete the inference request. If yes, proceed to step S160; otherwise, proceed to steps S130-S150.

[0042] In this embodiment, the executor model instance begins processing the inference request. During processing, based on a complete understanding of the problem, the executor model instance continuously evaluates its ability to complete the request with high quality. This evaluation is a natural component of the executor model's inference process. When the model encounters analytical needs, architectural decisions, or complex reasoning that exceed its capabilities during processing, uncertainty signals will naturally be generated.

[0043] It should be noted that the autonomous capability assessment is implemented through technical means at the system configuration level, rather than relying on spontaneous behavior of the model.

[0044] Further, step S120 includes: determining the capability assessment instruction corresponding to the executor model instance in the system prompt word through the executor model instance, the capability assessment instruction including multiple capability assessment dimensions; and performing autonomous capability assessment by the executor model instance based on the context information of the inference request and the multiple capability assessment dimensions to determine whether the executor model instance has the capability to complete the inference request.

[0045] This autonomous assessment is achieved by configuring capability assessment instructions in the system prompts of the executor model instance. The capability assessment instructions utilize the ability of the large language model to follow instructions in the system prompts learned during the pre-training phase to define the assessment dimensions and post-assessment behavior rules. The executor model instance does not require supervised fine-tuning or reinforcement learning training at the model parameter level for the upgrade decision task. The configuration of the capability assessment instructions is an operation at the system configuration level rather than a model training process.

[0046] Specifically, the system prompts for the executor model instance contain explicit capability assessment instructions. These instructions specify multiple capability assessment dimensions (such as problem complexity, required depth of professional knowledge, and the boundaries of its own reasoning ability) and post-assessment behavioral rules (if the capability is sufficient, it completes independently; if the capability is insufficient, it calls the ask_expert tool). Current mainstream commercial large language models (such as the GPT-4 series, Claude series, and DeepSeek series) all possess highly reliable adherence to the instructions in the system prompts. This capability is a fundamental ability learned during the pre-training phase and does not require supervised fine-tuning or reinforcement learning training at the model parameter level for the upgrade decision task. The system prompt configuration is a system configuration-level operation, implemented by modifying the input text, and does not involve modification of internal model parameters or weights. It is fundamentally different from the training process that requires backpropagation to update model parameters. Therefore, this application can activate the autonomous assessment behavior (whether it possesses the capability to complete the reasoning request) of the executor model instance through system prompt configuration. Its technical essence is: utilizing the existing instruction adherence capability of the large language model as the execution basis, and controlling the assessment logic through deterministic system prompt configuration to achieve predictable and reproducible upgrade decision behavior.

[0047] S130. Send an upgrade consultation request for the inference request to the expert model instance through the executor model instance.

[0048] In this embodiment, if the executor is assessed as having insufficient capability (i.e., the executor model instance does not have the capability to complete the inference request), it enters the upgrade consultation path, triggers an upgrade request, and sends an upgrade consultation request for the inference request to the expert model instance.

[0049] In some embodiments, the list of available tools for the executor model instance includes expert consultation tools; see [link to relevant documentation]. Figure 3 Step S130 includes: S1301. The expert consultation tool is invoked through the executor model instance. When the expert consultation tool is invoked, the upgrade consultation request is constructed by the consultation agent module based on the context information of the reasoning request. S1302, The upgrade consultation request is sent to the expert model instance through the consultation agent module.

[0050] This upgrade consultation request is a structured consultation request, containing the following structured fields: The problem description field describes the specific difficulties or uncertainties encountered by the executor model instance. The context summary field provides key background information for the current task, including completed steps, intermediate results obtained, and current execution status. The preliminary analysis fields include the executor model instance's initial understanding of the problem and the attempted solutions; The structured consultation request is parsed, extracted, and formatted by the backend deterministic program (consultation agent module) from the output of the executor model instance, rather than being directly constructed by the executor model instance.

[0051] In one specific embodiment, a structured consultation request includes the following fields: { "question": "A description of the specific problem encountered by the executor" "context_summary": A summary of the current task context, completed steps, and key intermediate results. "worker_analysis": "The worker's initial analysis of the problem and the approaches already attempted." "code_references" : [ {"file": "Related file path", "symbol": "Related function / class name", "line_hint": "line number hint"} } In the structured consultation request described above, `question`, `context_summary`, and `worker_analysis` are required fields, corresponding to the problem description field, context summary field, and preliminary analysis field. `code_references` is an optional extended field that is automatically extracted and appended from the session context by the consultation agent module when code-related tasks are involved, helping expert model instances quickly locate relevant code positions.

[0052] In one specific embodiment, the consultation agent module employs a two-phase state machine to construct the upgrade consultation request, thereby controlling transmission costs while ensuring information integrity. In this case, constructing the upgrade consultation request based on the context information of the inference request by the consultation agent module includes: The consultation agent module extracts and caches context summary information from the current session context information. The context summary information includes current task description information, executed operation steps information, key intermediate result information, and related code reference information. The consultation agent module sends the context summary information to the executor model instance. The executor model instance determines the problem description information based on the context summary information and sends the problem description information to the consultation agent module. The consultation agent module constructs an upgrade consultation request based on the context summary information and the problem description information.

[0053] Specifically, in the first phase (caching context): when the executor model instance triggers the first ask_expert (expert consultation tool) call, the consultation agent module extracts and caches key information from the current session context, including: the current task description, the executed operation steps, key intermediate results, and relevant code references. This phase does not immediately send this information to the expert model instance; instead, it returns a cached context summary to the executor, prompting it to provide specific questions in the next round of calls.

[0054] Phase Two (Sending Consultation): The executor model instance provides a specific problem description in the second `ask_expert` call. The consultation agent module merges the context summary cached in Phase One with the specific problem in Phase Two, constructs a complete structured consultation request, and sends it to the expert model instance.

[0055] The advantages of the two-stage design are: (a) the context summary is extracted from the session history by a deterministic procedure, rather than being restated by the executor model instance itself, thus avoiding information omissions or distortions that may occur during model restatement; (b) the specific problem is explicitly proposed by the executor model instance in the second stage, ensuring the accuracy of the problem description.

[0056] If any stage fails, the consultation agent module performs a degradation process, passing on the acquired partial information to the expert model instance as best it can, rather than failing completely. This ensures the robustness of the consultation process.

[0057] S140. Generate a guidance plan corresponding to the upgrade consultation request through the expert model instance, and send the guidance plan to the executor model instance, wherein the expert model instance does not execute the operation task in the reasoning request when processing the upgrade consultation request.

[0058] In this embodiment, after receiving an upgrade consultation request, the expert model instance analyzes the problem based on its strong reasoning ability and returns a guidance solution. Typical content of the guidance solution includes: root cause analysis of the problem, recommended solution strategies, suggested implementation steps, and risk points to be aware of. The expert model instance does not directly perform any operations, including not writing code, calling tools, or modifying files.

[0059] In this embodiment, the role constraints of the expert model instance are implemented through two levels: system prompts and tool configuration.

[0060] At the system prompt level: the system prompts for expert model instances clearly define their responsibility boundaries for analysis, guidance, interpretation, and evaluation. This constraint leverages the ability of large language models to adhere to system prompts, ensuring that expert model instances consciously abide by their role boundaries in the vast majority of cases.

[0061] At the tool configuration level: the list of available tools for an expert model instance does not include operation-type tools (such as file reading and writing, command execution, etc.). Even if the expert model instance attempts to perform an operation, it cannot do so due to the lack of tool interfaces, creating a hard constraint. In a specific embodiment, the expert model instance is only configured with read-only tools (such as searching code, viewing documentation), and not with write-type tools (such as editing files, executing commands).

[0062] The economic benefits of role constraints: Expert model instances only output analysis and guidance text, and the number of output tokens for a single consultation is typically much smaller than the number of output tokens for fully executing the original task. Taking programming tasks as an example, fully implementing a function may require generating hundreds of lines of code (operation-intensive), while analyzing the problem and providing implementation strategies only requires a few paragraphs of text (knowledge-intensive). Since the inference cost of large language models is positively correlated with the number of tokens, the guidance cost is significantly lower than the execution cost.

[0063] As can be seen, the role constraints of the expert model instances in the expert model service module are implemented through a dual mechanism: the system prompts of the expert model instances limit their responsibilities to analyzing problems, providing solutions, explaining principles, and assessing risks, and do not directly perform operational tasks; the available tool list of the expert model instances does not include operational tools, and only read-only tools are configured or no tools are configured, so that the expert model instances are structurally unable to directly perform operational tasks at the architecture level.

[0064] S150. Using the executor model instance, the inference request is processed according to the guidance scheme to obtain the inference result corresponding to the inference request.

[0065] Specifically, please refer to Figure 4 Step S150 includes: S1501. Receive the guidance scheme through the consultation agent module, and inject the guidance scheme as a tool invocation result into the context information of the executor model instance to obtain enhanced context information; S1502. The inference request is processed by the executor model instance based on the enhanced context information to obtain the inference result corresponding to the inference request.

[0066] In this embodiment, the executor model instance continues to process the original request in an enhanced context containing guidance, and performs operations according to the strategies and steps of the guidance to obtain high-quality inference results.

[0067] S160. The executor model instance performs reasoning processing on the reasoning request based on the context information of the reasoning request to obtain the reasoning result corresponding to the reasoning request.

[0068] In this embodiment, if the executor is assessed as having sufficient capability, it enters the autonomous solution path, where the executor completes the task independently without triggering any upgrade operations. The cost of this path is only the inference cost of the executor model instance.

[0069] Furthermore, embodiments of this application support multiple rounds of interaction within a single request.

[0070] Specifically, the sequence diagram of multiple interactions within a single request is as follows: Figure 5 As shown: In one specific embodiment, the multi-turn interaction is implemented based on a tool invocation mechanism: the available tool list of the executor model instance includes an expert consultation tool (e.g., named ask_expert). When the executor model instance invokes this tool during inference, the backend consultation agent module receives and parses the tool invocation, constructs a structured consultation request and sends it to the expert model instance, injecting the expert's response as the return result of the tool invocation into the executor's context. This process is transparent to the executor model instance; from the executor's perspective, ask_expert has the same interface pattern as other tool invocations (such as reading files and executing commands).

[0071] Each round of multi-round consultation is completed within the same HTTP request processing flow. The context of the executor model instance continues to accumulate, and by the second round of consultation, the executor's context already includes the questions and guidance from the first round, making subsequent consultations progressive.

[0072] Furthermore, the system allows setting a maximum consultation round limit (default value is 2-3 rounds). Once the limit is reached, the executor model instance will attempt to complete the task based on the acquired guidance information. In actual operation, most requests requiring upgrades receive sufficient guidance after 1-2 consultation rounds.

[0073] The quality of upgrade decisions made by executor model instances directly impacts the system's cost-quality balance. Decision bias manifests in two ways: (a) Over-escalation (false positive): When an implementer is capable enough, they unnecessarily trigger escalation consultations, resulting in additional expert costs. This type of bias reduces cost efficiency but does not affect quality.

[0074] (b) Inadequate escalation (false negative): The executor fails to trigger escalation consultation due to insufficient capacity, resulting in substandard output quality. This type of deviation saves consultation costs but compromises quality.

[0075] Therefore, this application iteratively optimizes and upgrades the decision threshold through a quality feedback mechanism. The method further includes: Record the decision results of each upgrade decision for the executor model instance. The decision results include: whether to trigger an upgrade, the task completion quality after the upgrade, and the task completion quality without the upgrade. The upgrade decision refers to determining whether the executor model instance has the execution capability to complete the inference request. Based on the historical statistics of the decision results, identify the deviation patterns of the upgrade decisions. The deviation patterns include the quality loss caused by failing to upgrade when it should have been upgraded and the cost waste caused by unnecessary upgrades. Typical cases of the deviation patterns are injected as minority sample examples into the system prompt words of the executor model instance.

[0076] The specific implementation method is as follows: Step 1 (Record Decisions): Record the executor's upgrade decision (whether to trigger an upgrade) and the final task completion quality (evaluated through acceptance criteria verification, user feedback, or automated testing) in each request processing.

[0077] Step 2 (Analyze Deviation Patterns): Periodically analyze decision logs to identify typical deviation cases, such as "failure to upgrade when encountering a multi-file architecture decision, resulting in improper implementation" or "unnecessary upgrades triggered by simple configuration modifications, wasting costs."

[0078] Step 3 (Update Decision Strategy): Inject the identified typical deviation cases as few-shot examples into the system prompts for the executor model instances, in the form of: "The following are reference cases for escalation decisions:" - When faced with cross-module architecture decisions, consult experts → Example: [Specific Case] - When only a single file needs partial modification, no upgrade is required → Example: [Specific Case]” This feedback mechanism leverages the in-context learning capabilities of large language models to fine-tune the upgrade decision threshold by adjusting examples in the system prompts, without retraining the model.

[0079] Furthermore, in this embodiment, the autonomous capability assessment and upgrade decision of the executor model instance are integrated with the attention scheduling kernel: Specifically, the executor model instance runs on top of the attention scheduling kernel, which manages multiple attention threads within the executor model instance, each attention thread corresponding to an independent inference processing procedure; when the executor model instance does not have the ability to complete the inference request, the attention scheduling kernel suspends the current attention thread of the executor model instance, creates an expert consultation thread, and manages the lifecycle of the expert consultation thread; after the expert model instance sends the guidance plan to the executor model instance, the attention scheduling kernel creates an execution guidance thread in the executor model instance, and resumes the task execution of the inference request based on the execution guidance thread.

[0080] When the executor model instance is deployed in the multi-core coordination layer of a multi-agent system, the upgrade request is routed to the expert model instance through the multi-core coordination layer. The multi-core coordination layer manages the lifecycle of the relevant threads in the executor model instance through the interface of the attention scheduling kernel. The executor model instance is unaware of the existence of the multi-core coordination layer.

[0081] At this time, there are multiple expert model instances, and the executor model instance is deployed on the multi-core coordination layer. Sending an upgrade consultation request for the inference request to the expert model instance through the executor model instance includes: determining the expert model instance that needs to be called from among the multiple expert model instances through the multi-core coordination layer; sending an upgrade consultation request for the inference request to the expert model instance that needs to be called through the executor model instance; wherein, the multi-core coordination layer manages the lifecycle of consultation-related threads in the executor model instance through the interface of the attention scheduling kernel.

[0082] More specifically, when the executor model instance runs on top of the attention scheduling kernel, the upgrade consultation process is deeply integrated with the attention thread scheduling. The attention scheduling kernel is a scheduling middleware for large language model inference, which abstracts each inference process of the large language model into an attention thread, and manages the creation, suspension, resumption and termination of attention threads through a deterministic scheduler.

[0083] Within the attention scheduling kernel framework, each inference process of the executor model instance corresponds to a running attention thread. When the executor triggers an upgrade request during inference, the processing flow is as follows: Step 1: The output of the executor model instance contains calls to the ask_expert tool; Step 2: The consultation agent module receives and parses the tool call, identifying it as a system upgrade call; Step 3: The attention scheduling kernel pauses the currently running thread and saves its state to the TCB; Step 4: The consultation agent module constructs a structured consultation request and sends it to the expert model instance; Step 5: The expert model instance returns a guidance solution; Step 6: The attention scheduling kernel creates an "execution guidance" thread in the executor and injects the guidance scheme as the context of that thread; Step 7: The scheduler resumes execution of the original thread, and the executor continues processing in a context containing the guidance plan; This integration method makes upgrade consultation a standard system call of the attention scheduling kernel, providing a unified interface with operations such as thread creation and phase switching. The attention scheduling kernel is an independent scheduling middleware layer. Its core abstraction is as follows: each inference call of the large language model is treated as an attention thread. Thread control blocks (TCBs) record thread states (including current context, execution phase, and suspension reasons), and a deterministic scheduler manages the thread lifecycle transitions (creation → running → paused → resumed → terminated). This kernel does not depend on a specific large language model implementation and can be adapted to any large language model instance that supports tool call interfaces. It should be noted that the integration of the upgrade consultation mechanism and the attention scheduling kernel in this application is an optional extension.

[0084] When this application is applied to a system containing multiple agents, a multi-core coordination layer (organizational coordination layer) exists in the system, responsible for managing the collaborative relationships among multiple agent instances. The core functions of the multi-core coordination layer include: (a) maintaining the registration information and availability status of multiple agent instances (including executor model instances and expert model instances); (b) receiving cross-agent requests from any agent instance (such as the executor's upgrade consultation request in this application) and routing them to the target agent instance; and (c) creating, managing, and terminating relevant attention threads in the target agent through the attention scheduling kernel interface of each agent instance, thereby achieving cross-agent thread lifecycle management. The multi-core coordination layer is transparent to individual agents; each agent triggers cross-agent operations only through standard tool call interfaces (such as ask_expert), unaware of the existence and routing logic of the multi-core coordination layer.

[0085] In the coordination layer of a multi-agent system, when multiple executor model instances and expert model instances are managed uniformly, the executor's upgrade request is routed to the expert model instance through the coordination layer. The coordination layer manages the lifecycle of relevant threads within the executor's kernel through the attention scheduling kernel interface. The executor is unaware of the coordination layer's existence; from the executor's perspective, the upgrade operation is completed via the `ask_expert` tool call, with the coordination layer transparently handling routing and thread management behind the scenes.

[0086] Furthermore, this application also provides a degradation protection mechanism. After obtaining the inference request, the method further includes: when the executor model instance is unavailable, routing the inference request to the expert model instance for inference processing; when the expert model instance is unavailable and the executor model instance does not have the ability to complete the execution of the inference request, not sending the upgrade consultation request to the expert model instance through the executor model instance, but directly performing inference processing on the inference request through the executor model instance, and marking the confidence level in the inference result.

[0087] Specifically, such as Figure 6 As shown, the system's degradation protection mechanism in the event of component failure is as follows: Scenario 1 (Executor Model Instance Unavailable): When the executor model instance becomes unavailable due to failure, overload, or offline, inference requests are directly routed to the expert model instance for processing. In this scenario, the expert model instance is not bound by roles and processes requests in full execution mode. System costs degrade to the level of using the expert model at full capacity, but service continuity is guaranteed.

[0088] Scenario 2 (Expert Model Instance Unavailable): When the expert model instance is unavailable, the executor model instance continues to receive and process requests normally. For requests that the executor assesses as having sufficient capability (e.g., accounting for 70%-80%), the processing flow is completely unaffected. For requests that the executor assesses as having insufficient capability, since escalation consultation cannot be triggered, the executor will try its best to complete the task based on its own capabilities, and will indicate the confidence level in the output, prompting the user that manual review may be required.

[0089] The confidence level annotation is implemented as follows: the system prompt for the executor model instance includes a downgrade mode instruction. When an expert model instance is detected as unavailable, this instruction is activated, requiring the executor model instance to append confidence level metadata to its output. This confidence level metadata includes the following fields: a confidence level field, with values ​​of "high," "medium," or "low," corresponding to the executor's self-assessment of its output quality; a uncertainty cause field, describing the specific sources of uncertainty identified by the executor during processing (e.g., "involves cross-module architecture decisions, without expert confirmation"); and a suggested review scope field, marking the parts of the output that require manual review. The confidence level annotation is implemented through a formatted output instruction in the system prompt, with the executor model instance appending confidence level information to the end of the output according to a predefined structured format. System quality is downgraded for this type of request, but it does not affect the processing of most requests.

[0090] Both degradation scenarios are triggered automatically without manual intervention. The system monitors the availability of each model instance through a heartbeat detection mechanism, automatically switching to the corresponding degradation mode when a fault occurs, and automatically switching back to normal mode after the fault is recovered.

[0091] Furthermore, the heterogeneous large language model inference system also includes a model selection module. When there are multiple executor model instances, before assigning the inference request to the executor model instance, the method further includes: The model selection module selects the executor model to be invoked from multiple executor model instances based on a preset model selection strategy. When there are multiple expert model instances, before sending the guidance scheme to the executor model instance, the method further includes: The model selection module selects the expert model instance to be invoked from multiple expert model instances based on a preset model selection strategy. The model selection strategy includes at least one of the following: performance scoring strategy, probability allocation strategy, user priority strategy, and scene-aware selection strategy.

[0092] Specifically, when the executor model service module or expert model service module contains multiple model instances of the same type (e.g., the executor model pool contains multiple low-cost model instances such as Claude Haiku, GPT-4o-mini, and DeepSeek-V3), the system needs to select a specific model instance from the same cost category to process the request. This application provides performance scoring strategies, probability allocation strategies, user priority strategies, and scene-aware selection strategies. Each strategy can be used individually or in combination. The following provides a detailed description of each strategy: Performance scoring strategy: The performance scoring strategy is based on pre-scoring ranking and a dynamic degradation recovery mechanism. The specific implementation is as follows: Step 1 (Pre-scoring): The system maintains a model performance scoring table and scores each model instance in the same cost category across multiple dimensions. Scoring dimensions include: Ability scoring: Performance scores based on standard test sets (such as code generation, logical reasoning, knowledge question answering, etc.); Price cost: the reasoning cost per unit of token; Cost-effectiveness index: the ratio of capability score to price cost; In one specific embodiment, the executor model pool contains three model instances: Model A: Ability score 85, price $0.0003 / 1K tokens, cost-effectiveness 283; Model B: Ability score 80, price $0.0004 / 1K tokens, cost-effectiveness 200; Model C: Ability score 75, price $0.0002 / 1K tokens, cost-effectiveness 375; Based on cost-effectiveness: C > A > B; Step 2 (Preferred Allocation): When an inference request arrives, the system prioritizes allocating the request to the model instance with the highest cost-effectiveness score (model C in this example).

[0093] Step 3 (Health Monitoring): The system continuously monitors the service quality indicators of each model instance, including: Response time: The delay from when the request is sent to when the first token is returned; Availability: The percentage of requests that are successfully processed; Error rate: The percentage of failures caused by timeouts, rate limiting, or service outages; Step 4 (Dynamic Degradation): When the service quality metrics of the optimal model instance fall below a preset threshold (e.g., response time exceeds 5 seconds, availability is below 95%, error rate is above 5%), the system automatically degrades subsequent requests to the suboptimal model instance. The degradation decision is based on sliding window statistics. The system maintains the quality metrics of the most recent N requests (e.g., N=100), and triggers degradation when the metrics within the window continue to deteriorate.

[0094] Step 5 (Periodic Monitoring and Recovery): During the degradation period, the system periodically sends probe requests to the optimal model instance (e.g., every 30 seconds) to check whether its service quality has returned to normal. These probe requests are lightweight test requests and do not affect normal business traffic. When N consecutive probes (e.g., N=3) show that the service quality has returned to normal, the system automatically switches traffic back to the optimal model instance.

[0095] Step 6 (Seamless Switching): Model instance switching leverages the memoryless nature of large language models. Large language models are stateless reasoning systems; their output is entirely determined by the current input and they do not retain internal state across calls. Therefore, as long as the complete conversation context (including historical dialogues, tool call results, intermediate states, etc.) is injected into the new model instance, the new model instance can seamlessly continue processing without the user's awareness.

[0096] This application, based on a deep understanding of the memoryless nature of large language models, creatively applies this characteristic to the dynamic scheduling of heterogeneous model instances, achieving a novel technical effect. Specifically, service instance switching in traditional distributed systems typically requires complex state transition mechanisms (such as session state serialization, state synchronization protocols, checkpoint recovery, etc.). These mechanisms involve multiple technical steps, including state capture, state transmission, and state reconstruction, resulting in switching latency typically in the order of seconds or even longer. Large language models, due to their stateless nature, naturally support zero-cost instance switching, requiring only that the new instance receives the same context input as the old instance, without any state serialization or state synchronization operations. This application identifies the application value of the memoryless nature of large language models in heterogeneous scheduling scenarios and designs a specific model switching protocol based on this characteristic. It achieves millisecond-level model instance switching through a context injection mechanism, completely transparent to the user. This technical insight allows the model switching in this application to be dynamically performed based on service quality during a single session, without being limited by traditional state transition mechanisms. In existing technologies, model selection for pre-routing schemes occurs before the session begins and does not involve switching during the session; model switching in cascading schemes involves task redoing, wasting the work of preceding models. This application is the first technical solution that systematically utilizes the memoryless characteristics of large language models to achieve real-time, seamless model switching within a session. This application innovatively solves the technical challenge of state transition in the dynamic scheduling of heterogeneous models.

[0097] For example, in one specific embodiment, a user is performing a multi-turn dialogue programming task with an executor model instance. During the fifth round of dialogue, the model's response time increases from an average of 1.2 seconds to 6.8 seconds due to insufficient computing power. The system detects the degraded service quality and automatically switches the sixth round and subsequent requests to a backup model instance. During the switch, the system injects context information such as the complete dialogue history of the previous five rounds, the content of read files, and the results of executed commands into the backup model instance. The backup model continues to process the sixth round request based on the complete context. The user is unaware of the model switch throughout the process, and the task quality remains unaffected.

[0098] Probability allocation strategy: The probability-based allocation strategy is applicable to multiple model instances with similar performance scores, achieving load balancing through probability allocation. The specific implementation is as follows: Step 1 (Identifying Same Groups): The system identifies model instances whose performance score differences are less than a preset threshold (e.g., score difference <5%) and groups them into the same score group.

[0099] Step 2 (Obtaining Computing Capacity): The system obtains the computing capacity information for each model instance. The computing capacity can be determined in the following ways: API provider-declared concurrency limits (such as maximum requests per minute); Measured throughput (number of requests that can be processed per unit of time); User-configured capacity weights; Step 3 (Calculate Probability): Suppose there are n model instances in the same rating group, and the computing power capacity of model instance i is C. i The probability that the model instance is selected is: P i = C i / (C1 + C2 + ... + C n ); In one specific embodiment, the cost-effectiveness scores of Model A and Model B in the executor model pool are 283 and 280 respectively, with a difference of only 1%, and they are grouped into the same score group. Model A has a computing power capacity of 70 req / min, and Model B has a computing power capacity of 30 req / min. Therefore: P A = 70 / (70+30) = 0.7; P B = 30 / (70+30) = 0.3; Each time a request arrives, the system generates a random number r∈[0,1). If r<0.7, it is assigned to model A; otherwise, it is assigned to model B.

[0100] The advantages of the probability allocation strategy are: (a) fully utilizing the computing power capacity of each model instance to avoid overloading a single model; (b) reducing the risk of single point of failure by distributing traffic for models with similar scores; and (c) achieving natural load balancing without the need for complex scheduling algorithms.

[0101] User priority strategy: The user priority strategy allows users to customize the model selection order based on business needs and preferences. The specific implementation is as follows: Step 1 (User Configuration): The system provides a configuration interface that allows users to specify priorities for model instances within the same cost category. In a specific embodiment, the user configuration is as follows: { "worker_models" : [ {"model" : "claude-haiku" , "priority" : 1} , {"model" : "gpt-4o-mini" , "priority" : 2} , {"model" : "deepseek-v3" , "priority" : 3} ] } Step 2 (Priority-based allocation): When an inference request arrives, the system first tries the model instance with priority=1. If the model instance is available and the service quality is normal, it is allocated to that model; otherwise, it tries the model instance with priority=2, and so on.

[0102] Step 3 (Degradation and Recovery): The degradation and recovery mechanism of the user priority strategy is similar to that of the performance scoring strategy. When the high-priority model is unavailable or the service quality deteriorates, it will automatically degrade to the second-priority model. The availability of the high-priority model will be checked periodically, and it will automatically switch back after recovery.

[0103] Application scenarios for user priority strategies include: (a) users have a preference for the output style of a specific model; (b) users have a business agreement with a specific API provider and give priority to using the provider's model; and (c) users limit the range of models to be selected based on compliance requirements (such as data sovereignty).

[0104] Scene-aware selection strategy: The scenario-aware selection strategy dynamically selects the best-performing model instance for the specific scenario of the current task. This strategy is based on the following technical insight: even within the same cost category, different model instances exhibit significant performance differences across different task scenarios. These differences stem from variations in data distribution and optimization objectives during the pre-training phase.

[0105] The specific implementation method is as follows: Step 1 (Scenario Performance Evaluation): The system performs scenario-based performance evaluations on model instances within the same cost category for typical task scenarios in the target application domain. Taking automated programming applications as an example, typical scenarios include: code writing scenarios, automated testing scenarios, code review scenarios, and architecture design scenarios. A standard test set is built for each scenario to evaluate the task completion quality, response speed, and cost efficiency of each model instance in that scenario, establishing a scenario-model performance matrix. Specifically, for each scenario, based on the task completion quality, response speed, and cost efficiency of each model in that scenario, the optimal model instance for that scenario is determined, and the correspondence between the scenario and the optimal model instance is constructed, thus obtaining the scenario-model performance matrix.

[0106] For example, in a specific commercial deployment implementation, the applicant used an internal proprietary automated programming task test dataset to evaluate the performance of two high-cost model instances (Model A and Model B) from the expert model pool under different scenarios. The test dataset contained 300 task samples extracted from real projects and manually labeled, covering typical scenarios such as code writing, automated testing, code review, and architecture design. The evaluation results are shown in Table 1. Table 1: Step 2 (Scene Recognition): The system identifies the task scenario type from the current task context during runtime. Scenario recognition is based on deterministic rules and does not require training an additional classifier model. The recognition criteria include: (a) Tool call pattern analysis: consecutive writefile and editfile calls indicate a code writing scenario, while generatetest, runtest, and analyzecoverage calls indicate a test scenario; (b) Task description keyword matching: "implement," "addfeature," and "refactor" correspond to code writing, while "test," "coverage," and "edge case" correspond to test scenarios; (c) Context state analysis: whether the current processing stage is the function implementation stage or the quality assurance stage.

[0107] Step 3 (Dynamic Model Selection): Based on the identified scene type, query the scene-model performance matrix for the best-performing model instance in that scene and route the request to that model instance. In the above embodiment, the code-written scene is routed to model A, and the automatic test scene is routed to model B.

[0108] Step 4 (Seamless Scene Switching): When the processing of a single session spans multiple scenes (e.g., code writing followed by test generation), the system leverages the memoryless nature of the large language model to achieve model switching between scenes. Specifically, when the scene recognition module detects a scene transition (e.g., the executor switches from calling the code writing tool to calling the test generation tool), the system switches the complete session context from the current model instance and injects it into the optimal model instance for the new scene. Since the large language model does not retain internal state across calls, as long as the new model instance receives the complete context, processing can seamlessly continue without the user's awareness.

[0109] The technical advantages of the scene-aware selection strategy are: (a) maximizing the relative advantages of each model, so that each task scenario is handled by the model with the strongest performance in that scenario; (b) scene recognition based on deterministic rules, without the need to train and maintain classifier models; (c) utilizing the memoryless nature of zero-cost switching, so that scene selection does not affect the continuity of interaction; (d) strong scalability. In the above commercial deployment, after adopting the scene-aware selection strategy, the overall task completion quality of the system is improved by about 6% compared with using a single model, without increasing the inference cost.

[0110] Strategy Combination: The above four strategies can be used in combination. In one specific embodiment, the system employs the following combination strategy: First, a user priority strategy is applied to filter out model instances that users are allowed to use; Within permissible limits, the application scenario-aware selection strategy selects the optimal model based on the current task scenario. If there are multiple model instances with similar performance in this scenario, a performance scoring strategy should be applied to rank them. For model instances with similar scores, a probability allocation strategy is applied to achieve load balancing; The model selection strategy is implemented independently in the executor model service module and the expert model service module. The executor model pool and the expert model pool each maintain independent scoring tables, health monitoring indicators, scenario-model performance matrices and user configurations, without interfering with each other.

[0111] This embodiment can reduce the overall reasoning cost. First, the executor model instance independently processes requests within its capabilities without triggering step S130. The processing path of these requests does not involve the reasoning calculation of the expert model instance. Secondly, the output guidance scheme of the expert model instance is limited to analysis and guidance text. This limitation is achieved through the system prompt word role constraint and tool configuration constraint of the expert model instance, so that the number of output tokens of a single consultation of the expert model instance is less than the number of output tokens required to fully execute the original request. The overall system cost model is as follows: Define the following parameters: C w The average cost for an executor model instance to process a single request; C e The average cost of processing a single consultation using an expert model instance; C ref : Average cost of fully processing a single request using an expert model instance (reference baseline); S RSelf-resolved rate (the percentage of requests that can be completed independently by the executor without requiring an upgrade); K = C ref / C w The ratio of the expert's total execution cost to the executor's cost; Overall average system cost: Cost avg =S R × C w + (1-S R ) × (C w + C e )= C w + (1-S R ) ×C e ; Note C e (Expert consultation cost) is much less than C ref (Expert's complete execution cost), because the expert only provides guidance and does not perform the operation. Let C be the cost. e ≈ β × C ref β represents the proportion of consultation cost to the total execution cost. A typical value for β ranges from 0.1 to 0.3, and its rationale is based on the following analysis: In consultation mode, expert model instances only output analysis and guidance text (such as root cause analysis, recommended strategies, and implementation step suggestions), without generating operational output (such as complete code, tool call sequences, and file modification content). The output token quantity of the former is typically 10% to 30% of that of the latter, while the inference cost of large language models is approximately linearly related to the output token quantity.

[0112] but: Cost avg = C w + (1-S R ) × β ×C ref ; Cost reduction percentage = 1 - Cost avg / C ref =1 -(1 / K + (1-S R )×β); In one specific embodiment, K=10 (the executor cost is 1 / 10 of the expert's full execution cost), S R =75% (75% of requests are completed independently by the executor), β=0.2 (expert consultation costs are 20% of the total execution cost), then: Cost reduction percentage = 1 - (1 / 10 + 0.25 × 0.2) = 1 - (0.1 + 0.05) = 85%; In one specific embodiment, the inventors constructed an internal verification system based on the executor-first architecture of this application. This system ran continuously for 20 days, handling real software development tasks. The verification system was configured as follows: the executor model instances used lightweight commercial large language models (e.g., but not limited to, MiniMax series models) with a cost approximately 1 / 20th that of expert models; the expert model instances used large commercial models with high inference capabilities (e.g., but not limited to, GLM-5 and Claude Opus-4-6 models). Verification scenarios included mixed programming tasks such as code writing, debugging, and architecture design, as well as C language audio algorithm optimization tasks and fully automated embedded C software programming and debugging tasks (e.g., UDP <-> RS232 protocol conversion, with the target hardware being a 32-bit RISC-V architecture development board).

[0113] During the 20-day operation, the system processed approximately 1 billion expert model instance inference tokens. Statistics show that approximately 76% of the requests were completed independently by the executor model instances (i.e., the self-resolved rate S). R The average number of tokens output per consultation for expert model instances is approximately 0.76, with 24% of requests triggering upgrade consultations. The average number of tokens output per consultation for expert model instances is approximately 18% of the total tokens required to complete the same task (i.e., β≈0.18). The overall system inference cost is reduced by approximately 82% compared to using only expert model instances, consistent with the predicted value of the theoretical cost model. No significant differences were found in task completion quality after manual sampling review. These data validate the effectiveness of the proposed solution in a real production environment, and this effect is not dependent on the selection of a specific model, but rather stems from the executor-first architecture and consultation collaboration mechanism of this application.

[0114] To further understand the reasoning method based on heterogeneous large language models provided in this application, the following description, combined with specific application scenarios, illustrates the reasoning method based on heterogeneous large language models provided in this application: Example 1: Fully automated programming scenario.

[0115] In the fully automated programming engine, executor model instances (low-cost models) are responsible for operation-intensive tasks such as code writing, debugging, and testing. Most programming requests (such as implementing well-defined functions, fixing clearly located defects, and writing standardized tests) can be completed independently within the capabilities of the executor. When encountering situations requiring high-level architectural decisions (such as module splitting strategies and data model design) or complex technical analysis (such as concurrency race condition analysis and security vulnerability assessment), the executor triggers escalation consultation.

[0116] # Tool definition for executor model instance tools = [ read_file, write_file, execute_command , search_code , ask_expert, # Read file # Write to file # Execute command # Search code # Consult an expert (Upgraded consultation portal) ] The following example illustrates the specific execution process of the fully automated programming scenario in Example 1 through a complete end-to-end interaction example.

[0117] User request (inference request): "Add batch import functionality to the user management module to support importing user data from CSV files."

[0118] Step S1 (Executor receives first): The request is directly assigned to the executor model instance (low-cost model, such as Claude Haiku). The executor receives the complete context, including the user request text, the current project's file structure, and existing code.

[0119] Step S2 (Executor's Autonomous Capability Assessment): The executor model instance begins processing the request. The executor first calls the `search_code` tool to locate the existing user management module's code structure and the `read_file` tool to read relevant files. After understanding the existing code, the executor assesses that the task involves CSV (a common text file format) parsing, data validation, batch database writing, and error handling. The executor determines that CSV parsing and basic data validation are within its capabilities, but the transactional consistency strategy for batch writing (such as row-by-row writing, batch transactions, and phased commits) involves architectural decisions and exceeds its capabilities.

[0120] Step S3 (Condition-triggered upgrade consultation): The executor model instance calls the ask_expert tool, and the output is as follows: { Question: "Database write strategy for batch user imports: Existing code uses ORM (Object-Relational Mapping) to insert data row by row. In a batch import scenario, what transaction strategy should be adopted to balance performance and data consistency?" "context_summary": "The user management module is located in src / modules / user / , uses TypeORM, and currently uses a single-user creation interface that inserts records one by one. The CSV may contain thousands of records." "worker_analysis": "Three initial solutions are considered: (1) INSERT row by row + single transaction, simple but transaction length is too long with large data volumes; (2) Batch INSERT, good performance but error location is difficult; (3) Batch commit (e.g., one transaction for every 100 rows), a compromise but requires handling partial failures. Solution 3 is preferred, but the batch size and partial failure handling strategy are uncertain." } The consultation agent module receives and parses the tool call, extracts supplementary information from the session context (read file content, database configuration, etc.), merges it with the executor's output to construct a structured consultation request, and sends it to the expert model instance (high-cost model, such as Claude Opus).

[0121] Step S4 (Expert Guidance Return): After the expert model instance analysis, a guidance solution is returned. "Root cause analysis: The core challenge of batch import is striking a balance between performance, consistency, and observability."

[0122] Recommended strategy: Use the phased submission approach (Option 3), with specific suggestions as follows: 1. Set the batch size to 200 rows (the performance inflection point for TypeORM batch INSERT is usually between 100 and 500 rows). 2. Each batch uses an independent transaction; in case of failure, only the current batch is rolled back. 3. Maintain import result report: Record the number of successful imports, the number of failed imports, and the reasons for each failure. 4. Perform format validation on the CSV data first (first pass), then perform business validation and write (second pass) to avoid interruptions due to format errors during the write process. Risk warning: Pay attention to unique constraint conflicts (such as duplicate email addresses). It is recommended to provide two strategies for users to choose from: "Skip Duplicates" and "Overwrite Update". Step S5 (Executor digests and executes the guidance): The consultation agent module injects the expert guidance as the return result of the ask_expert tool into the executor context. Following the guidance, the executor model instance sequentially calls the write_file tool to create the CSV parsing module, batch import service classes, and import result report generation logic, and calls the execute_command tool to run tests to verify the functionality's correctness. Throughout the entire implementation, the executor holds all tool call permissions; the expert model instance does not directly write any code or call any tools.

[0123] In this example, the expert model instance only outputs guidance text for approximately 200 tokens (knowledge-intensive), while the complete implementation process of the executor model instance involves code generation and tool calls for approximately 2000 tokens (operation-intensive). If the expert model instance were to perform this task entirely, its cost would be approximately 20 times that of the executor; using the consultation model of this application, the expert only undertakes approximately 10% of the token amount, and the total cost is approximately 15% of that of pure expert execution.

[0124] Example 2: Intelligent Customer Service Scenario.

[0125] In the intelligent customer service system, the executor model instance handles routine customer inquiries (such as product information queries, order status queries, and standardized process guidance). When encountering complaints requiring professional legal judgment, interpretation of complex refund rules, or cross-system collaborative exception handling, the executor triggers escalation consultation to obtain expert guidance before continuing to interact with the customer.

[0126] Example 3: Medical auxiliary diagnosis scenario.

[0127] In medical AI-assisted systems, executor model instances handle standardized information collection (medical history inquiry, examination item suggestions, and basic vital sign analysis). When encountering atypical symptom combinations, complex drug interaction assessments, or rare disease differential diagnoses, the executor triggers escalation consultations. Expert model instances provide differential diagnostic analysis and further examination suggestions, and the executor integrates expert guidance into interactions with patients or doctors.

[0128] Example 4: Fully automated development scenario for embedded software.

[0129] In embedded software development tasks, the executor model instance is responsible for operation-intensive tasks such as code writing, compilation and debugging, and protocol implementation. In a specific embodiment, the task objective is to implement bidirectional conversion between the UDP network protocol and the RS232 serial port protocol. The target hardware platform is a development board based on a 32-bit RISC-V architecture (e.g., but not limited to the AG32F407 development board, which is comparable to the performance level of the ARM Cortex-M4). This task involves low-level hardware operations, real-time constraints, and multi-protocol interaction, exhibiting typical characteristics of embedded systems.

[0130] During execution, the executor model instance (using a low-cost model, such as but not limited to the MiniMax series model) independently completed the following sub-tasks: UDP socket initialization and configuration, RS232 serial port parameter settings (baud rate, data bits, stop bits, parity bits), data buffer allocation and management, basic send / receive logic implementation, and protocol format conversion code writing. When encountering issues requiring in-depth technical analysis, upgrade consultation was triggered: (a) Conflict analysis between DMA transfer and interrupt priority configuration. In high-speed data flow scenarios, the priority inversion between DMA transfer completion interrupt and serial port receive interrupt may occur, requiring analysis of interrupt nesting strategies; (b) Buffer overflow risk assessment under high-speed data flow. The arrival rate of UDP packets may exceed the serial port sending rate, requiring the design of flow control mechanisms; (c) Task scheduling strategy selection under real-time requirements. In an RTOS (Real-Time Operating System) environment, the priority allocation of protocol conversion tasks, data send / receive tasks, and monitoring tasks needs to balance real-time performance and throughput.

[0131] Expert model examples (using high-cost models, such as but not limited to GLM-5 or Claude Opus-4-6 models) provide technical guidance for the above consultations, including: DMA channel allocation suggestions (it is recommended that UDP reception use DMA channel 1 and RS232 transmission use DMA channel 2 to avoid channel conflicts), double buffering mechanism design scheme (using ping-pong buffers, one buffer for DMA writing and the other for serial port reading, switching alternately), RTOS task priority configuration strategy (setting protocol conversion tasks to medium priority, data transmission and reception tasks to high priority, and monitoring tasks to low priority), and interrupt priority grouping suggestions (DMA interrupt priority is higher than serial port interrupt to ensure data integrity).

[0132] The executor model instance was implemented with expert guidance, including DMA initialization code, dual-buffer switching logic, RTOS task creation and priority setting, and interrupt service routine writing. Finally, hardware-in-the-loop testing was used to verify functional correctness. Test scenarios included: lossless forwarding of continuous high-speed UDP data streams (1000 packets / second), stability verification under serial port baud rate switching (9600 to 115200), and memory leak detection during long-term operation (24 hours).

[0133] In this case, the consultation cost of the expert model instance is approximately 15% of the cost required to fully implement the task, while the executor model instance handles approximately 85% of the code writing and debugging work. Compared to using only expert model instances, the total cost is reduced by approximately 78%. This embodiment verifies the effectiveness of the proposed solution in embedded system development scenarios. This effectiveness is not dependent on a specific hardware platform or development toolchain, but rather stems from the executor-first architecture of this application, where the executor undertakes the operationally intensive code writing and debugging work, while experts provide knowledge-intensive guidance only when in-depth technical analysis is required.

[0134] Example 5: Automatic programming application of scenario-based model selection.

[0135] In fully automated programming applications, the model selection mechanism of this application is reflected not only in the cost hierarchy selection of the executor-expert, but also in the dynamic selection of model instances based on task scenarios within the same cost hierarchy. This scenario-based selection leverages the memoryless nature of large language models to switch in real time to the optimal model instance for the current task type at different processing stages of a single session.

[0136] In a specific commercial deployment embodiment, the system's expert model pool contains two high-cost model instances: Model A and Model B. The applicant systematically evaluated the two models based on an internally proprietary automated programming task test dataset. This dataset contains 300 labeled samples extracted from real-world projects, covering multiple technical areas such as front-end development, back-end API, database operations, and system integration. The evaluation data shows significant differences in performance between the two models across different automated programming application scenarios: in code writing scenarios (including new feature implementation, code refactoring, API integration, etc.), Model A scored 92 points in task completion quality, while Model B scored 85 points, a performance difference of 7 points (approximately 8%); in automated testing scenarios (including test case generation, test strategy design, boundary condition identification, etc.), Model B scored 89 points in task completion quality, while Model A scored 76 points, a performance difference of 13 points (approximately 13%). This performance difference stems from the differences in data distribution and optimization objectives during the pre-training phase, resulting in relative advantages for each model in specific task types.

[0137] Based on the aforementioned evaluation data, the system implements a scenario-aware model selection strategy: When an executor model instance triggers an upgrade consultation, the consultation agent module identifies the task type from the current task context—by analyzing the executor's tool call history (e.g., consecutive write_file and edit_file calls indicate a code writing scenario, while generate_test and analyze_coverage calls indicate a testing scenario) and task description keywords (e.g., "implement" and "refactor" correspond to code writing, while "test" and "coverage" correspond to testing scenarios), classifying the task into a code writing scenario or an automated testing scenario. Based on the task classification results, the system dynamically selects the expert model instance with the best performance in that scenario: code writing scenarios are routed to model A, and automated testing scenarios are routed to model B.

[0138] The technology used in this scenario-based selection is based on the memoryless nature of large language models. Since large language models do not retain internal state across calls, during a single session, when a task transitions from the code writing phase to the testing phase, the system can switch and inject the complete session context (including written code, historical interactions between the executor and expert, intermediate results, etc.) from model A instance to model B instance, achieving a seamless switch within milliseconds. From the user's perspective, the entire session is guided by a single "expert," but the system dynamically selects the optimal model instance based on the task scenario, ensuring that each task stage is handled by the most powerful model for that scenario.

[0139] The system-level advantages of this embodiment are: (a) maximizing the relative advantages of each model and avoiding performance waste caused by a "one-size-fits-all" model selection; (b) seamless switching based on memoryless characteristics makes scenario selection completely transparent to the user and does not affect the continuity of interaction; (c) the scenario recognition logic is implemented by a deterministic program and does not rely on additional classifier training; (d) when the model pool expands to more model instances, the granularity of scenario classification can be further refined (e.g., subdividing code writing into sub-scenarios such as "front-end UI implementation," "back-end API implementation," and "database operations"), making model selection more accurate. In the above commercial deployment, after adopting the scenario selection strategy, the overall task completion quality of the system improved by about 6% compared to using a single expert model, without increasing inference costs.

[0140] Compared to existing pre-routing schemes: In the pre-routing scheme, the available information for the routing module is the surface characteristics of the request upon arrival (the text itself, length, keywords, etc.). Let the true difficulty of the request be determined by its intrinsic complexity C. intrinsic The decision is based on the surface feature F observed by the routing module. surface C intrinsic Noisy proxy variable: F surface= g(C intrinsic + noise. The routing module is based on F surface When making classification decisions, the accuracy of classification is limited by noise, which has a theoretical upper limit.

[0141] In this application, the available information for the executor model instance is the complete problem context (including request text, historical dialogues, tool execution results, intermediate reasoning processes, etc.), and the executor has a deep understanding of the problem (rather than merely observing surface features). The executor has a deep understanding of C. intrinsic The estimation is based on a complete understanding of the problem, and its estimation accuracy is higher than that based on F. surface Indirect inference. It is evident that the upgrade decision in this application is structurally superior to the pre-routing decision, with the advantage stemming from the timing of the decision (full understanding of the post-decision vs. understanding of the pre-decision), rather than the superiority of the decision algorithm. Even using the exact same decision algorithm, the information advantage of the post-decision makes its accuracy higher than that of the pre-decision.

[0142] Compared to existing cascading strategies: The upgrade method of the cascaded strategy is that the high-cost model completely redoes the original task (the work results of the low-cost model are discarded), while the upgrade method of this application is that the high-cost model only provides guidance (the work results of the low-cost model are retained and continue to be promoted under guidance).

[0143] In this application, the executor model instance, as a low-cost model, has already completed a full contextual understanding, basic analysis, and partial operation execution (such as code writing, tool invocation, and data preprocessing) of the inference request before triggering the upgrade consultation. These preliminary work results are not discarded due to the upgrade consultation but are passed on as core foundations to subsequent processes. The high-cost expert model instance focuses only on knowledge-intensive core links, outputting lightweight guidance solutions such as root cause analysis, solution strategies, and implementation step suggestions. It does not participate in the execution of any operation-intensive tasks, and its output token quantity is significantly reduced compared to the token quantity required to completely redo the original task, significantly reducing the inference consumption of the high-cost model. At the same time, the executor model instance, based on its existing work results, digests and absorbs the expert guidance solutions and continues to promote task implementation. This not only avoids the waste of the low-cost model's preliminary work but also minimizes the ineffective computing power consumption of the high-cost model through the division of labor mode of "low-cost execution + high-cost limited guidance," ultimately achieving a significant reduction in the overall inference cost of the system, while maintaining the same task completion quality as the complete execution of the high-cost model.

[0144] In summary, the method is applied to a heterogeneous large language model inference system, which includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance. Firstly, in this embodiment, the step of determining whether the executor model instance has the ability to complete the inference request (whether expert consultation is needed) occurs after the request is fully understood. This step is made by the executor model instance itself based on the context information of the inference request, resulting in high completeness of the decision information involved in determining the execution capability of the inference request, thereby reducing the classification error rate. Secondly, when the executor model instance does not have the execution capability of the current inference request, The system requests guidance from high-cost expert model instances, enabling executor model instances to complete inference under the guidance of expert model instances, resulting in high-quality inference. Thirdly, in this embodiment, the high-cost expert model instance only generates guidance; that is, the task execution right in this embodiment is always held by the executor model instance and is not transferred to the expert model instance. The high-cost model is only used for knowledge-intensive analysis and guidance and does not undertake operation-intensive execution work. The token consumption for a single consultation is far lower than that for the complete execution of the original task, thereby reducing the cost of inference execution. Fourthly, this application evaluates the execution capability of the executor model instance based on system prompts, eliminating the need to train and maintain external routing models or classifiers, thus eliminating the development cost, training data requirements, and continuous maintenance costs of routing models.

[0145] It is particularly noteworthy that the technical solution of this application overcomes the technical bias in this field. Existing technologies (including DiSRouter, SATER, SWE-Protege, etc.) generally believe that to achieve reliable autonomous upgrade decisions for executor model instances, it is necessary to enhance the model's self-evaluation ability through training methods at the model parameter level, such as supervised fine-tuning (SFT) and reinforcement learning (RL). This application takes the opposite approach, discovering that the instruction-following capabilities already possessed by current mainstream commercial large language models are sufficient to support reliable upgrade decisions—this capability can be activated through deterministic instruction configuration in system prompts, without requiring any training at the model parameter level. The system prompt configuration is a system configuration-level operation that does not change the model's internal parameters or weights, and is technically completely different from the model training process that requires a large amount of labeled data and computational resources. This discovery breaks the technical bias that "autonomous upgrade decisions must rely on model training," enabling the solution to be deployed immediately to any new model and new scenario, eliminating the data preparation, computational, and time costs required for SFT+RL training, and possessing a decisive advantage in engineering practice.

[0146] The core differences between this application and the closest prior art, SWE-Protege, lie in the following three aspects: (1) No training at the model parameter level is required. The instruction following ability learned by the large language model in the pre-training stage is utilized to achieve autonomous evaluation and upgrade decision through the deterministic instruction configuration in the system prompt words. The system prompt word configuration is an operation at the system configuration level, which is achieved by modifying the input text. It does not involve the modification of the model's internal parameters or weights. It is completely different from the SFT+RL training process that requires the backpropagation algorithm to update the model parameters. It eliminates the data preparation cost, computation cost and time cost required for training, and enables the solution to be deployed to new models and new scenarios in an instant. (2) By using a dual mechanism of system prompt word role limitation and toolset removal, the expert role is hard-constrained at the architecture level. This dual constraint solves the technical problem that the probability of the output of a large language model may cause the failure of a single soft constraint. The system prompt word constraint provides the first layer of protection by utilizing the model's instruction compliance capability, while the tool configuration constraint provides the second layer of hard protection at the system architecture level. Even if the model tries to violate the role constraint, it cannot perform the operation due to the lack of tool interface. In contrast, the expert role constraint of SWE-Protege is only an implicit effect of the training objective and lacks explicit guarantee at the architecture level. (3) A standardized structured consultation agreement is defined, which includes three mandatory fields: problem description field, context summary field, and preliminary analysis field. This makes the consultation process mechanically parsable, auditable, and reproducible, and supports automated quality monitoring and cost accounting. In contrast, SWE-Protege's consultation request is in an unstructured format.

[0147] Based on the above methodology, this application proposes a heterogeneous large language model reasoning method based on the autonomous upgrading decision of the executor. Its core architecture is "executor-first": all reasoning requests are first received and fully understood by low-cost executor model instances. After understanding the problem, the executor autonomously assesses whether its ability is sufficient. If the ability is sufficient, it completes the task independently (autonomous solution path). If the ability is insufficient, it sends a structured consultation request to a high-cost expert model instance to obtain guidance and then implements the task itself (upgrade consultation path).

[0148] The cost of the autonomous solution path is only the reasoning cost of the executor model instance. The cost of the upgraded consultation path is the sum of the reasoning cost of the executor model instance and the consultation cost of the expert model instance. Since the expert model instance only provides guidance solutions (knowledge-intensive, few tokens) and does not perform operational tasks (operation-intensive, large tokens), its consultation cost is significantly lower than the cost of fully executing the original task.

[0149] Table 2 shows the essential differences and detailed comparisons between this application and the prior art: Table 2: Large language models are memoryless probabilistic systems; their output is entirely determined by the current input and they do not retain internal states across calls. Therefore, their output quality is determined by the completeness of the input information. Extending this proposition to the model selection decision scenario: model selection is essentially a classification task, determining whether the current request exceeds the capabilities of a low-cost model. The accuracy of this classification task is positively correlated with the amount of information available to the classifier. In external pre-routing schemes, the available information for the classifier (routing module) is the surface features of the request, with an information completeness DI_router < 1. In this application's scheme, the available information for the classifier (the executor model instance itself) is the complete question context (including the request text, historical dialogue, tool execution results, intermediate inference processes, etc.), with an information completeness DI_worker ≈ 1. Since DI_worker > DI_router, the classification accuracy of this application's scheme is theoretically superior to that of the external pre-routing scheme.

[0150] Figure 1 This is a schematic block diagram of a heterogeneous large language model inference system provided in an embodiment of this application. Figure 1 As shown, corresponding to the above-described reasoning method based on heterogeneous large language models, this application also provides a heterogeneous large language model reasoning system. This heterogeneous large language model reasoning system includes modules for executing the above-described reasoning method based on heterogeneous large language models, and can be configured in a terminal or server. Specifically, please refer to... Figure 1 The heterogeneous large language model inference system includes an executor model service module 10 and an expert model service module 20. The executor model service module 10 contains at least one low-cost executor model instance, and the expert model service module 20 contains at least one high-cost expert model instance, wherein: The executor model instance is used to obtain a reasoning request; determine whether the executor model instance has the ability to complete the reasoning request based on the context information of the reasoning request; if the executor model instance does not have the ability to complete the reasoning request, then send an upgrade consultation request for the reasoning request to the expert model instance. The expert model instance is used to generate a guidance plan corresponding to the upgrade consultation request and send the guidance plan to the executor model instance. When the expert model instance processes the upgrade consultation request, it does not execute the operation task in the reasoning request. The executor model instance is further configured to perform reasoning processing on the reasoning request according to the guidance scheme to obtain the reasoning result corresponding to the reasoning request; if the executor model instance has the ability to complete the execution of the reasoning request, it performs reasoning processing on the reasoning request based on the context information of the reasoning request to obtain the reasoning result corresponding to the reasoning request.

[0151] In some embodiments, when the executor model instance performs the step of determining whether the executor model instance has the ability to complete the inference request based on the context information of the inference request, it is specifically used for: The system prompt word is used to determine the capability assessment instruction corresponding to the executor model instance, and the capability assessment instruction includes multiple capability assessment dimensions. Based on the context information of the inference request and the multiple capability assessment dimensions, an autonomous capability assessment is performed to determine whether the executor model instance has the capability to complete the inference request.

[0152] In some embodiments, the list of available tools for the executor model instance includes expert consultation tools, and the heterogeneous large language model inference system further includes a consultation agent module 30; when the executor model instance executes the step of sending an upgrade consultation request for the inference request to the expert model instance if the executor model instance does not have the ability to complete the inference request, the specific steps are as follows: If the executor model instance does not have the ability to complete the inference request, the expert consultation tool is invoked. When the expert consultation tool is invoked, the consultation agent module 30 constructs the upgrade consultation request based on the context information of the inference request; and the consultation agent module 30 sends the upgrade consultation request to the expert model instance.

[0153] In some embodiments, when the consultation agent module 30 performs the step of constructing the upgrade consultation request based on the context information of the reasoning request, it is specifically used for: Extract and cache context summary information from the current session context information. The context summary information includes current task description information, executed operation steps information, key intermediate result information, and related code reference information. Send the context summary information to the executor model instance. Determine the problem description information based on the context summary information through the executor model instance, and send the problem description information to the consultation agent module 30. Construct an upgrade consultation request based on the context summary information and the problem description information.

[0154] In some embodiments, when the executor model instance performs the step of reasoning the inference request according to the guidance scheme to obtain the inference result corresponding to the inference request, it is specifically used for: The consulting agent module 30 receives the guidance scheme and injects it as a tool invocation result into the context information of the executor model instance to obtain enhanced context information; based on the enhanced context information, it performs reasoning processing on the reasoning request to obtain the reasoning result corresponding to the reasoning request.

[0155] In some embodiments, the heterogeneous large language model inference system further includes: The recording module is used to record the decision results of each upgrade decision of the executor model instance. The decision results include: whether to trigger an upgrade, the task completion quality after the upgrade, and the task completion quality before the upgrade. The upgrade decision refers to determining whether the executor model instance has the execution capability to complete the inference request. The identification module is used to identify deviation patterns in upgrade decisions based on historical statistics of the decision results. The deviation patterns include quality loss caused by failure to upgrade when it should have been upgraded and cost waste caused by unnecessary upgrades. An injection module is used to inject typical cases of the deviation pattern as few-sample examples into the system prompt words of the executor model instance.

[0156] In some embodiments, the executor model instance runs on top of an attention scheduling kernel, which manages multiple attention threads within the executor model instance, with each attention thread corresponding to an independent inference process. When the executor model instance does not have the ability to complete the inference request, the attention scheduling kernel pauses the current attention thread of the executor model instance, creates an expert consultation thread, and manages the lifecycle of the expert consultation thread. After the expert model instance sends the guidance scheme to the executor model instance, the attention scheduling kernel creates an execution guidance thread in the executor model instance and resumes the task execution of the inference request based on the execution guidance thread.

[0157] In some embodiments, there are multiple expert model instances, and the executor model instance is deployed on a multi-core coordination layer. When the executor model instance performs the step of sending an upgrade consultation request for the inference request to the expert model instance, it is specifically used for: The multi-core coordination layer determines the expert model instance that needs to be invoked from multiple expert model instances; it sends an upgrade consultation request for the inference request to the expert model instance that needs to be invoked; wherein, the multi-core coordination layer manages the lifecycle of consultation-related threads in the executor model instance through the interface of the attention scheduling kernel.

[0158] In some embodiments, the heterogeneous large language model inference system further includes: The degradation module is used to route the inference request to the expert model instance for inference processing when the executor model instance is unavailable; when the expert model instance is unavailable and the executor model instance does not have the ability to complete the inference request, the module does not send the upgrade consultation request to the expert model instance through the executor model instance, but directly performs the inference request through the executor model instance and marks the confidence level in the inference result.

[0159] In some embodiments, the heterogeneous large language model inference system further includes a model selection module, wherein there are multiple executor model instances: the model selection module is used to select the executor model to be invoked from multiple executor model instances based on a preset model selection strategy; when there are multiple expert model instances, the module selects the expert model instance to be invoked from multiple expert model instances based on a preset model selection strategy. The model selection strategy includes at least one of the following: performance scoring strategy, probability allocation strategy, user priority strategy, and scene-aware selection strategy.

[0160] In summary, the heterogeneous large language model inference system includes an executor model service module 10 and an expert model service module 20. The executor model service module 10 contains at least one low-cost executor model instance, and the expert model service module 20 contains at least one high-cost expert model instance. Firstly, in this embodiment, the step of determining whether an executor model instance has the ability to complete the inference request (whether it needs to consult an expert) occurs after the request is fully understood. This step is made by the executor model instance itself based on the context information of the inference request, resulting in high completeness of the decision information involved in determining the execution capability of the inference request, thereby reducing the classification error rate. Secondly, when an executor model instance does not have the ability to execute the current inference request, it will consult a high-cost expert model instance. The expert model instance requests guidance, enabling the executor model instance to complete inference under the guidance of the expert model instance, resulting in high-quality inference. Thirdly, in this embodiment, the high-cost expert model instance only generates guidance; that is, the task execution right in this embodiment is always held by the executor model instance and is not transferred to the expert model instance. The high-cost model is only used for knowledge-intensive analysis and guidance and does not undertake operation-intensive execution work. The token consumption for a single consultation is far lower than that for the complete execution of the original task, thereby reducing the cost of inference execution. Fourthly, this application evaluates the execution capability of the executor model instance based on system prompts, eliminating the need to train and maintain external routing models or classifiers, thus eliminating the development cost, training data requirements, and continuous maintenance costs of routing models.

[0161] It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above heterogeneous large language model inference system and each unit can be referred to the corresponding description in the foregoing method embodiments. For the sake of convenience and brevity, it will not be repeated here.

[0162] The aforementioned heterogeneous large language model reasoning system can be implemented as a computer program, which can perform tasks such as... Figure 7 It runs on the computer device shown.

[0163] Please see Figure 7 , Figure 7 This is a schematic block diagram of a computer device 700 provided in an embodiment of this application. The computer device 700 can be a terminal or a server. The server can be a standalone server or a server cluster composed of multiple servers.

[0164] The computer device is equipped with a heterogeneous large language model inference system, which includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance.

[0165] See Figure 7 The computer device 700 includes a processor 702, a memory, and a network interface 705 connected via a system bus 701. The memory may include a non-volatile storage medium 703 and internal memory 704.

[0166] The non-volatile storage medium 703 may store an operating system 7031 and a computer program 7032. The computer program 7032 includes program instructions that, when executed, cause the processor 702 to perform a reasoning method based on a heterogeneous large language model.

[0167] The processor 702 provides computing and control capabilities to support the operation of the entire computer device 700.

[0168] The internal memory 704 provides an environment for the execution of the computer program 7032 in the non-volatile storage medium 703. When the computer program 7032 is executed by the processor 702, the processor 702 can execute a reasoning method based on a heterogeneous large language model.

[0169] This network interface 705 is used for network communication with other devices. Those skilled in the art will understand that... Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device 700 to which the present application is applied. The specific computer device 700 may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0170] The processor 702 is used to run a computer program 7032 stored in the memory to perform the following steps: Obtain the inference request and assign the inference request to the executor model instance; Based on the context information of the inference request, it is determined whether the executor model instance has the ability to complete the inference request. If the executor model instance does not have the ability to complete the inference request, then the executor model instance sends an upgrade consultation request for the inference request to the expert model instance. The expert model instance generates a guidance plan corresponding to the upgrade consultation request and sends the guidance plan to the executor model instance. When the expert model instance processes the upgrade consultation request, it does not execute the operation task in the reasoning request. Using the executor model instance, the inference request is processed according to the guidance scheme to obtain the inference result corresponding to the inference request; If the executor model instance has the ability to complete the inference request, then the executor model instance performs inference processing on the inference request based on the context information of the inference request to obtain the inference result corresponding to the inference request.

[0171] It should be understood that in the embodiments of this application, the processor 702 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

[0172] It will be understood by those skilled in the art that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program includes program instructions and can be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the process steps of the embodiments of the above methods.

[0173] Therefore, this application also provides a storage medium. This storage medium can be a computer-readable storage medium. The storage medium stores a computer program, wherein the computer program includes program instructions. When executed by a processor, the program instructions cause the processor to perform the following steps: Obtain the inference request and assign the inference request to the executor model instance; Based on the context information of the inference request, it is determined whether the executor model instance has the ability to complete the inference request. If the executor model instance does not have the ability to complete the inference request, then the executor model instance sends an upgrade consultation request for the inference request to the expert model instance. The expert model instance generates a guidance plan corresponding to the upgrade consultation request and sends the guidance plan to the executor model instance. When the expert model instance processes the upgrade consultation request, it does not execute the operation task in the reasoning request. Using the executor model instance, the inference request is processed according to the guidance scheme to obtain the inference result corresponding to the inference request; If the executor model instance has the ability to complete the inference request, then the executor model instance performs inference processing on the inference request based on the context information of the inference request to obtain the inference result corresponding to the inference request.

[0174] The storage medium can be any computer-readable storage medium capable of storing program code, such as a USB flash drive, portable hard drive, read-only memory (ROM), magnetic disk, or optical disk.

[0175] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this application.

[0176] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For example, the division of each unit is merely a logical functional division, and there may be other division methods in actual implementation. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.

[0177] The steps in the methods of this application embodiment can be adjusted, merged, or deleted according to actual needs. The units in the apparatus of this application embodiment can be merged, divided, or deleted according to actual needs. Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0178] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a terminal, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application.

[0179] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A reasoning method based on a heterogeneous large language model, characterized in that, The method is applied to a heterogeneous large language model inference system, which includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance. The method includes: Obtain the inference request and assign the inference request to the executor model instance; Based on the context information of the inference request, it is determined whether the executor model instance has the ability to complete the inference request. If the executor model instance does not have the ability to complete the inference request, then the executor model instance sends an upgrade consultation request for the inference request to the expert model instance. The expert model instance generates a guidance plan corresponding to the upgrade consultation request and sends the guidance plan to the executor model instance. When the expert model instance processes the upgrade consultation request, it does not execute the operation task in the reasoning request. Using the executor model instance, the inference request is processed according to the guidance scheme to obtain the inference result corresponding to the inference request; If the executor model instance has the ability to complete the inference request, then the executor model instance performs inference processing on the inference request based on the context information of the inference request to obtain the inference result corresponding to the inference request.

2. The method of claim 1, wherein, The step of determining whether the executor model instance has the ability to complete the inference request based on the context information of the inference request, through the executor model instance, includes: The system prompt word corresponding to the executor model instance is determined by the executor model instance, and the capability assessment instruction includes multiple capability assessment dimensions. The executor model instance performs an autonomous capability assessment based on the context information of the inference request and multiple capability assessment dimensions to determine whether the executor model instance has the capability to complete the inference request.

3. The method of claim 1, wherein, The list of available tools for the executor model instance includes expert consultation tools, and the heterogeneous large language model inference system also includes a consultation agent module; if the executor model instance does not have the ability to complete the inference request, then the executor model instance sends an upgrade consultation request for the inference request to the expert model instance, including: If the executor model instance does not have the ability to complete the inference request, then the expert consultation tool is invoked through the executor model instance. When the expert consultation tool is invoked, the consultation agent module constructs the upgrade consultation request based on the context information of the reasoning request; The upgrade consultation request is sent to the expert model instance through the consultation agent module.

4. The method of claim 3, wherein, The step of constructing the upgrade consultation request based on the context information of the reasoning request through the consultation agent module includes: The consultation agent module extracts and caches context summary information from the current session context information. The context summary information includes current task description information, executed operation steps information, key intermediate result information, and relevant code reference information. The context summary information is sent to the executor model instance through the consultation agent module; The problem description information is determined by the executor model instance based on the context summary information, and the problem description information is sent to the consultation agent module; The consultation agent module constructs an upgrade consultation request based on the context summary information and the problem description information.

5. The method of claim 3, wherein, The step of processing the inference request through the executor model instance according to the guidance scheme to obtain the inference result corresponding to the inference request includes: The guidance scheme is received through the consultation agent module, and the guidance scheme is injected into the context information of the executor model instance as a tool invocation result to obtain enhanced context information; The inference request is processed by the executor model instance based on the enhanced context information to obtain the inference result corresponding to the inference request.

6. The method of claim 1, wherein, The method further includes: Record the decision results of each upgrade decision of the executor model instance. The decision results include: whether to trigger the upgrade, the task completion quality after the upgrade, and the task completion quality before the upgrade. The upgrade decision refers to determining whether the executor model instance has the execution capability to complete the inference request. Based on historical statistics of the decision results, identify deviation patterns in upgrade decisions, including quality loss caused by failure to upgrade when necessary and cost waste caused by unnecessary upgrades. Typical cases of the aforementioned deviation pattern are injected as few-sample examples into the system prompts of the executor model instance.

7. The method of claim 1, wherein, The executor model instance runs on top of the attention scheduling kernel, which manages multiple attention threads within the executor model instance. Each attention thread corresponds to an independent inference processing process. When the executor model instance does not have the ability to complete the inference request, the attention scheduling kernel pauses the current attention thread of the executor model instance, creates an expert consultation thread, and manages the lifecycle of the expert consultation thread. After the expert model instance sends the guidance scheme to the executor model instance, the attention scheduling kernel creates an execution guidance thread in the executor model instance and resumes the task execution of the inference request based on the execution guidance thread.

8. The method of claim 7, wherein, There are multiple expert model instances, and the executor model instance is deployed on a multi-core coordination layer. Sending an upgrade consultation request for the inference request to the expert model instance through the executor model instance includes: The multi-core coordination layer determines the expert model instance that needs to be invoked from multiple expert model instances. The executor model instance sends an upgrade consultation request for the inference request to the expert model instance that needs to be invoked. The multi-core coordination layer manages the lifecycle of the relevant threads in the executor model instance through the interface of the attention scheduling kernel.

9. The method of claim 1, wherein, After obtaining the inference request, the method further includes: When the executor model instance is unavailable, the inference request is routed to the expert model instance for inference processing; When the expert model instance is unavailable and the executor model instance does not have the ability to complete the inference request, the upgrade consultation request is not sent to the expert model instance through the executor model instance. Instead, the inference request is directly processed through the executor model instance, and the confidence level is marked in the inference result.

10. The method according to claim 1, characterized in that, The heterogeneous large language model inference system further includes a model selection module. When there are multiple executor model instances, before allocating the inference request to the executor model instance, the method further includes: The model selection module selects the executor model to be invoked from multiple executor model instances based on a preset model selection strategy. When there are multiple expert model instances, before sending the guidance scheme to the executor model instance, the method further includes: The model selection module selects the expert model instance to be invoked from multiple expert model instances based on a preset model selection strategy. The model selection strategy includes at least one of the following: performance scoring strategy, probability allocation strategy, user priority strategy, and scene-aware selection strategy.

11. The method according to claim 1, characterized in that, The role constraints of the expert model instance are implemented through a dual mechanism: At the system prompt level, the system prompt explicitly defines the role of the expert model instance as an advisor: the operations that the expert model instance can perform include analyzing the root causes of problems, providing solutions, explaining technical principles, and assessing the risks of solutions; the operations that the expert model instance cannot perform include directly manipulating code files, directly calling external tools, directly modifying system state, and directly generating final deliverables; At the tool configuration level, the available tool list of the expert model instance does not include operation tools, only read-only tools are configured or no tools are configured, so that the expert model instance is structurally unable to directly execute operation tasks at the architecture level; This allows the high-cost reasoning capabilities of the expert model instance to be used only for knowledge-intensive analysis and guidance, while the operation-intensive execution tasks are completed by the executor model instance.

12. The method according to claim 1, characterized in that, The upgrade consultation request is a structured consultation request, containing the following structured fields: The problem description field describes the specific difficulties or uncertainties encountered by the executor model instance. The context summary field provides key background information for the current task, including completed steps, intermediate results obtained, and current execution status. The preliminary analysis fields include the executor model instance's initial understanding of the problem and the attempted solutions; The structured consultation request is obtained by the consultation agent module parsing, extracting and formatting the output of the executor model instance, rather than being a network request directly constructed by the executor model instance.

13. A heterogeneous large language model reasoning system, characterized in that, The heterogeneous large language model inference system includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance, wherein: The executor model instance is used to obtain a reasoning request; determine whether the executor model instance has the ability to complete the reasoning request based on the context information of the reasoning request; if the executor model instance does not have the ability to complete the reasoning request, then send an upgrade consultation request for the reasoning request to the expert model instance. The expert model instance is used to generate a guidance plan corresponding to the upgrade consultation request and send the guidance plan to the executor model instance. When the expert model instance processes the upgrade consultation request, it does not execute the operation task in the reasoning request. The executor model instance is further configured to perform reasoning processing on the reasoning request according to the guidance scheme to obtain the reasoning result corresponding to the reasoning request; if the executor model instance has the ability to complete the execution of the reasoning request, it performs reasoning processing on the reasoning request based on the context information of the reasoning request to obtain the reasoning result corresponding to the reasoning request.

14. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, The computer device is equipped with a heterogeneous large language model inference system, which includes an executor model service module and an expert model service module. The executor model service module contains at least one low-cost executor model instance, and the expert model service module contains at least one high-cost expert model instance. When the processor executes the computer program, it implements the inference method based on the heterogeneous large language model as described in any one of claims 1-12.

15. A storage medium, characterized in that, The storage medium stores a computer program, which includes program instructions that, when executed by a processor, cause the processor to perform the reasoning method based on a heterogeneous large language model as described in any one of claims 1-12.