Model switching method and related device

By adaptively selecting and switching AI models through the model management system, the problem of large models being unable to meet the needs of complex business applications has been solved, achieving more efficient task solving capabilities and cost optimization.

WO2026129643A1PCT designated stage Publication Date: 2026-06-25HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD
Filing Date
2025-07-18
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

In existing technologies, large-scale model applications struggle to meet business needs when dealing with complex operations, and the fixed use of the same model leads to reduced end-to-end accuracy, increased costs, and longer response times.

Method used

This paper provides a model switching method that adaptively selects and switches AI models through a model management system. It selects a suitable model based on feature matching of query requests and optimizes model parameters by combining integer programming and mixed integer programming to achieve adaptive model switching and parameter configuration.

Benefits of technology

It improved the ability to solve tasks ranging from simple to complex, increased end-to-end accuracy, shortened response latency, reduced costs, and met business needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025109238_25062026_PF_FP_ABST
    Figure CN2025109238_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A model switching method, comprising: receiving a first query request, and extracting a first feature on the basis of the first query request; selecting, from among a plurality of AI models on the basis of the first feature, a first model matching the first feature, each model among the plurality of AI models having one or more matched features; using the first model among the plurality of AI models to perform inference on the first query request so as to obtain a first model generation result, and providing the first model generation result on a user interface; receiving a second query request, and extracting a second feature on the basis of the second query request; selecting, from among the plurality of AI models on the basis of the second feature, a second model matching the second feature; and using the second model to perform inference on the second query request so as to obtain a second model generation result, and providing the second model generation result on the user interface. In this way, adaptive model switching is achieved, the same model is not fixedly invoked, the advantages of the AI models are fully utilized, the capability of solving tasks from simple tasks to complex tasks is improved, and service requirements are satisfied.
Need to check novelty before this filing date? Find Prior Art

Description

A model switching method and related equipment

[0001] This application claims priority to Chinese Patent Application No. 202411880278.2, filed with the State Intellectual Property Office of China on December 17, 2024, entitled "A Model Switching Method and Related Equipment", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence (AI) technology, and in particular to a model switching method, a model management system, a computing chip, a computing device cluster, a computer-readable storage medium, and a computer program product. Background Technology

[0003] With the continuous development of artificial intelligence (AI) technology, especially the rapid development of large language models (LLM), large AI models, represented by LLM (which can be simply referred to as large models), are gradually showing the potential to approach human intelligence. Based on this, more and more research is beginning to use LLM as the core controller to build large model applications with human decision-making capabilities.

[0004] For large-scale production-level applications, when a user asks a question, the backend AI agent can call the model multiple times to complete the corresponding business logic. However, for complex business scenarios, large-scale applications still struggle to meet business requirements. Summary of the Invention

[0005] This application provides a model switching method for multiple AI models used to provide inference capabilities. Each model has one or more matching features. Based on the features of each model, the method adaptively selects and calls a language model, achieving adaptive model switching without consistently calling the same model. This fully leverages the strengths of each AI model, improving the ability to solve tasks from simple to complex, and meeting business needs. This application also provides a corresponding model management system, computing device cluster, computing chip, computer-readable storage medium, and computer program product.

[0006] Firstly, this application provides a model switching method. The model switching method can be executed by a model management system. The model management system supports adaptive model selection and switching. The model management system can be software, which can be standalone software or integrated into other software as a plugin, component, functional module, or mini-program. For example, the model management system can be integrated into an AI Agent or a large model application. The model management system can be provided to customers as a software package for self-deployment. Alternatively, the model management system can be provided to users as a cloud service. If a user subscribes to a cloud service, they can use the cloud service's API to call the cloud service, thereby achieving model switching. In some possible implementations, the model management system may include hardware that executes the model switching method; for example, the model management system may include a cluster of computing devices with model switching capabilities or functions, which executes the model switching method of this application when the computing device cluster is running.

[0007] Specifically, the model management system can receive a first query request and extract a first feature based on the first query request. Based on the first feature extracted from the first query request, the model management system can select a first model from multiple AI models that matches the first feature. Each of the multiple AI models has one or more matching features. The model management system can use the first model from the multiple AI models to infer the first query request, obtain a first model generation result, and provide the first model generation result to the user interface. Similarly, the model management system can receive a second query request and extract a second feature based on the second query request. The first query request and the second query request are different query requests. Based on the second feature extracted from the second query request, the model management system can select a second model from multiple AI models that matches the second feature. The second feature is different from the first feature, and the second model and the first model are different models from the multiple AI models. The model management system can use the second model to infer the second query request, obtain a second model generation result, and provide the second model generation result to the user interface.

[0008] This method can adaptively select the AI ​​model to be called for inference on a query request based on one or more features matched by each of multiple AI models. This achieves adaptive model switching and avoids calling the same AI model in a fixed way. In this way, the advantages of each AI model can be fully utilized, and the ability to solve tasks from simple to complex can be improved to meet business needs.

[0009] Furthermore, this method can monitor the price, performance, and functionality of language models provided by model providers through OpenAPI in real time, avoiding application unavailability due to cost changes caused by price fluctuations (not the lowest cost) or OpenAPI unavailability. Moreover, this method considers the limitations on the number of tokens or calls by model providers within a certain period, thus enabling more reasonable model selection and switching, and supporting global optimization of multiple call modes. In addition, this method can dynamically update optimization targets as the application's user base continues to grow.

[0010] In some possible implementations, the model management system can determine the model parameters of a first model based on a first feature extracted from a first query request, so as to use the first model to infer the first query request. Similarly, the model management system can determine the model parameters of a second model based on a second feature extracted from a second query request, so as to use the second model to infer the second query request.

[0011] This allows for adaptive selection of model parameters based on the model parameters corresponding to the features (e.g., the mapping relationship between features and parameters), solving the problem of unreasonable parameter configuration caused by manual selection of model parameters, and further improving end-to-end accuracy, shortening response latency, and reducing costs.

[0012] In some possible implementations, the model management system can obtain a prompt template for the first model, and fill in the prompts according to the first feature and the prompt template to obtain the prompts. For example, the model management system can fill in the first feature in the corresponding position of the prompt template according to the filling requirements in the prompt template of the first model, thereby obtaining the prompts. The model management system inputs the prompts into the first model for inference and obtains the generated result of the first model.

[0013] This allows for the generation of prompts that match the first model, improving the quality of the generated prompts and consequently enhancing the quality of the model's generated results, thus meeting business requirements.

[0014] In some possible implementations, the model management system can also extract metadata for each of multiple AI models, including at least one of open-interface models or self-deployed models. Then, the model management system can evaluate the multiple AI models using an evaluation corpus to obtain an evaluation metric value for each model. Next, based on the metadata and evaluation metric values ​​of each model, the model management system can solve a multi-model optimization problem to obtain one or more features matched to each model.

[0015] This method extracts metadata from AI models and evaluates them using an evaluation corpus. Then, it substitutes the metadata and evaluation metrics into a multi-model optimization problem to obtain one or more features matched by each of the multiple AI models, thus helping to select and switch adaptive models.

[0016] In some possible implementations, the model management system can solve the multi-model optimization problem by integer programming based on the metadata of each of the multiple AI models and the evaluation index value of each of the multiple AI models, thereby obtaining one or more features matched by each of the multiple AI models.

[0017] This allows us to use integer programming to solve problems and obtain one or more features that each AI model matches, providing a reference for adaptive model selection and adaptive model switching.

[0018] In some possible implementations, the model management system can solve the multi-model optimization problem by using mixed integer programming based on the metadata of each model in multiple AI models and the evaluation metric values ​​of each model in multiple AI models, thereby obtaining one or more features and model parameters matched by each model in multiple AI models.

[0019] This method can use mixed integer programming to obtain one or more features and model parameters that match each of multiple AI models, thereby providing a repository for adaptive model selection and adaptive model parameter configuration.

[0020] In some possible implementations, the decision variables solved by mixed-integer programming include combinations of the following variables:

[0021] For the k-th model call to the j-th question from the i-th user, should we select model m from provider p? Or...

[0022] The generated temperature parameters when selecting model m from provider p; or...

[0023] The top_p parameter when selecting model m from provider p; or...

[0024] The frequency penalty parameter is selected when choosing model m from provider p; or...

[0025] The existence penalty parameter when choosing provider p for model m; or...

[0026] The `top_logprobs` parameter when selecting model `m` for provider `p`.

[0027] This method enables adaptive model selection and automatic parameter configuration by setting variables related to model selection, as well as the model's generation temperature parameter, top_p parameter, frequency penalty parameter, existence penalty parameter, and top_logprobs parameter, thus meeting business requirements.

[0028] In some possible implementations, the optimization objective of the multi-model optimization problem is a joint optimization objective determined based on multiple individual objectives. These individual objectives include at least one of minimizing total cost, minimizing response latency, maximizing generation quality, maximizing generation speed, maximizing performance-price ratio, or maximizing speed-price ratio.

[0029] This method fully considers the needs of different dimensions. By determining a joint optimization objective based on multiple single objectives, and using this joint optimization objective as the optimization objective for the multi-model optimization problem, solutions that satisfy the needs of different dimensions can be obtained. This allows for the recommendation of optimal models that are more in line with real-world scenarios. Consequently, the quality of model generation or overall performance can be improved.

[0030] In some possible implementations, the model management system may also provide a configuration interface through which at least one of the optimization objective, decision variables, and constraints of the multi-model optimization problem can be received.

[0031] This method provides a configuration interface to support users in configuring optimization objectives, decision variables, and constraints for multi-model optimization problems online, thereby assisting in solving for the optimal model, implementing adaptive model selection, and adaptive model switching.

[0032] In some possible implementations, one or more features matched by each of the multiple AI models are used to describe one or more of the following: business, domain, agent, downstream task, and task context. Here, business / scenario / domain / downstream task can be different expressions of features for different applications; task context is a derived feature of the downstream task; and agent is the business logic / service that carries a specific business domain and downstream task.

[0033] This method determines the business, domain, agent, downstream task or task context that each AI model matches among multiple AI models. It can then match features extracted from query requests with AI models, providing a reference for adaptive model selection or switching.

[0034] In some possible implementations, the model management system can also receive trigger conditions for model switching. These trigger conditions are used to switch to a new model in response to the second query request. The new model can be an AI model different from the first model among multiple AI models, such as a second model.

[0035] This method sets a trigger condition for model switching. When the trigger condition for model switching is received, the step of responding to the second query request in the model switching method is executed. This can achieve compatibility with existing solutions and has high availability.

[0036] In some possible implementations, the triggering conditions for model switching include:

[0037] The multiple AI models have undergone version updates or new AI models have been added; or...

[0038] The performance, cost, or functionality of the multiple AI models has changed; or...

[0039] At least one of the multiple AI models has a feature that it matches; or...

[0040] A model switching instruction has been received.

[0041] The model switching command can be a user-triggered command that instructs the user to switch models. Furthermore, by setting the aforementioned switching conditions, this method can achieve adaptive model switching and a seamless model switching experience for the user, thus improving the user experience.

[0042] In some possible implementations, when the quality of the second model's generated result does not meet certain conditions, such as a quality score below a set threshold, the model management system can select a third model from multiple AI models based on the second features extracted from the second query request. The third model differs from the second model. The model management system can use the third model to infer the second query request, obtain the third model's generated result, and provide this result to the user interface.

[0043] In this way, when the answer to a query request does not meet the conditions, different language models are used to re-infer the same query request, thereby improving the quality of the answer.

[0044] Secondly, this application provides a model management system. The system includes:

[0045] The interaction module is used to receive the first query request;

[0046] The model selection module is used to extract a first feature based on the first query request, and select a first model that matches the first feature from multiple artificial intelligence (AI) models based on the first feature extracted from the first query request, wherein each of the multiple AI models has one or more matching features;

[0047] The reasoning module is used to reason about the first query request using the first model among the multiple AI models to obtain the result generated by the first model.

[0048] The interaction module is also used to provide the first model generation result on the user interface;

[0049] The interaction module is also used to receive a second query request;

[0050] The model selection module is further configured to extract a second feature based on the second query request, wherein the first query request and the second query request are different query requests, and select a second model that matches the second feature from multiple AI models based on the second feature extracted from the second query request, wherein the second feature is different from the first feature, and the second model and the first model are different models among the multiple AI models;

[0051] The reasoning module is also used to reason about the second query request using the second model to obtain the result generated by the second model;

[0052] The interaction module is also used to provide the second model generation result on the user interface.

[0053] In some possible implementations, the model selection module is further used for:

[0054] Based on the first feature extracted from the first query request, the model parameters of the first model are determined so that the first model can be used to infer the first query request; and / or,

[0055] Based on the second feature extracted from the second query request, the model parameters of the second model are determined so that the second model can be used to infer the second query request.

[0056] In some possible implementations, the inference module is specifically used for:

[0057] Obtain the prompt template for the first model;

[0058] Based on the prompt template of the first feature and the first model, prompts are filled in to obtain prompts;

[0059] The prompt is input into the first model for reasoning, and the first model generates the result.

[0060] In some possible implementations, the system further includes:

[0061] The model computation module is used to extract metadata of each of the multiple AI models, which include at least one of open interface models or self-deployed models. The module evaluates the multiple AI models using an evaluation set corpus to obtain the evaluation index value of each of the multiple AI models. Based on the metadata and evaluation index value of each of the multiple AI models, the module solves the multi-model optimization problem to obtain one or more features matched by each of the multiple AI models.

[0062] In some possible implementations, the model calculation module is specifically used for:

[0063] Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved using integer programming to obtain one or more features matched by each of the multiple AI models.

[0064] In some possible implementations, the model calculation module is specifically used for:

[0065] Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved by mixed integer programming to obtain one or more features and model parameters that match each of the multiple AI models.

[0066] In some possible implementations, the decision variables solved by mixed-integer programming include combinations of the following variables:

[0067] For the k-th model call to the j-th question from the i-th user, should we select model m from provider p? Or...

[0068] The generated temperature parameters when selecting model m from provider p; or...

[0069] The top_p parameter when selecting model m from provider p; or...

[0070] The frequency penalty parameter is selected when choosing model m from provider p; or...

[0071] The existence penalty parameter when choosing provider p for model m; or...

[0072] The `top_logprobs` parameter when selecting model `m` for provider `p`.

[0073] In some possible implementations, the optimization objective of the multi-model optimization problem is a joint optimization objective determined based on multiple single objectives, wherein the single objective includes at least one of minimizing total cost, minimizing response latency, maximizing generation quality, maximizing generation speed, maximizing performance-price ratio, or maximizing speed-price ratio.

[0074] In some possible implementations, the interaction module is also used for:

[0075] Provides a configuration interface;

[0076] The configuration interface receives at least one of the optimization objective, decision variables, and constraints for the multi-model optimization problem.

[0077] In some possible implementations, one or more features matched by each of the plurality of AI models are used to describe one or more of the following: business, domain, agent, downstream task, task context.

[0078] In some possible implementations, the interaction module is also used for:

[0079] The system receives a trigger condition for model switching, which is used to switch to a new model in response to the second query request.

[0080] In some possible implementations, the triggering conditions for the model switching include:

[0081] The multiple AI models have undergone version updates or new AI models have been added; or...

[0082] The performance, cost, or functionality of the multiple AI models has changed; or...

[0083] At least one of the multiple AI models has a feature that it matches; or...

[0084] A model switching instruction has been received.

[0085] In some possible implementations, the model selection module is further used for:

[0086] When the quality of the result generated by the second model does not meet the conditions, based on the second feature extracted from the second query request, a third model is selected from multiple AI models, the third model being different from the second model;

[0087] The reasoning module is also used to reason about the second query request using the third model to obtain the result generated by the third model;

[0088] The interaction module is also used to provide the third model generation result on the user interface.

[0089] Thirdly, this application provides a computing device cluster. The computing device cluster includes at least one computing device, and the at least one computing device includes at least one processor and at least one memory. The at least one processor and the at least one memory communicate with each other. The at least one processor is used to execute instructions stored in the at least one memory to cause the computing device or the computing device cluster to perform the model switching method as described in the first aspect or any implementation thereof.

[0090] Fourthly, this application provides a computer-readable storage medium storing instructions that instruct a computing device or a cluster of computing devices to perform the model switching method described in the first aspect or any implementation thereof.

[0091] Fifthly, this application provides a computer program product containing instructions that, when run on a computing device or a cluster of computing devices, causes the computing device or cluster of computing devices to execute the model switching method described in the first aspect or any implementation thereof.

[0092] Sixthly, this application provides a computing chip, such as a graphics processing unit (GPU), a tensor processing unit (TPU), a deep learning processing unit (DPU), or other AI-related processors, to run AI models and / or to execute the model switching method of this application and / or implement a model management system.

[0093] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description

[0094] To more clearly illustrate the technical methods of this application, the accompanying drawings used will be briefly described below.

[0095] Figure 1 is a schematic diagram of the architecture of a model management system provided in this application;

[0096] Figure 2 is a flowchart illustrating the mapping relationship between features and models provided in this application;

[0097] Figure 3 is a schematic diagram of solving a multi-model optimization problem provided in this application;

[0098] Figure 4 is a flowchart of a model switching method provided in this application;

[0099] Figure 5 is a flowchart of an agent collaboration method in a multi-node dynamic SOP mode provided in this application;

[0100] Figure 6 is a schematic diagram of the structure of a computing device provided in this application;

[0101] Figure 7 is a schematic diagram of the structure of a computing device cluster provided in this application;

[0102] Figure 8 is a schematic diagram of another computing device cluster provided in this application;

[0103] Figure 9 is a schematic diagram of another computing device cluster provided in this application. Detailed Implementation

[0104] The terms "first" and "second" used in the embodiments of this application are for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined with "first" and "second" may explicitly or implicitly include one or more of that feature.

[0105] First, some technical terms involved in the embodiments of this application will be introduced.

[0106] Artificial intelligence (AI) is the ability to correctly interpret external data, learn knowledge from that data, and use that knowledge to achieve specific goals and tasks. A significant branch of AI is natural language processing (NLP). NLP uses algorithms such as machine learning (ML) and deep learning (DL) to build language models (LM), enabling computers to interpret, process, and understand human language.

[0107] A language model is a probability distribution model of words in natural language. Language models can be used to predict the next most likely word based on the input context (e.g., several preceding context words). Language models can also be categorized by parameter size into small language models and large language models (LLMs). Small language models are small-scale language models, often simply called small models, while large language models are large-scale language models, often simply called large models. Large models can include, but are not limited to, generative pre-trained transformer (GPT) models with a large parameter size, while small models can include, but are not limited to, bidirectional encoder representations from transformers (BERT) models.

[0108] Language models, represented by LLMs, can serve as core controllers to construct intelligent agents with human decision-making capabilities, also known as AI agents. An AI agent is a computer program based on a language model that possesses planning and thinking abilities, memory capabilities, and the ability to use tool functions, enabling it to autonomously complete given tasks.

[0109] AI agents can be used to build applications with human decision-making capabilities, such as large-scale production-ready applications. Currently, the working mechanism of large-scale application is usually that the model information in the business workflow is pre-specified manually. The model information includes, but is not limited to, the provider, the model identity (ID) of the large / small model, model parameters, prompts, and context.

[0110] When a user raises a question, the backend AI Agent can call the LLM (Language Model) multiple times for inference to complete the corresponding business logic. However, the model called multiple times by the AI ​​Agent is usually a fixed model specified manually, such as the AI ​​model identified by the model ID. However, different AI models vary in their applicable downstream tasks, model generation quality, cost / price, model generation speed / throughput, model generation latency, context length, knowledge update time, etc. Model generation quality can include, but is not limited to, human evaluation scores and automated evaluation scores from benchmark sets. Benchmark sets can include, but are not limited to, Massive Multitask Language Understanding (MMLU) datasets, the MATH dataset for evaluating AI's mathematical problem-solving abilities, and the HumanEval dataset for evaluating the performance of AI models trained on code. The model's input and output granularity can be tokens, where a token is the basic unit for the language model to process text. Correspondingly, the price can be the price of millions of tokens, the model generation speed or throughput can be the number of incremental tokens output per second, and the model generation latency can include the initial token generation latency, the incremental token generation latency, or the total generation latency. For the same AI model, the model generation speed and price can vary between different application programming interface (API) providers or even between the same API provider at different times.

[0111] When the AI ​​agent calls the same model repeatedly, it means that the AI ​​agent does not adaptively select the appropriate AI model (such as the optimal model), which leads to a decrease in end-to-end accuracy, an increase in cost, and a longer response time, making it difficult to meet business needs.

[0112] In view of this, this application provides a model switching method. The model switching method can be executed by a model management system. The model management system supports adaptive model selection and switching. The model management system can be software, which can be standalone software or integrated into other software as a plugin, component, functional module, mini-program, etc. For example, the model management system can be integrated into an AI Agent or a large model application. The model management system can be provided to customers as a software package for self-deployment. Alternatively, the model management system can be provided to users as a cloud service; users subscribe to the cloud service and can use the cloud service's API to call the cloud service, thereby achieving model switching. In some possible implementations, the model management system may include hardware that executes the model switching method. For example, the model management system may include a cluster of computing devices with model switching capabilities or functions; when the computing device cluster is running, it executes the model switching method of this application.

[0113] Specifically, the model management system (e.g., an AI Agent integrating model selection and switching capabilities) provides multiple AI models for inference. Each of these AI models has one or more matching features. For a first query request, the model management system can select a first model matching the first feature extracted from the first query request (e.g., query 1). The model management system uses the first model to infer the first query request and obtains the first model-generated result. The model management system can provide the first model-generated result on the user interface. For a second query request (e.g., query 2), such as the (i+1)th question in a multi-turn question-and-answer session where i is greater than or equal to 1, the model management system can select a second model matching the second feature extracted from the second query request. The second feature is different from the first feature, and the second model and the first model are different models among the multiple AI models. The model management system can use the second model to infer the second query request, obtain the second model-generated result, and provide the second model-generated result on the user interface.

[0114] This method adaptively selects the AI ​​model invoked for inference on a query request based on one or more features matched by each of multiple AI models. This adaptive model switching avoids consistently using the same AI model, thus fully leveraging the strengths of each model and compensating for their weaknesses. This enhances the ability to solve tasks ranging from simple to complex, meeting business needs. Furthermore, the method also supports adaptive selection of model parameters based on one or more features matched by the model parameters of each of the multiple AI models, further improving end-to-end accuracy, reducing response latency, and lowering costs.

[0115] To make the technical solution of this application clearer and easier to understand, the model management system of this application will be described below with reference to the accompanying drawings.

[0116] Referring to Figure 1, which shows a schematic diagram of the architecture of a model management system 10, the model management system 10 includes a model information acquisition module 102, a model calculation module 104, a model selection module 106, and a model inference module 108. Among them, the model information acquisition module 102 and the model calculation module 104 are optional modules, and will be described in detail below from the perspective of functional modularization.

[0117] The model information acquisition module 102 is used to collect model information from multiple AI models. AI models can include self-deployed models or models with open interfaces. Models with open interfaces can be AI models provided by model providers (such as LLM providers) or API providers through their open APIs. These AI models can also be categorized into L0 models and L1 models based on their training stage. L0 models are typically basic pre-trained models, including self-deployed L0 models or L0 models provided by model providers through OpenAPI. L1 models are usually AI models obtained by training L0 models using industry data. L1 models can be obtained through Supervised Fine-Tuning (SFT) training. Therefore, L1 models can be SFT models. L1 models can also be obtained through Reinforcement Learning from Human Feedback (RLHF) algorithms. The RLHF algorithm can include various Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Odds Ratio Preference Optimization (ORPO) algorithms. These DPO, PPO, and ORPO algorithms can be collectively referred to as XPO algorithms. Optionally, L1 models can also be categorized based on the provider: self-deployed L1 models or L1 models provided by model providers through OpenAPI.

[0118] Model information may include at least one of the following: model identifier, provider identifier, invocation mode, model constraints, model price / cost, model performance, and model functionality (which may also be referred to as model availability). The invocation mode can be self-deployment or third-party interface invocation. Model constraints may include context constraints (context length threshold) or the maximum number of interface invocations within a time period. In some possible implementations, model information may also include a dataset. The dataset may include data extracted from the training set corpus and evaluation set corpus, used to solve the multi-model optimization problem. In other words, the dataset may include a dataset independent of the training set and evaluation set, specifically used to solve the multi-model optimization problem. Alternatively, the dataset may include both a training set and an evaluation set. Data in the training set and evaluation set can also be used to solve the multi-model optimization problem. Furthermore, the training set can be used to train the AI ​​model, for example, by performing SFT on an L0 model using the training set corpus, and the evaluation set is used to evaluate the AI ​​model. The evaluation set may include a performance evaluation set and a functional evaluation set. Evaluation metrics may include model generation speed or throughput, and model generation latency, which characterize model performance. The model generation latency can include the latency of generating the first token, the latency of generating incremental tokens, or the total generation latency. Evaluation metrics can also include model generation quality, accuracy, or recall, which characterize the model's functionality.

[0119] It should be noted that the model information acquisition module 102 can collect model information of multiple AI models in real time, such as real-time monitoring of AI model price, model performance, and model function, so that when model information changes, the application can smoothly switch AI models without the user's awareness.

[0120] The model calculation module 104 is used to calculate one or more features matched by each model in multiple AI models. The one or more features matched by each model in the multiple AI models are used to describe one or more of the following: business, domain, agent, downstream task, and task context. Here, business / scenario / domain / downstream task can be different expressions of the feature by different applications, for example, different expressions of the same feature. Task context is a derived feature of the downstream task, and agent is the business logic / service that carries a certain business domain and downstream task. Further, the one or more features matched by each model in the multiple AI models can be represented by a feature-model mapping relationship, and the model calculation module 104 can construct the feature-model mapping relationship. The feature-model mapping relationship can be a mapping relationship between feature vectors and model identifiers. Here, the AI ​​model identified by each set of feature vectors or each feature vector mapped to a model identifier can be the optimal model that satisfies the model constraints.

[0121] In specific implementation, the model calculation module 104 can extract metadata from multiple AI models, evaluate the multiple AI models using the evaluation corpus, obtain the evaluation index value of each model, and then solve the multi-model optimization problem based on the metadata and evaluation index value of each model to obtain one or more features matched by each model, thereby establishing a mapping relationship between features and models. The model management system 10 supports configuring the optimization objective, decision variables, and constraints of the multi-model optimization problem. The model calculation module 104 can solve the multi-model optimization problem based on the optimization objective, decision variables, and constraints to obtain one or more features matched by each model. For example, the model management system 10 also includes an interaction module (not shown in Figure 1), which provides a configuration interface. Developers can configure the optimization objective, decision variables, and constraints of the multi-model optimization problem through the configuration interface. The interaction module also receives the optimization objective, decision variables, and constraints of the multi-model optimization problem through the configuration interface. Furthermore, the model management system 10 also supports online configuration of the calling mode of each model among the multiple AI models.

[0122] The solution to a multi-model optimization problem (such as the optimal solution or a local optimal solution) can be the optimal model corresponding to a business, scenario, domain, or downstream task, task context, or agent. The model calculation module 104 can construct a mapping relationship between features and models based on at least one feature of the business, scenario, domain, or downstream task, task context, or agent and the model identifier of the optimal model.

[0123] Furthermore, the solution to the multi-model optimization problem may also include model parameters, such as the optimal model parameters corresponding to the business, scenario, domain, or downstream task, task context, or agent. These optimal model parameters may be the optimal parameters of the optimal model. For example, the model calculation module 104 can solve the multi-model optimization problem using mixed integer programming based on the metadata of each model in the multiple AI models and the evaluation metric values ​​of each model in the multiple AI models, to obtain one or more features and model parameters matched by each model in the multiple AI models. Correspondingly, the model calculation module 104 can also construct a feature-parameter mapping relationship based on at least one feature in the business, scenario, domain, or downstream task, task context, or agent and the optimal model parameters. The feature-parameter mapping relationship may include the mapping relationship between feature vectors and optimal model parameters. In some possible implementations, the solution to the multi-model optimization problem may also include SFT suggestions from self-deployed models or open interface AI models when existing models cannot meet the optimization objective. The SFT suggestions include suggested SFT parameters. Among them, the SFT parameters may include SFT hyperparameters, including the number of epochs, learning rate, and batch size.

[0124] In this application, the computation mode of the model computation module 104 may include offline computation, real-time computation, or near real-time computation. Offline computation typically stores data in a reliable storage system for batch processing. Real-time computation is typically used to process real-time data streams. The latency of real-time computation is typically in the order of seconds or even milliseconds. Near real-time computation is a computation mode between offline and real-time computation; the latency of near real-time computation is typically slightly greater than that of real-time computation, for example, in the order of seconds or minutes. The latency of offline computation is typically in the order of hours or days.

[0125] The model calculation module 104 can employ one of the calculation modes, or a combination of multiple calculation modes. For example, the model calculation module 104 can perform offline calculations for data with small changes, and perform real-time calculations for data with frequent changes.

[0126] The model selection module 106 is used to adaptively select the AI ​​model used for inference. Specifically, the model selection module 106 can adaptively select different AI models for different query requests. Specifically, the interaction module receives a first query request, the model selection module 106 extracts a first feature based on the first query request, selects a first model from multiple AI models that matches the first feature based on the extracted first feature, and the inference module 108 uses the first model from the multiple AI models to infer the first query request, obtaining the first model generation result. The interaction module also provides the first model generation result on the user interface.

[0127] Similarly, the interaction module is also used to receive a second query request. The second query request can be a new query request initiated by the user. For example, the second query request can be a new query request entered by the user based on feedback or response to the first query request. Or, for example, the second query request can be a query request that the user re-initiates in a new round of queries. The model selection module 106 is also used to extract a second feature based on the second query request, and based on the second feature extracted from the second query request, to select a second model from multiple AI models that matches the second feature. The inference module 108 is used to infer the second query request using the second model to obtain the second model generation result. The interaction module is also used to provide the second model generation result on the user interface.

[0128] The model selection module 106 can also reselect the AI ​​model when the model information acquisition module detects a change in the model information, so as to achieve a smooth and imperceptible model switching.

[0129] Based on the model management system 10 in Figure 1, this application also provides a model switching method. The model switching method of this application relies on one or more features matched by each AI model among multiple AI models. The one or more features matched by each AI model can be calculated by the model calculation module 104. Wherein, when the model calculation module 104 has real-time or near-real-time calculation capabilities, it can calculate when selecting the AI ​​model to use for inference. Alternatively, the model calculation module 104 can also calculate one or more features matched by each AI model offline and store the one or more features matched by each AI model in the form of a mapping relationship to achieve adaptive model selection and switching. For ease of description, this application uses the offline calculation of one or more features matched by each AI model to construct a mapping relationship between features and models as an example for illustration.

[0130] Referring to the flowchart of the mapping relationship between features and the model shown in Figure 2, this process can be implemented by the model calculation module 104, and specifically includes the following steps:

[0131] S202, Model calculation module 104 obtains the model from the open interface provided by the model provider.

[0132] An open interface can be an API provided by a model provider for implementing inference tasks using an AI model, such as an OpenAPI. An open interface provides a request body / request parameters and a return body / return parameters. Users can configure the request and return parameters, call the open interface to drive the AI ​​model provided by the model provider to perform inference, and obtain the values ​​of the return parameters. The AI ​​model can be a language model, such as GPT. In some examples, the AI ​​model can also be a machine learning model such as a Support Vector Machine (SVM), an eXtreme Gradient Boosting (XGBoost) model, or a Random Forest model, or a BERT classification model, embedding model, or rerank model. Furthermore, the AI ​​model can also include a multimodal model, such as a multimodal model for processing images, videos, and audio. For ease of description, this application uses multiple selectable and switchable AI models as examples of multiple language models.

[0133] As shown in Figure 2, the model calculation module 104 can obtain the model provider's identifier and the models with open interfaces provided by the model provider. The model provider's identifier may include the model provider's name. In some examples, the model provider's identifier may also include other characters or strings that can uniquely identify the model provider, such as the model provider's number. A model provider can provide one or more models, or one or more versions of a model. For example, model provider 1 can provide model 1-1 and model 1-2, model provider 2 can provide model 2-1, model provider 3 can provide model 3-1, and model provider 4 can provide model 4-1.

[0134] S204, Model Calculation Module 104 obtains the self-deployed model.

[0135] A self-deployed model is a model that is deployed by the user themselves. Self-deployed models can be deployed for privacy protection purposes, or they can be models that the user develops and deploys themselves. Similar to models provided by model providers through open interfaces, self-deployed models can be language models, machine learning models such as SVM, XGBoost, and Random Forest, or BERT classification models, embedding models, rerank models, or multimodal models used for image, video, and audio processing.

[0136] For ease of description, this application uses the open interface model provided by the model provider and the self-deployed model obtained by the model calculation module 104 as examples of language models such as LLM.

[0137] S206, Model Calculation Module 104 acquires training set corpus and evaluation set corpus.

[0138] The training set is used for model training. The evaluation set is typically used to evaluate model performance or functionality. Evaluation metrics characterizing model performance can include at least one of model generation latency or model generation speed. Model generation latency can be further divided into first token generation latency, incremental token generation latency, and total generation latency. First token generation latency and incremental token generation latency can be applied to any evaluation scenario, while total generation latency is used for comparisons when the number of output tokens is the same. Model generation speed, also known as throughput, can be the number of tokens generated per unit of time. Evaluation metrics characterizing model functionality can include, but are not limited to, model generation quality, accuracy (or precision), and recall. In addition to objective metrics such as accuracy and recall, evaluation metrics can also include subjective metrics, including but not limited to like rate and dislike rate.

[0139] The training corpus can include data from different businesses, scenarios, domains, downstream tasks, task contexts, or intelligent agents. Similarly, the evaluation corpus can include data from different businesses, scenarios, domains, downstream tasks, task contexts, or intelligent agents. The corpus can include question-answer pairs (questions and answers) used for training or evaluation. The questions can be user-generated questions, and the answers can be genuine responses or responses generated by AI model inference.

[0140] The evaluation corpus may include application-built evaluation corpora or standard evaluation corpora. Application-built evaluation corpora can be constructed during the operation of an application (such as a large model application) based on user-submitted questions and the model's responses to those questions. Standard evaluation corpora may include at least one of the MMLU dataset, MATH dataset, or HumanEval dataset.

[0141] The steps S202, S204, and S206 described above can be executed in parallel or sequentially according to a set order; this application does not impose any restrictions on this. S202 and S204 are optional steps. For example, the model calculation module 104 may not execute S202; correspondingly, the multiple models available for selection or switching may only include multiple self-deployed models. As another example, the model calculation module 104 may not execute S204; the multiple models available for selection or switching may only include models with development interfaces provided by the model provider. It should also be noted that after executing S202 and S204, S208 can be executed, and after executing S206, S210 can be executed.

[0142] S208, Model Calculation Module 104 extracts metadata for each model from multiple language models.

[0143] For each language model among multiple language models, the model computation module 104 can extract the metadata of that language model. This metadata may include the context length threshold supported by the language model, the input token price, the output token price, the maximum number of API calls within a given time period, the maximum number of tokens allowed within a given time period (max_tokens), or model parameters. Model parameters may include at least one of the following: temperature sampling parameters, top_p sampling parameters, top_k sampling parameters, frequency penalty parameters, and presence penalty parameters. The top_p sampling parameter is a kernel sampling parameter, indicating that the model randomly selects tokens from the smallest set with a cumulative probability greater than or equal to "p". The top_k sampling parameter indicates sampling from the top k tokens, allowing other tokens with higher scores or probabilities to also have a chance of being selected. It should be noted that the model parameters in different models can be different, specifically in terms of the number and type of model parameters. In some examples, different models may also represent model parameters expressing the same physical meaning differently.

[0144] S210 and model calculation module 104 use the evaluation set corpus to evaluate multiple language models and obtain evaluation index values ​​for multiple language models.

[0145] Evaluation metrics can include indicators from different dimensions. In some examples, evaluation metrics may include model generation speed or throughput, and model generation latency, which indicate model performance. Model generation latency includes the latency of generating the first token, the latency of generating incremental tokens, or the total generation latency. In other examples, evaluation metrics may include model generation quality, accuracy, or recall, which characterize model functionality. It should be noted that accuracy and recall are usually objective metrics; in practical applications, evaluation metrics may also include subjective metrics such as likes and dislikes.

[0146] The following example illustrates the evaluation of a language model across multiple languages. Specifically, the model computation module 104 can input questions from the evaluation set corpus into the language model for inference, obtaining at least one of the following: the initial token generation latency, the incremental token generation latency, or the total token generation latency. Furthermore, the model computation module 104 can also obtain the number of generated tokens to determine throughput or model generation speed. It should be noted that the evaluation set may include a performance evaluation set. The model computation module 104 can evaluate the language model based on the performance evaluation set corpus, obtaining evaluation metric values ​​indicating model performance. These performance metric values ​​include model generation latency (such as initial token generation latency, incremental token generation latency, or total token generation latency), throughput, or model generation speed.

[0147] Furthermore, the model calculation module 104 can obtain the model generation results of the language model and evaluate the model function of the language model based on the model generation results and the answers in the evaluation set corpus. For example, the model calculation module 104 can determine the accuracy of the language model based on the model generation results and the answers in the evaluation set corpus. It should be noted that the evaluation set may include a functional evaluation set. The model calculation module 104 can evaluate the language model based on the functional evaluation set corpus to obtain evaluation index values ​​indicating the model function. The evaluation index values ​​indicating the model performance include accuracy and recall. For example, the model calculation module 104 can input questions from the functional evaluation set corpus into the language model for reasoning, obtain the model generation results, and then determine the accuracy of the language model based on the model generation results and the answers corresponding to the questions in the functional evaluation set corpus.

[0148] S212, Model calculation module 104 obtains the calling modes of multiple language models configured by the user.

[0149] The calling modes of the language models can include self-deployment or API calls. The model calculation module 104 can obtain the calling modes of multiple language models configured by the user through the configuration interface from the interaction module. Specifically, the interaction module can provide a configuration interface, which can include a list of multiple language models. The interaction module can receive the calling modes configured by the user for multiple language models in the list. The user can configure the calling modes of multiple language models by selecting from a drop-down list. The drop-down list includes the following calling modes: self-deployment and API calls. The user can select self-deployment or API calls for the languages ​​in the list, thereby configuring the calling mode. Correspondingly, the model calculation module 104 can obtain the calling modes of multiple language models configured by the user from the interaction module.

[0150] S214, The model calculation module 104 obtains at least one of the optimization objective, decision variables and constraints of the multi-model optimization problem configured by the user.

[0151] Specifically, the interaction module provides a configuration interface, such as displaying a configuration interface to the user, through which the optimization objective, decision variables, and constraints of the multi-model optimization problem are received. The model calculation module 104 can obtain the optimization objective, decision variables, and constraints of the multi-model optimization problem configured by the user from the interaction module.

[0152] The optimization objective refers to the goal that a multi-model optimization problem needs to achieve or satisfy. The optimization objective of a multi-model optimization problem can be a joint optimization objective determined based on multiple individual objectives. These individual objectives include at least one of the following: minimizing total cost, minimizing response latency, maximizing generation quality, maximizing generation speed, maximizing performance-price ratio, or maximizing speed-price ratio. For multiple individual objectives, the model computation module 104 can introduce weights to balance these individual objectives.

[0153] The total cost is determined by the prices of input and output tokens, typically based on the price and quantity of input tokens, and the price and quantity of output tokens. Response latency can be the average response latency, such as the average model call latency. Furthermore, the main component of model call latency is the initial token generation latency; therefore, minimizing response latency can be simplified to minimizing the initial token generation latency. In some scenarios, minimizing response latency can also be simplified to minimizing the incremental token generation latency. Generally, higher throughput for the language model is better; therefore, maximizing generation speed is incorporated into the optimization objective. While ensuring generation quality, users want the total cost to be as low as possible; therefore, maximizing cost-effectiveness is incorporated into the optimization objective. Given the same generation speed, users prefer models with lower prices; therefore, maximizing the speed-to-price ratio is incorporated into the optimization objective.

[0154] The optimization objective for a multi-model optimization problem can be referenced by the following formula:

[0155] Where w1, w2, w3, w4, and w5 are the weights of different single objectives, and x ijkpm This indicates whether to select model m from provider p for the k-th model call in response to the j-th question from the i-th user. ijkpm This can be a binary variable, taking the value 0 or 1. `i` is the user ID, and the total number of users is N. `j` is the j-th question asked by the i-th user, and the total number of questions is n. i k represents the k-th call in the j-th question from the i-th user, with a total of n calls. ijp represents the model provider, and the total number of providers is n. p m represents the models provided by provider p, and the number of models is n. pm Where i, j, k, p, and m are user-defined parameters. It should be noted that the optimization objective shown in formula (1) is only an example of this application; in actual applications, users can also customize optimization objectives and constraints through the configuration interface. Users can customize optimization objectives and constraints according to their business needs.

[0156] IT ijk IP represents the number of input tokens for the k-th model call in response to the j-th question from the i-th user. pm This represents the price of the input token for model m of provider p. OT ijk This represents the number of output tokens for the k-th model call in response to the j-th question from the i-th user. OP pm This represents the price of the output token of provider p's model m. L ijk Q represents the response latency of the k-th model call in response to the j-th question from the i-th user. ijk S represents the quality (abbreviated as Q) of the k-th model call for the j-th question from the i-th user. ijk S represents the generation speed of the k-th model call for the j-th question from the i-th user. Indicating cost-effectiveness, This indicates the speed-price ratio.

[0157] Among them, IT ijk OT ijk IP pm OP pm It belongs to the model call parameters.

[0158] The above optimization objective can be simplified to:

[0159] in, The total cost is represented by the given information, Latency by the response time, Quality by the generation quality, and Speed ​​by the generation speed. Indicating cost-effectiveness, This indicates the speed-price ratio.

[0160] Constraints can include those related to generation quality, generation speed, generation price, context length, or generation latency. Each constraint can also be categorized as a single-invocation constraint or an average constraint across multiple invocations. Examples of constraints are provided below.

[0161] Quality-related constraints can include quality constraints for a single call, as shown below:

[0162] Among them, Q ijk The quality of the k-th model call for the j-th question from the i-th user can be represented by a quality score; this parameter is a model call parameter. min This represents the quality threshold for a single call; any single call with a quality greater than Q... min .

[0163] Alternatively, quality-related constraints may include an average quality constraint across multiple calls, as shown below:

[0164] Among them, Q avg This represents the average quality threshold across multiple calls.

[0165] Similarly, constraints related to generation speed can include speed constraints for a single call and average speed constraints for multiple calls, as shown below:

[0166] Among them, S ijk This represents the generation speed of the k-th model call for the j-th question from the i-th user, and is a model call parameter. min This represents the speed threshold for a single call; any single call with a speed greater than S... min .

[0167] Among them, S avg This represents the average speed threshold across multiple calls.

[0168] Constraints related to the generated price can include price constraints for a single call and average price constraints for multiple calls, as shown below:

[0169] Among them, P max This represents the price threshold for a single call, where the speed of any single call is less than P. max .

[0170] Among them, P avg This represents the average price threshold across multiple calls.

[0171] Constraints related to context length can be: IT ijk +OT ijk ≤C pm (9)

[0172] Among them, C pmThis represents the context length threshold of model m provided by provider p, and this parameter belongs to the model invocation parameters. Formula (9) above indicates that the sum of the number of input tokens and the number of output tokens in the model invocation does not exceed the context length threshold.

[0173] Constraints related to generation latency can include latency constraints for a single call and average latency constraints for multiple calls, as shown below:

[0174] Among them, L ijk L represents the response latency of the k-th model call to the j-th question from the i-th user. ijk This belongs to the model call parameters. L max This represents the latency threshold for a single call; the latency of any single call is less than L. max .

[0175] Among them, L avg This represents the average latency threshold across multiple calls.

[0176] Furthermore, constraints may also include constraints related to the decision variables. In some possible implementations, the decision variables may include x. ijkpm x ijkpm The relevant constraints are as follows:

[0177] Wherein, the constraint is x ijkpm The value is 0 or 1, used to indicate whether to use provider p's model m for the k-th model call for the j-th question of the i-th user.

[0178] x ijkpm The relevant constraints may also include:

[0179] The constraint in formula (13) is used to indicate that a model is selected for each call.

[0180] x ijkpm The relevant constraints may also include a limit on the number of calls within a given time period. Specifically, for each user i, the call count constraint within a given time period T can be:

[0181] n i (T) represents the number of questions asked by user i within time period T. ij (T) represents the number of times user i invokes the model for the j-th question within time period T, R pm This represents the maximum number of times a user can call model m of provider p within a given time period T.

[0182] xijkpm Related constraints may also include a limit on the number of tokens invoked within a given time period. Specifically, for each user i, within a given time period T, the total number of input and output tokens of the model shall not exceed the upper limit of tokens. The specific constraints are illustrated in the following formula:

[0183] Among them, T pm The upper limit of tokens for model m provided by supplier p.

[0184] S215, Model Calculation Module 104 solves the multi-model optimization problem based on the metadata of each model in the multiple language models and the evaluation index value of each model in the multiple language models.

[0185] Specifically, the model computation module 104 can solve the multi-model optimization problem using integer programming based on the metadata and evaluation metrics of each model in multiple language models, thereby obtaining the mapping relationship between features and models. Integer programming refers to programming where variables take integer values. When the decision variables take integer values, for example, when the decision variables only include x... ijkpm When integer programming can be pure integer programming, and when some decision variables take integer values ​​and some take real values, integer programming can be mixed integer programming.

[0186] S216. Model calculation module 104 determines whether the optimal solution has been obtained. If yes, execute S218; otherwise, execute S222.

[0187] The following section provides a detailed explanation of the process of solving multi-model optimization problems using integer programming.

[0188] Referring to Figure 3, a schematic diagram of solving a multi-model optimization problem can be included, specifically comprising the following steps:

[0189] S302, Model Calculation Module 104 obtains the initial model parameters.

[0190] Multi-model optimization problems can be solved by constructing an optimization model corresponding to the multi-model optimization problem. Initializing model parameters refers to the initial parameters of the optimization model, including but not limited to the provider and the model parameters of the language model provided by the provider, the number of users, the number of questions, the generation quality, and the price.

[0191] S304, Model Calculation Module 104 defines the optimization model for the multi-model optimization problem based on the decision variables.

[0192] The optimization objective of the optimization model can be a joint optimization objective determined based on multiple individual objectives. These individual objectives include at least one of minimizing total cost, minimizing response latency, maximizing generation quality, maximizing generation speed, maximizing performance-price ratio, or maximizing speed-price ratio. In some examples, the joint optimization objective can be a weighted sum of multiple individual objectives, as shown in equation (1) or (2).

[0193] The constraints of the optimization model include constraints related to generation quality, generation speed, generation price, context length, or generation delay, as shown in equations (3) to (11). In addition, the constraints of the optimization model may also include constraints related to decision variables, as shown in equations (12) to (15).

[0194] Based on this, the multi-model optimization problem can be transformed into finding the values ​​of decision variables that achieve the optimization objective while satisfying the constraints of the optimization model.

[0195] S306, Model Calculation Module 104 generates an initial feasible solution for the optimization model.

[0196] In practice, the model calculation module 104 can randomly generate an initial feasible solution or generate an initial feasible solution based on a heuristic method to ensure that the user's call limit and other model constraints are met. For example, the model calculation module 104 can prioritize allocating high-quality language models to users, or prioritize selecting low-priced language models when the budget is low.

[0197] S308, Model Calculation Module 104 uses the branch and bound method to optimize the initial feasible solution. If the solution is successful and the solution time is less than the target time, execute S310; if the solution time reaches the target time, execute S314.

[0198] The model calculation module 104 can start from the initial feasible solution, branch the decision variables to generate subproblems, and then solve them separately. When solving the subproblems, the model calculation module 104 can select the model provider and model for the subproblems, and determine whether to continue branching by solving the lower bound.

[0199] The target time can be a time threshold used to measure whether the solution process is slow. When the solution time is greater than or equal to the target time, it indicates that the solution process is slow. The model calculation module 104 can adjust the weights of individual objectives in the joint optimization objective or adjust the constraints to accelerate the solution process.

[0200] It should be noted that the branch and bound method is only one specific implementation of integer programming and mixed integer programming. In practical applications, other algorithms, such as heuristic algorithms and reinforcement learning algorithms, can also be used to implement integer programming or mixed integer programming. Among them, heuristic algorithms can include, but are not limited to, genetic algorithms, simulated annealing, and particle swarm optimization.

[0201] Furthermore, the method of this application may also omit S306 described above. For example, the model calculation module 104 may employ a branch and bound method, a heuristic algorithm, or a reinforcement learning algorithm to solve the problem. Generating an initial feasible solution before using the branch and bound method, heuristic algorithm, or reinforcement learning algorithm helps to accelerate convergence, improve solution efficiency, and shorten solution time.

[0202] S310, Model Calculation Module 104 checks whether the currently solved decision variables are all globally optimal solutions. If not, then execute S312.

[0203] Specifically, the model calculation module 104 can verify whether the currently solved decision variables satisfy all constraints, especially the global average constraint. If satisfied, the model calculation module 104 can add the currently solved decision variables to the solution set of the multi-objective optimization problem. Furthermore, the model calculation module 104 can also use Pareto optimization to process the solution set and obtain a set of non-dominated solutions. If not satisfied, then S312 is executed.

[0204] S312, Model Calculation Module 104 adjusts the weight of a single objective in the joint optimization objective or adjusts the constraints.

[0205] Specifically, the model calculation module 104 can employ a local search approach to adjust the weights of individual objectives within the optimization goals. Alternatively, when the feasible region is empty, the model calculation module 104 can relax secondary constraints in the constraint conditions. Secondary constraints refer to constraints with relatively low importance. This ensures both the quality and speed of language model generation.

[0206] S314, Model calculation module 104 solves the problem based on the adjusted weights or constraints.

[0207] The specific implementation of the model calculation module 104 based on the adjusted weights can be found in the descriptions of the relevant content in S306 and S308 mentioned above, and will not be repeated here.

[0208] Figure 3 is merely one specific implementation of solving the multi-model optimization problem. In practical applications, other methods can also be used to solve the multi-model optimization problem, and this application does not impose any restrictions on this.

[0209] In some possible implementations, the multi-model optimization problem can be used not only to determine the optimal model but also to determine the optimal model parameters. The optimal model parameters can be the optimal parameters of the optimal model, or model parameters determined along with the optimal model. To determine the optimal model parameters, decision variables related to the model parameters can be added when defining the optimization model. Accordingly, during the solution process, the model computation module 104 can use mixed integer programming.

[0210] Specifically, the decision variables may also include at least one of the following: the generation temperature parameter when selecting provider p's model m; the top_p parameter when selecting provider p's model m; the frequency penalty parameter when selecting provider p's model m; the existence penalty parameter when selecting provider p's model m; or the top_logprobs parameter when selecting provider p's model m. The top_logprobs parameter is used to return the most likely tokens at a specified token position, and each token has an associated log probability.

[0211] When adding decision variables related to model parameters to an optimized model, the optimization objective can also be updated to be related to the model parameters. Different models can have different parameters; the following example uses a set of model parameters. Specifically, generation quality, generation speed, cost-effectiveness, quality-speed ratio, and speed-price ratio may be affected by model parameters, and the related single objective can be updated as follows:

[0212] Among them, Q(Tem) ijkpm ,TP ijkpm ,FP ijkpm PP ijkpm Log ijkpm ) represents the quality function related to the model parameters. ijkpm This indicates whether the model m of provider p is selected in the k-th model call for the j-th question of the i-th user. ijkpm The temperature parameter "temperature" indicates the temperature parameter selected when model m is chosen from provider p. ijkpm The top_p parameter represents the selection of model m from provider p. ijkpm PP represents the frequency penalty parameter when selecting model m from provider p. ijkpm Log represents the existence penalty parameter when selecting model m from provider p. ijkpm The `top_logprobs` parameter represents the selection of model `m` from provider `p`.

[0213] Among them, S(Tem) ijkpm ,TP ijkpm ,FP ijkpm PP ijkpm Log ijkpm ) represents the generation rate function related to the model parameters.

[0214] in, Related to model parameters.

[0215] in, Related to model parameters.

[0216] in, Related to model parameters.

[0217] The above explains the updating of single objectives in different dimensions. When the joint optimization objective introduces single objectives related to generation quality, generation speed, cost-effectiveness, quality-speed ratio, and speed-price ratio, the above formulas (16) to (20) can be used for joint optimization.

[0218] The constraints related to generation quality and generation speed can be updated as follows:

[0219] Formulas (21) and (22) are illustrated using single-call constraints, while the average constraint for multiple calls can be found in the aforementioned description.

[0220] The model calculation module 104 can determine the multi-model optimization problem based on the updated optimization objectives and constraints. Accordingly, the model calculation module 104 can solve the multi-model optimization problem by using mixed integer programming based on the metadata of each model in the multiple AI models and the evaluation index values ​​of each model in the multiple AI models, to obtain one or more features and model parameters matched by each model in the multiple AI models.

[0221] S218, the model calculation module 104 constructs the mapping relationship between the query request and the model identifier of the optimal model, and constructs the mapping relationship between the query request and the parameters of the optimal model.

[0222] To address the issues in the evaluation dataset, the model calculation module 104 can identify multiple query requests. Based on the aforementioned steps, it determines the optimal model and optimal model parameters for each query request. The model calculation module 104 can construct a mapping relationship between query requests and optimal model identifiers based on multiple query requests and the optimal model identifiers corresponding to each query request. Based on multiple query requests and the optimal model parameters corresponding to each query request, it can construct a mapping relationship between query requests and optimal model parameters.

[0223] S220 and model calculation module 104 construct the mapping relationship between features and the model, as well as the mapping relationship between features and parameters.

[0224] The model calculation module 104 can extract features from the query requests corresponding to the questions in the evaluation set corpus, obtaining features extracted from the query requests. The model calculation module 104 can then construct a mapping relationship between features and models based on the features extracted from the query requests and the mapping relationship between the query requests and the model identifiers of the optimal models. Similarly, the model calculation module 104 can construct a mapping relationship between features and parameters based on the features extracted from the query requests and the mapping relationship between the query requests and the parameters of the optimal models.

[0225] The steps S218 and S220 described above are optional steps in the embodiments of this application. The method of this application may also omit steps S218 to S220. For example, the model selection module 106 may directly perform adaptive model selection and parameter configuration based on one or more features and model parameters matched by each of the multiple AI models obtained in S215, or it may directly perform adaptive model selection based on one or more features matched by each of the multiple AI models obtained in S215.

[0226] S222, Model calculation module 104 determines whether the language model allows relaxation. If yes, then execute S224; otherwise, execute S226.

[0227] Specifically, the model calculation module 104 can determine whether the language model allows relaxation by referring to the language model's documentation. Language model relaxation can be the relaxation of constraints or optimization objectives of the language model. Specifically, users (such as developers) can configure whether the language model allows relaxation before optimization, and during the optimization process, the model calculation module 104 can read the configuration documentation to determine whether the language model allows relaxation.

[0228] In some possible implementations, the model calculation module 104 can provide a judgment window through an interactive module, allowing users to manually determine whether the language model allows relaxation. Users can determine whether language model relaxation is allowed based on business requirements. It should be noted that this application also supports modifying whether language model relaxation is allowed. For example, if a user pre-configures the language model to disallow relaxation, as the optimization process progresses, the user can modify the language model configuration file to configure the language model to allow relaxation.

[0229] If relaxation is allowed, the model computation module 104 can execute S224 to resolve; if relaxation is not allowed, the model computation module 104 can provide fine-tuning suggestions, such as suggested SFT parameters, to perform several steps in S226 to S234 to fine-tune the language model.

[0230] S224, Model calculation module 104 updates the optimization objective or constraints of the optimization model. Then execute S215.

[0231] Specifically, the interaction module can provide a configuration interface, such as displaying a configuration interface to the user. The user can update the optimization objective and constraints of the optimization model through the configuration interface. The interaction module receives the updated optimization objective or constraints through the configuration interface, and the model calculation module 104 can obtain the updated optimization objective or constraints from the user through the configuration interface from the interaction module.

[0232] Thus, the model computation module 104 can update the multi-model optimization problem based on the updated optimization objective and constraints, and solve the updated multi-model optimization problem based on the metadata of each model in the multiple language models and the evaluation index values ​​of each model in the multiple language models. The solution process can be referred to the relevant content described above, and will not be repeated here.

[0233] The above describes the specific implementation by which the model calculation module 104 solves the multi-model optimization problem based on the metadata and evaluation metrics of each of the multiple AI models, obtaining one or more features matched by each of the multiple AI models. In practical applications, one or more features matched by each model can also be obtained through other methods. This application does not limit this.

[0234] S226, Model Calculation Module 104 selects the base model and training corpus.

[0235] S228. The model calculation module 104 determines whether the calling mode of the base model is self-deployment. If yes, then execute S230; otherwise, execute S234.

[0236] S230 and model calculation module 104 fine-tune the base model using the training set corpus to obtain an updated self-deployment model.

[0237] S232, the model calculation module 104 will add the updated self-deployed model to the language model for metadata extraction and model evaluation.

[0238] S234, Model Calculation Module 104 calls the fine-tuning interface to fine-tune the language model provided by the provider.

[0239] It should be noted that, to better reflect real-world user scenarios, the computational efficiency of the model calculation module 104 is generally better the faster it is, ideally reaching near real-time or real-time levels within seconds. To this end, the model calculation module 104 can split the optimization model into two parts: a simple model and a complex model. The simple model can select the optimal model from a limited scenario model, such as quickly choosing from multiple historically best models, thus achieving calculation completion within seconds. The complex model can select the optimal model from the entire scenario model, or select the optimal model parameters, typically requiring calculations within minutes or even hours. By splitting the optimization model into a simple and a complex model, rapid updates can be performed first using the simple model, followed by precise updates using the complex model, thus enabling rolling updates.

[0240] Based on the aforementioned model management system 10, this application also provides a model switching method. The model switching method of this application will be described below with reference to the accompanying drawings.

[0241] Referring to the flowchart of a model switching method shown in Figure 4, this method can be executed by the model management system 10 shown in Figure 1. The model management system 10 includes an interaction module (not shown in Figure 1), a model selection module 106, and an inference module 108. Further, the model management system 10 may also include a model information acquisition module 102 and a model calculation module 104. The method specifically includes the following steps:

[0242] S402, Model selection module 106 receives the first query request.

[0243] The first query request can be generated based on a question posed by the user, and can be denoted as query1. It's important to note that each question a user asks can generate a new query. In complex problems or business scenarios, users may initiate multiple rounds of questions based on previous answers. Therefore, the first query request can be a new query generated based on the response or answer of the previous query. In some examples, the first query request can also be a new query that does not include the response or answer of the previous query. For instance, if the time between the user's initial query and the most recent query exceeds the maximum time interval, the first query request may not include the response or answer of the previous query.

[0244] In specific implementation, the interaction module in the model management system 10 can provide a user interface, which may include a graphical user interface (GUI) or a command user interface (CUI). The GUI, also known as a graphical user interface, allows users to initiate queries, thereby triggering a query request. To distinguish this query request from other query requests, this application identifies it as the first query request. The model selection module 106 can receive or forward the first query request sent by the interaction module.

[0245] S404, Model selection module 106 extracts the first feature based on the first query request.

[0246] The first feature refers to the feature extracted based on the first query request. Considering that the model selection module 106 can extract features for different query requests, to facilitate the distinction between features extracted based on the first query request and features extracted based on the second query request, this application refers to them as the first feature and the second feature, respectively. The first feature can be used to describe one or more of the following: business, scenario, domain or downstream task, task context or intelligent agent. Among these, business, scenario, domain or downstream task can be different expressions of the feature by different applications, for example, different expressions for the same feature. The task context is a feature derived from the downstream task.

[0247] In some possible implementations, the model selection module 106 can parse the first query request to obtain the first feature. For example, if the first query request specifies an agent, a downstream task, or includes task context in the request body, then the model selection module 106 can obtain features such as the agent, downstream task, or task context by parsing the first query request. As another example, when the first query request initiated by the user in the integrated development environment (IDE) interface or web interface includes business information, the model selection module 106 can parse the first query request to obtain features such as business information.

[0248] In other possible implementations, the model selection module 106 can perform intent recognition on the first query request to obtain a first feature. Specifically, the model selection module 106 can extract information from the first query request using a feature extraction network to obtain the first feature. The feature extraction network can be a neural network, including but not limited to recurrent neural networks (RNNs), convolutional neural networks (RNNs), and transformers. In some examples, the model selection module 106 can also utilize a language model to extract the first feature from the first query request.

[0249] S406, Model selection module 106 determines a first model that matches the first feature from multiple language models based on the first feature extracted from the first query request.

[0250] Each of the multiple language models has one or more matching features. When a feature extracted from a query request (such as a first query request) matches one or more features of a language model, the model selection module 106 can select the language model that successfully matches the features extracted from the query request.

[0251] The model selection module 106 can determine a first model from multiple language models through semantic retrieval or keyword matching. For example, the model selection module 106 can extract keywords from the first feature extracted from the first query request, and extract keywords from one or more features possessed by each of the multiple language models. Then, it matches the keywords extracted from the first feature with keywords extracted from matching features of the language models to determine the first model. The first model can be the language model whose keywords successfully match those extracted from the first feature. When one or more matching features of a language model are represented by a feature-model mapping relationship, the model selection module 106 can extract keywords from both the feature-model mapping relationship and the first feature, and obtain the model identifier of the first model from the successfully matched mapping records, thereby determining the first model. As another example, the model selection module 106 can perform semantic analysis on the first feature extracted from the first query request, retrieve the feature-model mapping relationship based on the semantic analysis results, and obtain semantically matched mapping records. The model selection module 106 can determine the first model from the semantically matched mapping records.

[0252] In some possible implementations, the model selection module 106 may also determine the model parameters of the first model based on the first feature extracted from the first query request, so as to use the first model to perform inference on the first query request. Specifically, after the model calculation module 104 determines the optimal model parameters corresponding to the feature, the model selection module 106 can determine the optimal model parameters corresponding to the first feature based on the first feature and the optimal model parameters corresponding to the feature. These optimal model parameters can be the model parameters of the first model. Alternatively, after the model calculation module 104 constructs the mapping relationship between features and parameters, the model selection module 106 can determine the model parameters of the first model based on the mapping relationship between features and parameters and the first feature extracted from the first query request.

[0253] S408, the reasoning module 108 uses the first model among multiple language models to reason about the first query request and obtain the result generated by the first model.

[0254] Specifically, the inference module 108 can input the prompts constructed based on the first query request into the first model, and perform inference through the first model to obtain the result generated by the first model. It should be noted that this embodiment uses the first model as a language model example for illustration. In actual applications, if the first model is not a language model but other machine learning models, then it is not necessary to construct prompts or input prompts into the first model for inference.

[0255] Specifically, the prompt constructed based on the first query request can be obtained by the model management system 10 from the prompt template of the first model, and the prompt can be filled in according to the first feature and the prompt template of the first model to obtain the prompt. The inference module in the model management system 10 can input the above prompt into the first model for inference, thereby obtaining the first model generation result.

[0256] S409. The interactive module provides the first model generation result in the user interface.

[0257] The first model's generated result can be an intermediate result, such as the derivation process for a user's question. This derivation process can be an intermediate step in a thought process chain. The user can configure whether to display the intermediate thought process chain. When the user configures to display the intermediate thought process chain, the interaction module can show it to the user. When the user does not configure to display the intermediate thought process chain, or the user configures not to display the intermediate thought process chain, the interaction module can not display it. In some examples, the first model's generated result can also be the final result, such as an answer to a user's question. The interaction module can provide the first model's generated result in the user interface to respond to the user's question. For example, the interaction module can display the first model's generated result to the user through the user interface to respond to the user's question.

[0258] S410, Model Selection Module 106 receives the second query request.

[0259] The second query request can be generated based on a question raised by the user, and can be denoted as query2. Specifically, the second query request can be generated based on questions raised by the user in subsequent rounds. It should be noted that subsequent rounds can be related to or unrelated to the previous round. For example, when a user asks consecutive questions about a complex issue, subsequent rounds can be related to the previous round. Conversely, when a user asks questions about other areas, businesses, or scenarios, subsequent rounds can be unrelated to the previous round. It should also be noted that the model management system 10 can provide services to multiple users simultaneously; therefore, the second query request can also be generated based on questions raised by other users.

[0260] In specific implementation, the interaction module in the model management system 10 can provide a user interface, which may include a GUI or a command-line interface (CUI). Users can initiate questions through the GUI or CUI, thereby triggering a query request. To distinguish this query request from other query requests, this application identifies it as a second query request. The model selection module 106 can receive or forward the second query request sent by the interaction module.

[0261] S412, Model selection module 106 extracts the second feature based on the second query request.

[0262] Specifically, the model selection module 106 can parse the second query request to obtain the second features. For example, if the second query request specifies an agent, a downstream task, or includes task context in the request body, the model selection module 106 can obtain features such as the agent, downstream task, or task context by parsing the second query request. As another example, when a user initiates a second query request through an IDE interface or web interface that includes features such as business logic, the model selection module 106 can parse the second query request to obtain features such as business logic.

[0263] Considering the scenario where the user does not specify an agent or downstream task when asking the question, the model selection module 106 can perform intent recognition on the second query request to obtain a second feature. Specifically, the model selection module 106 can extract information from the second query request using a feature extraction network to obtain the second feature. Alternatively, the model selection module 106 can also extract the second feature from the second query request using a language model. The specific implementation of the model selection module 106 extracting the second feature based on the second query request can be found in the description in section S404, and will not be repeated here.

[0264] S414, Model selection module 106 determines a second model from multiple language models based on the second feature extracted from the second request.

[0265] In this system, each of the multiple language models has one or more matching features. When a feature extracted from a query request (such as a second query request) matches one or more features of a language model, the model selection module 106 can select the language model that successfully matches the features extracted from the query request as the second model for inference. It should be noted that the model selection module 106 can determine the second model from multiple language models through semantic retrieval or keyword matching. The specific implementation of semantic retrieval or keyword matching can be found in the description in section S406, and will not be elaborated upon here.

[0266] In some possible implementations, the model selection module 106 may also receive a model switching trigger condition. The model switching trigger condition is used to switch to a new model in response to the second query request. Specifically, the model selection module 106 may trigger a model switch upon receiving the second query request, or upon detecting a model switching trigger condition. The process of triggering a model switch may involve reselecting a language model. Specifically, the model selection module 106 may determine a second model from multiple language models based on a second feature extracted from the second query request and one or more features possessed by each of the multiple language models. Alternatively, the model selection module 106 may determine a second model from multiple language models based on a second feature extracted from the second query request and the mapping relationship between features and models.

[0267] In some possible implementations, the triggering conditions for model switching may include any of the following: multiple language models undergo version updates or a new language model is added; or, the performance, cost, or functionality of multiple language models changes; or, at least one of the features matched by multiple AI models changes (e.g., the mapping relationship between features and models changes); or, a model switching instruction is received. The model switching instruction may be a user-initiated instruction that directs a model switch.

[0268] S416, the reasoning module 108 uses the second model to reason about the second query request and obtains the result generated by the second model.

[0269] Specifically, the reasoning module 108 can input the prompts constructed based on the second query request into the second model, perform reasoning through the second model, and obtain the result generated by the second model.

[0270] Similar to the reasoning for the first query request, the prompt constructed based on the second query request can be obtained by the model management system 10 from the prompt template of the second model, and the prompt can be filled in according to the second feature and the prompt template of the second model to obtain the prompt. The reasoning module in the model management system 10 can input the above prompt into the second model for reasoning, thereby obtaining the second model's generated result.

[0271] S418, The interactive module provides the second model generation result in the user interface.

[0272] The second model's generated result can be an intermediate result, such as a derivation process for a user's question. This derivation process can be an intermediate step in a thought process chain. The user can configure whether to display this intermediate thought process chain. When the user configures to display the intermediate thought process chain, the interaction module can show it to the user. When the user does not configure to display the intermediate thought process chain, or the user configures not to display it, the interaction module can choose not to display it. In some examples, the second model's generated result can also be the final result, such as an answer to a user's question. The interaction module can provide the second model's generated result in the user interface to respond to the user's question. For example, the interaction module can display the second model's generated result to the user through the user interface to respond to the user's question.

[0273] Furthermore, if the quality of the second model's generated result does not meet the conditions, it indicates that the quality of the second model's generated result is poor, for example, the accuracy of the second model's generated result is low. The model selection module 106 can also select a third model from multiple AI models based on the second feature extracted from the second query request. The third model is different from the second model. For example, the model selection module 106 can reuse the second feature extracted in S412, retrieve the mapping relationship between the feature and the model based on the second feature, and thus determine the third model. Correspondingly, the inference module 108 can use the third model to infer the second query request and obtain the third model's generated result. The interaction module can provide the above-mentioned third model's generated result on the user interface.

[0274] The embodiment shown in Figure 4 illustrates multiple AI models as an example of multiple language models. In practical applications, the multiple models available for user inference can not only be multiple language models, but also other models or combinations of other models. For example, multiple AI models can include, but are not limited to, machine learning models such as SVM, XGBoost, and Random Forest, or BERT classification models, embedding models, and rerank models. Furthermore, AI models can also include multimodal models, such as multimodal models for processing images, videos, and audio.

[0275] Based on the above description, this application provides a model switching method. This method adaptively selects and calls the appropriate language model based on one or more features possessed by each AI model, represented by a language model. This achieves adaptive model switching, as the method does not consistently call the same language model. This allows for full utilization of the strengths of each language model, compensating for their weaknesses and improving the ability to solve tasks ranging from simple to complex, thus meeting business needs. Furthermore, this method also supports adaptive selection of model parameters and calling modes based on the mapping relationship between features and parameters, solving the problem of unreasonable parameter configuration caused by manual selection of model parameters, further improving end-to-end accuracy, reducing response latency, and lowering costs.

[0276] Furthermore, this method can monitor the price, performance, and functionality of language models provided by model providers through OpenAPI in real time, avoiding application unavailability due to cost changes caused by price fluctuations (not the lowest cost) or OpenAPI unavailability. Moreover, this method considers the limitations on the number of tokens or calls by model providers within a certain period, thus enabling more reasonable model selection and switching, and supporting global optimization of multiple call modes. In addition, this method can dynamically update optimization targets as the application's user base continues to grow.

[0277] To make the technical solution of this application clearer and easier to understand, the model switching method of this application is described below in conjunction with a specific application scenario. The model switching method of this application can be used for the development and operation of large model applications such as single-agent and multi-agent applications.

[0278] The following example illustrates the application of a multi-agent architecture. In a multi-agent architecture, multiple agents typically execute tasks according to a Standard Operating Procedure (SOP). An SOP can be a pre-set static SOP or a dynamic SOP generated based on business requirements.

[0279] Referring to Figure 5, which shows a flowchart of a multi-node dynamic SOP mode for agent collaboration, the model management system 10 can be integrated into the agent execution device as a plug-in. The method includes the following steps:

[0280] S501, the intelligent agent execution device receives the query request and performs feature extraction through the model selection module 106 to obtain the business, domain, scenario, downstream task or intelligent agent.

[0281] Specifically, the intelligent agent execution device can parse the query request through the model selection module 106. If the query request specifies an intelligent agent and a downstream task, and the request body includes task context, the intelligent agent execution device can obtain features such as the downstream task, task context, or intelligent agent. If the query request does not specify an intelligent agent or a downstream task, and the request body does not include task context, the intelligent agent execution device can perform intent recognition on the query request to obtain the business, domain, scenario, downstream task, or intelligent agent.

[0282] S502, The intelligent agent execution device obtains the task context.

[0283] The intelligent agent execution device can obtain task context based on the query request. When the query request is generated in response to a question posed by a user, the task context may include historical question-answer pairs. Furthermore, the task context may also include the response to the previous query or the previous step / operation (e.g., knowledge base retrieval).

[0284] S503, The agent execution device loads the next agent to be loaded from among the multiple agents associated with the SOP, according to the SOP.

[0285] The agent execution device progressively loads and executes the agents associated with the Standard Operation Plan (SOP) through a cyclical mechanism, ensuring task continuity and efficient system operation. Specifically, the agent execution device can load the next agent from among the multiple agents associated with the SOP, based on the SOP. During the initialization phase, the next agent can be the first agent executed in the SOP, such as the first agent. During the execution phase, the next agent can be the next agent to be executed in the SOP, such as the second agent.

[0286] S504, The agent execution device loads the next operation node in the agent.

[0287] Specifically, when an agent is loaded, the agent execution device can load the agent's operation nodes to achieve refined task execution. For example, if the next agent is the first agent, the agent execution device can load the first operation node of the first agent sequentially according to the standard operating procedure (SOP). Similarly, if the next agent is the second agent, the agent execution device can load the second operation node of the second agent sequentially according to the SOP. When loading operation nodes, the next operation node can be loaded only after the current operation node has been successfully executed.

[0288] S506, The intelligent agent execution device obtains the node type and node information of the operation node.

[0289] Node types can be operation types, including knowledge query, data processing, API call, or model inference. Node information can include input and output parameters, execution logic, and dependencies of tasks assigned to the operation node. Dependencies of tasks assigned to the operation node include dependencies between the agent and other agents among multiple agents associated with the SOP, dependencies between the operation node and other operation nodes within the agent, or dependencies between parameters used to execute the operation node.

[0290] In some possible implementations, each agent's operational node is defined in a configuration file or database using JavaScript object notation (JSON) or YAML metadata to define its node type and information. Based on this, the agent's execution device can obtain the node type and information by reading the aforementioned metadata.

[0291] S508, the intelligent agent execution device determines whether the operation node calls a language model. If yes, the intelligent agent execution device calls the model selection module 106 to determine the language model corresponding to the query request from multiple language models based on the mapping relationship between features and models, as well as the business, domain, scenario, downstream task, intelligent agent, or task context. If no, then S514 is executed.

[0292] S510, the intelligent agent execution device calls the model selection module 106 to determine the language model corresponding to the query request from multiple language models.

[0293] S511, the intelligent agent execution device calls the inference module 108 to infer the query request through the language model corresponding to the query request and obtain the model generation result.

[0294] S512, the agent execution device determines whether the operation node is the last operation node of the agent. If yes, then execute S514; otherwise, return to execute S504.

[0295] Specifically, each agent can use a Directed Acyclic Graph (DAG) structure to manage the execution order of operation nodes, ensuring the dependencies and order of operation node execution. Based on this, the agent's execution device can identify whether the current operation node is the agent's last operation node according to the DAG.

[0296] S514. The agent execution device determines whether the agent is the last agent in the SOP. If yes, execute S516; otherwise, return to execute S503.

[0297] S516, The agent execution device returns the answer obtained by the agent's execution.

[0298] After all agents have completed their tasks, the agent execution device can determine whether the user's query (or question) has been resolved, and decide whether replanning or further processing is necessary. If the question is resolved, the agent execution device can return the generated answer to the user, completing the entire multi-agent collaboration process.

[0299] The above describes the method of this application from the perspective of application runtime. This method can also be applied to the development state to develop corresponding large-scale model applications, improving the development efficiency of complex large-scale model applications and guiding SFT model development. Multiple interaction methods are supported when developing large-scale model applications, including selection recommendation or question-and-answer recommendation. During the development of large-scale model applications, the mapping relationship between features and models can be configured. Thus, when users cold-start large-scale model applications, they can use the feature-model mapping relationship configured during development to select and switch models. Subsequently, as business data accumulates and is continuously input into the model management system 10, the model calculation module 104 of the model management system 10 can update the feature-model mapping relationship, and perform model selection and switching based on the updated mapping relationship, which better matches the user's real business scenarios.

[0300] Corresponding to the aforementioned model switching method, this application also provides a model management system 10. As shown in Figure 2, the model management system 10 includes:

[0301] The interaction module is used to receive the first query request;

[0302] The model selection module 106 is used to extract a first feature based on the first query request, and select a first model that matches the first feature from a plurality of artificial intelligence (AI) models based on the first feature extracted from the first query request, wherein each of the plurality of AI models has one or more matching features;

[0303] Inference module 108 is used to infer the first query request using the first model among the plurality of AI models to obtain the first model generation result;

[0304] The interaction module is also used to provide the first model generation result on the user interface;

[0305] The interaction module is also used to receive a second query request;

[0306] The model selection module 106 is further configured to extract a second feature based on the second query request, wherein the first query request and the second query request are different query requests, and select a second model that matches the second feature from multiple AI models based on the second feature extracted from the second query request, wherein the second feature is different from the first feature, and the second model and the first model are different models among the multiple AI models;

[0307] The reasoning module 108 is also used to reason about the second query request using the second model to obtain the second model generation result;

[0308] The interaction module is also used to provide the second model generation result on the user interface.

[0309] For example, the interaction module, model selection module 106, and reasoning module 108 described above can be implemented in combination with hardware and / or software.

[0310] In one implementation scenario, the interaction module, model selection module 106, and inference module 108 can be applications running on computing devices, such as computing engines. These applications can also be deployed on cloud computing architectures, specifically hosted on virtualization services for user access. Virtualization services can include virtual machine (VM) services, bare metal server (BMS) services, or container services. VM services utilize virtualization technology to create a pool of virtual machine (VM) resources across multiple physical hosts, providing VMs to users on demand. BMS services create a pool of BMS resources across multiple physical hosts, providing BMS services to users on demand. Container services create a pool of container resources across multiple physical hosts, providing containers to users on demand. A VM is a simulated virtual computer, or logically a computer. A BMS is a scalable, high-performance computing service with computing performance indistinguishable from traditional physical machines, featuring secure physical isolation. Containers are a kernel virtualization technology that provides lightweight virtualization to isolate user space, processes, and resources. It should be understood that the VM service, BMS service, and container service mentioned above are merely specific examples. In practical applications, virtualization services can also include other lightweight or heavyweight virtualization services, which are not specifically limited here.

[0311] In one implementation example, the interaction module, model selection module 106, and inference module 108 may include hardware resources for deploying these modules, including at least one computing device, such as a server. Optionally, the interaction module, model selection module 106, and inference module 108 may also be devices that execute these modules, implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD may be implemented using a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), generic array logic (GAL), or any combination thereof.

[0312] In some possible implementations, the model selection module 106 is also used for:

[0313] Based on the first feature extracted from the first query request, the model parameters of the first model are determined so that the first model can be used to infer the first query request; and / or,

[0314] Based on the second feature extracted from the second query request, the model parameters of the second model are determined so that the second model can be used to infer the second query request.

[0315] In some possible implementations, the inference module 108 is specifically used for:

[0316] Get the prompt template for the first model;

[0317] Based on the first feature and the prompt template of the first model, prompts are filled in to obtain prompts;

[0318] The prompt will be used to input the first model for inference, and the result generated by the first model will be obtained.

[0319] In some possible implementations, the model management system 10 also includes:

[0320] The model calculation module 104 is used to extract metadata of each of multiple AI models, including at least one of open interface models or self-deployed models. Then, the multiple AI models are evaluated using an evaluation set corpus to obtain the evaluation index value of each of the multiple AI models. Based on the metadata and evaluation index value of each of the multiple AI models, the multi-model optimization problem is solved to obtain one or more features matched by each of the multiple AI models.

[0321] Similar to the model selection module 106, the model calculation module 104 can be implemented in hardware or in software.

[0322] In one implementation scenario, the model computing module 104 can be an application running on a computing device, such as a computing engine. This application can also be deployed on a cloud computing architecture, specifically hosted on virtualization services such as VMs, BMS, or containers for user access. In another implementation scenario, when implemented in hardware, the model computing module 104 can include hardware resources for deploying the module, including at least one computing device, such as a server. Furthermore, the model computing module 104 can also include devices implemented using ASICs or PLDs to execute the module.

[0323] In some possible implementations, the model calculation module 104 is specifically used for:

[0324] Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved using integer programming to obtain one or more features matched by each of the multiple AI models.

[0325] In some possible implementations, the model calculation module 104 is specifically used for:

[0326] Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved by mixed integer programming to obtain one or more features and model parameters that match each of the multiple AI models.

[0327] In some possible implementations, the decision variables solved by mixed-integer programming include combinations of the following variables:

[0328] For the k-th model call to the j-th question from the i-th user, should we select model m from provider p? Or...

[0329] The generated temperature parameters when selecting model m from provider p; or...

[0330] The top_p parameter when selecting model m from provider p; or...

[0331] The frequency penalty parameter is selected when choosing model m from provider p; or...

[0332] The existence penalty parameter when choosing provider p for model m; or...

[0333] The `top_logprobs` parameter when selecting model `m` for provider `p`.

[0334] In some possible implementations, the optimization objective of the multi-model optimization problem is a joint optimization objective determined based on multiple single objectives, wherein the single objective includes at least one of minimizing total cost, minimizing response latency, maximizing generation quality, maximizing generation speed, maximizing performance-price ratio, or maximizing speed-price ratio.

[0335] In some possible implementations, the interaction module is also used for:

[0336] Provides a configuration interface;

[0337] The configuration interface receives at least one of the optimization objective, decision variables, and constraints for the multi-model optimization problem.

[0338] In some possible implementations, one or more features matched by each of the plurality of AI models are used to describe one or more of the following: business, domain, agent, downstream task, task context.

[0339] In some possible implementations, the interaction module is also used for:

[0340] The system receives a trigger condition for model switching, which is used to switch to a new model in response to the second query request.

[0341] In some possible implementations, the triggering conditions for the model switching include:

[0342] The multiple AI models have undergone version updates or new AI models have been added; or...

[0343] The performance, cost, or functionality of the multiple AI models has changed; or...

[0344] At least one of the multiple AI models has a feature that it matches; or...

[0345] A model switching instruction has been received.

[0346] In some possible implementations, the model selection module 106 is further configured to:

[0347] When the quality of the result generated by the second model does not meet the conditions, based on the second feature extracted from the second query request, a third model is selected from multiple AI models, the third model being different from the second model;

[0348] The reasoning module 108 is also used to reason about the second query request using the third model to obtain the result generated by the third model;

[0349] The interaction module is also used to provide the third model generation result on the user interface.

[0350] This application also provides a computing device 600. As shown in FIG6, the computing device 600 includes: a bus 602, a processor 604, a memory 606, and a communication interface 608. The processor 604, the memory 606, and the communication interface 608 communicate with each other via the bus 602. The computing device 600 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 600.

[0351] Bus 602 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 6, but this does not imply that there is only one bus or one type of bus. Bus 602 can include pathways for transmitting information between various components of computing device 600 (e.g., memory 606, processor 604, communication interface 608).

[0352] Processor 604 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).

[0353] The memory 606 may include volatile memory, such as random access memory (RAM). The memory 606 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid-state drive (SSD). The memory 606 stores executable program code, which the processor 604 executes to implement the aforementioned model switching method. Specifically, the memory 606 stores instructions from the model management system 10 for executing the model switching method. Specifically, the memory 606 may store instructions for implementing the functions of the interaction module, model calculation module 104, model selection module 106, and inference module 108. Further, the memory 606 may also store instructions for implementing the functions of the model information acquisition module 102.

[0354] The communication interface 608 uses transceiver modules, such as, but not limited to, network interface cards and transceivers, to enable communication between the computing device 600 and other devices or communication networks.

[0355] This application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.

[0356] As shown in Figure 7, the computing device cluster includes at least one computing device 600. The memory 606 of one or more computing devices 600 in the computing device cluster may store the same instructions from the model management system 10 for executing model switching methods.

[0357] In some possible implementations, one or more computing devices 600 in the computing device cluster can also be used to execute some of the instructions used by the model management system 10 to execute the model switching method. In other words, a combination of one or more computing devices 600 can jointly execute the instructions used by the model management system 10 to execute the model switching method.

[0358] It should be noted that the memory 606 in different computing devices 600 in the computing device cluster can store different instructions for executing some functions of the model management system 10.

[0359] Figure 8 illustrates one possible implementation. As shown in Figure 8, two computing devices 600A and 600B are connected via a communication interface 608. The memory in computing device 600A stores instructions for executing the functions of the interaction module. The memory in computing device 600B stores instructions for executing the functions of the model selection module 106 and the inference module 108. Furthermore, the memory in computing device 600A may also store instructions for executing the functions of the model information acquisition module 102, and the memory in computing device 600B may also store instructions for executing the functions of the model calculation module 104. In other words, the memory 606 of computing devices 600A and 600B jointly stores the instructions used by the model management system 10 to execute the model switching method.

[0360] The connection method between the computing device clusters shown in Figure 8 can be considered in light of the fact that the model switching method provided in this application requires a lot of resources to select the AI ​​model used for inference and to use the selected AI model to infer the query request. Therefore, it is considered that the model selection module 106 and the inference module 108 are executed by independent computing devices, such as computing device 600B.

[0361] It should be understood that the functions of computing device 600A shown in Figure 8 can also be performed by multiple computing devices 600. Similarly, the functions of computing device 600B can also be performed by multiple computing devices 600.

[0362] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 9 illustrates one possible implementation. As shown in Figure 9, two computing devices 600C and 600D are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 606 in computing device 600C stores instructions for executing the functions of the interaction module. Simultaneously, the memory 606 in computing device 600D stores instructions for executing the functions of the model selection module 106 and the inference module 108.

[0363] The connection method between the computing device clusters shown in Figure 9 can be considered as follows: considering that the model switching method provided in this application requires a lot of resources to select the AI ​​model used for inference and respond to the query request through the AI ​​model, the functions implemented by the model selection module 106 and the inference module 108 are considered to be executed by independent computing devices, such as computing device 600D.

[0364] It should be understood that the functions of computing device 600C shown in Figure 9 can also be performed by multiple computing devices 600. Similarly, the functions of computing device 600D can also be performed by multiple computing devices 600.

[0365] This application embodiment also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to execute the model switching method described above applied to the model management system 10.

[0366] This application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any usable medium. When the computer program product runs on at least one computing device, it causes the at least one computing device to perform the model switching method described above.

[0367] This application also provides a computing chip, such as a graphics processing unit (GPU), a tensor processing unit (TPU), a deep learning processing unit (DPU), or other AI-related processors, to run AI models and / or to execute the model switching method of this application and / or implement the model management system of this application.

[0368] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

A model switching method, characterized in that, The method includes: Receive a first query request and extract a first feature based on the first query request; Based on the first feature extracted from the first query request, a first model matching the first feature is selected from multiple artificial intelligence (AI) models, wherein each of the multiple AI models has one or more matching features; The first model among the multiple AI models is used to infer the first query request to obtain the result generated by the first model. The first model generation result is provided in the user interface; Receive a second query request and extract a second feature based on the second query request, wherein the first query request and the second query request are different query requests; Based on the second feature extracted from the second query request, a second model matching the second feature is selected from multiple AI models, wherein the second feature is different from the first feature, and the second model and the first model are different models among the multiple AI models; Using the second model, reason about the second query request to obtain the result generated by the second model; The second model generation result is provided in the user interface. The method according to claim 1, characterized in that, The method further includes: Based on the first feature extracted from the first query request, the model parameters of the first model are determined so that the first model can be used to infer the first query request; and / or, Based on the second feature extracted from the second query request, the model parameters of the second model are determined so that the second model can be used to infer the second query request. The method according to claim 1 or 2, characterized in that, The step of using the first model among the plurality of AI models to reason about the first query request and obtain the result generated by the first model includes: Obtain the prompt template for the first model; Based on the prompt template of the first feature and the first model, prompts are filled in to obtain prompts; The prompt is input into the first model for reasoning, and the first model generates the result. The method according to any one of claims 1 to 3, characterized in that, The method further includes: Extract the metadata of each of the plurality of AI models, wherein the plurality of AI models include at least one of open interface models or self-deployed models; The multiple AI models are evaluated using the evaluation corpus to obtain the evaluation index value for each of the multiple AI models; Based on the metadata of each of the multiple AI models and the evaluation metric values ​​of each of the multiple AI models, the multi-model optimization problem is solved to obtain one or more features matched by each of the multiple AI models. The method according to claim 4, characterized in that, The step involves solving a multi-model optimization problem based on the metadata and evaluation metrics of each of the multiple AI models to obtain one or more features matched by each of the multiple AI models, including: Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved using integer programming to obtain one or more features matched by each of the multiple AI models. The method according to claim 4, characterized in that, The step involves solving a multi-model optimization problem based on the metadata and evaluation metrics of each of the multiple AI models to obtain one or more features matched by each of the multiple AI models, including: Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved by mixed integer programming to obtain one or more features and model parameters that match each of the multiple AI models. The method according to claim 6, characterized in that, The decision variables solved by mixed integer programming include combinations of the following variables: For the j-th question from the i-th user, in the k-th model call, should we select model m from provider p? or, Generated temperature parameters when selecting model m from provider p; or, The top_p parameter when selecting model m from provider p; or... The frequency penalty parameter is selected when choosing model m from provider p; or... The existence penalty parameter when selecting model m for provider p; or... The `top_logprobs` parameter when selecting model `m` for provider `p`. The method according to any one of claims 4 to 7, characterized in that, The optimization objective of the multi-model optimization problem is a joint optimization objective determined based on multiple single objectives. The single objective includes at least one of minimizing total cost, minimizing response latency, maximizing generation quality, maximizing generation speed, maximizing performance-price ratio, or maximizing speed-price ratio. The method according to any one of claims 4 to 8, characterized in that, The method further includes: Provides a configuration interface; The configuration interface receives at least one of the optimization objective, decision variables, and constraints for the multi-model optimization problem. The method according to any one of claims 1 to 9, characterized in that, One or more features matched by each of the multiple AI models are used to describe one or more of the following: business, domain, agent, downstream task, and task context. The method according to any one of claims 1 to 10, characterized in that, The method further includes: The system receives a trigger condition for model switching, which is used to switch to a new model in response to the second query request. The method according to claim 11, characterized in that, The triggering conditions for model switching include: The multiple AI models have undergone version updates or new AI models have been added; or... The performance, cost, or functionality of the multiple AI models has changed; or... At least one of the multiple AI models has a feature that it matches; or... A model switching instruction has been received. The method according to any one of claims 1 to 12, characterized in that, The method further includes: When the quality of the result generated by the second model does not meet the conditions, a third model is selected from the plurality of AI models based on the second feature extracted from the second query request. The third model is different from the second model. Using the third model, reasoning is performed on the second query request to obtain the result generated by the third model; The third model generation result is provided in the user interface. A model management system, characterized in that, The system includes: The interaction module is used to receive the first query request; The model selection module is used to extract a first feature based on the first query request, and select a first model that matches the first feature from multiple artificial intelligence (AI) models based on the first feature extracted from the first query request, wherein each of the multiple AI models has one or more matching features; The reasoning module is used to reason about the first query request using the first model among the multiple AI models to obtain the result generated by the first model. The interaction module is also used to provide the first model generation result on the user interface; The interaction module is also used to receive a second query request; The model selection module is further configured to extract a second feature based on the second query request, wherein the first query request and the second query request are different query requests, and select a second model that matches the second feature from multiple AI models based on the second feature extracted from the second query request, wherein the second feature is different from the first feature, and the second model and the first model are different models among the multiple AI models; The reasoning module is also used to reason about the second query request using the second model to obtain the result generated by the second model; The interaction module is also used to provide the second model generation result on the user interface. The system according to claim 14 is characterized in that, The model selection module is also used for: Based on the first feature extracted from the first query request, the model parameters of the first model are determined so that the first model can be used to infer the first query request; and / or, Based on the second feature extracted from the second query request, the model parameters of the second model are determined so that the second model can be used to infer the second query request. The system according to claim 14 or 15 is characterized in that, The reasoning module is specifically used for: Obtain the prompt template for the first model; Based on the prompt template of the first feature and the first model, prompts are filled in to obtain prompts; The prompt is input into the first model for reasoning, and the first model generates the result. The system according to any one of claims 14 to 16 is characterized in that, The system also includes: The model computation module is used to extract metadata of each of the multiple AI models, which include at least one of open interface models or self-deployed models. The module evaluates the multiple AI models using an evaluation set corpus to obtain the evaluation index value of each of the multiple AI models. Based on the metadata and evaluation index value of each of the multiple AI models, the module solves the multi-model optimization problem to obtain one or more features matched by each of the multiple AI models. The system according to claim 17 is characterized in that, The model calculation module is specifically used for: Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved using integer programming to obtain one or more features matched by each of the multiple AI models. The system according to claim 17 is characterized in that, The model calculation module is specifically used for: Based on the metadata and evaluation metrics of each of the multiple AI models, the multi-model optimization problem is solved by mixed integer programming to obtain one or more features and model parameters that match each of the multiple AI models. The system according to claim 19 is characterized in that, The decision variables solved by mixed integer programming include combinations of the following variables: For the j-th question from the i-th user, in the k-th model call, should we select model m from provider p? or, Generated temperature parameters when selecting model m from provider p; or, The top_p parameter when selecting model m from provider p; or... The frequency penalty parameter is selected when choosing model m from provider p; or... The existence penalty parameter when selecting model m for provider p; or... The `top_logprobs` parameter when selecting model `m` for provider `p`. The system according to any one of claims 17 to 20 is characterized in that, The optimization objective of the multi-model optimization problem is a joint optimization objective determined based on multiple single objectives. The single objective includes at least one of minimizing total cost, minimizing response latency, maximizing generation quality, maximizing generation speed, maximizing performance-price ratio, or maximizing speed-price ratio. The system according to any one of claims 17 to 21 is characterized in that, The interaction module is also used for: Provides a configuration interface; The configuration interface receives at least one of the optimization objective, decision variables, and constraints for the multi-model optimization problem. The system according to any one of claims 14 to 22 is characterized in that, One or more features matched by each of the multiple AI models are used to describe one or more of the following: business, domain, agent, downstream task, and task context. The system according to any one of claims 14 to 23 is characterized in that, The interaction module is also used for: The system receives a trigger condition for model switching, which is used to switch to a new model in response to the second query request. The system according to claim 24 is characterized in that, The triggering conditions for model switching include: The multiple AI models have undergone version updates or new AI models have been added; or... The performance, cost, or functionality of the multiple AI models has changed; or... At least one of the multiple AI models has a feature that it matches; or... A model switching instruction has been received. The system according to any one of claims 14 to 25 is characterized in that, The model selection module is also used for: When the quality of the result generated by the second model does not meet the conditions, a third model is selected from the plurality of AI models based on the second feature extracted from the second query request. The third model is different from the second model. The reasoning module is also used to reason about the second query request using the third model to obtain the result generated by the third model; The interaction module is also used to provide the third model generation result on the user interface. A computing device cluster, characterized in that, The computing device cluster includes at least one computing device, the at least one computing device including at least one processor and at least one memory, the at least one memory storing computer-readable instructions; the at least one processor executes the computer-readable instructions to cause the computing device cluster to perform the model switching method as described in any one of claims 1 to 13. A computer-readable storage medium, characterized in that, Includes computer-readable instructions; the computer-readable instructions are used to implement the model switching method according to any one of claims 1 to 13. A computer program product, characterized in that, Includes computer-readable instructions; the computer-readable instructions are used to implement the model switching method according to any one of claims 1 to 13. A computing chip, characterized in that, Used to execute computer-readable instructions to implement the model switching method according to any one of claims 1 to 13.