A multi-agent collaboration platform and method for distributed large model orchestration

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
Through a multi-agent collaboration platform with distributed large model orchestration, the complex resource allocation and scheduling problems in multimodal task processing of traditional centralized architectures are solved, and efficient collaborative learning and stable service are achieved in complex dynamic network environments, meeting the real-time and reliability requirements of users.

CN121542043BActive Publication Date: 2026-06-12TIANJIN UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TIANJIN UNIV
Filing Date: 2025-11-20
Publication Date: 2026-06-12

Application Information

Patent Timeline

20 Nov 2025

Application

12 Jun 2026

Publication

CN121542043B

IPC: G06F9/50; G06N3/006; G06N3/0455; G06N3/092; G06N5/04

CPC: G06F9/5027; G06F9/5061; G06F9/5083; G06N3/006; G06N3/0455; G06N3/092; G06N5/041

AI Tagging

Application Domain

Resource allocation Artificial life

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN121542043B_ABST

Patent Text Reader

Abstract

The application provides a multi-agent collaboration platform and method for distributed large model arrangement, the platform comprises: a forward task execution loop system, a multi-agent layer, a task matching module, an evaluation candidate module and a backward model learning loop system; the forward task execution loop system is composed of a distributed environment reasoning unit, a task stage sequence unit and an edge server; the backward model learning loop system is composed of an experience replay buffer, an adaptive collaboration module and a synchronization coordination module; the multi-agent layer is composed of a decision network and an evaluation network, the application solves the problems of resource allocation imbalance and scheduling complexity caused by scattered user demand and various task stages in distributed edge computing, and through intelligent arrangement and adaptive decision at the task stage granularity, optimizes the collaborative relationship of server selection and model selection, improves the overall response speed of the system and reduces the end-to-end inference delay, so as to meet the high standard requirements for service quality.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of multi-agent reinforcement learning technology, and in particular relates to a multi-agent collaborative platform and method for orchestrating distributed large models. Background Technology

[0002] With breakthroughs in 5G / 6G communication, IoT, and generative AI technologies (such as large language models and multimodal large models), the global data volume is growing exponentially. User demand for intelligent services has expanded from simple text interaction to multimodal scenarios integrating text, images, audio, and video (such as virtual human interaction, immersive meetings, and autonomous driving decision-making). These applications require the simultaneous invocation of multiple AI models (such as speech recognition, image generation, and semantic understanding) and demand millisecond-level end-to-end response times, placing extremely high demands on the real-time performance and collaborative capabilities of computing architectures.

[0003] Currently, the bottlenecks of traditional centralized architectures are becoming increasingly apparent, specifically manifested in the following ways:

[0004] Centralized computing power and network latency: Traditional cloud computing relies on remote data centers, and data needs to be transmitted through multiple network hops to the central node for processing, resulting in increased link latency. For example, video processing tasks need to upload high-definition streams to the cloud, which is susceptible to network fluctuations and makes it difficult to meet the latency requirements of ≤20ms in scenarios such as AR / VR.

[0005] Rigid resource allocation: The centralized architecture adopts a static resource allocation model, which cannot flexibly adapt to sudden traffic (such as real-time bullet screen analysis and special effects rendering in large-scale live broadcasts), and is prone to response delays or service interruptions.

[0006] Low efficiency of cross-model collaboration: Multimodal tasks require the chaining of different models (e.g., first the visual model detects objects, and then the language model generates descriptions). In a centralized architecture, data exchange between models depends on frequent IO transmission, which can easily lead to performance bottlenecks and high risk of single point of failure.

[0007] Existing technologies face complex challenges in distributed environments, including the difficulty of scheduling heterogeneous resources: edge devices, cloud servers, and terminal devices have significantly different computing power and storage capabilities, requiring dynamic coordination of heterogeneous resources such as GPUs and NPUs to ensure task parallel efficiency. Data synchronization between distributed nodes (such as model parameter updates) may lead to inconsistencies in state, while strong consistency protocols (such as distributed consensus algorithms) increase communication overhead and affect real-time performance. Highly differentiated user needs (such as customized AI assistants) require processing privacy data on local edge nodes while simultaneously updating the global model; traditional architectures struggle to balance personalized and global performance.

[0008] Addressing the stringent requirements of specific industry scenarios presents a significant technical challenge for those skilled in the art. In the Industrial Internet of Things (IIoT): predictive maintenance of equipment requires real-time analysis of sensor video and vibration data; delays in central cloud response could lead to production accidents. In smart healthcare: remote surgical assistance systems must simultaneously process endoscopic video streams, voice commands, and medical record text, demanding end-to-end latency of less than 10ms for multimodal model collaboration and 100% service reliability. In metaverse interaction: real-time motion capture and rendering in virtual scenes require combining computer vision and physics engines; traditional architectures struggle to support low-latency interaction with thousands of concurrent users.

[0009] The current technological paradigm has shifted from "single-model cloud processing" to "multi-modal edge cloud collaboration". However, the existing computing architecture is still limited by resource fragmentation, scheduling lag and inefficient cross-modal collaboration. There is an urgent need to build a new generation of computing power foundation that takes into account real-time performance, elasticity and security through innovations such as unified management of heterogeneous computing, dynamic task partitioning and lightweight cross-modal protocols. Summary of the Invention

[0010] This invention aims to address the problems of resource allocation imbalance and scheduling complexity caused by dispersed user demands and diverse task stages in distributed edge computing. By implementing intelligent orchestration and adaptive decision-making at the task stage level, the collaborative relationship between server selection and model selection is optimized, improving the overall system response speed and reducing end-to-end inference latency, thereby meeting high standards for service quality and reliability.

[0011] Furthermore, this invention addresses the limitations of multi-model collaboration in large-scale, multi-user, and complex multi-stage tasks by proposing a technical approach that maintains robust learning and efficient collaboration under uncertain and dynamic network conditions. By constructing an adaptive update mechanism with measurable uncertainty and a coordination optimization strategy oriented towards inter-stage impacts, this invention aims to alleviate the bottlenecks of existing technologies in personalized services and resource optimization, providing users with a more stable, faster, and scalable distributed intelligent service experience.

[0012] This invention employs the following technical solution: a multi-agent collaborative platform for distributed large-scale model orchestration. The platform comprises: a forward task execution loop system, a multi-agent layer, a task matching module, an evaluation candidate module, and a backward model learning loop system. The forward task execution loop system consists of a distributed environment inference unit, a task stage sequence unit, and an edge server. The backward model learning loop system consists of an experience replay buffer, an adaptive collaboration module, and a synchronization coordination module. The multi-agent layer consists of a decision network and an evaluation network. Wherein:

[0013] The forward task execution loop system processes the input tasks into a phase sequence and unifies the quality and latency metrics of the phased tasks.

[0014] The multi-agent layer completes the joint distribution within the feasible domain and sorts the overall communication, computing and quality indicators based on the task status information of the current stage, and selects the probability of the first choice and equivalent alternative computing servers or models for computing.

[0015] The task matching module sorts the selection probabilities of stage tasks with servers or models to obtain the most feasible server or model path for the current stage task.

[0016] The backward model learning loop system optimizes the multi-agent layer selection strategy by updating the policy parameters of the multi-agent layer through an adaptive proximal policy optimization algorithm.

[0017] Furthermore, the process of constructing the forward task execution loop system includes:

[0018] The forward task execution loop system is implemented through a topology graph. Unified computing and communication;

[0019] Where: node set Includes edge servers, necessary gateways, and user-side devices; edge set Indicates available data transmission links;

[0020] The forward task execution loop system constructs an execution solution set. To accomplish the task; where: if and only if the server The model has been correctly deployed and is currently available. Meeting the preconditions for operation is denoted as If not deployed or temporarily unavailable, then ;

[0021] The forward task execution loop system is based on bandwidth. Data volume Accessibility verification; and constraints Express queue stability; implement the "server-model" feasible region mask and display it at the policy output; where: For the mission phase The intensity of arrival, For server Operating Model The service intensity.

[0022] Furthermore, the forward task execution loop system processes the input tasks according to a phase sequence and unifies the quality and latency metrics of the phased tasks; including:

[0023] Break down complex tasks into ordered stages Obtain phase tasks, the phase tasks include: each phase Given the amount of output data Arrival rate Deployed model set, phased licensed model set ;

[0024] Build the set of deployed models and the set of phased licensed models. Intersection generates candidate pairs ;

[0025] The quality-delay of the stage task is weighted and optimized using the probabilistic decision variables according to the following formula:

[0026] ,

[0027] in: This represents a summary of the quality scores for each phase of the task. The total delay of each stage task is represented by a variable; simultaneously, probabilistic decision variables are introduced. , and respectively satisfy , ;

[0028] By using probabilistic decision variables to ensure that the output task queue of the forward task execution loop system is stable, the output task queue is kept less than the execution rate.

[0029] Model type matching requirements The candidate set remaining after this step is all executable at the current moment, reducing the cost of subsequent correction.

[0030] Furthermore, the multi-agent layer completes the process of jointly distributing and ranking the feasible domain based on the current task state information, integrating communication, computation, and quality indicators, and simultaneously calculating the selection probabilities of the preferred and equivalent alternative computation servers or models; including:

[0031] Calculate the agent's state at each stage Marginal distribution under the current state: Selecting a server The probability, i.e. Select a model in the current state. The probability of that, i.e.: ;

[0032] The pairwise scores of the first server and the model are obtained according to the marginal distribution of the agents using the following formula:

[0033] ;

[0034] The pairwise score of the first server and model is obtained by multiplying it by the feasible region mask. The feasible region mask is: ;

[0035] The optimal server and model strategy is obtained by normalizing the pairwise scores of the second server and the model according to the following formula:

[0036] ;

[0037] in: It is the original preference or unconstrained joint probability given by the policy network, reflecting the agent's subjective inclination towards the combination; It is a stage The feasible domain mask, when combined The value is 1 when permitted in engineering practice and 0 when not permitted; the denominator is... For all servers With model The weighted sum of the combinations is used as a normalization factor; only the sums are accumulated. Feasible combinations;

[0038] Through probability decision variables The optimization strategies for servers and models were verified.

[0039] Furthermore, the task matching module sorts the selection probabilities of stage tasks with servers or models to obtain the most feasible server or model path for the current stage task; including:

[0040] By leveraging target link margins, target server load on optimization servers and model strategies, and online path correction:

[0041] Equivalent replacements are performed in the candidate set using adjacent paths or nodes with similar capabilities;

[0042] Without changing the phase task type, the strategy for optimizing the server and model is lightly downgraded, and a model with available computational intensity is selected for recovery. Stability conditions;

[0043] The number of retries is limited; if the threshold is exceeded, an upstream backoff or bypass strategy is triggered to correct the error.

[0044] The reasons for selection, the correction trajectory, and key operational indicators are written into the operational platform for subsequent learning and updates.

[0045] Furthermore, the backward model learning recurrent system optimizes the multi-agent layer selection policy by updating the policy parameters of the multi-agent layer through an adaptive proximal policy optimization algorithm, including:

[0046] The matching between the server and the model is optimized by introducing the following cross-stage coordination model:

[0047] ;

[0048] in: For the stage The advantages of coordination; Leveraging local advantages; For the stage Current decisions have an impact on downstream industries. The marginal impact, counterfactual estimation; To influence the weight; For synergy strength;

[0049] Using the current shear threshold Stable server and model adaptation;

[0050]

[0051] in: For adaptive stride length; It is uncertain; Due to environmental changes; For peer consistency; As weight; For the Sigmoid function;

[0052] The server and model are adaptively stabilized using the shearing threshold according to the following formula;

[0053] ;

[0054] in: This is the current shear threshold; Initial value; The shrinkage coefficient; For near Step parameter fluctuation variance;

[0055] The updated gradient compression is then asynchronously aggregated, and the new parameters are then... Return to the multi-agent layer to complete one closed loop.

[0056] This invention also provides a multi-agent cooperation method for orchestrating distributed large models, comprising the following steps:

[0057] S1. The input tasks are arranged into a phase sequence with unified quality and latency metrics; where:

[0058] 101. Decompose complex tasks into ordered stages. Obtain phase tasks, the phase tasks include: each phase Given the amount of output data Arrival rate Deployed model set, phased licensed model set ;

[0059] 102. Construct the set of deployed models and the set of phased license models. Intersection generates candidate pairs ;

[0060] The quality-delay of the stage task is weighted and optimized using the probabilistic decision variables according to the following formula:

[0061] ,

[0062] in: This represents a summary of the quality scores for each phase of the task. The total delay of each stage task is represented by a variable; simultaneously, probabilistic decision variables are introduced. , and respectively satisfy , ;

[0063] 103. By using probabilistic decision variables to ensure that the output task queue of the forward task execution loop system is less than the execution rate, the stability of the output task queue is guaranteed.

[0064] 104. Matching requirements by model type The candidate set remaining after this step is all executable at the current moment, reducing the cost of subsequent correction.

[0065] S2. Based on the current task status information, rank the feasible domain by joint distribution and comprehensive communication, computation, and quality indicators, and calculate the selection probability of the preferred and equivalent alternative computation servers or models; where:

[0066] 201. Calculate the agent's state at each stage. Marginal distribution under the current state: Selecting a server The probability, i.e. Select a model in the current state. The probability of that, i.e.: ;

[0067] 202. Based on the marginal distribution of the agents, obtain the pairwise scores of the first server and the model using the following formula:

[0068] ;

[0069] The pairwise score of the first server and model is obtained by multiplying it by the feasible region mask. The feasible region mask is: ;

[0070] The optimal server and model strategy is obtained by normalizing the pairwise scores of the second server and the model according to the following formula:

[0071] ;

[0072] in: It is the original preference or unconstrained joint probability given by the policy network, reflecting the agent's subjective inclination towards the combination; It is a stage The feasible domain mask, when combined The value is 1 when permitted in engineering practice and 0 when not permitted; the denominator is... For all servers With model The weighted sum of the combinations is used as a normalization factor; only the sums are accumulated. Feasible combinations;

[0073] Through probability decision variables Verify the optimization strategies for servers and models;

[0074] S3. Sort the stage tasks and the selection probabilities of servers or models to obtain the most feasible server or model path for this stage task.

[0075] S4. Optimize the multi-agent layer selection strategy by updating the policy parameters of the multi-agent layer using an adaptive proximal policy optimization algorithm; where:

[0076] 401. The matching between the server and the model is optimized by introducing the following cross-stage coordination model:

[0077] ;

[0078] in: For the stage The advantages of coordination; Leveraging local advantages; For the stage Current decisions have an impact on downstream industries. The marginal impact, counterfactual estimation; To influence the weight; For synergy strength;

[0079] 402. Adjust the consistency of the learning direction between the agent in the current stage and the whole in other stages through the following value adaptive factor step size;

[0080] ;

[0081] in: For adaptive stride length; It is uncertain; Due to environmental changes; As a consistency indicator; As weight; For the Sigmoid function;

[0082] ;

[0083] in: For adaptive stride length; It is uncertain; Due to environmental changes; For peer consistency; As weight; For the Sigmoid function;

[0084] 403. Adaptive stabilization of the server and model using the following formula for shearing threshold;

[0085] ;

[0086] in: This is the current shear threshold; Initial value; The shrinkage coefficient; For near Step parameter fluctuation variance.

[0087] Beneficial effects

[0088] This invention implements intelligent orchestration and resource scheduling for multi-stage, multi-modal tasks in a distributed computing environment, achieving significant technical benefits in terms of scalability, personalized services, latency and communication costs, stability, and generalization capabilities, as detailed below:

[0089] 1. Scalability and Environmental Adaptability: A distributed decision-making architecture of "task phase - multi-agent" is adopted. While each agent perceives computing / bandwidth / deployment constraints locally, global optimization is achieved through lightweight coordination. This mechanism maintains stable convergence and efficient inference under different scales and topologies such as 25, 51, and 101 nodes, verifying its scalability and deployment adaptability in large-scale heterogeneous clusters.

[0090] 2. Personalized and intelligent task allocation: A unified model quality scoring framework and QoS QoS trade-off function (balancing inference quality and end-to-end latency) are introduced to make decisions on "server selection - model selection" at the stage granularity. The strategy can be dynamically adjusted according to user preference weights and task characteristics, which significantly improves the matching accuracy of tasks and models and the diversity of services compared with static or single-objective methods.

[0091] 3. End-to-end latency and throughput optimization: By jointly optimizing the communication and computation links through value-adaptive PPO updates, action masks, and constraint penalties, the system maintains low latency and high success rate even under high concurrency. In comparative experiments, the task success rate reached a maximum of 97.0% (cross-method comprehensive evaluation), the average response time was reduced by 24.2% compared to the random and heuristic baselines, and further reduced by 6.5% compared to the state-of-the-art reinforcement learning baseline (MAPPO). The average communication overhead was reduced by 21.2%, maintaining stable service quality even under bandwidth-constrained or unstable link conditions.

[0092] 4. Stability and Benefits of Collaborative Decision Making: The proposed Coordination Advantage Function (CAF) quantifies the impact between adjacent stages through counterfactual inference, avoiding local optima and decision-making from hindering each other, and improving cross-stage consistency. Combined with adaptive pruning and policy enhancement weights, the oscillations during training are significantly reduced, the reward distribution during the inference stage is more concentrated and the variance is lower, and the system exhibits higher robustness in complex topology and load fluctuation scenarios.

[0093] 5. Communication cost and synchronization overhead control: By combining gradient compression and priority empirical sampling in multi-agent asynchronous parameter synchronization, the amount of gradient transmission across nodes and the frequency of synchronization are significantly reduced. While ensuring convergence, the network occupation of distributed training / inference is effectively suppressed, and the overall resource utilization is improved.

[0094] 6. Load balancing and node health improvement: In multi-topology evaluation, compared with common baselines (MAPPO, greedy and random), this invention achieves better results in the "server usage diversity" index, which is reflected in the relief of pressure on hot nodes and the increase in the activation rate of idle nodes, thereby reducing the resource waste caused by the coexistence of overload and long-term idle time, and helping to extend the stable operation cycle of edge nodes.

[0095] 7. Zero-shot cross-environment generalization capability: Under the setting of "training-test" network environment separation, the present invention can still maintain a task success rate of 98.0% on unseen topologies, indicating that the proposed value adaptive learning paradigm has good cross-domain transfer capability, which can reduce the retraining cost and deployment time in new environments.

[0096] In summary, this invention can significantly reduce end-to-end latency and communication overhead while maintaining a high task success rate, and achieve stable and transferable intelligent orchestration effects in complex, dynamic and heterogeneous distributed environments, providing engineerable high-performance foundational capabilities for mobile and edge intelligent applications. Attached Figure Description

[0097] Figure 1 This is a schematic diagram of the structure of a multi-agent collaborative platform for distributed large model orchestration according to the present invention.

[0098] Figure 2 This is a flowchart of a multi-agent collaborative platform for distributed large model orchestration according to the present invention.

[0099] Figure 3 This invention describes the convergence of training rewards after one million training steps on three network topologies (S1: 25 nodes; S2: 51 nodes; S3: 101 nodes).

[0100] Figure 4 This invention compares the loss in Actor networks.

[0101] Figure 5 This invention compares the loss in Critic networks.

[0102] Figure 6 During training, an adaptive learning rate scheduling strategy is used for the three network topologies.

[0103] Figure 7 This is the statistical distribution of inference rewards for the three network topologies of this invention over 200 test rounds.

[0104] Figure 8 This is a communication efficiency analysis diagram of the present invention.

[0105] Figure 9 This is the average reward of the invention across five model configurations.

[0106] Figure 10 This represents the task completion rate of different ablation schemes in this invention.

[0107] Figure 11 This invention relates to a comprehensive comparison of the performance of different algorithms. Detailed Implementation

[0108] The following is in conjunction with the appendix Figure 1 -Appendix Figure 11 The present invention will be described in detail as follows:

[0109] This invention enables the efficient completion of complex AI inference tasks (such as video analysis and language processing) in a distributed environment composed of multiple edge servers (such as cloud computing nodes and mobile base stations). The system dynamically selects the most suitable server and model components for each task stage through collaborative learning among multiple agents, ultimately optimizing overall efficiency (such as reducing latency and resource consumption) while ensuring Quality of Service (QoS). The entire process can be divided into two main loops: forward task execution and backward model learning.

[0110] like Figure 1As shown, this invention organizes data and control flows in a closed loop: "task orchestration—parallel decision-making—online execution—collaborative learning—robust deployment." Solid lines represent data transmission and decision distribution during runtime, while dashed lines represent gradient backpropagation and parameter synchronization during training. The process first performs feasibility studies on candidate solutions (deployment availability, type matching, connectivity and capacity, load stability). Based on this, multiple agents work in parallel to make joint decisions between the server and the model. Before deployment, rapid verification and necessary equivalent replacements / lightweight degradation are implemented to eliminate the risk of instantaneous fluctuations. During runtime, the determined solution is executed, and labeled metrics such as latency, quality, success rate, and communication overhead are collected. During training, the strategy is updated through coordinated sensing of advantageous signals and adaptive stabilization mechanisms, ultimately achieving robust deployment through gradual rollback and automatic rollback. Specifically:

[0111] The platform includes: a forward task execution loop system, a multi-agent layer, a task matching module, an evaluation candidate module, and a backward model learning loop system. The forward task execution loop system consists of a distributed environment inference unit, a task stage sequence unit, and an edge server. The backward model learning loop system consists of an experience replay buffer, an adaptive collaboration module, and a synchronization coordination module. The multi-agent layer consists of a decision network and an evaluation network. The task flow (to the right) is: task entry -> agent decision -> execution -> obtaining reward / output result. The learning flow (downward and then leftward) is: storing experience -> sampling -> PPO algorithm calculation update -> parameter server synchronization -> updating agent policy.

[0112] The forward task execution loop system processes input tasks into a phase sequence and unifies the quality and latency metrics of the phased tasks; it then constructs an execution solution set. To accomplish the task; where: if and only if the server The model has been correctly deployed and is currently available. Meeting the preconditions for operation is denoted as If not deployed or temporarily unavailable, then ;

[0113] The forward task execution loop system is based on bandwidth. Data volume Accessibility verification; and constraints Express queue stability; implement the "server-model" feasible region mask and display it at the policy output; where: For the mission phase The intensity of arrival, For server Operating Model The service intensity.

[0114] The multi-agent layer completes the joint distribution within the feasible domain and ranks the computational servers or models based on the current task status information and the comprehensive communication, computation and quality indicators, and selects the first-choice and equivalent alternative computational servers or models according to the probability of selection.

[0115] The task matching module sorts the stage tasks with the selection probabilities of servers or models to obtain the most feasible server or model path for the current stage task.

[0116] The forward task execution loop system is the process by which input tasks are arranged into a phase sequence and the quality and latency metrics of the phased tasks are unified; including:

[0117] Break down complex tasks into ordered stages Obtain phase tasks, the phase tasks include: each phase Given the amount of output data Arrival rate Deployed model set, phased licensed model set ;

[0118] Build the set of deployed models and the set of phased licensed models. Intersection generates candidate pairs ;

[0119] The quality-delay of the stage task is weighted and optimized using the probabilistic decision variables according to the following formula:

[0120] ,

[0121] in: This represents a summary of the quality scores for each phase of the task. The total delay of each stage task is represented by a variable; simultaneously, probabilistic decision variables are introduced. , and respectively satisfy , ;

[0122] By using probabilistic decision variables to ensure that the output task queue of the forward task execution loop system is stable, the output task queue is kept less than the execution rate.

[0123] Model type matching requirements The candidate set remaining after this step is all executable at the current moment, reducing the cost of subsequent correction.

[0124] The multi-agent layer completes the process of jointly distributing and ranking the feasible domain based on the current task status information, integrating communication, computation, and quality indicators, and simultaneously calculating the selection probabilities of the preferred and equivalent alternative computation servers or models; including:

[0125] Calculate the agent's state at each stage Marginal distribution under the current state: Selecting a server The probability, i.e. Select a model in the current state. The probability of that, i.e.: ;

[0126] The pairwise scores of the first server and the model are obtained according to the marginal distribution of the agents using the following formula:

[0127] ;

[0128] The pairwise score of the first server and model is obtained by multiplying it by the feasible region mask. The feasible region mask is: ;

[0129] The optimal server and model strategy is obtained by normalizing the pairwise scores of the second server and the model according to the following formula:

[0130] ;

[0131] in: It is the original preference or unconstrained joint probability given by the policy network, reflecting the agent's subjective inclination towards the combination; It is a stage The feasible domain mask, when combined The value is 1 when permitted in engineering practice and 0 when not permitted; the denominator is... For all servers With model The weighted sum of the combinations is used as a normalization factor; only the sums are accumulated. Feasible combinations;

[0132] Through probability decision variables The optimization strategies for servers and models were verified.

[0133] The task matching module sorts the selection probabilities of stage tasks with servers or models to obtain the most feasible server or model path for the current stage task; including:

[0134] By leveraging target link margins, target server load on optimization servers and model strategies, and online path correction:

[0135] Equivalent replacements are performed in the candidate set using adjacent paths or nodes with similar capabilities;

[0136] Without changing the phase task type, the strategy for optimizing the server and model is lightly downgraded, and a model with available computational intensity is selected for recovery. Stability conditions;

[0137] The system performs a limited number of retries. If the threshold is exceeded, an upstream backoff or bypass strategy is triggered to correct the error.

[0138] The reasons for selection, the correction trajectory, and key operational indicators are written into the operational platform for subsequent learning and updates.

[0139] The backward model learning recurrent system optimizes the multi-agent layer selection policy by updating the policy parameters of the multi-agent layer through an adaptive proximal policy optimization algorithm, including:

[0140] The matching between the server and the model is optimized by introducing the following cross-stage coordination model:

[0141] ;

[0142] in: For the stage The advantages of coordination; Leveraging local advantages; For the stage Current decisions have an impact on downstream industries. The marginal impact, counterfactual estimation; To influence the weight; For synergy strength;

[0143] The consistency of the learning direction between the agent in the current stage and the whole in other stages is adjusted by the following value adaptive factor step size;

[0144]

[0145] in: For adaptive stride length; It is uncertain; Due to environmental changes; For peer consistency; As weight; For the Sigmoid function;

[0146] Using the current shear threshold Stable server and model adaptation;

[0147] ;

[0148] in: This is the current shear threshold; Initial value; The shrinkage coefficient; For near Step parameter fluctuation variance;

[0149] The updated gradient compression is then asynchronously aggregated, and the new parameters are then... Return to the multi-agent layer to complete one closed loop.

[0150] like Figure 2 As shown: This invention also employs the following process:

[0151] Task Flow: Forward Task Execution Loop This loop describes the "work path" of a task from input to completion.

[0152] Step 1: Task Input: Tasks enter the distributed inference environment in the form of a "task stage sequence". This means that a task may be broken down into multiple stages that must be executed sequentially (e.g., object detection first, then image recognition, and finally result output). Different AI models or model components are stored on the server.

[0153] 101. Decompose complex tasks into ordered stages. Obtain phase tasks, the phase tasks include: each phase Given the amount of output data Arrival rate Deployed model set, phased licensed model set ;

[0154] 102. Construct the set of deployed models and the set of phased license models. Intersection generates candidate pairs ;

[0155] The quality-delay of the stage task is weighted and optimized using the probabilistic decision variables according to the following formula:

[0156] ,

[0157] in: This represents a summary of the quality scores for each phase of the task. The total delay of each stage task is represented by a variable; simultaneously, probabilistic decision variables are introduced. , and respectively satisfy , ;

[0158] 103. By using probabilistic decision variables to ensure that the output task queue of the forward task execution loop system is less than the execution rate, the stability of the output task queue is guaranteed.

[0159] 104. Matching requirements by model type The candidate set remaining after this step is all executable at the current moment, reducing the cost of subsequent correction.

[0160] Step 2: Agent Decision-Making Process: The current task state information (State s_t) is sent to the "Multi-Agent Layer". Each agent (Agent[i]) in the layer works based on the Actor-Critic algorithm structure. Actor: Responsible for making decisions based on the current state, i.e., choosing which server and which model component ((π_s,π_m)) to use. Critic: Responsible for evaluating how good the decision made by the Actor is (outputting a value function). Includes: Multi-Agent Layer: Multi-Agent Layer Agent[i] (Actor-Critic): Agent[i] (Actor-Critic algorithm) State s_t: State s_t (environmental state at time step t) Server & Model Selection: Server and model selection; Part 1-4 / Model AD: Part 1-4 / Model AD (representing different model fragments or the complete model available).

[0161] 201. Calculate the agent's state at each stage. Marginal distribution under the current state: Selecting a server The probability, i.e. Select a model in the current state. The probability of that, i.e.: ;

[0162] 202. Based on the marginal distribution of the agents, obtain the pairwise scores of the first server and the model using the following formula:

[0163] ;

[0164] The pairwise score of the first server and model is obtained by multiplying it by the feasible region mask. The feasible region mask is: ;

[0165] The optimal server and model strategy is obtained by normalizing the pairwise scores of the second server and the model according to the following formula:

[0166] ;

[0167] Through probability decision variables Verify the optimization strategies for servers and models;

[0168] Step 3: Execution and Feedback: Based on the agent's selection, the task is executed on the specified server and model. Upon completion, two results are generated: Output; Reward; QoS, penalties; Latency.

[0169] Output: The result of the inference task. Reward signal: The environment generates a reward based on the task's performance (e.g., latency, QoS). Poor performance (e.g., timeout) may also incur penalties. This reward signal is crucial for the agent's learning.

[0170] Learning flow: The backward model learning loop, also known as adaptive PPO update.

[0171] Step 4: Experience Storage Process: The records (state, decision, reward, new state) generated by the agent in each decision step are called "experiences". These experiences are stored uniformly in the Experience Buffer.

[0172] Step 5: Experience Sampling Process: During learning, the system randomly samples a small batch of historical experience data from the buffer. This breaks down the correlation between data points, making the learning process more stable.

[0173] Step 6: Adaptive PPO Update (VAMAPPO Core) The sampled data batches are sent to the Adaptive PPOUpdate module. The role of PPO: The PPO algorithm uses this data to calculate how to update the agent's decision network (Actor) and evaluation network (Critic), aiming to maximize the agent's cumulative reward in the future. "Adaptive" and "VAMAPPO": PPO is an improved version, potentially incorporating advanced features such as gradient compression and multi-agent collaborative optimization to adapt to distributed environments. Gradient Compression and Synchronization: The calculated model updates (compressed gradients) are sent to the Parameter Server (Grad sync). The Parameter Server aggregates updates from all agents, synchronizes global model parameters, and ensures co-evolution among all agents. Experience Buffer: Experience replay buffer; Sample batch: Sampling batch; Adaptive PPO Update (VAMAPPO Core): Adaptive proximal policy optimization update (VAMAPPO core); Compressed gradients: Compressed gradients; ParameterServer (Grad sync): Parameter server (gradient synchronization); θ_μ Update: Policy parameter θ_μ update (i.e., update the Actor network parameters of the agent);

[0174] 401. The matching between the server and the model is optimized by introducing the following cross-stage coordination model:

[0175] ;

[0176] in: For the stage The advantages of coordination; Leveraging local advantages; For the stage Current decisions have an impact on downstream industries. The marginal impact, counterfactual estimation; To influence the weight; For synergy strength;

[0177] 402. The strength and peer consistency are updated by adjusting the step size using the following value adaptive factor;

[0178]

[0179] in: For adaptive stride length; It is uncertain; Due to environmental changes; For peer consistency; As weight; For the Sigmoid function;

[0180] 403. Adaptive stabilization of the server and model using the following formula for shearing threshold;

[0181] ;

[0182] in: This is the current shear threshold; Initial value; The shrinkage coefficient; For near Step parameter fluctuation variance.

[0183] Step 7: Policy Update Process: The updated parameters (θ_μ Update) are sent back to the "multi-agent layer". Thus, all agents now possess a "smarter" new policy, enabling them to make better decisions in the next round of task execution.

[0184] This end-to-end mechanism ensures operational feasibility and stability while continuously approaching the global optimum through adaptive and cross-stage collaboration. It can steadily improve task success rate, shorten response latency, and reduce communication overhead under conditions of heterogeneous resources and network fluctuations.

[0185] Figure 3 The figure illustrates the training convergence characteristics of the VAMAPPO framework of this invention under different network topologies. The performance changes are presented in stages: "initial exploration - rapid improvement - steady-state convergence". The horizontal axis represents the number of training steps, and the vertical axis represents the average reward per round. The legend corresponds to three types of topologies: S1 (25 nodes), S2 (51 nodes), and S3 (101 nodes). Figure 3 The comparison shows that curve S1 rises rapidly from an initial low value and then plateaus (from approximately 2.32 to approximately 6.10), reflecting sample efficiency and early learning ability. S2 eventually reaches a higher steady-state level (approximately 6.34), indicating that the upper limit return is better under medium-scale conditions with better connectivity. S3 has a slightly lower convergence value (approximately 5.76) but the smallest fluctuation, indicating that the strategy performs more robustly in large-scale heterogeneous environments. Overall, this set of curves reflects that, under the combined effect of feasible region control, coordination advantage, and adaptive shearing mechanisms, the strategy can achieve stable convergence under various scales and connectivity patterns, providing a reusable parameter baseline for subsequent online deployment and indicating that convergence and transferability can still be maintained in complex topologies.

[0186] Figure 4The figure illustrates a comparison of Actor (policy network) losses under three network topologies. The graph shows a trend of "continuous decline – stable convergence" to reflect the quality of policy learning: the S2 curve eventually converges to approximately 0.022, while S1 and S3 stabilize at approximately 0.023 and 0.024, respectively. Among these, S3 exhibits the smoothest descent trajectory and the smallest oscillation, demonstrating robust updates even in large-scale heterogeneous environments. These results indicate that VAMAPPO's feasible region control and coordination advantages effectively suppress policy divergence and ensure repeatable convergence across multiple topologies.

[0187] Figure 5 The comparison of Critic (value network) losses under the same three topologies is shown. All three curves converge to the same order of magnitude (approximately 0.020) in the later stages of training, indicating the consistency and reliability of value estimation; S3 has the fastest convergence speed and the smoothest curve, followed by S1 and S2. In summary... Figure 4 and Figure 5 As can be seen, both the strategy and value sides achieve stable descent and low variance convergence, providing parameter guarantees for decision stability and sample efficiency during subsequent online inference.

[0188] Figure 6 The adaptive learning rate evolution for three types of topologies is shown. The curves generally exhibit a "fast at first, then stable" rhythm: a larger step size is maintained in the early stages to accelerate exploration and increase returns, while automatic contraction occurs in the mid-to-late stages to reduce policy jitter. The learning rate curve for S3 is the smoothest, without obvious jagged edges; S2 contracts more sensitively in the middle, which helps to quickly cross the loss basin; S1 enters steady state slightly earlier. These results demonstrate that adaptive shearing and step size adjustment work together to effectively balance exploration efficiency and convergence stability.

[0189] Figure 7 The reward distributions for the three topologies during the testing phase (200 rounds) are shown. S3 has the highest mean (approximately 32.64) and the smallest variance, with the fewest outliers, demonstrating consistency and robustness under large-scale topologies; S2 has a slightly lower mean (approximately 32.48) but its fluctuations are manageable; S1 has a slightly lower mean (approximately 32.22) and a more dispersed distribution, indicating that the reward is more sensitive to environmental perturbations when resources / connectivity are limited. This distribution comparison verifies the cross-scale generalization stability of VAMAPPO.

[0190] Figure 8The paper presents a communication efficiency analysis of three types of topologies (subgraphs S1, S2, and S3), with the horizontal axis representing communication overhead and the vertical axis representing reward. All three show a significant negative correlation: lower communication overhead leads to higher reward. Compared to S1, S2 has a more concentrated point cloud and improved efficiency (approximately 7.1% according to the paper), indicating that under better connectivity, multi-objective optimization is more likely to find a cost-effective solution combining "near-end computation + short-path transmission." S3 falls between the two, but exhibits more high-reward points in low-overhead regions, showing that large-scale topologies provide more opportunities for in-situ processing and nearest-neighbor scheduling.

[0191] Figure 9 The ablation experiments (average returns) of the value adaptation mechanism are shown. The full model achieved the highest return (approximately 31.63 ± 5.63); the returns for the un-attention and minimal models decreased significantly (approximately 2.30 ± 2.58 and 2.37 ± 3.18, respectively), indicating that multi-head attention is crucial for heterogeneous resource selection and key channel identification; the un-enhancement (approximately 31.04 ± 6.10) and un-uncertainty (approximately 30.71 ± 6.40) resulted in moderate declines, suggesting that they primarily improve sample utilization and steady-state update quality. The results demonstrate the necessity and synergistic gains of each sub-module.

[0192] Figure 10 The task completion rates for the above ablation configurations are shown. The complete model achieved the highest completion rate (approximately 94.6%) with the least fluctuation; the completion rates with / o Attention and Minimal dropped to single digits (approximately 4.8% and 5.0%, respectively), making stable delivery difficult in a dynamic distributed environment; while with / o Enhancement and with / o Uncertainty, the rates decreased slightly (approximately 92.8% and 92.6%, respectively), remaining at a high level. These results indicate that attention and feasible region control determine whether stable completion is possible, while enhancement weights and uncertainty determine whether the completion is better.

[0193] Figure 11The results show a comprehensive comparison with multiple baselines in new environments (success rate, response latency, communication overhead, and server usage diversity). This method achieves the highest success rate (approximately 97.0%, outperforming MAPPO (approximately 94.0%) and METAPPO (approximately 89.5%)); the shortest average response latency (approximately 0.257 s) and the lowest communication overhead (approximately 1.287 s), significantly outperforming heuristic and random strategies. Regarding server usage diversity, while VAMAPPO (approximately 1.071) is not as high as migration-focused algorithms, it significantly outperforms conventional baselines, balancing load balancing and end-to-end performance. These results demonstrate the portability and performance advantages of this invention in unfamiliar topologies.

[0194] In summary, this invention proposes a distributed model orchestration method based on Value Adaptive Multi-Agent Proximal Policy Optimization (VAMAPPO), aimed at the efficient execution of multi-stage, multi-modal generation tasks in a distributed environment. The system uses "task stages" as intelligent agent units, simultaneously deciding on server and model selection under a unified QoS objective; combining variational uncertainty measurement, coordinated advantage functions, and adaptive pruning mechanisms, it achieves end-to-end optimization of the communication-computation link and cross-stage collaboration. Following this overall path, this invention achieves verifiable technical breakthroughs in the following aspects:

[0195] 1. Stage-based multi-agent architecture and unified QoS-driven joint decision-making: The complex pipeline is decomposed into a parallel "stage-agent" system. A unified QoS function is used to simultaneously characterize inference quality and latency benefits, and the probability distribution of "server allocation-model selection" is directly output in the action space. With the help of constraint penalties and action masks, engineering constraints such as deployment feasibility, link capacity, and queuing stability are explicitly satisfied during the learning process, realizing full-link linkage optimization of heterogeneous computing power and network conditions.

[0196] 2. Value Adaptive Update Mechanism Driven by Variational Inference: By randomizing the policy parameters, variational inference is introduced to obtain decision uncertainty, which is then mapped together with the degree of environmental variability and agent consistency as an adaptive update factor to dynamically adjust the policy step size and learning intensity. This significantly suppresses training oscillations in non-stationary, multi-constraint environments and improves robustness and convergence efficiency during online deployment.

[0197] 3. Coordination Advantage Function and Counterfactual Impact Modeling: The Coordination Advantage Function (CAF) is innovatively constructed. Based on local advantages, a "neighboring stage impact term" based on counterfactual inference is introduced. The term is adaptively weighted according to the relationship such as the scale of stage data, which quantifies the marginal contribution of a stage decision to the returns of subsequent stages, thereby significantly improving the global optimality and interpretability of cross-stage collaboration.

[0198] 4. Communication-computation integration enables constrained optimization and low-overhead synchronization: Simultaneously modeling propagation and transmission delays, queuing delays, and model quality scores in the end-to-end objective; asynchronous parameter synchronization with gradient compression and priority empirical sampling are adopted on the training side, which significantly reduces network occupancy and synchronization frequency of distributed inference, and maintains high success rate and low latency under bandwidth-limited or unstable link conditions.

[0199] 5. Stable PPO of Adaptive Pruning and Policy Augmentation Weights: The pruning threshold is bound to the variance of historical parameter changes to form an adaptive pruning interval; at the same time, policy augmentation weights are constructed based on the dominance quantity, focusing on high-value samples without violating the trust region constraint, balancing exploration and convergence speed, and improving the learning efficiency and training stability of large-scale heterogeneous scenarios.

[0200] 6. Unified scoring and decision fusion for multimodal models: A unified model quality scoring framework covering text, image, and video generation (including pairwise comparison / BT models and multidimensional perception metrics) is proposed. The quality score and latency are jointly injected into the reward and policy network, supporting measurable, comparable, and transferable selection and orchestration among multiple types of models.

[0201] 7. Zero-shot cross-topology generalization and portability: Through stage-level agent reuse and value adaptation paradigm, it can maintain high task success rate and low response latency under unseen network topologies and deployment strategies, significantly reducing the cost of commissioning and retraining in new environments, demonstrating strong cross-environment generalization ability and engineering portability.

[0202] The aforementioned breakthroughs enable the present invention to achieve unified optimization of quality, latency, communication, and constraints in multimodal distributed orchestration, and to obtain quantifiable performance advantages and stable industrial deployment capabilities in complex, dynamic, and heterogeneous edge environments.

[0203] Although the present invention has been described above, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many modifications under the guidance of the present invention without departing from the spirit of the present invention, and these modifications are all within the protection scope of the present invention.

Claims

1. A multi-agent cooperative system for orchestrating distributed large models, characterized in that, The system comprises: a forward task execution loop system, a multi-agent layer, a task matching module, an evaluation candidate module, and a backward model learning loop system; the forward task execution loop system consists of a distributed environment inference unit, a task stage sequence unit, and an edge server; the backward model learning loop system consists of an experience replay buffer, an adaptive collaboration module, and a synchronization coordination module; the multi-agent layer consists of a decision network and an evaluation network; wherein: The forward task execution loop system processes the input tasks into a phase sequence and unifies the quality and latency metrics of the phased tasks. The multi-agent layer completes the joint distribution within the feasible domain and sorts the overall communication, computing and quality indicators based on the task status information of the current stage, and selects the probability of the first choice and equivalent alternative computing servers or models for computing. The task matching module sorts the selection probabilities of stage tasks with servers or models to obtain the most feasible server or model path for the current stage task. The backward model learning loop system optimizes the multi-agent layer selection strategy by updating the policy parameters of the multi-agent layer through an adaptive proximal policy optimization algorithm.

2. The multi-agent cooperative system for orchestrating distributed large models according to claim 1, characterized in that, The process of constructing the forward task execution loop system includes: The forward task execution loop system performs data calculation and communication through the following topology diagram; ； Where: node set Includes edge servers, necessary gateways, and user-side devices; edge set Indicates available data transmission links; The forward task execution loop system constructs an execution solution set. To accomplish the task; where: if and only if the server The model has been correctly deployed and is currently available. Meeting the preconditions for operation is denoted as If not deployed or temporarily unavailable, then ; The forward task execution loop system is based on bandwidth. Data volume Accessibility verification; and constraints Express queue stability; implement the "server-model" feasible domain mask and display it at the policy output; where: For the mission phase The intensity of arrival, For server Operating Model The service intensity.

3. A multi-agent cooperative system for orchestrating distributed large models according to claim 2, characterized in that, The forward task execution loop system is a process in which the input tasks are arranged according to a phase sequence and the phase tasks are standardized in terms of quality and latency; including: Break down complex tasks into ordered stages Obtain phase tasks, the phase tasks include: each phase Given the amount of output data Arrival rate Deployed model set, phased licensed model set ; Build the set of deployed models and the set of phased licensed models. Intersection generates candidate pairs ; The quality and latency of the stage tasks are weighted and optimized using the probabilistic decision variables according to the following formula: ， in: This represents a summary of the quality scores for each phase of the task. The total delay of each stage task is represented by a variable; simultaneously, probabilistic decision variables are introduced. , and respectively satisfy , ; By constraining task allocation through probabilistic decision variables, a stable task queue is ensured to be output by the forward task execution loop system. Model type matching requirements The candidate set remaining after this step is all executable at the current moment, reducing the cost of subsequent correction.

4. A multi-agent cooperative system for orchestrating distributed large models according to claim 1, characterized in that, The multi-agent layer completes the process of jointly distributing and ranking the feasible domain based on the current task status information, integrating communication, computation, and quality indicators, and calculating the selection probabilities of the preferred and equivalent alternative computation servers or models; including: Calculate the agent's state at each stage Marginal distribution under the current state: Selecting a server The probability, i.e. Select a model in the current state. The probability of that, i.e.: ; The pairwise scores of the first server and the model are obtained according to the marginal distribution of the agents using the following formula: ； The pairwise score of the first server and model is obtained by multiplying it by the feasible region mask. The feasible region mask is: ; The optimal server and model strategy is obtained by normalizing the pairwise scores of the second server and the model according to the following formula: ； in: It is the original preference or unconstrained joint probability given by the policy network, reflecting the agent's subjective inclination towards the combination; It is a stage The feasible domain mask, when combined The value is 1 when permitted in engineering practice and 0 when not permitted; the denominator is... It is a weighted sum of all server and model combinations, used as a normalization factor, and only accumulated. Feasible combinations; Through probability decision variables The optimization strategies for servers and models were verified.

5. A multi-agent cooperative system for orchestrating distributed large models according to claim 1, characterized in that, The task matching module sorts the selection probabilities of stage tasks with servers or models to obtain the most feasible server or model path for the current stage task; including: By leveraging target link margins, target server load on optimization servers and model strategies, and online path correction: Equivalent replacements are performed in the candidate set using adjacent paths or nodes with similar capabilities; Without changing the task type at each stage, the strategy for optimizing the server and model is lightly downgraded, and models with available computational intensity are selected for recovery. Stability conditions; The number of retries is limited; if the threshold is exceeded, an upstream backoff or bypass strategy is triggered to correct the error. The reasons for selection, the correction trajectory, and key operational indicators are written into the operating system for subsequent learning and updates.

6. A multi-agent cooperative system for orchestrating distributed large models according to claim 1, characterized in that, The backward model learning recurrent system optimizes the multi-agent layer selection policy by updating the policy parameters of the multi-agent layer through an adaptive proximal policy optimization algorithm, including: The matching between the server and the model is optimized by introducing the following cross-stage coordination model: ； in: For the stage The advantages of coordination; Leveraging local advantages; For the stage Current decisions have an impact on downstream industries. The marginal impact, counterfactual estimation; To influence the weight; For synergy strength; The consistency of the learning direction between the agent in the current stage and the whole in other stages is adjusted by the following value adaptive factor step size; ； in: For adaptive stride length; It is uncertain; Due to environmental changes; As a consistency indicator; As weight; For the Sigmoid function; Based on the current shear threshold Stable server and model adaptation: ； in: This is the current shear threshold; Initial value; The shrinkage coefficient; For near Step parameter fluctuation variance; The updated gradient compression is then asynchronously aggregated, and the new parameters are then... Return to the multi-agent layer to complete one closed loop.

7. A multi-agent collaborative method for orchestrating distributed large models, characterized in that: The method is implemented based on any one of claims 1-6, and includes the following steps: S1. The input tasks are arranged into a phase sequence with unified quality and latency metrics; where:

101. Decompose complex tasks into ordered stages. Obtain phase tasks, the phase tasks include: each phase Given the amount of output data Arrival rate Deployed model set, phased licensed model set ; 102. Construct the set of deployed models and the set of phased licensed models. Intersection generates candidate pairs ; The quality-delay of the stage task is weighted and optimized using the probabilistic decision variables according to the following formula: ， in: This represents a summary of the quality scores for each phase of the task. The total delay of each stage task is represented by a variable; simultaneously, probabilistic decision variables are introduced. , and respectively satisfy , ; 103 By constraining task allocation through probabilistic decision variables, a stable task queue is output by the forward task execution loop system; 104. Matching requirements based on model type The candidate set remaining after this step is all executable at the current moment, reducing the cost of subsequent correction. S2. Based on the current task status information, rank the feasible domain by joint distribution and comprehensive communication, computation, and quality indicators, and calculate the selection probability of the preferred and equivalent alternative computation servers or models; where: 201 Calculate the agent's state at each stage Marginal distribution under the current state: Selecting a server The probability, i.e. Select a model in the current state. The probability of that, i.e.: ; 202. Based on the marginal distribution of the agents, the pairwise scores of the first server and the model are obtained according to the following formula: ； The pairwise score of the first server and model is obtained by multiplying it by the feasible region mask. The feasible region mask is: ; The optimal server and model strategy is obtained by normalizing the pairwise scores of the second server and the model according to the following formula: ； Through probability decision variables Verify the optimization strategies for servers and models; S3. Sort the stage tasks and the selection probabilities of servers or models to obtain the most feasible server or model path for this stage task. S4. Optimize the multi-agent layer selection strategy by updating the policy parameters of the multi-agent layer using an adaptive proximal policy optimization algorithm; where: 401 optimizes the matching between the server and the model by introducing the following cross-stage coordination model: ； in: For the stage The advantages of coordination; Leveraging local advantages; For the stage Current decisions have an impact on downstream industries. The marginal impact, counterfactual estimation; To influence the weight; For synergy strength; 402. The consistency of the agent's learning direction with that of other stages as a whole is adjusted by the following value adaptive factor step size: ； in: For adaptive stride length; It is uncertain; Due to environmental changes; For peer consistency; As weight; For the Sigmoid function; 403. Through the current shear threshold Stable server and model adaptation: ； in: This is the current shear threshold; Initial value; The shrinkage coefficient; For near Step parameter fluctuation variance.