A cloud-edge collaborative inference optimization method and device based on deep reinforcement learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a deep reinforcement learning-based cloud-edge collaborative inference optimization method, the problems of resource waste and latency accumulation in cloud-edge collaborative inference strategies in dynamic heterogeneous network environments are solved. This method enables intelligent scheduling between heterogeneous edge nodes and the cloud, improving inference accuracy and resource utilization efficiency.

CN122198112APending Publication Date: 2026-06-12GUANGXI NORMAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GUANGXI NORMAL UNIV
Filing Date: 2026-02-14
Publication Date: 2026-06-12

Application Information

Patent Timeline

14 Feb 2026

Application

12 Jun 2026

Publication

CN122198112A

IPC: G06N5/04; G06N3/092; G06N3/098; G06N7/01

AI Tagging

Application Domain

Mathematical models Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Large complex magnesium alloy component semi-solid injection one-key forming full-parameter self-optimization method, system and equipment
CN122194619AMathematical models Biological models
Physical examination package generation method and device based on dynamic reasoning, storage medium and computer equipment
CN122201789AMathematical models Health-index calculation
A method and system for visual inspection of food packaging
CN122199479AMathematical models Image analysis
A method for evaluating the probabilistic distribution of lightning strike positions for a rail system
CN122197512Areduce adverse effectsGuaranteed operational safetyMathematical models Geometric CAD
Time-varying aircraft circumnavigation path planning method based on weather avoidance zone prediction
CN122195020AMathematical models Biological models

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing cloud-edge collaborative inference strategies cannot flexibly cope with network congestion and node overload in dynamic heterogeneous edge network environments, resulting in resource waste and latency accumulation. Furthermore, they lack the ability to proactively decide on task difficulty and cannot achieve multi-objective collaborative optimization of latency, energy consumption, cloud service costs, and inference accuracy.

⚗Method used

A cloud-edge collaborative reasoning optimization method based on deep reinforcement learning is adopted. The D3QN agent dynamically makes decision-making and reasoning paths in a hybrid action space. Combined with Markov decision process model and composite reward function, cross-layer collaborative scheduling and multi-objective optimization are achieved. Multi-trajectory random sampling and self-consistency verification mechanism are used to improve the accuracy of judgment.

🎯Benefits of technology

It effectively avoids latency accumulation and resource waste caused by invalid flow at the edge, and achieves low latency, low energy consumption and high inference reliability in dynamic environments, realizing an adaptive balance between service quality and resource overhead.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122198112A_ABST

Patent Text Reader

Abstract

The application discloses a cloud-edge collaborative inference optimization method and device based on deep reinforcement learning, comprising the following steps: if it is a new inference task, acquiring the first global state space and the M-layer logic model deployment situation of the cascaded inference model chain, transmitting to the optimal starting heterogeneous edge node and executing multi-trajectory random sampling inference by adopting the input representation method based on the thought chain, generating and verifying the inference result, and triggering cross-layer collaborative scheduling if the inference result is unreliable; the D3QN intelligent agent selects the optimal action from the candidate actions; if it is a cloud server, the cloud server performs inference, and if it is a heterogeneous edge node with M+1-layer logic model, the heterogeneous edge node is transmitted and multi-trajectory random sampling inference is executed by adopting the input representation method based on the thought chain. Through the method provided by the application, the limitation of the fixed physical flow path in the traditional cascaded inference is broken through, the flexibility of the cascaded inference in the dynamic environment is effectively enhanced, the optimization of the system comprehensive performance is realized while the task accuracy is guaranteed.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of edge computing and artificial intelligence technology, and in particular to a cloud-edge collaborative reasoning optimization method and apparatus based on deep reinforcement learning. Background Technology

[0002] With the rapid development of deep learning technology, artificial intelligence applications, represented by large language models, have become widespread. To reduce transmission latency and protect user privacy, moving inference tasks from the cloud to edge networks has become an inevitable trend. Given that edge devices are limited by storage and computing power and cannot independently handle large models, cascaded inference, as an emerging paradigm that reduces computational overhead through collaboration between models of different sizes, is widely adopted. This paradigm typically chains models of different sizes in a logical hierarchy, with tasks initially processed by smaller models and only escalated to subsequent models when confidence levels are insufficient.

[0003] However, in real-world cloud-edge collaborative application scenarios, existing static cascaded inference strategies still face significant challenges. First, edge network environments are highly dynamic and heterogeneous, with vastly different node hardware configurations and drastic fluctuations in real-time load and network bandwidth over time. Traditional solutions typically pre-define fixed physical execution paths, failing to detect network congestion or node overload, easily leading to long-tail latency. Second, existing methods lack flexible cloud-edge collaboration mechanisms. For some extremely challenging inference tasks, forcing them to undergo trial and error at the edge can result in severe resource waste and latency accumulation, lacking the proactive decision-making capability to directly connect to the cloud based on task difficulty. Furthermore, most existing optimization solutions focus only on a single performance metric, ignoring the complex trade-offs between latency, energy consumption, cloud service costs, and inference accuracy, making it difficult to meet diverse quality of service requirements.

[0004] Therefore, there is an urgent need to explore an optimization method that can perceive the environmental state and make intelligent decision-making inference paths to enhance the flexibility and efficiency of edge large model systems in dynamic environments. Summary of the Invention

[0005] Therefore, the present invention aims to at least partially address the shortcomings of the prior art, and proposes a cloud-edge collaborative inference optimization method and device based on deep reinforcement learning. It aims to achieve multi-objective collaborative optimization of latency, energy consumption, cloud service cost and inference accuracy in a dynamic and heterogeneous edge network environment by realizing adaptive inference path planning.

[0006] In a first aspect, the present invention provides a cloud-edge collaborative inference optimization method based on deep reinforcement learning, applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M sequentially arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models. The method includes:

[0007] The detected inference task is judged. If it is a new inference task, the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain are obtained. The new inference task is transmitted to the optimal starting heterogeneous edge node using the D3QN agent. Multi-trajectory random sampling inference is performed through the optimal starting heterogeneous edge node using the input representation method based on the thought chain to generate inference results. The reliability of the inference results is verified. If reliable, the final inference result is obtained. If unreliable, cross-layer collaborative scheduling is triggered. The current global state and hybrid action space are constructed for the new inference task that triggers cross-layer collaborative scheduling.

[0008] The current global state space is input into the D3QN agent and combined with the hybrid action space. The value function of the current global state space and the advantage function of each candidate action are evaluated respectively, and the optimal action is selected from the candidate actions. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function, and a composite reward function built on this basis.

[0009] If the optimal action is the cloud server, inference is performed through the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task then enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.

[0010] Secondly, the present invention provides a cloud-edge collaborative inference optimization device based on deep reinforcement learning, applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M sequentially arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models. The device includes:

[0011] The judgment module is used to judge the detected inference task. If it is a new inference task, it obtains the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain. It then uses the D3QN agent to transmit the new inference task to the optimal starting heterogeneous edge node. Through the optimal starting heterogeneous edge node, it performs multi-trajectory random sampling inference using the input representation method based on the thought chain to generate inference results and verify whether the inference results are reliable. If reliable, it obtains the final inference result. If unreliable, it triggers cross-layer collaborative scheduling and constructs the current global state and hybrid action space for the new inference task that triggers cross-layer collaborative scheduling.

[0012] Selection module: used to input the current global state space into the D3QN agent and combine it with the hybrid action space, evaluate the value function of the current global state space and the advantage function of each candidate action respectively, and select the optimal action from each candidate action. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function, and a composite reward function built on this basis.

[0013] The inference module is used to perform inference through the cloud server when the optimal action is the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task then enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.

[0014] Thirdly, the present invention provides a cloud-edge collaborative inference optimization device based on deep reinforcement learning, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the various steps of the cloud-edge collaborative inference optimization method based on deep reinforcement learning as described in the first aspect.

[0015] Fourthly, the present invention also provides a storage medium storing a computer program thereon, which, when executed, implements the various steps of the cloud-edge collaborative inference optimization method based on deep reinforcement learning as described in the first aspect.

[0016] This invention provides a cloud-edge collaborative inference optimization method and apparatus based on deep reinforcement learning, applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M incrementally arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models. The method includes: judging the detected inference task; if it is a new inference task, obtaining the initial global state space and the deployment status of the M-layer logical models of the cascaded inference model chain; using a D3QN agent to transmit the new inference task to the optimal starting heterogeneous edge node; performing multi-trajectory random sampling inference through the optimal starting heterogeneous edge node using a thought chain-based input representation method to generate an inference result; and verifying the reliability of the inference result. If reliable, obtaining the final inference result; if unreliable, triggering cross-layer collaborative scheduling; and constructing the current global state and hybrid action space for the new inference task that triggered cross-layer collaborative scheduling. The current global state space is input into the D3QN agent and combined with the hybrid action space. The value function of the current global state space and the advantage function of each candidate action are evaluated respectively, and the optimal action is selected from the candidate actions. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function, and a composite reward function built on this basis. If the optimal action is the cloud server, inference is performed through the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task triggering cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.The method provided by this invention overcomes the limitations of fixed physical flow paths in traditional cascaded inference. By constructing a cloud server direct access option in a hybrid action space, the system can perform intelligent scheduling for high-difficulty tasks or edge network congestion scenarios while adapting to the deployment constraints of heterogeneous edge node models. This effectively avoids the accumulation of latency and waste of resources caused by ineffective flow at the edge. Based on the multi-trajectory sampling and self-consistency verification mechanism of the thought chain, the multi-step inference capability of the large model significantly improves the judgment accuracy of intermediate layers compared to single-path inference. At the same time, by constructing a composite reward function that includes consistency gain and system cost, the agent is guided to actively choose the path that maximizes the reliability of inference while pursuing low latency and low energy consumption, achieving an adaptive balance between service quality and resource overhead. The D3QN algorithm is used to solve complex cross-layer scheduling problems. By constructing a global state space that includes the real-time load of the entire network and the historical performance of the model, the agent is endowed with a deep perception capability of the dynamic network environment. By separating the value function and the advantage function, multi-objective collaborative optimization of inference latency, energy consumption and cloud service cost in heterogeneous dynamic environments is achieved. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the structures shown in these drawings without creative effort.

[0018] Figure 1 This is a flowchart illustrating the cloud-edge collaborative inference optimization method based on deep reinforcement learning of the present invention.

[0019] Figure 2 This is a schematic diagram of the overall logical flow of the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning in this invention;

[0020] Figure 3 This is a system architecture diagram of the cloud-edge collaborative reasoning system based on the cloud-edge collaborative reasoning optimization method of deep reinforcement learning according to the present invention.

[0021] Figure 4 This is a schematic diagram of inference path planning in heterogeneous dynamic scenarios using the cloud-edge collaborative inference optimization method based on deep reinforcement learning, as described in this invention.

[0022] Figure 5 This is a schematic diagram of the logical structure of the D3QN agent in the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning of this invention.

[0023] Figure 6This is a sub-flowchart of the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning of the present invention;

[0024] Figure 7 This is another flowchart illustrating the cloud-edge collaborative inference optimization method based on deep reinforcement learning according to the present invention.

[0025] Figure 8 This is a schematic diagram of another sub-process of the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning in this invention;

[0026] Figure 9 This is a schematic diagram of another sub-process of the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning in this invention;

[0027] Figure 10 This is another sub-process diagram of the cloud-edge collaborative inference optimization method based on deep reinforcement learning in the embodiments of this application.

[0028] Figure 11 This is a schematic diagram of another sub-process of the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning in this invention;

[0029] Figure 12 This is a schematic diagram of the program modules of the cloud-edge collaborative inference optimization device based on deep reinforcement learning of the present invention. Detailed Implementation

[0030] To make the objectives, features, and advantages of this invention more apparent and understandable, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0031] Please refer to Figure 1 and Figure 2 , Figure 1 This is a flowchart illustrating the cloud-edge collaborative inference optimization method based on deep reinforcement learning in this embodiment of the application. Figure 2 This is a flowchart illustrating the overall logic of the cloud-edge collaborative inference optimization method based on deep reinforcement learning in this embodiment. It is applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M sequentially arranged logical hierarchy models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchy models. In this embodiment, the aforementioned cloud-edge collaborative inference optimization method based on deep reinforcement learning includes:

[0032] Step 101: Judge the detected inference task. If it is a new inference task, obtain the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain. Use the D3QN agent to transmit the new inference task to the optimal starting heterogeneous edge node. Through the optimal starting heterogeneous edge node, perform multi-trajectory random sampling inference using the input representation method based on the thought chain to generate inference results and verify whether the inference results are reliable. If reliable, obtain the final inference result. If unreliable, trigger cross-layer collaborative scheduling and construct the current global state and hybrid action space for the new inference task that triggered cross-layer collaborative scheduling.

[0033] In this embodiment, please refer to Figure 3 , Figure 3 This is a system architecture diagram of a cloud-edge collaborative inference system, including the edge gateway. As the unified access point for the cloud-edge collaborative inference system, it is responsible for receiving external inference tasks and performing initial scheduling. Multiple heterogeneous edge nodes form a heterogeneous edge node set. heterogeneous edge nodes Hardware resource attributes are defined as vectors. These represent the CPU frequency, GPU frequency, and memory bus frequency, respectively. These heterogeneous attributes directly determine the computing and processing speed of heterogeneous edge nodes. Each heterogeneous edge node maintains a first-in-first-out (FIFO) inference task queue, which is used in the decision-making step. The state is denoted as This represents the backlog of computational workload at the current node. Each time a scheduling decision is triggered (whether by the arrival of a new inference task or a cross-layer scheduling trigger), a new decision step t is entered. It's a temporal concept, representing the "t-th round of decision-making." Each time an inference task triggers scheduling, it corresponds to one decision step t. (Cloud server) As a special node, it possesses unlimited computing resources and the ability to deploy the entire model. Its queuing and inference latency are negligible, and its inference results are considered absolutely reliable, meaning its consistency score is always zero. .

[0034] Simultaneously, a cascaded inference model chain consisting of M logical levels is constructed. The models at each level are arranged in a strictly increasing order of reasoning ability and parameter size. Each model Has static properties ,in Indicates the computational complexity per unit of data. This indicates the compression ratio of the output feature map, with a special definition for the compression ratio of the input layer. Finally, the system generates a model deployment matrix. ,in Represents edge nodes The model was deployed This determines whether the node can perform the inference task at the corresponding level.

[0035] In this embodiment, the system judges the detected inference task. If it is determined to be a new inference task, it obtains the initial global state space and the deployment status of the M-layer logic model on the heterogeneous edge nodes through the edge gateway. That is, each heterogeneous edge node has at least one logic layer model deployed. Then, it determines the specific deployment location of the M-layer logic model on each heterogeneous edge node and uses the D3QN agent to retrieve the information from all deployed M-layer models. The system determines the optimal starting point among heterogeneous edge nodes and transmits the task to that node. To avoid random errors from a single inference path, a thought chain-based input representation is used to construct prompts from the optimal starting point. Multi-trajectory random sampling inference is then performed to guide the model to gradually generate the inference process and obtain the inference result. To quantify the reliability of the inference result, self-consistency verification is performed to verify its reliability. If the inference result is reliable, it is used as the final inference result; otherwise, cross-layer collaborative scheduling is triggered. A current global state space and hybrid actions are constructed for the new inference task that triggers cross-layer collaborative scheduling for the next scheduling step. The introduction of a thought chain-based multi-trajectory sampling and self-consistency verification mechanism significantly improves the accuracy of intermediate-level decisions by leveraging the multi-step inference capability of a large model compared to single-path inference. Furthermore, by constructing a composite reward function that includes consistency gain and system cost, the agent is guided to actively choose the path that maximizes inference reliability while pursuing low latency and low energy consumption, achieving an adaptive balance between service quality and resource overhead.

[0036] Step 102: Input the current global state space into the D3QN agent and combine it with the hybrid action space. Evaluate the value function of the current global state space and the advantage function of each candidate action, and select the optimal action from the candidate actions. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function, and a composite reward function built on this basis.

[0037] In this embodiment, please refer to Figure 4 , Figure 4This is a schematic diagram of inference path planning in a heterogeneous dynamic scenario in this application embodiment. The D3QN agent needs to decide, based on the current environment, whether to transfer the new inference task that triggers cross-layer scheduling to other heterogeneous edge nodes for continued inference, or to directly upload it to the cloud server for processing. Specifically, the current global state space is input into the policy network of the D3QN agent and combined with the hybrid action space to evaluate the value function of the current global state space and the advantage function of each candidate action, so as to select the optimal action from the candidate actions.

[0038] To overcome the limitations of fixed physical flow paths in traditional cascaded inference and to ensure the feasibility of scheduling decisions, based on the current logical hierarchy model... The inference task that triggered the cross-layer collaborative scheduling was not completed. Therefore, by combining the model deployment matrix, available heterogeneous edge nodes with next-level inference capabilities in the edge network were identified, and a dynamic model was constructed that includes all deployed models. heterogeneous edge nodes Furthermore, cloud servers are added as a direct access option with full inference capabilities, thereby generating a hybrid action space that dynamically changes with the state of the inference task. ,in, For heterogeneous edge nodes, Represents heterogeneous edge nodes The (j+1)th layer inference logic model has been deployed, i.e., all deployed models. A heterogeneous set of edge nodes This represents the cloud server. Specifically, by constructing a cloud-to-cloud pass-through option within the hybrid action space, the system can perform intelligent scheduling for challenging tasks or edge network congestion scenarios, while adapting to the deployment constraints of heterogeneous edge models. This effectively avoids latency accumulation and resource waste caused by ineffective flow at the edge.

[0039] To endow the system with the ability to perceive dynamic network environments and to achieve an adaptive trade-off between inference accuracy and overall system overhead, a current global state space containing four-dimensional feature vectors is constructed. .in, Includes the location and execution level of the currently scheduled task. and intermediate data after compression by the preceding layers:

[0040]

[0041] in, This refers to the amount of raw input data for reasoning task k. This refers to the data compression ratio.

[0042] This is a vector of real-time task queue lengths for all edge nodes, reflecting the load distribution across the entire network. The historical average consistency score of each level of the model was recorded over a recent period to help the agent predict the model performance trend. This represents the available bandwidth vector from the current node to other candidate nodes.

[0043] In this embodiment, please refer to Figure 5 , Figure 5 This diagram illustrates the logical structure of the D3QN agent. The D3QN agent is pre-trained using existing Markov decision process modeling methods and employs a composite reward function constructed from the system cost function. Its training reward is a weighted sum of the system cost function. Once the D3QN agent, trained and converged by the Markov decision process model, faces a new inference task with cross-layer triggered scheduling in actual deployment, it evaluates the merits of each candidate action's next hop based solely on the current global state, successively selecting the optimal action. Multiple optimal single-step decisions are automatically chained together to form the overall optimal scheduling path. This allows the D3QN agent to obtain the D3QN network parameters encoding the optimal scheduling strategy and directly deployable intelligent scheduling capabilities. The system cost function and composite reward function serve as reward signals for the Markov decision model, guiding the iteration of the D3QN network parameters only during the training phase. This enables the D3QN agent to learn path preferences with low system cost and high inference reliability, and it no longer participates in path selection during the deployment phase.

[0044] To guide D3QN agents in achieving multi-objective collaborative optimization, this application defines in detail a system cost function and a composite reward function built upon it. Specifically, the system cost function comprehensively measures cross-node transmission latency, queuing latency, computation processing latency, transmission and inference energy consumption, and cloud service traffic costs. The composite reward function, built upon this, consists of a positive incentive from the consistency score gain and a negative penalty from the system cost function, to guide the optimization of the scheduling strategy. Assume the system decides to... Layered model inference scheduling to heterogeneous edge nodes The execution and decision interval is First, the task queue updates for edge nodes follow these rules:

[0045]

[0046] in For nodes The overall computing and processing speed The computational cost of assigning new tasks. If the target node is scheduled... ,but Otherwise, it is 0. Then, calculate the single-step transmission delay:

[0047]

[0048] in Indicating in the decision-making step Time node and The link bandwidth between them, if If the transmission delay is 0, then the queuing delay and inference delay are respectively:

[0049] ,

[0050] The corresponding energy consumption includes transmission energy consumption and inference energy consumption, and the formulas are as follows:

[0051] ,

[0052] in Indicates the sending node The transmission power, This is the effective capacitance coefficient. Furthermore, this only applies if the target node is in the cloud, i.e. At that time, cloud service costs are incurred:

[0053]

[0054] Based on the above indicators, the cost function of the single-step weighted system and the final composite reward function are defined as follows:

[0055] ,

[0056]

[0057] in This indicates the increase in consistency score after performing the action. This is the positive incentive coefficient. The reward function is a weighted cost function that includes transmission latency, queuing latency, computation processing latency, energy consumption, and cloud service traffic costs. This composite reward function guides the D3QN agent to actively avoid congested nodes and intelligently choose whether to pay cloud service costs to obtain higher consistency benefits while pursuing low latency and low energy consumption. It is also incentivized to choose paths that can significantly improve inference reliability.

[0058] Step 103: If the optimal action is the cloud server, inference is performed through the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task then enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.

[0059] In this embodiment, the pre-trained D3QN agent selects the optimal action from the current global state space and hybrid action space constructed for the new inference task that triggers cross-layer collaborative scheduling. The optimal action includes the cloud server and the heterogeneous edge node with an M+1 layer logic model deployed.

[0060] If the optimal action is the cloud server, the new inference task that triggers cross-layer collaborative scheduling will be directly transmitted to the cloud server. The cloud server will execute the cloud pass-through strategy, skip all remaining logical levels of the cascaded inference model chain, and directly use cloud computing power to complete the inference. The consistency score will be set to 1.0 to cope with high-difficulty tasks or edge network congestion scenarios.

[0061] If the optimal action is to deploy a heterogeneous edge node with an M+1 layer logic model, then the edge collaboration strategy is executed. The new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node, and multi-trajectory random sampling inference is performed using the M+1 layer logic model corresponding to the heterogeneous edge node with a thought chain-based input representation method. This triggers a new round of thought chain inference and consistency verification, i.e., returning to step 101. This process is repeated to continue inference at the edge with minimal communication cost until the final inference result is obtained.

[0062] Specifically, the embodiments of this application overcome the limitations of fixed physical flow paths in cascaded inference. First, a multi-trajectory sampling and self-consistency verification mechanism based on the thought chain representation is introduced to quantify the reliability of inference results at different levels. Second, the collaborative inference path planning problem is modeled as a Markov decision process model, defining the current global state space and designing a hybrid action space that includes edge collaboration and cloud direct access. Finally, a deep reinforcement learning algorithm is used to solve the problem. By maximizing a composite reward function that includes latency, energy consumption, cloud service costs, and consistency gains, the optimal flow path of the task between heterogeneous edge nodes and the cloud is dynamically decided, effectively enhancing the flexibility of cascaded inference in dynamic environments and optimizing the overall system performance while ensuring task accuracy.

[0063] This application provides a cloud-edge collaborative inference optimization method based on deep reinforcement learning, applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M incrementally arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models. The method includes: judging the detected inference task; if it is a new inference task, obtaining the initial global state space and the deployment status of the M-layer logical models of the cascaded inference model chain; using a D3QN agent to transmit the new inference task to the optimal starting heterogeneous edge node; performing multi-trajectory random sampling inference through the optimal starting heterogeneous edge node using a thought chain-based input representation method to generate an inference result; and verifying the reliability of the inference result. If reliable, obtaining the final inference result; if unreliable, triggering cross-layer collaborative scheduling; and constructing the current global state and hybrid action space for the new inference task that triggered cross-layer collaborative scheduling. The current global state space is input into the D3QN agent and combined with the hybrid action space. The value function of the current global state space and the advantage function of each candidate action are evaluated respectively, and the optimal action is selected from the candidate actions. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function, and a composite reward function built on this basis. If the optimal action is the cloud server, inference is performed through the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task triggering cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.The method provided by this invention overcomes the limitations of fixed physical flow paths in traditional cascaded inference. By constructing a cloud server direct access option in a hybrid action space, the system can perform intelligent scheduling for high-difficulty tasks or edge network congestion scenarios while adapting to the deployment constraints of heterogeneous edge node models. This effectively avoids the accumulation of latency and waste of resources caused by ineffective flow at the edge. Based on the multi-trajectory sampling and self-consistency verification mechanism of the thought chain, the multi-step inference capability of the large model significantly improves the judgment accuracy of intermediate layers compared to single-path inference. At the same time, by constructing a composite reward function that includes consistency gain and system cost, the agent is guided to actively choose the path that maximizes the reliability of inference while pursuing low latency and low energy consumption, achieving an adaptive balance between service quality and resource overhead. The D3QN algorithm is used to solve complex cross-layer scheduling problems. By constructing a global state space that includes the real-time load of the entire network and the historical performance of the model, the agent is endowed with a deep perception capability of the dynamic network environment. By separating the value function and the advantage function, multi-objective collaborative optimization of inference latency, energy consumption and cloud service cost in heterogeneous dynamic environments is achieved.

[0064] Further, please refer to Figure 6 , Figure 6 This is a schematic diagram of a sub-process of the cloud-edge collaborative inference optimization method based on deep reinforcement learning in this embodiment of the application. In this embodiment, the step of judging the monitored inference task includes:

[0065] Step 201: Process the received inference task using an event-driven mechanism, and process it accordingly based on the type of the inference task;

[0066] Step 202: If the reasoning task is the new reasoning task, then obtain the deployment status of the initial global state space and the M-layer logic model of the cascaded reasoning model chain;

[0067] Step 203: If the reasoning task is a new reasoning task that triggers cross-layer scheduling, then multi-trajectory random sampling reasoning is directly performed using the input expression method based on the thought chain.

[0068] In this embodiment, when the cloud-edge collaborative inference system detects an external inference task, it processes the received inference task through an event-driven mechanism, executing different logic based on the type of the inference task. If the received inference task is a new inference task, when the new inference task arrives at the edge gateway, it obtains the initial global state space at this time, as well as the deployment status of the M-layer logic model of the cascaded inference model chain on the heterogeneous edge nodes through the edge gateway. When the inference task is a new inference task, the obtained M-layer logic model is the first-layer logic model, which is the first-layer logic model of the cascaded inference model chain.

[0069] If the received inference task is a new inference task that triggers cross-layer scheduling, then the M+1 layer logic model corresponding to its heterogeneous edge node is directly used to perform multi-trajectory random sampling inference using the input representation method based on the thought chain, triggering a new round of thought chain inference and consistency verification. That is, if the first layer logic model of the cascaded inference model chain cannot complete the inference, and instead of directly using the cloud server for inference, it still uses the heterogeneous edge node for inference, then the next layer logic model of the cascaded inference model chain is used for inference, and this process is repeated until the inference is completed.

[0070] Further, please refer to Figure 7 , Figure 7 This is another sub-process diagram of the cloud-edge collaborative inference optimization method based on deep reinforcement learning in this embodiment. In this embodiment, if it is a new inference task, the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain are obtained, and the task data of the new inference task is transmitted to the optimal starting heterogeneous edge node using the D3QN agent. Through the optimal starting heterogeneous edge node, multi-trajectory random sampling inference is performed using the input representation method based on the thought chain to generate inference results, including:

[0071] Step 301: When the new inference task arrives at the edge gateway, the new inference task is parsed to obtain the original input data and the preset requirements, wherein the preset requirements are the minimum consistency score requirements;

[0072] Step 302: Obtain the initial global state space and the deployment status of the M-layer logic model on the heterogeneous edge nodes, and use the D3QN agent to determine the optimal starting heterogeneous edge node from the heterogeneous edge nodes where the M-layer logic model has been deployed, and transmit the new inference task to the optimal starting heterogeneous edge node.

[0073] Step 303: When the heterogeneous edge node with optimal starting point receives the original input data of the new inference task, it uses the M-layer logic model and adopts the input method based on the thought chain to perform multi-trajectory random sampling inference on the original input data to generate inference results, wherein the inference results contain the output answer set of different inference trajectories.

[0074] In this embodiment, when the cloud-edge collaborative inference system detects a new inference task from the outside, it transmits the new inference task to the edge gateway. When the edge gateway receives the Kth new inference task... At that time, the cloud-edge collaborative reasoning system will parse the new reasoning task to obtain the size of the original input data for the new reasoning task. and minimum consistency score requirements At this point, the edge gateway acquires the initial global state space and the deployment status of the M-th layer logic model of the cascaded inference model chain on the heterogeneous edge nodes, where M is 1. Then, the D3QN agent determines the optimal starting heterogeneous edge node from among the heterogeneous edge nodes deploying the M-th layer logic model, and transmits the original input data of the new inference task to this optimal starting heterogeneous edge node. When the optimal starting heterogeneous edge node receives the original input data of the new inference task, it uses the M-th layer logic model corresponding to this optimal starting heterogeneous edge node to perform multi-track random sampling inference on the original input data using a thought chain-based input representation method, thereby guiding the M-th layer logic model to gradually generate the inference process. To avoid a single inference path, this embodiment performs Z multi-track random sampling inferences to obtain a set of output answers containing different inference trajectories. .

[0075] Further, please refer to Figure 8 , Figure 8 This is another sub-process diagram of the cloud-edge collaborative inference optimization method based on deep reinforcement learning in this embodiment. In this embodiment, the reliability of the inference result is verified. If it is reliable, the final inference result is obtained; if it is unreliable, cross-layer collaborative scheduling is triggered, including:

[0076] Step 401: Perform semantic similarity calculation and cluster analysis on the reasoning results to obtain a consistency score;

[0077] Step 402: Compare the consistency score with the minimum consistency score requirement. If the consistency score is greater than the minimum consistency score requirement, the inference result is used as the final inference result and output. If the consistency score is less than the minimum consistency score requirement, cross-layer collaborative scheduling is triggered.

[0078] In this embodiment, to quantify the reliability of the inference results, the cloud-edge collaborative inference system performs self-consistency verification on the generated inference results. Specifically, by performing semantic similarity calculation and cluster analysis on the inference results, a consistency score is obtained. This consistency score is then compared to the minimum consistency score requirement for the new inference task. If the consistency score is greater than the minimum consistency score requirement, it indicates that the current M-layer logic model can provide a high-confidence result, and the cloud-edge collaborative inference system directly outputs this inference result as the final inference result and terminates the task. If the consistency score is less than the minimum consistency score requirement, it indicates that the current M-layer logic model's inference capability is insufficient, requiring the triggering of cross-layer collaborative scheduling.

[0079] Further, please refer to Figure 9 , Figure 9This is a schematic diagram of another sub-process of the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning in this application embodiment. In this embodiment, the step of calculating semantic similarity and performing cluster analysis on the reasoning results to obtain a consistency score includes:

[0080] Step 501: Compare the answers in the output answer set pairwise using a semantic similarity function, group the answers with high semantic similarity into the same cluster, and obtain a cluster set, wherein the cluster set includes multiple answer clusters;

[0081] Step 502: The proportion of the number of times the same reasoning trajectory of the largest answer cluster in the cluster set is used as the consistency score of the M-layer logic model. If the consistency score is greater than the minimum consistency score requirement, the largest answer cluster is used as the reasoning result.

[0082] In this embodiment, a semantic similarity function is defined. For the output answer set The answers are compared pairwise, and answers with high semantic similarity are grouped into the same cluster, thus obtaining a cluster set. ,in This represents the c-th answer cluster. The consistency score of the current M-layer logical model inference. Defined as the proportion of samples contained in the largest answer cluster out of the total number of samples:

[0083]

[0084] Then a judgment is made: If This indicates that the current M-layer logic model can provide a high-confidence result, and the system directly outputs the representative answer of the largest cluster as the final inference result and ends the task; if If the current level's reasoning ability is insufficient, then the subsequent cross-level collaborative scheduling process needs to be triggered.

[0085] Further, please refer to Figure 10 , Figure 10 This is another sub-process diagram of the cloud-edge collaborative inference optimization method based on deep reinforcement learning in this application embodiment. In this embodiment, the step of constructing the current global state space and the hybrid action space, inputting the current global state space into the D3QN agent and combining it with the hybrid action space, evaluating the value function of the current global state space and the advantage function of each candidate action respectively, and selecting the optimal action from each candidate action includes:

[0086] Step 601: Aggregate and calculate the value function of the current global state space and the advantage function of each candidate action to obtain the Q value corresponding to each candidate action;

[0087] Step 602: Correct the multiple Q values using the Double DQN mechanism, and then... The -greedy strategy selects the candidate action corresponding to the largest Q value from multiple modified Q values as the optimal action.

[0088] In this embodiment, the D3QN agent will construct the current global state space. The input is fed into a deep neural network, and the network backend is divided into two paths: one path evaluates the value function of the current global state space. Another approach evaluates the advantage function of each candidate action. The Q-values of each candidate action are calculated using an aggregation formula. The aggregated Q-values simultaneously contain both the "basic state value" and the "relative action advantage," serving as the core quantitative basis for selecting the optimal scheduling action. The value function is a single scalar value, representing the overall basic value of the scheduling environment corresponding to the current global state space—the basic long-term reward level that can be obtained regardless of which legitimate scheduling candidate action is chosen in the current global state space. The advantage function is a multi-dimensional vector value, with its dimension perfectly matching the number of legitimate candidate actions in the hybrid action space. It represents the scheduling value advantage of each candidate action relative to other actions, i.e., the additional value increment of a candidate action compared to the average action level in the current state. Training the D3QN agent through the system cost function and the composite reward function built upon it enables the unified quantification of "inference gains" and "system costs," defining a "good action" (high improvement, low cost → "Positive and high" and "bad actions" (low improvement, high cost → (negative and low) It is the core signal for updating the parameters of the value function and the advantage function, ensuring that the output values of the two functions match the actual scheduling requirements. Without the signals provided by these two functions... Feedback indicates that the value function and advantage function are merely random values and cannot guide the selection of the optimal action; After anchoring, the function value becomes a quantitative indicator of "low cost, high return." Specifically, the D3QN agent quantifies the system losses in computation, transmission, and queuing during the scheduling process of a new inference task based on cross-layer collaborative scheduling. These losses are used as a negative penalty term in the composite reward function. The D3QN agent integrates the positive return of the consistency score of the new inference task based on cross-layer collaborative scheduling and the negative penalty of the system cost function to calculate the immediate reward for a single scheduling action, achieving a unified quantification of inference gains and system costs. The system cost function is used to calculate losses, and the composite reward function is used to calculate the consistency score; the two are aggregated to obtain the Q value. This reward function guides the agent to actively avoid congested nodes and intelligently choose whether to pay cloud service costs to obtain higher consistency gains while pursuing low latency and low energy consumption.

[0089] By employing the Double DQN mechanism—which utilizes the current network to select actions and the target network to evaluate the value of those actions—the overestimation problem of Q-values is effectively mitigated. The D3QN agent, based on the output Q-value, adopts... -The greedy strategy selects the optimal action. Determine the next-hop heterogeneous edge node In this application, the D3QN algorithm is used to solve the complex cross-layer scheduling problem. By constructing a global state space that includes the real-time load of the entire network and the historical performance of the model, the agent is endowed with a deep perception capability of the dynamic network environment. By separating the value function and the advantage function, the problem of Q-value overestimation is effectively alleviated, and multi-objective collaborative optimization of inference latency, energy consumption and cloud service cost is achieved in a heterogeneous dynamic environment.

[0090] Further, please refer to Figure 11 , Figure 11 This is a schematic diagram of another sub-process of the cloud-edge collaborative inference optimization method based on deep reinforcement learning in this embodiment. In this embodiment, the current global state space includes at least the intermediate data that triggers the new task of cross-layer collaborative scheduling. The new inference task that triggers cross-layer collaborative scheduling is then transmitted to the heterogeneous edge node that deploys the M+1 layer logic model. It then enters the M+1 layer logic model and again performs multi-trajectory random sampling inference using an input representation method based on thought chains, including:

[0091] Step 701: Transmit the intermediate data of the new inference task that triggers cross-layer collaborative scheduling to the heterogeneous edge node that deploys the M+1 layer logical model, and enter the M+1 layer logical model;

[0092] Step 702: Based on the intermediate data of the new inference task that triggers cross-layer collaborative scheduling, which is parsed by the edge gateway, the new inference task that triggers cross-layer collaborative scheduling is made to perform multi-trajectory random sampling inference again using the input representation method based on the thought chain.

[0093] Step 703: Based on the execution result of the new inference task that triggers cross-layer collaborative scheduling, update the queue status of all heterogeneous edge nodes and iteratively optimize the D3QN agent.

[0094] In this embodiment, the optimal action is a heterogeneous edge node with an M+1 layer logical model deployed. At that time, the cloud-edge collaborative inference system executes an edge collaboration strategy, utilizing the feature compression characteristics between layers to extract intermediate data that triggers new inference tasks for cross-layer collaborative scheduling. The data is transmitted to the heterogeneous edge node to continue inference at the edge with minimal communication cost. The heterogeneous edge node invokes the corresponding M+1 layer logic model and, through this model, performs multi-track random sampling inference on the intermediate data of the new inference task triggered by cross-layer collaborative scheduling, using a thought chain-based input representation. This initiates a new round of thought chain inference and consistency verification. For example, if the first layer logic model of the cascaded inference model chain cannot infer the new inference task, a new inference task triggered by cross-layer collaborative scheduling constructs the current state global space. The D3QN agent then calculates this global state to determine the optimal action for the new inference task: is it a heterogeneous edge node with a second layer logic model or a cloud server? If the optimal action is a heterogeneous edge node with a second layer logic model, then the second layer logic model again performs multi-track random sampling inference on the intermediate data of the new inference task triggered by cross-layer collaborative scheduling, using a thought chain-based input representation.

[0095] Meanwhile, the cloud-edge collaborative inference system updates the queue status of each heterogeneous edge node based on the actual feedback from the execution of new inference tasks that trigger cross-layer collaborative scheduling. and the quadruple The data is stored in the experience replay pool for iterative optimization of subsequent neural network parameters, thereby continuously improving the adaptability of the path planning strategy to dynamic environments. For the current global state space, For optimal action, This refers to the immediate reward value calculated by the cloud-edge collaborative inference system based on the composite reward function. To execute the optimal action, the cloud-edge collaborative reasoning system updates the global state space for the next decision.

[0096] Furthermore, this application embodiment also provides a cloud-edge collaborative inference optimization device 800 based on deep reinforcement learning, applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M sequentially arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models. Figure 12 This is a schematic diagram of the program modules of the cloud-edge collaborative inference optimization device based on deep reinforcement learning in this embodiment of the application. In this embodiment, the cloud-edge collaborative inference optimization device 800 based on deep reinforcement learning includes:

[0097] Judgment module 801: Used to judge the monitored inference task. If it is a new inference task, it obtains the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain. It then uses the D3QN agent to transmit the new inference task to the optimal starting heterogeneous edge node. Through the optimal starting heterogeneous edge node, it performs multi-trajectory random sampling inference using the input representation method based on the thought chain to generate inference results and verify whether the inference results are reliable. If reliable, it obtains the final inference result. If unreliable, it triggers cross-layer collaborative scheduling and constructs the current global state and hybrid action space for the new inference task that triggers cross-layer collaborative scheduling.

[0098] Selection module 802: used to input the current global state space into the D3QN agent and combine it with the hybrid action space, evaluate the value function of the current global state space and the advantage function of each candidate action respectively, and select the optimal action from each candidate action. The hybrid action space includes at least multiple candidate actions of the heterogeneous edge nodes of the M+1 layer logic model deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function and a composite reward function built on this basis.

[0099] Inference module 803: If the optimal action is the cloud server, it performs inference through the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, it transmits the new inference task that triggers cross-layer collaborative scheduling to the heterogeneous edge node with the M+1 layer logic model deployed, enters the M+1 layer logic model, and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.

[0100] This application provides a cloud-edge collaborative inference optimization device based on deep reinforcement learning, applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M sequentially arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models. This enables the construction of a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes... The system comprises M hierarchical logic models arranged in an increasing order, where M ≥ 1, and each heterogeneous edge node is deployed with at least one of these hierarchical logic models. Upon detecting a new inference task, the system determines its initial global state space and the deployment status of the M-layer logic models in the cascaded inference model chain. A D3QN agent is then used to transmit the new inference task to the optimal starting heterogeneous edge node. Through this node, multi-trajectory random sampling inference is performed using a thought chain-based input representation method to generate inference results. The reliability of these results is then verified. If reliable, the final inference result is obtained. If unreliable, cross-layer collaborative scheduling is triggered, and a current global state and hybrid action space are constructed for the new inference task that triggers cross-layer collaborative scheduling. The current global state space is input into the D3QN agent and combined with the hybrid action space to evaluate the value function of the current global state space and the advantage function of each candidate action, and select the optimal action from the candidate actions. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent pre-processes the data through Markov... The decision process model, system cost function, and composite reward function built upon them are trained. If the optimal action is the cloud server, inference is performed through the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task then enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.The method provided by this invention overcomes the limitations of fixed physical flow paths in traditional cascaded inference. By constructing a cloud server direct access option in a hybrid action space, the system can perform intelligent scheduling for high-difficulty tasks or edge network congestion scenarios while adapting to the deployment constraints of heterogeneous edge node models. This effectively avoids the accumulation of latency and waste of resources caused by ineffective flow at the edge. Based on the multi-trajectory sampling and self-consistency verification mechanism of the thought chain, the multi-step inference capability of the large model significantly improves the judgment accuracy of intermediate layers compared to single-path inference. At the same time, by constructing a composite reward function that includes consistency gain and system cost, the agent is guided to actively choose the path that maximizes the reliability of inference while pursuing low latency and low energy consumption, achieving an adaptive balance between service quality and resource overhead. The D3QN algorithm is used to solve complex cross-layer scheduling problems. By constructing a global state space that includes the real-time load of the entire network and the historical performance of the model, the agent is endowed with a deep perception capability of the dynamic network environment. By separating the value function and the advantage function, multi-objective collaborative optimization of inference latency, energy consumption and cloud service cost in heterogeneous dynamic environments is achieved.

[0101] Furthermore, this application also provides a cloud-edge collaborative inference optimization device based on deep reinforcement learning, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the memory executes the computer program, it implements the various steps in the cloud-edge collaborative inference optimization method based on deep reinforcement learning as described above.

[0102] Furthermore, this application also provides a storage medium storing a computer program thereon, which, when executed by a processor, implements the various steps in the cloud-edge collaborative inference optimization method based on deep reinforcement learning as described above.

[0103] In the various embodiments of this invention, the functional modules can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium.

[0104] Based on this understanding, the technical solution of this invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0105] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, as some steps can be performed in other orders or simultaneously according to the present invention. Secondly, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to the present invention. In the above embodiments, the descriptions of each embodiment have their own emphasis; for parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0106] For those skilled in the art, based on the ideas of the embodiments of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A cloud-edge collaborative inference optimization method based on deep reinforcement learning, applied to a cloud-edge collaborative inference system and a cascaded inference model chain, wherein the cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server, and the cascaded inference model chain includes M sequentially arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models, characterized in that... The method includes: The detected inference task is judged. If it is a new inference task, the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain are obtained. The new inference task is transmitted to the optimal starting heterogeneous edge node using the D3QN agent. Multi-trajectory random sampling inference is performed through the optimal starting heterogeneous edge node using the input representation method based on the thought chain to generate inference results. The reliability of the inference results is verified. If reliable, the final inference result is obtained. If unreliable, cross-layer collaborative scheduling is triggered. The current global state and hybrid action space are constructed for the new inference task that triggers cross-layer collaborative scheduling. The current global state space is input into the D3QN agent and combined with the hybrid action space. The value function of the current global state space and the advantage function of each candidate action are evaluated respectively, and the optimal action is selected from the candidate actions. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function, and a composite reward function built on this basis. If the optimal action is the cloud server, inference is performed through the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task then enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.

2. The method according to claim 1, characterized in that, The judgment of the monitored inference task includes: The received inference task is processed using an event-driven mechanism, and is processed accordingly based on the type of the inference task. If the reasoning task is the new reasoning task, then obtain the deployment status of the initial global state space and the M-layer logic model of the cascaded reasoning model chain; If the inference task is a new inference task that triggers cross-layer scheduling, then multi-trajectory random sampling inference is directly performed using the input expression method based on the thought chain.

3. The method according to claim 1, characterized in that, If it is a new inference task, the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain are obtained. The task data of the new inference task is then transmitted to the optimal starting heterogeneous edge node using a D3QN agent. Multi-trajectory random sampling inference is then performed through the optimal starting heterogeneous edge node using a thought chain-based input representation method to generate inference results, including: When the new inference task arrives at the edge gateway, the new inference task is parsed to obtain the original input data and the preset requirements, wherein the preset requirements are the minimum consistency score requirements; The first global state space and the deployment status of the M-layer logic model on the heterogeneous edge nodes are then obtained. The D3QN agent is used to determine the optimal starting heterogeneous edge node from the heterogeneous edge nodes where the M-layer logic model has been deployed, and the new inference task is transmitted to the optimal starting heterogeneous edge node. When the heterogeneous edge node that starts optimally receives the original input data of the new inference task, it uses the M-layer logic model and adopts a thought chain-based input method to perform multi-trajectory random sampling inference on the original input data to generate inference results, wherein the inference results contain a set of output answers for different inference trajectories.

4. The method according to claim 3, characterized in that, The process involves verifying the reliability of the inference result. If reliable, the final inference result is obtained; if unreliable, cross-layer collaborative scheduling is triggered, including: The semantic similarity calculation and cluster analysis of the reasoning results are used to obtain a consistency score; The consistency score is compared with the minimum consistency score requirement. If the consistency score is greater than the minimum consistency score requirement, the inference result is used as the final inference result and output. If the consistency score is less than the minimum consistency score requirement, cross-layer collaborative scheduling is triggered.

5. The method according to claim 4, characterized in that, The process of calculating semantic similarity and performing cluster analysis on the reasoning results to obtain a consistency score includes: The answers in the output answer set are compared pairwise using a semantic similarity function. Answers with high semantic similarity are grouped into the same cluster to obtain a cluster set, wherein the cluster set includes multiple answer clusters. The proportion of the number of times the same reasoning trajectory of the largest answer cluster in the cluster set is used as the consistency score of the M-layer logic model. If the consistency score is greater than the minimum consistency score requirement, then the largest answer cluster is used as the reasoning result.

6. The method according to claim 1, characterized in that, The process of constructing a current global state space and a hybrid action space, inputting the current global state space into the D3QN agent and combining it with the hybrid action space, evaluating the value function of the current global state space and the advantage function of each candidate action, and selecting the optimal action from the candidate actions includes: The value function of the current global state space and the advantage function of each candidate action are aggregated and calculated to obtain the Q value corresponding to each candidate action; Multiple Q values are corrected using the Double DQN mechanism, and then... The -greedy strategy selects the candidate action corresponding to the largest Q value from multiple modified Q values as the optimal action.

7. The method according to claim 6, characterized in that, The current global state space includes at least the intermediate data that triggers the new task of cross-layer collaborative scheduling. The new inference task that triggers cross-layer collaborative scheduling is then transmitted to the heterogeneous edge node that deploys the M+1 layer logic model. It enters the M+1 layer logic model and again performs multi-trajectory random sampling inference using a thought chain-based input representation method, including: The intermediate data of the new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node that has the M+1 layer logical model deployed, and then enters the M+1 layer logical model. Based on the intermediate data parsed by the edge gateway that triggers the new inference task of cross-layer collaborative scheduling, the new inference task that triggers the cross-layer collaborative scheduling is then made to perform multi-track random sampling inference again using the input representation method based on the thought chain. Based on the execution result of the new inference task that triggers cross-layer collaborative scheduling, the queue status of all heterogeneous edge nodes is updated, and the D3QN agent is iteratively optimized.

8. A cloud-edge collaborative inference optimization device based on deep reinforcement learning, characterized in that, An apparatus is applied to a cloud-edge collaborative inference system and a cascaded inference model chain. The cloud-edge collaborative inference system includes an edge gateway, multiple heterogeneous edge nodes, and a cloud server. The cascaded inference model chain includes M sequentially arranged logical hierarchical models, M≥1, and each heterogeneous edge node is deployed with at least one of the logical hierarchical models. The apparatus includes: The judgment module is used to judge the detected inference task. If it is a new inference task, it obtains the initial global state space and the deployment status of the M-layer logic model of the cascaded inference model chain. It then uses the D3QN agent to transmit the new inference task to the optimal starting heterogeneous edge node. Through the optimal starting heterogeneous edge node, it performs multi-trajectory random sampling inference using the input representation method based on the thought chain to generate inference results and verify whether the inference results are reliable. If reliable, it obtains the final inference result. If unreliable, it triggers cross-layer collaborative scheduling and constructs the current global state and hybrid action space for the new inference task that triggers cross-layer collaborative scheduling. Selection module: used to input the current global state space into the D3QN agent and combine it with the hybrid action space, evaluate the value function of the current global state space and the advantage function of each candidate action respectively, and select the optimal action from each candidate action. The hybrid action space includes at least multiple candidate actions of heterogeneous edge nodes with M+1 layer logic models deployed with the cascaded inference model chain and the cloud server. The D3QN agent is pre-trained using a Markov decision process model, a system cost function, and a composite reward function built on this basis. The inference module is used to perform inference through the cloud server when the optimal action is the cloud server to obtain the final inference result. If the optimal action is the heterogeneous edge node with the M+1 layer logic model deployed, the new inference task that triggers cross-layer collaborative scheduling is transmitted to the heterogeneous edge node with the M+1 layer logic model deployed. The task then enters the M+1 layer logic model and performs multi-trajectory random sampling inference again using the thought chain-based input representation method. This process is repeated until the final inference result is generated.

9. A cloud-edge collaborative inference optimization device based on deep reinforcement learning, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the memory executes the computer program, it implements each step of the cloud-edge collaborative reasoning optimization method based on deep reinforcement learning as described in any one of claims 1-7.

10. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements each step of the cloud-edge collaborative inference optimization method based on deep reinforcement learning as described in any one of claims 1-7.