A Spatiotemporally Aware Multi-Agent Joint Task Scheduling Method for Edge Computing

By combining hybrid sequence prediction and graph neural networks in a multi-agent reinforcement learning approach, the problems of time-varying load, local observation, and dynamic task sequences in edge computing are solved, achieving efficient resource scheduling and improved system performance.

CN122309085APending Publication Date: 2026-06-30QINGHAI UNIV FOR NATITIES +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
QINGHAI UNIV FOR NATITIES
Filing Date
2026-04-30
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In edge computing environments, existing scheduling algorithms cannot effectively cope with time-varying loads, coordination difficulties caused by local observations, and the disaster of decision dimensions brought about by dynamic task sequences, resulting in a decline in system performance.

Method used

This paper combines hybrid sequence prediction, graph neural networks, and multi-agent reinforcement learning. It predicts future loads using ARIMA and GRU models, achieves global state awareness using graph neural networks, performs fine-grained scheduling through pointer networks, and designs an action masking mechanism to ensure the legitimacy of decisions.

Benefits of technology

It achieves proactive resource planning, improves task completion time, resource utilization and energy efficiency, solves problems of time-varying load, local observation and dynamic task sequence, and significantly improves system performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309085A_ABST
    Figure CN122309085A_ABST
Patent Text Reader

Abstract

This invention discloses a spatiotemporally aware multi-agent joint task scheduling method for edge computing, belonging to the field of resource scheduling technology. The method includes the following steps: S1, acquiring historical task arrival sequences and inputting them into a hybrid load prediction module to generate future load prediction values; S2, inputting the future load prediction values ​​into a global state awareness module, using a graph neural network to transform the topology of the edge network into a high-dimensional feature vector, generating agent state observations; S3, inputting the agent state observations into a joint decision-making module, performing fine-grained scheduling through a pointer network to generate a scheduling scheme. This method can proactively predict task load, achieve global state awareness through a graph neural network, and perform fine-grained scheduling using a pointer network. Experimental results show that compared with existing methods, this method significantly improves task completion time, resource utilization, and energy efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of resource scheduling technology, specifically relating to a spatiotemporally aware multi-agent joint task scheduling method for edge computing. Background Technology

[0002] Edge computing has emerged as a distributed computing paradigm. It pushes computing resources down to the edge of the wireless access network, allowing tasks to be processed closer to the data source. Research shows that edge computing can significantly reduce end-to-end latency and alleviate core network congestion. However, the physical resources of edge nodes are typically limited. Unlike the unlimited resource pool in the cloud, computing, storage, and energy are scarce for edge servers. Furthermore, edge environments are highly heterogeneous and dynamic. Without efficient scheduling mechanisms, tasks can easily back up on computing nodes, leading to a sharp decline in system performance. Achieving efficient task scheduling in resource-constrained edge networks is a current research hotspot. Despite numerous achievements, existing solutions still face three significant challenges when dealing with complex real-world scenarios.

[0003] The first issue is the lack of proactive response due to the time-varying nature of the load. Edge task traffic exhibits strong spatiotemporal correlation, encompassing long-term periodic trends and short-term random fluctuations. Most existing scheduling algorithms are reactive and cannot anticipate future computing power demands. When sudden surges in traffic arrive, the system often lacks the time to scale up or reserve resources. This lag directly leads to a deterioration in service quality.

[0004] Secondly, there are coordination difficulties caused by local observation. Multi-cluster edge networks are typical distributed systems. Due to communication overhead limitations, a single scheduling agent can usually only obtain the local resource status. It cannot grasp the load distribution of the entire network in real time. This local observation limits the achievement of the global optimum. Some nodes may drop tasks due to overload, while neighboring nodes are idle. The lack of global topology awareness makes cross-domain coordination extremely difficult.

[0005] Thirdly, there is the curse of dimensionality in decision-making caused by dynamic task sequences. In actual operation, the length of the arriving task queue changes in real time. However, traditional reinforcement learning algorithms typically rely on a fixed-dimensional action space. Standard neural networks struggle to directly handle such variable-length input sequences. Furthermore, agents are prone to outputting actions that violate physical constraints, such as allocating resources beyond the limit. This leads to difficulties in model training convergence and infeasible decisions. Summary of the Invention

[0006] To address the aforementioned shortcomings in existing technologies, this invention provides a spatiotemporally aware multi-agent joint task scheduling method for edge computing. This method combines hybrid sequence prediction, GNN, and multi-agent reinforcement learning to proactively predict task load, achieve global state awareness through GNN, and utilize pointer networks for fine-grained scheduling. This solves the problems of lack of initiative due to time-varying load, coordination difficulties caused by local observation, and the decision-dimensional curse brought about by dynamic task sequences in existing technologies.

[0007] To achieve the aforementioned objectives, the present invention employs the following technical solution: a spatiotemporally aware multi-agent joint task scheduling method for edge computing, comprising the following steps:

[0008] S1. Obtain the historical task arrival sequence and input it into the hybrid load prediction module to generate future load prediction values;

[0009] S2. Input the future load prediction value into the global state perception module, and use the graph neural network to transform the topology of the edge network into a high-dimensional feature vector to generate the state observation of the agent.

[0010] S3. Input the state observations of the agent into the joint decision-making module, perform fine scheduling through the pointer network, and generate a scheduling scheme.

[0011] Furthermore: In S1, the hybrid load prediction module includes an ARIMA model and a GRU model. The ARIMA model is used to capture the long-term periodicity of task traffic, while the GRU model is used to deeply mine the short-term nonlinear characteristics in the sequence. S1 includes the following sub-steps:

[0012] S11. Input the historical task arrival sequence into the ARIMA model, extract and model the linear trend in the sequence, and output a robust linear prediction benchmark and residual sequence.

[0013] S12. Process the residual sequence according to the two-level preprocessing mechanism to generate standardized residual features;

[0014] S13. Input the standardized residual features into the GRU model, output the standardized predicted value of the next residual, and map it back to the real physical dimensions through inverse transformation to obtain the next residual.

[0015] S14. Based on the next residual and the linear prediction baseline, predict the task arrival amount of the next time slot as the future load forecast value.

[0016] Furthermore: In S11, the specific workflow of the ARIMA model is as follows:

[0017] Based on the Akaike information content criterion, dynamic order optimization is performed in the candidate space to output a robust linear prediction benchmark; then the historical fitting residual sequence is calculated to separate nonlinear fluctuation characteristics.

[0018] Among them, the candidate space is specified. In the formula , and These correspond to the autoregression order, differencing order, and moving average order, respectively. The residual sequence includes the original residual values ​​at several time points, where the... The original residual value at each time point The specific expression is:

[0019]

[0020] In the formula, For the first The residual value at each time point. For the first The actual number of tasks completed at any given time. For the model to the first The fitted value output at each time step. For time indexing.

[0021] Furthermore: S12 includes the following sub-steps:

[0022] S121. Perform exponential moving average denoising on the residual sequence to generate denoised residual features;

[0023]

[0024] In the formula, For smoothing coefficients, For the first Denoising residual characteristics at each time step For the first The denoised residual characteristics at each time step;

[0025] S122. Perform sliding window normalization on the denoised residual features to generate standardized residual features. ;

[0026]

[0027] In the formula, For the size of the context window, The mean, To obtain the maximum value, Standard deviation It is a tiny extreme value.

[0028] Furthermore: S14 includes the following sub-steps:

[0029] S141, Based on the next step residual and linear forecasting benchmark forecast Initial combined prediction results are generated through linear superposition. ;

[0030] S142, Constructing smooth anchor points and execute based on the first weight Weak fusion, generating fusion prediction values ;

[0031]

[0032]

[0033] In the formula, , and For the most recent 12 time slots Mean, last observation, and standard deviation of the historical arrival sequence;

[0034] S143, Define dynamic stable bandwidth The fusion prediction value will be strictly limited to a physically reasonable range. Inside;

[0035]

[0036]

[0037] In the formula, This is a truncation function used to restrict input values ​​to a given range of upper and lower bounds. This is the dynamic stable bandwidth scaling factor;

[0038] S144. To align with the physical semantics of discrete task scheduling, output the task arrival count for the next time slot in non-negative integer format. , as a future load forecast value;

[0039]

[0040] In the formula, This is the round function.

[0041] Furthermore, in S2, the topology of the edge network is modeled as an undirected weighted graph. ,in, For the set of edge nodes of the edge network, Define an adjacency matrix as the set of physical communication connections between edge nodes. To describe the connection relationship between nodes, in the formula Represents a matrix, This indicates the number of edge nodes. With edge nodes Direct connection, then edge node With edge nodes adjacency matrix Otherwise .

[0042] Furthermore, S3 includes the following sub-steps:

[0043] S31. Input the task sequence and the agent's state observation into the pointer network, use the attention mechanism to calculate the weight of each task in the task sequence, and select the task with the highest priority to determine the task execution order.

[0044] S32. Based on the determined task execution order, a masking mechanism is introduced in the action output stage through the policy network to calculate the probability distribution of scheduling actions and then generate a scheduling scheme.

[0045] S33. A single-policy-based same-policy proximal policy optimization framework is adopted. During the continuous interaction between the agent and the environment, the total loss function is constructed by maximizing the cumulative reward, and the policy network and value network are updated to optimize the scheduling policy.

[0046] Further, S31 specifically involves: extracting features from the task sequence using the attention mechanism of a pointer network to generate features for the task to be processed; combining the agent's state observations with the features for the task to be processed through an encoder to generate corresponding hidden representations; and calculating the current hidden state through the encoder. The correlation between the candidate task encoded features and the encoded output is used to generate the weights of each task in the task sequence. ;

[0047]

[0048] In the formula, , and The weight matrix is ​​a learnable matrix. It is the transpose symbol. For candidate tasks The coding features, For task indexing, Index for decision-making steps.

[0049] Furthermore: In S32, the probability distribution of scheduling actions is calculated. The specific expression is:

[0050]

[0051] In the formula, For policy networks targeting tasks Assigned to target edge node The output of the original action score, For the task Scheduled to edge nodes The corresponding mask value, For the task Scheduled to candidate edge nodes The corresponding mask value, For policy networks targeting tasks Assigned to candidate edge nodes The output of the original action score, Index for candidate edge nodes, Index the target edge node. For task indexing;

[0052] The masking mechanism is as follows: for mask values If any of the following conditions are met, then the mask value Set to negative infinity, otherwise let ;

[0053] (1) Edge nodes The available CPU or GPU computing power is lower than the minimum requirements of the task;

[0054] (2) The communication link between the source node and the target node is interrupted or the congestion level exceeds the threshold.

[0055] Furthermore: In S33, the total loss function of the single-policy same-policy proximal policy optimization framework. The specific expression is:

[0056]

[0057] In the formula, For the policy function loss, For the value function loss, The strategy entropy used to encourage exploration, These are the loss weight coefficients of the value function, used to control the intensity of value network updates. The entropy regularization term weight coefficient is used to control the exploration intensity of the strategy;

[0058]

[0059]

[0060] In the formula, This is the expectation operator, used to average the results over a batch of samples. For the current value network to time slot The predictive value of a state For the target state value, For the old value network in time slots Output state value prediction, Value trimming threshold, To cut off value;

[0061]

[0062] In the formula, To obtain the minimum value, This represents the probability ratio between the old and new strategies. , For the new strategy, This is the old strategy. The dominant function;

[0063] In terms of advantage function estimation, a generalized advantage estimation method is adopted. First, the original reward is standardized, and then a scaling factor is applied. With cutoff threshold Apply bandwidth limiting and calculate the reward after bandwidth limiting. ;

[0064]

[0065] In the formula, For the standardized reward, the time-series difference error is then calculated. In the formula This is a discount factor used to measure the degree to which future rewards influence current decisions. For time slots State value estimation, For time slots State value estimation and recursive derivation of the advantage function. The value of the target state is defined as Before sending it to the network for update, further refine the advantage function. Perform standardization with zero mean and unit variance;

[0066] Original reward It is designed as a weighted combination of multiple system costs and utility incentives;

[0067]

[0068] In the formula, The cost of being delayed for the mission For system energy consumption, Penalties for service overdue payments, To average queuing delay, For tail delay, For transmission delay, Due to task turnaround time delay, To complete the task, For resource utilization, The weighting coefficient for the cost of task delay. This is the system energy consumption weighting coefficient. To determine the weighting coefficient for penalties for breach of contract, This is the average queuing delay weighting coefficient. The tail 95th percentile delay weighting coefficient. For transmission delay weighting coefficient, This is the task turnaround time weighting coefficient. The positive incentive coefficient corresponding to the number of tasks completed. This is the positive incentive coefficient corresponding to resource utilization rate;

[0069] In S33, the specific method for updating the policy network and value network is as follows: Each cumulative... The network is updated once per round, with the learning rate using a cosine annealing strategy, while the action exploration intensity uses linear annealing decay.

[0070]

[0071]

[0072] In the formula, For the first The intensity of action exploration in the next update This is the linear annealing schedule factor. Initial exploration intensity, For the final exploration intensity, This represents the current number of network updates. To explore the total number of updates required for the intensity to complete a linear decay.

[0073] The beneficial effects of this invention are as follows: This invention provides a spatiotemporally aware multi-agent joint task scheduling method for edge computing, which innovatively integrates hybrid sequence prediction, graph representation learning, and multi-agent decision-making techniques. Compared with existing technologies, it has the following advantages:

[0074] (1) A proactive resource planning mechanism based on a hybrid model is proposed. To address the scheduling lag problem, this invention designs a hybrid load prediction module based on the ARIMA and GRU models. The ARIMA model is used to capture the long-term periodicity of task traffic. The GRU model is used to deeply mine the short-term nonlinear characteristics in the sequence. This hybrid mechanism realizes the transformation from passive response to proactive planning. The system can perceive load peaks in advance based on the prediction results and dynamically adjust the resource reservation strategy, solving the problem of lack of initiative caused by the time-varying nature of load in the prior art.

[0075] (2) This invention constructs a global state representation method based on graph neural networks. Addressing the limitations of local observation, this invention models the edge cluster network as a graph structure. Utilizing the message passing mechanism of graph neural networks, agents can aggregate feature information from neighboring nodes. This method endows each agent with an approximately global perspective. Agents can perceive the resource distribution of the entire network with low communication overhead, thereby making more accurate collaborative offloading decisions and solving the problem of collaborative difficulties caused by local observation in existing technologies.

[0076] (3) This invention designs a fine-grained scheduling algorithm that integrates pointer networks. For dynamically changing task queues, we integrate pointer networks into a multi-agent reinforcement learning architecture. Utilizing its unique attention mechanism, the agent can flexibly select the optimal task from a variable-length candidate queue. Simultaneously, this invention introduces an action masking mechanism to filter invalid decisions. This ensures that resource allocation instructions are always within physical constraints, significantly improving the utilization efficiency of computing resources and solving the problem of the curse of decision dimensionality caused by dynamic task sequences in existing technologies.

[0077] (4) The experimental results show that, compared with the existing methods, this method has significant improvements in task completion time, resource utilization and energy efficiency. Attached Figure Description

[0078] Figure 1 This is a flowchart of the spatiotemporal awareness multi-agent joint task scheduling method for edge computing according to the present invention.

[0079] Figure 2 This is a schematic diagram of a spatiotemporally aware resource scheduling framework proposed in this invention.

[0080] Figure 3 The figure shows the experimental results of the global performance of the proposed method and two representative benchmark strategies on the core evaluation indicators.

[0081] Figure 4 The figure shows the experimental results of the global performance of the proposed method and two representative benchmark strategies in terms of system throughput and resource completion rate. Detailed Implementation

[0082] The specific embodiments of the present invention are described below to enable those skilled in the art to understand the present invention. However, it should be understood that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, various changes are obvious as long as they are within the spirit and scope of the present invention as defined and determined by the appended claims. All inventions utilizing the concept of the present invention are protected.

[0083] like Figure 1 As shown, in one embodiment of the present invention, a spatiotemporally aware multi-agent joint task scheduling method for edge computing includes the following steps:

[0084] S1. Obtain the historical task arrival sequence and input it into the hybrid load prediction module to generate future load prediction values;

[0085] S2. Input the future load prediction value into the global state perception module, and use the graph neural network to transform the topology of the edge network into a high-dimensional feature vector to generate the state observation of the agent.

[0086] S3. Input the state observations of the agent into the joint decision-making module, perform fine scheduling through the pointer network, and generate a scheduling scheme.

[0087] like Figure 2 As shown, this invention proposes a spatiotemporally aware resource scheduling framework that combines hybrid sequence prediction, GNN, and multi-agent reinforcement learning. The proposed method can proactively predict task load, achieve global state awareness through GNN, and utilize pointer networks for fine-grained scheduling.

[0088] In S1, the hybrid load prediction module includes an ARIMA model and a GRU (Gated Recurrent Unit) model. The ARIMA model is used to capture the long-term periodicity of task traffic, while the GRU model is used to deeply mine the short-term nonlinear characteristics in the sequence. S1 includes the following sub-steps:

[0089] S11. Input the historical task arrival sequence into the ARIMA model, extract and model the linear trend in the sequence, and output a robust linear prediction benchmark and residual sequence.

[0090] S12. Process the residual sequence according to the two-level preprocessing mechanism to generate standardized residual features;

[0091] S13. Input the standardized residual features into the GRU model, output the standardized predicted value of the next residual, and map it back to the real physical dimensions through inverse transformation to obtain the next residual.

[0092] S14. Based on the next residual and the linear prediction baseline, predict the task arrival amount of the next time slot as the future load forecast value.

[0093] In this embodiment, to significantly improve the prediction accuracy and system robustness of edge computing environments under highly non-stationary load scenarios, a cascaded prediction architecture combining linear trend modeling, residual deep learning, online adaptive updating, and output stabilization is proposed. Given a historical task arrival sequence, the system accurately predicts the number of tasks arriving in the next time slot.

[0094] In S11, the workflow of the ARIMA model is as follows:

[0095] The ARIMA model is based on the Akaike Information Criterion (AIC) in the candidate space. The system performs dynamic order optimization to output a robust linear prediction benchmark; then it calculates the historical fitting residual sequence to separate nonlinear fluctuation characteristics; when the underlying statistical analysis library is unavailable, it will automatically degrade to autoregressive integral (ARI) approximation search to ensure the system availability of edge nodes.

[0096] Among them, the candidate space is specified. In the formula , and These correspond to the autoregression order, differencing order, and moving average order, respectively. The residual sequence includes the original residual values ​​at several time points, where the... The original residual value at each time point The specific expression is:

[0097]

[0098] In the formula, For the first The residual value at each time point. For the first The actual number of tasks completed at any given time. For the model to the first The fitted value output at each time step. For time indexing;

[0099] S12 includes the following steps:

[0100] S121. Perform exponential moving average (EMA) denoising on the residual sequence to generate denoised residual features;

[0101]

[0102] In the formula, The smoothing coefficient is set to [value] in this embodiment. , For the first Denoising residual characteristics at each time step For the first The denoised residual characteristics at each time step;

[0103] S122. Perform sliding window normalization on the denoised residual features to generate standardized residual features. ;

[0104]

[0105] In the formula, For the size of the context window, The mean, To obtain the maximum value, Standard deviation To address the issue of minute extrema, in this embodiment, for each context window of length 8, the local mean is calculated over its 12 most recent historical samples. with standard deviation Standardization transformation is performed, and to prevent division by zero errors in numerical computation, minute extrema are introduced. .

[0106] In S13, to further enhance the model's dynamic adaptability to time-varying environments, an online residual network update mechanism was designed for the hybrid load prediction module. After each prediction during the training phase, the system constructs a sliding window sample using the latest real residual sequence, extracts the most recent 24 sets of samples for incremental fine-tuning, and performs mean squared error (MSE) gradient descent for four optimization steps. This mechanism achieves continuous correction of the residual prediction model without significantly increasing the computational overhead of edge devices.

[0107] S14 includes the following steps:

[0108] S141, Based on the next step residual and linear forecasting benchmark forecast Initial combined prediction results are generated through linear superposition. ;

[0109] S142. To prevent prediction overshoot caused by extreme sudden noise, an adaptive stabilization band clipping mechanism is introduced. First, smooth anchor points are constructed. and execute based on the first weight Weak fusion, generating fusion prediction values ;

[0110]

[0111]

[0112] In the formula, , and In this embodiment, the mean, last observation, and standard deviation of the historical arrival sequence within the most recent 12 time slots t are used. .

[0113] S143, Define dynamic stable bandwidth The fusion prediction value will be strictly limited to a physically reasonable range. Inside;

[0114]

[0115]

[0116] In the formula, This is a truncation function used to restrict input values ​​to a given range of upper and lower bounds. This is the dynamic stable bandwidth scaling factor;

[0117] S144. To align with the physical semantics of discrete task scheduling, output the task arrival count for the next time slot in non-negative integer format. , as a future load forecast value;

[0118]

[0119] In the formula, This is the round function.

[0120] In S2, the topology of the edge network is modeled as an undirected weighted graph. ,in, For the set of edge nodes of the edge network, This represents the set of physical communication connections between edge nodes. To support subsequent global state awareness based on graph neural networks (GNNs), an adjacency matrix is ​​defined. To describe the connection relationship between nodes, in the formula Represents a matrix, This indicates the number of edge nodes. With edge nodes Direct connection, then edge node With edge nodes adjacency matrix Otherwise This graph-based modeling approach enables the system to effectively capture spatial dependencies between edge nodes. By passing messages between neighboring nodes, the global state awareness module allows each agent to perceive the resource distribution across the entire network.

[0121] To capture the characteristics of edge nodes themselves, this embodiment introduces an adjacency matrix with self-loops. In the formula for The identity matrix. The initial input feature vector for each edge node. It is a fusion of multi-dimensional information, including not only current CPU / GPU utilization, queue length, and remaining bandwidth, but also explicitly integrating future load predictions. This feature construction method ensures that the input data possesses both spatial immediacy and temporal foresight.

[0122] To effectively extract spatial dependency features from non-Euclidean topological graphs, this embodiment employs a Graph Convolutional Network (GCN). The core mechanism of GCN is "message passing," where each node aggregates feature information from its neighbors to update its own latent representation. This embodiment uses multiple stacked graph convolutional layers. Layer-updated node feature matrix The specific expression is:

[0123]

[0124] in, For the first The node feature matrix of the layer, The initial feature matrix, For the first The learnable weight matrix of the layer, It is a non-linear activation function (such as ReLU). for The degree matrix is ​​used for symmetric normalization to prevent feature values ​​from exploding or disappearing during multi-layer propagation. Layered graph convolution operations aggregate the originally isolated node features multiple times, so that the final output embedding of each node not only encodes its own resource state, but also implicitly contains its... Load distribution information within the skip neighborhood. High-dimensional feature vector after GCN encoding. This constitutes the state observation of the agents. This design is crucial, as it transforms the unstructured network topology into structured semantic vectors, solving the problem of information asymmetry in multi-agent cooperation. For example, when all the neighbors of a node are under high load, GCN significantly alters the feature representation of that node through aggregation operations, thereby suppressing the tendency of agents to offload tasks to that region. This global perception mechanism based on graph neural networks provides policy inputs rich in spatial context information for subsequent joint decision-making, and is a key technical support for achieving network-wide load balancing.

[0125] S3 includes the following steps:

[0126] S31. Input the task sequence and the agent's state observation into the pointer network, use the attention mechanism to calculate the weight of each task in the task sequence, and select the task with the highest priority to determine the task execution order.

[0127] S32. Based on the determined task execution order, a masking mechanism is introduced in the action output stage through the policy network to calculate the probability distribution of scheduling actions and then generate a scheduling scheme.

[0128] S33. The same-policy proximal policy optimization (On-policyPPO) framework based on a single policy is adopted. During the continuous interaction between the agent and the environment, the total loss function is constructed by maximizing the cumulative reward, and the policy network and value network are updated to optimize the scheduling policy, which can adaptively cope with various changes in the edge environment.

[0129] S31 specifically involves: extracting features from the task sequence using the attention mechanism of a pointer network to generate features for the task to be processed; combining the agent's state observations with the features for the task to be processed through an encoder to generate the corresponding hidden representation; and calculating the current hidden state through the encoder. The correlation between the candidate task encoded features and the encoded output is used to generate the weights of each task in the task sequence. ;

[0130]

[0131] In the formula, , and The weight matrix is ​​a learnable matrix. It is the transpose symbol. For candidate tasks The coding features, For task indexing, This indexes the decision steps; through this indexing mechanism, the agent can flexibly select the optimal task from a variable-length candidate queue, without being limited by a preset action space.

[0132] In S32, to ensure the legality of the decision, this invention introduces an action masking mechanism to filter out actions that violate physical constraints and calculates the probability distribution of scheduled actions. The specific expression is:

[0133]

[0134] In the formula, For policy networks targeting tasks Assigned to target edge node The output of the original action score, For the task Scheduled to edge nodes The corresponding mask value, For the task Scheduled to candidate edge nodes The corresponding mask value, For policy networks targeting tasks Assigned to candidate edge nodes The output of the original action score, Index for candidate edge nodes, Index the target edge node. The task index is used; the scheduling action probability distribution specifically represents the probability of a task being scheduled to a target edge node. This formula, through a modified Softmax function, forces the probability of selecting invalid decisions to be reduced to zero. This mechanism not only accelerates the convergence process of the reinforcement learning model but also ensures that the resource allocation instructions output by the system are always within the physical capacity limit. After selecting the task and its target node, the decoder further determines the fine-grained computing power allocation, thereby completing the complete resource scheduling process.

[0135] The masking mechanism is as follows: for mask values If any of the following conditions are met, then the mask value Set to negative infinity, otherwise let ;

[0136] (1) Edge nodes The available CPU or GPU computing power is lower than the minimum requirements of the task;

[0137] (2) The communication link between the source node and the target node is interrupted or the congestion level exceeds the threshold;

[0138] In this embodiment, the invention designs an action masking mechanism that scans the remaining resources of all edge nodes in real time at each decision step. The masking generation logic is mainly based on the following two indicators: first, resource availability. If the node... The available CPU or GPU computing power is lower than that of the task. If the minimum requirement is not met, the node is marked as unreachable. Secondly, network connectivity is considered. If the communication link between the task source node and the target node is interrupted or congestion exceeds a threshold, the corresponding scheduling action will also be filtered out.

[0139] In S33, the total loss function of the single-policy on-policy policy optimization (On-policyPPO) framework is... The specific expression is:

[0140]

[0141] In the formula, For the policy function loss, For the value function loss, The strategy entropy used to encourage exploration, These are the loss weight coefficients of the value function, used to control the intensity of value network updates. The entropy regularization term weight coefficient is used to control the exploration intensity of the strategy;

[0142] During backpropagation, a gradient norm pruning mechanism is applied. Specifically, if the approximate KL divergence... If the preset threshold is exceeded, the training iterations for optimizing the policy network and value network will be forcibly terminated early, thereby effectively suppressing the overshoot phenomenon in policy updates.

[0143]

[0144]

[0145] In the formula, This is the expectation operator, used to average the results over a batch of samples. For the current value network to time slot The predictive value of a state For the target state value, For the old value network in time slots Output state value prediction, Value trimming threshold, To truncate the value and prevent drastic fluctuations in the value network update, this embodiment introduces a truncation value into the value function loss.

[0146]

[0147] In the formula, To obtain the minimum value, This represents the probability ratio between the old and new strategies. , For the new strategy, This is the old strategy. The advantage function is used; in terms of advantage function estimation, this embodiment employs generalized advantage estimation (GAE). To improve the stationarity of training, the original reward is first standardized and then combined with a scaling factor. With cutoff threshold Apply bandwidth limiting and calculate the reward after bandwidth limiting. ;

[0148]

[0149] In the formula, This is the standardized reward. Then, the time-series difference error is calculated. In the formula This is a discount factor used to measure the degree to which future rewards influence current decisions. For time slots State value estimation, For time slots State value estimation and recursive derivation of the advantage function. The target state value is defined as follows: Before sending it to the network for update, the system further evaluates the advantageous functions. Perform standardization with zero mean and unit variance.

[0150] To comprehensively guide model learning, the original reward... It is designed as a weighted combination of multiple system costs and utility incentives;

[0151]

[0152] In the formula, The cost of being delayed for the mission For system energy consumption, Penalties for service overdue payments, To average queuing delay, For the tail (95th percentile) time delay, For transmission delay, Due to task turnaround time delay, To complete the task, For resource utilization, The weighting coefficient for the cost of task delay. This is the system energy consumption weighting coefficient. To determine the weighting coefficient for penalties for breach of contract, This is the average queuing delay weighting coefficient. The tail 95th percentile delay weighting coefficient. For transmission delay weighting coefficient, This is the task turnaround time weighting coefficient. The positive incentive coefficient corresponding to the number of tasks completed. This is the positive incentive coefficient corresponding to resource utilization rate, where and As a positive incentive, this composite reward function encourages the optimization objective to achieve a comprehensive optimum among Service Level Agreement (SLA), end-to-end latency, cluster throughput, and green energy efficiency.

[0153] In S33, the specific method for updating the policy network and value network is as follows: Each cumulative... The network is updated once per round, with the learning rate using a cosine annealing strategy, while the action exploration intensity uses linear annealing decay.

[0154]

[0155]

[0156] In the formula, For the first The intensity of action exploration in the next update This is the linear annealing schedule factor. Initial exploration intensity, For the final exploration intensity, This represents the current number of network updates. The total number of updates required for linear decay of the exploration intensity is determined. This mechanism maintains a high and uniform exploration ratio in the early stages of training to avoid getting trapped in local optima, and gradually converges in the later stages to improve the determinism and stability of the policy.

[0157] To enhance the reproducibility of experiments and the reliability of model selection, a deterministic evaluation module is embedded in the training pipeline. After each policy update, the system performs validation on a fixed set of random seeds. A new model is accepted only if the improvement in evaluation score exceeds the minimum gain threshold and does not violate system safety constraints (Guardrails); if performance stagnates for an extended period, triggering early stopping, the system automatically reloads the historical optimal parameters. Furthermore, during evaluation, the online updates of the hybrid load prediction module are frozen, and the random number states of the training and evaluation environments are strictly isolated to prevent the evaluation process from contaminating the training environment.

[0158] In this embodiment, to comprehensively evaluate the effectiveness of the method of the present invention, two representative benchmark scheduling algorithms are selected for comparative analysis. The first is the First-Come, First-Served (FCFS) algorithm. This strategy adopts a first-in, first-out (FIFO) queue management mechanism locally at the edge nodes, without introducing cross-node collaborative unloading and task priority allocation. The second is a greedy scheduling algorithm based on urgency. This heuristic strategy comprehensively evaluates the urgency of the task's deadline and the remaining resource margin of the target node when making decisions, and greedily allocates the task to the node with the most abundant computing power. The method proposed in this invention fully integrates hybrid load prediction, graph structure spatial awareness, joint action generation, and online fine-tuning mechanism for near-end strategy optimization.

[0159] In terms of constructing the evaluation index system, this embodiment comprehensively quantifies four dimensions: end-to-end latency, service quality, system throughput, and underlying physical energy efficiency. The core evaluation items for the latency dimension include average queuing latency, 95th percentile tail queuing latency, and average task turnaround latency. The service quality dimension is mainly measured by the service level agreement (SLA) default rate and task completion rate, reflecting the system's ability to cope with stringent deadline constraints. System throughput quantifies the total scale of tasks successfully processed and delivered by the edge cluster per unit time. At the physical resource level, average computing power utilization and overall energy efficiency indicators are used to evaluate the power consumption and output ratio of the underlying devices. Furthermore, this embodiment introduces a comprehensive latency score as a global evaluation reference. This index weights multi-dimensional latency penalty characteristics; a higher value indicates better overall system scheduling performance.

[0160] The focus is on analyzing the global performance of the proposed algorithm and benchmark strategies on key evaluation metrics. For example... Figure 3 The experimental results clearly demonstrate that FCFS represents the First-Come, First-Served (FCFS) algorithm, Optimus represents the urgency-based greedy scheduling algorithm, and GPMS (Graph Prediction Multi-agent Scheduling) represents the scheduling method proposed in this invention. The proposed scheduling algorithm exhibits significant advantages in end-to-end latency and service level agreement (SLA) guarantees. In terms of queuing and turnaround latency, the average queuing latency of this algorithm is reduced to 2.179, a substantial decrease of approximately 39.8% compared to the FCFS strategy. Simultaneously, the 95th percentile queuing latency, a measure of the long-tail effect, is suppressed to 6.658, representing an optimization of 28.8%. Average task turnaround latency also achieves a significant reduction of 27.2%. In the crucial service quality control aspect, the algorithm's service default rate is only 0.022, a decrease of approximately 63.6% and 66.7% compared to the FCFS and greedy scheduling strategies, respectively. The comprehensive latency score also confirms this, with the algorithm achieving an excellent score of -7.421, completely outperforming the benchmark algorithm.

[0161] In terms of system throughput and resource completion rate, such as Figure 4As shown, the experiment observed a reasonable performance trade-off. Compared to the absolute throughput of 1.167 for the greedy scheduling strategy, the system throughput of the algorithm in this invention slightly decreased to 1.078, and the overall task completion rate also showed a slight decline. The mathematical essence of this phenomenon stems from the weight preference design of the multi-objective reward function at the bottom layer of reinforcement learning. In order to strictly guarantee the latency bottom line of high-priority tasks and completely suppress large-scale timeout defaults, the agent learns a strong risk avoidance and global coordination awareness when making decisions. The system actively abandons the aggressive allocation method of blindly stacking tasks to pursue the extreme saturation of single-node resources, and instead adopts a more robust resource reservation and space-balanced scheduling. This strategy of trading a small throughput trade-off for extreme latency stability is highly consistent with the stringent requirements of high-reliability service quality in complex edge computing scenarios.

[0162] In summary, this invention provides a systematic experimental evaluation and performance analysis of the proposed spatiotemporally aware resource scheduling framework. Through multi-dimensional quantitative comparison with traditional heuristic baseline algorithms, the significant advantages of the proposed algorithm in suppressing tail queuing latency, shortening task turnaround time, and reducing service level agreement (SLA) default rates are verified. Experimental data show that this scheduling strategy can maximize the overall service quality and operational stability of the system while allowing for minimal throughput trade-offs.

[0163] In the description of this invention, the above are merely preferred embodiments and are not intended to limit the scope of protection of this invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A spatiotemporally aware multi-agent joint task scheduling method for edge computing, characterized in that, Includes the following steps: S1. Obtain the historical task arrival sequence and input it into the hybrid load prediction module to generate future load prediction values; S2. Input the future load prediction value into the global state perception module, and use the graph neural network to transform the topology of the edge network into a high-dimensional feature vector to generate the state observation of the agent. S3. Input the state observations of the agent into the joint decision-making module, perform fine scheduling through the pointer network, and generate a scheduling scheme.

2. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 1, characterized in that, In S1, the hybrid load prediction module includes an ARIMA model and a GRU model. The ARIMA model is used to capture the long-term periodicity of task traffic, while the GRU model is used to deeply mine the short-term nonlinear characteristics in the sequence. S1 includes the following sub-steps: S11. Input the historical task arrival sequence into the ARIMA model, extract and model the linear trend in the sequence, and output a robust linear prediction benchmark and residual sequence. S12. Process the residual sequence according to the two-level preprocessing mechanism to generate standardized residual features; S13. Input the standardized residual features into the GRU model, output the standardized predicted value of the next residual, and map it back to the real physical dimensions through inverse transformation to obtain the next residual. S14. Based on the next residual and the linear prediction baseline, predict the task arrival amount of the next time slot as the future load forecast value.

3. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 2, characterized in that, In S11, the workflow of the ARIMA model is as follows: Based on the Akaike information content criterion, dynamic order optimization is performed in the candidate space to output a robust linear prediction benchmark; then the historical fitting residual sequence is calculated to separate nonlinear fluctuation characteristics. Among them, the candidate space is specified. In the formula , and These correspond to the autoregression order, differencing order, and moving average order, respectively. The residual sequence includes the original residual values ​​at several time points, where the... The original residual value at each time point The specific expression is: In the formula, For the first The residual value at each time point. For the first The actual number of tasks completed at any given time. For the model to the first The fitted value output at each time step. For time indexing.

4. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 3, characterized in that, S12 includes the following steps: S121. Perform exponential moving average denoising on the residual sequence to generate denoised residual features; In the formula, For smoothing coefficients, For the first Denoising residual characteristics at each time step For the first Denoising residual characteristics at each time step; S122. Perform sliding window normalization on the denoised residual features to generate standardized residual features. ; In the formula, For the size of the context window, The mean, To obtain the maximum value, Standard deviation, It is a tiny extreme value.

5. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 4, characterized in that, S14 includes the following steps: S141, Based on the next step residual and linear forecasting benchmark forecast Initial combined prediction results are generated through linear superposition. ; S142, Constructing smooth anchor points and execute based on the first weight Weak fusion, generating fusion prediction values ; In the formula, , and For the most recent 12 time slots Mean, last observation, and standard deviation of the historical arrival sequence; S143, Define Dynamic Stable Bandwidth The fusion prediction value will be strictly limited to a physically reasonable range. Inside; In the formula, This is a truncation function used to restrict input values ​​to a given range of upper and lower bounds. This is the dynamic stable bandwidth scaling factor; S144. To align with the physical semantics of discrete task scheduling, output the task arrival count for the next time slot in non-negative integer format. , as a future load forecast value; In the formula, This is the round function.

6. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 1, characterized in that, In S2, the topology of the edge network is modeled as an undirected weighted graph. ,in, For the set of edge nodes of the edge network, Define an adjacency matrix as the set of physical communication connections between edge nodes. To describe the connection relationship between nodes, in the formula Represents a matrix, This indicates the number of edge nodes. With edge nodes Direct connection, then edge node With edge nodes adjacency matrix Otherwise .

7. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 5, characterized in that, S3 includes the following steps: S31. Input the task sequence and the agent's state observation into the pointer network, use the attention mechanism to calculate the weight of each task in the task sequence, and select the task with the highest priority to determine the task execution order. S32. Based on the determined task execution order, a masking mechanism is introduced in the action output stage through the policy network to calculate the probability distribution of scheduling actions and then generate a scheduling scheme. S33. A single-policy-based same-policy proximal policy optimization framework is adopted. During the continuous interaction between the agent and the environment, the total loss function is constructed by maximizing the cumulative reward, and the policy network and value network are updated to optimize the scheduling policy.

8. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 7, characterized in that, Specifically, S31 involves: extracting features from the task sequence using the attention mechanism of a pointer network to generate features for the task to be processed; combining the agent's state observations with the features for the task to be processed through an encoder to generate corresponding hidden representations; and calculating the current hidden state through the encoder. The correlation between the candidate task encoded features and the encoded output is used to generate the weights of each task in the task sequence. ; In the formula, , and The weight matrix is ​​a learnable matrix. It is the transpose symbol. For candidate tasks The coding features, For task indexing, Index for decision-making steps.

9. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 8, characterized in that, In S32, the probability distribution of scheduling actions is calculated. The specific expression is: In the formula, For policy networks targeting tasks Assigned to target edge node The output of the original action score, For the task Scheduled to edge nodes The corresponding mask value, For the task Scheduled to candidate edge nodes The corresponding mask value, For policy networks targeting tasks Assigned to candidate edge nodes The output of the original action score, Index for candidate edge nodes, Index the target edge node. For task indexing; The masking mechanism is as follows: for mask values If any of the following conditions are met, then the mask value Set to negative infinity, otherwise let ; (1) Edge nodes The available CPU or GPU computing power is lower than the minimum requirements of the task; (2) The communication link between the source node and the target node is interrupted or the congestion level exceeds the threshold.

10. The spatiotemporally aware multi-agent joint task scheduling method for edge computing according to claim 9, characterized in that, In S33, the total loss function of the single-policy same-policy proximal policy optimization framework The specific expression is: In the formula, For the policy function loss, For the value function loss, The strategy entropy used to encourage exploration, These are the loss weight coefficients of the value function, used to control the intensity of value network updates. The entropy regularization term weight coefficient is used to control the exploration intensity of the strategy; In the formula, This is the expectation operator, used to average the results over a batch of samples. For the current value network to time slot The predictive value of a state For the target state value, For the old value network in time slots Output state value prediction, Value trimming threshold, To cut off value; In the formula, To obtain the minimum value, This represents the probability ratio between the old and new strategies. , For the new strategy, This is the old strategy. The dominant function; In terms of advantage function estimation, a generalized advantage estimation method is adopted. First, the original reward is standardized, and then a scaling factor is applied. With cutoff threshold Apply bandwidth limiting and calculate the reward after bandwidth limiting. ; In the formula, For the standardized reward, the time-series difference error is then calculated. In the formula This is a discount factor used to measure the degree to which future rewards influence current decisions. For time slots State value estimation, For time slots State value estimation and recursive derivation of the advantage function. The value of the target state is defined as Before sending it to the network for update, further refine the advantage function. Perform standardization with zero mean and unit variance; Original reward It is designed as a weighted combination of multiple system costs and utility incentives; In the formula, The cost of being delayed for the mission For system energy consumption, Penalties for service overdue payments, To average queuing delay, For tail delay, For transmission delay, Due to task turnaround time delay, To complete the task, For resource utilization, The weighting coefficient for the cost of task delay. This is the system energy consumption weighting coefficient. To determine the weighting coefficient for penalties for breach of contract, This is the average queuing delay weighting coefficient. The tail 95th percentile delay weighting coefficient. For transmission delay weighting coefficient, This is the task turnaround time weighting coefficient. The positive incentive coefficient corresponding to the number of tasks completed. This is the positive incentive coefficient corresponding to resource utilization rate; In S33, the specific method for updating the policy network and value network is as follows: Each cumulative... The network is updated once per round, with the learning rate using a cosine annealing strategy, while the action exploration intensity uses linear annealing decay. In the formula, For the first The intensity of action exploration in the next update This is the linear annealing schedule factor. Initial exploration intensity, For the final exploration intensity, This represents the current number of network updates. To explore the total number of updates required for the intensity to complete a linear decay.