Distributed task offloading method and device based on reinforcement learning and storage medium

By training a reinforcement learning model offline and fine-tuning it online in an edge computing system, the differences between the simulation environment and the actual environment and the cold start problem are solved, the task offloading decision is optimized, and the resource utilization and user experience of the edge computing system are improved.

CN116467005BActive Publication Date: 2026-06-16SOUTH CHINA UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTH CHINA UNIV OF TECH
Filing Date
2023-03-02
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing task offloading methods based on reinforcement learning suffer from poor performance in edge computing systems due to differences between simulation and real-world environments, as well as cold-start issues caused by online learning, and cannot effectively optimize task offloading decisions.

Method used

An offline reinforcement learning model combined with heuristic methods is used to train the Actor and Critic models, and the models are updated through online fine-tuning algorithms. The task offloading decision is optimized using a distributed partial observation Markov decision process, and the optimization problem is defined and deployed in edge nodes using a system model.

🎯Benefits of technology

It improves the performance and stability of the task offloading model, reduces performance degradation during the online learning phase, and enhances the resource utilization and user experience of the edge computing system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116467005B_ABST
    Figure CN116467005B_ABST
Patent Text Reader

Abstract

The application discloses a kind of distributed task unloading method, device and storage medium based on reinforcement learning, wherein the method comprises: modeling system, defining optimization problem using system model;Convert the optimization problem into a distributed partially observed Markov decision process, determine observation, action and reward;Deploy heuristic algorithm in the system, collect task running log;According to the collected task running log, train Actor, Critic model using offline training algorithm;Actor, Critic model trained in offline stage is deployed to the system;During system operation, update Actor, Critic model according to online fine-tuning algorithm.The application uses the task running log of heuristic method to train an offline reinforcement learning model to hot start model in offline stage, and updates the model according to online fine-tuning algorithm during system operation, improves the performance of the model.The application can be widely applied in computing offload technical field.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computational offloading technology, and in particular to a distributed task offloading method, apparatus and storage medium based on reinforcement learning. Background Technology

[0002] With the widespread adoption of IoT devices, such as wearables and smartphones, many new computationally intensive and latency-sensitive applications have emerged, including facial recognition and augmented reality. Because these applications have high network bandwidth requirements, transmitting data to the cloud for processing and then returning the results would place enormous pressure on the core network. Furthermore, network latency would degrade the end-user experience. Edge computing (EC) moves computing and storage facilities to the network edge, processing data near the end user, thereby reducing network latency and improving the user experience.

[0003] However, edge nodes have limited computing resources, their workloads are dynamic, and the load on geographically distributed edge nodes is unbalanced. A single edge node can hardly consistently provide sufficient computing resources to nearby users. Resource utilization across all edge nodes and improved user experience can be achieved by leveraging collaboration between different edge nodes; for example, heavily loaded edge nodes can forward some of their tasks to less loaded edge nodes, thus achieving load balancing across geographically distributed edge nodes. Most research on task offloading aims to minimize task latency or energy consumption. However, each operational unit of a network system, such as edge nodes and transmission links, is susceptible to failure when processing tasks. Therefore, the task offloading scheduler for each edge node needs to intelligently balance latency and reliability during task execution.

[0004] Existing methods for collaborative edge computing can be broadly categorized into centralized and distributed approaches. Centralized methods require global system information to make offloading decisions for each task arriving at each edge node. Updating rapidly changing global information, such as information on randomly arriving tasks and the workload of each edge node, incurs significant transmission overhead, making it unsuitable for large-scale edge network systems. Distributed methods can be categorized into heuristic algorithms, game theory, and reinforcement learning. Traditional distributed heuristic algorithms, based on greedy optimization, tend to get stuck in suboptimal solutions in the long run. Game theory methods rely on strong assumptions to solve for Nash equilibria (NE), making it difficult to characterize real-world system scenarios.

[0005] Reinforcement learning-based task offloading methods have the following advantages. First, reinforcement learning methods can make offloading decisions in dynamic network environments. Second, reinforcement learning methods aim to optimize a discounted cumulative reward, allowing for long-term goal optimization. However, existing reinforcement-based task offloading methods, when implemented in edge computing systems, fall into two categories. First, some methods train the reinforcement learning model in a simulation environment and then deploy it to the actual system. Since simulation environments are difficult to characterize real-world systems, models that perform well in simulation may not be effective in real-world deployments. Second, some methods involve online learning, where the reinforcement learning model is trained and deployed directly in the edge network. Online learning encounters the cold start problem, which severely impacts the user experience before the reinforcement learning model converges.

[0006] Offline reinforcement learning provides a practical method for deploying reinforcement learning to edge computing. It trains a reinforcement learning model using pre-collected datasets without interacting with the environment. Therefore, the offline-trained model can be directly deployed to edge computing environments for task offloading. However, because the task offloading algorithm that generates the data may be suboptimal, and the model cannot be corrected by interacting with the environment, offline reinforcement learning models cannot achieve optimal performance. Therefore, further fine-tuning of the reinforcement learning model in a real-world environment is necessary. Summary of the Invention

[0007] In order to at least partially solve one of the technical problems existing in the prior art, the present invention aims to provide a distributed task offloading method, apparatus and storage medium based on reinforcement learning.

[0008] The technical solution adopted in this invention is:

[0009] A distributed task offloading method based on reinforcement learning includes the following steps:

[0010] Model the system and use the system model to define the optimization problem;

[0011] The optimization problem is transformed into a Markov decision process with distributed partial observations, which determines the observations, actions, and rewards.

[0012] Deploy heuristic algorithms in the system to collect task execution logs;

[0013] Based on the collected task execution logs, train the Actor and Critic models using an offline training algorithm;

[0014] The Actor and Critic models trained offline are deployed to the system; during system operation, the Actor and Critic models are periodically updated based on the online fine-tuning algorithm using new task operation logs.

[0015] Furthermore, the system model includes edge nodes, transmission links, and tasks;

[0016] The modeling of the system includes:

[0017] The system model is modeled as a graph G = (V, E); where V represents the set of edge nodes and E represents the set of transmission links;

[0018] The preset time is divided into multiple equal time slots, each time slot is represented as t, K i (t) represents the task received by edge node i in time slot t, represented by a tuple {s} i (t),c i (t),d i (t)} represents; where s i (t) represents the size of the task input data, c i (t)=δ i (t)×s i (t) represents the CPU cycles required to complete the task, v i (t) represents the calculated density, d i (t) represents the deadline for the task;

[0019] Each edge node processes arriving tasks locally or offloads tasks to other edge nodes. Task K i The processing time (t) at edge node j is expressed as:

[0020]

[0021] In the formula, F j This represents the computational power of edge node j;

[0022] K i The time required to send data from edge node i to edge node j is expressed as:

[0023]

[0024] In the formula, r ij This represents the transmission rate of edge node i and edge node j;

[0025] Task K i The processing time (t) at edge node j is expressed as exist During this period, the failure rate of edge node j is α. j During the execution period, Indicates the number of times failure occurred. The probability is calculated as follows:

[0026]

[0027] When k = 0, it means the task is in No failures occurred during this processing time; therefore, task K... i The reliability of (t) at edge node j is calculated as follows:

[0028]

[0029] Task K i The reliability of transmission from edge node i to edge node j is expressed as follows:

[0030]

[0031] Where, β ij This represents the failure rate of the transmission link between edge node i and edge node j.

[0032] Furthermore, the optimization problem defined using a system model includes:

[0033] For each task K i (t), where the decision variable for edge node i is vector x. i (t), x ij (t) = 1 indicates that the task is unloaded to edge node j. Each task can only be executed on one edge node, expressed as:

[0034]

[0035] The task completion time is expressed as:

[0036]

[0037] in, These represent the transmission wait time, transmission time, execution wait time, and execution time, respectively; the task must be completed before the deadline, as shown below:

[0038] T i (t)≤d i (t)

[0039] Tasks that exceed the deadline will be discarded;

[0040] The system's goal is to maximize the success rate of tasks within a preset time, expressed as:

[0041]

[0042] Among them, |K succ |,|K drop | and | K fail| These represent the number of successful tasks, the number of lost tasks, and the number of failed tasks, respectively.

[0043] Furthermore, the determination of observations includes:

[0044] In time slot t, each edge node i observes local information in the edge network; the local information includes the task arrival probability λ. i The failure rate α of this edge node. i The computing power F of this edge node i The length l of the execution queue of this edge node i The length of the receive queue from edge node i to the other N-1 edge nodes. The transmission rate r from edge node i to the other N-1 edge nodes ij When a task arrives, the local information also includes the task's data size s. i (t) Task complexity c i (t), Task deadline d i (t); When no task arrives, fill s with 0. i (t), c i (t) and d i (t);

[0045] In time slot t, the observation of edge node i is defined as the following tuple:

[0046]

[0047] Furthermore, the determining action includes:

[0048] In time slot t, if a task arrives at edge node i, edge node i will... Choose an action To determine whether to unload the task to edge node j, each edge node requires at least N discrete actions to represent all edge nodes.

[0049] In a dynamic edge network environment, edge node i may not receive the task sent by the end user in time slot t. In order to construct the state change tuple of DRL, In this time slot, edge node i still needs to select an action. Therefore, adding an action indicates that the edge node does not make an unload decision when no task arrives.

[0050] Furthermore, the determination of the reward includes:

[0051] A task may take multiple time slots to complete; for task K... iIf the task is successfully executed, edge node i will receive a +1 feedback in the time slot in which it is completed; if the task is dropped or fails to execute, edge node i will receive a -1 feedback.

[0052] In the distributed partially observed Markov decision process, all edge nodes share a team reward; in each time slot, the team reward is defined as the sum of the feedback from all edge nodes in that time slot.

[0053] Furthermore, the step of training the Actor and Critic models using an offline training algorithm includes:

[0054] By utilizing the task execution logs of all edge nodes, a shared reinforcement learning model is trained, and then the reinforcement learning model is deployed on all edge nodes to perform distributed task offloading.

[0055] A reinforcement learning algorithm based on discrete SAC (Spiritual Amplifier) ​​for offline training is designed. SAC uses two Q-networks to avoid overestimation, where φ1 and φ2 represent the parameters of the two Q-networks, respectively. Let θ represent the target network parameters of these two Q-networks, and let θ represent the Actor network parameters. The soft-value function is defined as follows:

[0056]

[0057] The target value is then expressed as:

[0058]

[0059] The Q-network is updated by minimizing the following mean squared error using gradient descent:

[0060]

[0061] The following CQL regularization terms are used to optimize the Q-network:

[0062]

[0063] The Q-network is updated by minimizing the following mean squared error using gradient descent:

[0064]

[0065] For updating the Actor network, the following soft-value function is maximized through gradient ascent:

[0066]

[0067] The Critic network is updated via gradient descent with the following mean squared error objective:

[0068]

[0069] in, r represents the observation of agent i at time t. t Let γ represent the reward at time t, and γ represent the discount factor. This represents the batch size, and N represents the number of agents. Let represent the action of agent i at time t. Denotes the Q-network function, λ cql This indicates the weight of the CQL regularization term. V represents the CQL regularization term, V() represents the soft-value function, V ω () represents the Critic network function, and y represents the target value.

[0070] Furthermore, the online fine-tuning algorithm updates the Actor and Critic models, including:

[0071] By using GAE, the bias introduced by Criticism and the variance introduced by Return can be balanced to address the overestimation problem caused by the distribution shift of online data.

[0072] Another technical solution adopted in this invention is:

[0073] A distributed task offloading device based on reinforcement learning, comprising:

[0074] At least one processor;

[0075] At least one memory for storing at least one program;

[0076] When the at least one program is executed by the at least one processor, the at least one processor implements the method described above.

[0077] Another technical solution adopted in this invention is:

[0078] A computer-readable storage medium storing a processor-executable program, which, when executed by a processor, performs the method described above.

[0079] The beneficial effects of this invention are: In the offline stage, this invention uses a heuristic method to train an offline reinforcement learning model using task execution logs to warm up the model; during system operation, the model is updated according to an online fine-tuning algorithm to improve the model's performance. Attached Figure Description

[0080] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following description is provided with accompanying drawings of the relevant technical solutions in the embodiments of the present invention or the prior art. It should be understood that the accompanying drawings described below are only for the purpose of clearly illustrating some embodiments of the technical solutions of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0081] Figure 1 This is a flowchart illustrating the steps of a distributed task offloading method based on reinforcement learning in an embodiment of the present invention. Detailed Implementation

[0082] The embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention. The step numbers in the following embodiments are set only for ease of explanation, and there is no limitation on the order between the steps. The execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.

[0083] In the description of this invention, it should be understood that the orientation descriptions, such as up, down, front, back, left, right, etc., are based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limiting this invention.

[0084] In the description of this invention, "several" means one or more, "more than" means two or more, "greater than," "less than," and "exceeding" are understood to exclude the stated number, while "above," "below," and "within" are understood to include the stated number. The use of "first" and "second" in the description is merely for distinguishing technical features and should not be construed as indicating or implying relative importance, or implicitly indicating the number of indicated technical features, or implicitly indicating the order of the indicated technical features.

[0085] In the description of this invention, unless otherwise explicitly defined, terms such as "set up," "install," and "connect" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this invention in conjunction with the specific content of the technical solution.

[0086] Terminology Explanation:

[0087] Actor: A neural network that selects actions based on the state.

[0088] Critic: A neural network that scores states.

[0089] GAE: General Advantage Function Evaluator.

[0090] return: cumulative discount rewards.

[0091] O2O-DRL: An offline-to-online deep reinforcement learning algorithm.

[0092] Offloading tasks across multiple edge computing nodes can improve resource utilization and user experience. However, centralized task offloading requires global system information, leading to additional transmission overhead and failing to scale to large-scale edge networks. Existing distributed methods, including heuristics and game theory, focus on greedy optimization or strict assumptions, which are not suitable for dynamic edge computing environments. Existing reinforcement learning-based task offloading methods rely on simulation environments to train models and then deploy them to real-world systems. Due to significant differences between real-world systems and simulation environments, these methods fail to achieve good performance. On the other hand, directly training and deploying models in real-world systems can lead to cold-start problems, severely impacting user experience before model convergence. This invention proposes an offline-to-online reinforcement learning method, O2O-DRL. Specifically, in the offline phase, it uses task execution logs from heuristic methods to train an offline reinforcement learning model for warm-starting. Due to the distributional offset between offline and online data, directly using offline methods for online fine-tuning can corrupt the model trained offline. To address this issue, this invention uses reinforcement learning methods with the same policy to fine-tune the model to avoid overestimation of the value function.

[0093] like Figure 1 As shown, this embodiment provides a distributed task offloading method based on reinforcement learning, including the following steps:

[0094] S1. Model the system and use the system model to define the optimization problem.

[0095] (1) System Model

[0096] This embodiment considers multiple heterogeneous edge nodes in an edge network. The system model can be modeled as a graph G = (V, E), where V represents the set of edge nodes, V = {i | 1 ≤ i ≤ N}, and E represents the set of transmission links. This model can be applied to wireless or wired network scenarios. A time period is divided into multiple equal time slots, each denoted as t, where t ∈ 1, 2, ..., T. Each edge node i ∈ V may receive computationally intensive and latency-sensitive tasks from end users in each time slot. It is assumed that the task received by edge node i from the end user in a time slot follows a Bernoulli distribution with a probability of λ. i K i (t) represents the task received by edge node i in time slot t, which can be represented by a tuple {s} i (t),c i (t),d i (t)} represents s. Where s i (t) represents the size of the task input data, c i (t)=δ i (t)×s i (t) represents the CPU cycles required to complete the task, δ i (t) represents the computation density, i.e., the CPU cycles required per bit of data, d i (t) represents the task's deadline. Each edge node can process arriving tasks locally or offload them to other edge nodes. C i This represents the number of CPU cores on the edge node. F represents the frequency of each CPU core. We assume that tasks can be executed in parallel on the edge nodes. i This represents the computational power of edge node i. Where F... i =C i ×F. Task K i The processing time at edge node j (t) can be expressed as:

[0097]

[0098] Each edge node maintains an execution queue, and task scheduling is based on FIFO (First-In, First-Out). We assume that each task consumes all the resources of the edge node. If a task is being executed on an edge node, other tasks on that node must wait in the execution queue until the task is completed. The queue threshold for each edge node is L. i L i It is positively correlated with the number of its CPU cores, i.e., L i =μC i μ is a parameter. If the number of tasks exceeds its queue threshold, it indicates that the edge node is overloaded. ij This represents the transmission rate of edge node i and edge node j. Task Ki The time required to send data from edge node i to edge node j can be expressed as:

[0099]

[0100] Each edge node maintains N-1 receive queues to receive tasks sent by other edge nodes and N-1 send queues to send tasks to other edge nodes. When task K... i (t) is received by edge node j and placed in the execution queue to await execution, without being forwarded to other edge nodes. This is because in this fully connected network, the result of multiple forwardings of a task can be completed in a single forwarding, reducing transmission time. The reliability of the system depends on the probability of each operation unit failing to process the task, such as edge nodes and transmission links. These operation unit failures include hardware and software failures. The failure scenarios of each operation unit can be modeled as a Poisson distribution. In this paper, task K... i The processing time (t) at edge node j can be expressed as: exist During this period, assume that failures occur according to a Poisson distribution, and the failure rate of edge node j is α. j During the execution period, Indicates the number of failures. The probability can be calculated as follows:

[0101]

[0102] When k = 0, it means the task is in No failures occurred during this processing time. Therefore, task K... i The reliability of (t) at edge node j can be calculated as follows:

[0103]

[0104] Assume that the failure of task K will not affect the execution of subsequent tasks. Similarly, task K... i (t) The reliability of transmission from edge node i to edge node j can be expressed as:

[0105]

[0106] Where, β ij This represents the failure rate of the transmission link between edge node i and edge node j.

[0107] (2) Problem Definition

[0108] a) Decision variables

[0109] In time slot t, when a task submitted by an end user arrives at edge node i, it can process the task locally or offload it to another edge node for execution. The "01" decision variable x ij (t) indicates whether task K is included. i (t) Unload to edge node j for execution. When x ij When (t) = 1, it indicates that task K will be... i (t) Unload to edge node j for execution. Otherwise, x ij (t) = 0. For generality, x ii (t) = 1 indicates that edge node i is executing task K locally. i (t).

[0110] b) Constraints

[0111] For each task K i (t), where the decision variable for edge node i is vector x. i (t),x ij (t) = 1 indicates that the task is unloaded to edge node j. Each task can only be executed on one edge node, which can be expressed as:

[0112]

[0113] The task completion time can be expressed as:

[0114]

[0115] in, These represent the transmission wait time, transmission time, execution wait time, and execution time, respectively. The task needs to be completed before the deadline, and can be represented as:

[0116] T i (t)≤d i (t)

[0117] Tasks that exceed the deadline will be discarded.

[0118] c) Objective

[0119] The system's goal is to maximize the success rate of tasks over a period of time, which can be expressed as:

[0120]

[0121] Where, |K succ |,|K drop | and | K fail | These represent the number of successful tasks, the number of lost tasks, and the number of failed tasks, respectively.

[0122] S2. Transform the optimization problem into a Markov decision process with distributed partial observations, and determine the observations, actions, and rewards.

[0123] The above problem can be modeled as a distributed partially observed Markov decision process (Dec-POMDP). The three most important elements are defined below: observation, action, and reward.

[0124] (1) Observation

[0125] In time slot t, each edge node i observes its local information within the edge network, including the task arrival probability λ. i The failure rate α of this edge node i The computing power F of this edge node i The length l of the execution queue of this edge node i The length of the receive queue from edge node i to the other N-1 edge nodes. The transmission rate r from edge node i to the other N-1 edge nodes ij When a task arrives, the task's data size (s) is also included. i (t), the complexity of the task c i (t), the deadline d of the task i (t). To maintain a consistent input dimension for the neural network, s is padded with 0s when no task arrives. i (t), c i (t) and d i (t). Therefore, in time slot t, the observation of edge node i can be defined as the following tuple:

[0126]

[0127] Noting that the definition of observation includes variables of different orders of magnitude, we normalize each element by dividing it by its largest value. This keeps the value of each element within the range [0,1], ensuring that each element is equally important for training the reinforcement learning model. The state of Dec-POMDP represents the observations of all edge nodes.

[0128] (2) Actions

[0129] In time slot t, if a task arrives at edge node i, it needs to base its current observations on the task. Choose an action This determines whether to offload the task to edge node j, including itself. Therefore, each edge node requires at least N discrete actions to represent all edge nodes. In a dynamic edge network environment, edge node i may not receive the task sent by the end user in time slot t. However, in order to construct the state change tuple of the DRL... In this time slot, edge node i still needs to choose an action. Therefore, we need to add an action to indicate that the edge node does not make an unloading decision when no tasks arrive. Based on this modification, our method can be applied to most real-world scenarios where tasks arrive randomly.

[0130] To control the agent's choice of available actions, we introduce "optional actions," which are arrays of 0s and 1s. We use these for action space masking. Specifically, we reassign the values ​​of unavailable actions (those with a value of 0 in the "optional actions") output by the neural network to a very large negative value (e.g., -1 × 10⁻⁶). 8 After the softmax function, the probability of choosing this action becomes 0.

[0131] In the initial stage, the neural network uses a stochastic policy. We can guide the reinforcement learning model to converge to a better solution by controlling the "optional actions." For example, when an edge node is heavily loaded, it may not want to receive tasks from other edge nodes. Therefore, when an edge node needs to make a task offloading decision, it queries other edge nodes to see if its tasks exceed its queue threshold. If so, the corresponding "optional action" of that edge node is set to 0. This only requires 1 bit of data transmission, so the overhead of this query is negligible.

[0132] (3) Rewards

[0133] A task may take multiple time slots to complete; for task K... i If the task is successfully executed, edge node i will receive a +1 feedback in the time slot in which it is completed; if the task is dropped or fails, edge node i will receive a -1 feedback. In the distributed partially observed Markov decision process, all edge nodes share a team reward. In each time slot, the team reward is defined as the sum of the feedback from all edge nodes in that time slot.

[0134] S3. Deploy heuristic algorithms in the system and collect task execution logs.

[0135] Heuristic-based task unloading methods can be directly deployed in real systems. However, these methods lack a learning mechanism and cannot learn from historical task execution logs. We can collect this data and train an offline reinforcement learning model. This reinforcement learning model can mimic the policies of heuristic methods and, through a learning mechanism, achieve better performance than heuristic methods. The advantages of using heuristic task execution logs to train a reinforcement learning model can be summarized in two points: First, we can warm-start a reinforcement learning model. This model has good performance in the initial stage. Otherwise, a reinforcement learning model based on random initialization of a neural network has a random policy. A random policy will seriously affect the user experience. Second, heuristic task execution logs can guide the reinforcement learning model to converge to a better solution.

[0136] S4. Based on the collected task execution logs, train the Actor and Critic models using an offline training algorithm.

[0137] This embodiment proposes an offline-to-online reinforcement learning method, O2O-DRL, to solve the problem of distributed task offloading between edge nodes.

[0138] Our intuition is that each edge node trains its own reinforcement learning model using its task execution logs. However, each edge node has limited observations and cannot perceive the task offloading decisions of other edge nodes, making it difficult for each edge node's N-level reinforcement learning model to converge to a better solution. To address this, we could train a shared reinforcement learning model using the task execution logs of all edge nodes and then deploy this model across all edge nodes for distributed task offloading. However, according to the definition of observations in Dec-POMDP, this is insufficient to distinguish each edge node. Therefore, we add a one-hot encoding based on the edge node index to the observations. For example, the one-hot encoding of edge node i is an array where index i-1 is 1 and the rest are 0.

[0139] Since training is an offline process, we can train the reinforcement learning model offline on a high-performance edge node or cloud server. Once the model is trained, we can deploy it to each edge node for distributed task offloading.

[0140] We design an offline-training reinforcement learning algorithm based on discrete SAC. SAC uses two Q-networks to avoid overestimation. φ1 and φ2 represent the parameters of the two Q-networks, respectively. Let θ represent the target network parameters of these two Q-networks. Let θ represent the Actor network parameters. The soft-value function is defined as follows:

[0141]

[0142] The target value can then be expressed as:

[0143]

[0144] We can update the Q-network by minimizing the following mean squared error using gradient descent:

[0145]

[0146] However, there is a distribution shift between the heuristics for generating task logs and the offline reinforcement learning strategy. To address this issue, we employ the following CQL regularization term to optimize the Q-network:

[0147]

[0148] Note that we minimize the regularization term. Therefore, the first term represents reducing the Q-value corresponding to unseen observations, and the second term represents increasing the Q-value corresponding to actions taken in offline data. Incorporating the CQL regularization term, we update the Q-network by minimizing the following mean squared error through gradient descent:

[0149]

[0150] To update the Actor network, we need to maximize the following soft-value function through gradient ascent:

[0151]

[0152] After deploying the reinforcement learning model to the edge network, the reinforcement learning-based scheduler accumulates new task execution logs. To fine-tune the reinforcement learning model on-policy during the online phase, we train an additional Critic network offline. We update the Critic network using gradient descent with the following mean squared error objective:

[0153]

[0154] The Critic network is only used in the online phase. The reward has a large range in each time slot, which severely impacts the training of the neural network. Therefore, we regularize the reward on offline data to improve the convergence speed and performance of the algorithm. We provide the algorithm for offline training of reinforcement learning models in Algorithm1.

[0155] S5. Deploy the Actor and Critic models trained in the offline phase into the system; during system operation, periodically update the Actor and Critic models based on the online fine-tuning algorithm using new task operation logs.

[0156] When deploying the model to a production system, we can periodically collect data generated during the online phase and adjust the model accordingly. Algorithms trained directly on offline training can lead to performance degradation in the initial stages of online fine-tuning, potentially damaging the model trained offline. By training an additional Critic network offline, the On-policyactor-critic algorithm can be used to update the model during online fine-tuning. Using GAE (Generative Adversarial Evaluator), the bias introduced by the Critic network and the variance introduced by the return function are balanced, addressing the overestimation problem caused by the distribution shift of online data.

[0157] The GAE can be solved iteratively as follows:

[0158]

[0159] in,

[0160]

[0161] This paper updates the actor network based on PPO by minimizing the following loss function:

[0162]

[0163] Where r(θ) is the ratio of the old and new strategies, and η(x) is the clip function.

[0164] For the Critic network, the Critic network is updated by minimizing the following loss function.

[0165]

[0166] While we can directly deploy offline models on edge systems for online task offloading, we can further improve their performance using online fine-tuning. When a reinforcement learning model is deployed in the system, each edge node accumulates new task execution logs during execution. We can periodically collect this data to fine-tune the reinforcement learning model. Note that we only need to deploy the Actor network on the edge nodes for online task offloading, and the Actor network is built using only a 3-layer fully connected network. Therefore, the Actor network model size is less than 100KB, and each model update incurs only a small transmission overhead. Furthermore, we only update the model parameters after collecting sufficient data.

[0167] The challenge of online fine-tuning lies in the distribution bias between online and offline data. The Q-network and Critic network are trained on offline data, which can lead to overestimation of observations not seen in the offline data during the online phase. Note that we additionally train a Critic network offline; therefore, we can leverage an on-policy Actor-Critic reinforcement learning algorithm to fine-tune the model. The advantage of this is that we can use GAE to compute the advantage function, thus balancing the bias introduced by the value-based function and the variance introduced by the return-based function. Our method avoids the performance degradation that occurs in the initial stages of online fine-tuning.

[0168] We can calculate GAE recursively as follows:

[0169]

[0170]

[0171] Therefore, we can calculate the target value as follows:

[0172]

[0173] We use the clip method of PPO to update the Actor network. Therefore, we update the Actor network by minimizing the following objective through gradient ascent:

[0174]

[0175] Where r(θ) is the ratio of the old and new strategies, it can be calculated as follows:

[0176]

[0177] η(x) is the clip function.

[0178] We use gradient descent to minimize the following mean squared error objective to update the Critic network:

[0179]

[0180] Similar to the offline phase, we regularize the rewards in the online data to accelerate model convergence and improve model performance.

[0181] This embodiment also provides a distributed task offloading device based on reinforcement learning, including:

[0182] At least one processor;

[0183] At least one memory for storing at least one program;

[0184] When the at least one program is executed by the at least one processor, the at least one processor implements Figure 1 The method shown.

[0185] This embodiment of a distributed task offloading device based on reinforcement learning can execute a distributed task offloading method based on reinforcement learning provided in the method embodiment of the present invention. It can execute any combination of implementation steps of the method embodiment and has the corresponding functions and beneficial effects of the method.

[0186] This application also discloses a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device can read the computer instructions from the computer-readable storage medium and execute the computer instructions, causing the computer device to perform... Figure 1 The method shown.

[0187] This embodiment also provides a storage medium storing instructions or programs that can execute the distributed task offloading method based on reinforcement learning provided in the method embodiment of the present invention. When the instructions or programs are run, any combination of implementation steps of the method embodiment can be executed, and the method has the corresponding functions and beneficial effects.

[0188] In some alternative embodiments, the functions / operations mentioned in the block diagrams may not occur in the order shown in the operation diagrams. For example, depending on the functions / operations involved, two consecutively shown blocks may actually be executed substantially simultaneously, or the blocks may sometimes be executed in reverse order. Furthermore, the embodiments presented and described in the flowcharts of this invention are provided by way of example to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is altered and sub-operations described as part of a larger operation are executed independently.

[0189] Furthermore, although the invention has been described in the context of functional modules, it should be understood that, unless otherwise stated, one or more of the described functions and / or features may be integrated into a single physical device and / or software module, or one or more functions and / or features may be implemented in a separate physical device or software module. It is also understood that a detailed discussion of the actual implementation of each module is unnecessary for understanding the invention. Rather, given the properties, functions, and internal relationships of the various functional modules in the apparatus disclosed herein, the actual implementation of the module will be understood within the scope of conventional skill of an engineer. Therefore, those skilled in the art can implement the invention as set forth in the claims using ordinary techniques without excessive experimentation. It is also understood that the specific concepts disclosed are merely illustrative and not intended to limit the scope of the invention, which is determined by the full scope of the appended claims and their equivalents.

[0190] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, essentially, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0191] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.

[0192] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.

[0193] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0194] In the foregoing description of this specification, references to terms such as "one embodiment," "another embodiment," or "some embodiments" indicate that a specific feature, structure, material, or characteristic described in connection with an embodiment or example is included in at least one embodiment or example of the present invention. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0195] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

[0196] The above is a detailed description of the preferred embodiments of the present invention. However, the present invention is not limited to the above embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. All such equivalent modifications or substitutions are included within the scope defined by the claims of this application.

Claims

1. A distributed task offloading method based on reinforcement learning, characterized in that, Includes the following steps: Model the system and use the system model to define the optimization problem; The optimization problem is transformed into a Markov decision process with distributed partial observations, which determines the observations, actions, and rewards. Deploy heuristic algorithms in the system to collect task execution logs; Based on the collected task execution logs, the Actor and Critic models are trained using an offline training algorithm. Deploy the Actor and Critic models trained in the offline phase into the system; During system operation, the Actor and Critic models are updated according to an online fine-tuning algorithm. The system model includes edge nodes, transmission links, and tasks; The process of modeling the system includes: Model the system as a graph ;in, Represents the set of edge nodes. Represents a set of transmission links; The preset time is divided into multiple equal time slots, each time slot being represented as... , Represents edge nodes In the time slot The received task is represented by a tuple. Indicates; among which, Indicates the size of the task input data. This indicates the CPU cycles required to complete the task. Indicates the calculated density. Indicates the deadline for the task; Each edge node processes arriving tasks locally, or offloads tasks to other edge nodes. At the edge node The processing time is expressed as: In the formula, Represents edge nodes Its computing power; Task From edge nodes Send to edge node The required time is expressed as follows: In the formula, Represents edge nodes and edge nodes The transmission rate; Task At the edge node The processing time is expressed as ,exist During this period, edge nodes The failure rate is During the execution period, Indicates the number of times failure occurred. The probability is calculated as follows: when This indicates that the task is in No malfunctions occurred during this processing time, therefore, the task... At the edge node The reliability calculation is as follows: The task From edge nodes Transmitted to edge nodes The reliability is expressed as: in, Represents edge nodes and edge nodes The failure rate of the transmission link between them; The optimization problem defined using a system model includes: For each task Edge nodes The decision variables are vectors. , This indicates that the task will be offloaded to an edge node. Each task can only be executed on one edge node, expressed as: The task completion time is expressed as: in, , , , These represent the transmission wait time, transmission time, execution wait time, and execution time, respectively; the task must be completed before the deadline, as shown below: Tasks that exceed the deadline will be discarded; The system's goal is to maximize the success rate of tasks within a preset time, expressed as: in, , and These represent the number of successful tasks, the number of lost tasks, and the number of failed tasks, respectively. The offline training algorithm for training Actor and Critic models includes: By utilizing the task execution logs of all edge nodes, a shared reinforcement learning model is trained, and then the reinforcement learning model is deployed on all edge nodes to perform distributed task offloading. A reinforcement learning algorithm based on discrete SAC (Spiritual Amplifier) ​​is designed for offline training. SAC uses two Q-networks to avoid overestimation. and These represent the parameters of the two Q-networks, , These represent the target network parameters of the two Q-networks. The soft-value function, representing the parameters of the Actor network, is defined as follows: The target value is then expressed as: The Q-network is updated by minimizing the following mean squared error using gradient descent: The following CQL regularization terms are used to optimize the Q-network: The Q-network is updated by minimizing the following mean squared error using gradient descent: For updating the Actor network, the following soft-value function is maximized through gradient ascent: The Critic network is updated via gradient descent with the following mean squared error objective: in, Represents intelligent agents exist Observation of time, express Momentary rewards Indicates the discount factor. Indicates batch size. Indicates the number of agents. Represents intelligent agents exist Momentary actions express Network functions, This indicates the weight of the CQL regularization term. This represents a CQL regular expression term. Represents a soft-value function. Represents the Critic network function. Indicates the target value; The online fine-tuning algorithm updates the Actor and Critic models, including: By using GAE, the bias introduced by Criticism and the variance introduced by Return can be balanced to address the overestimation problem caused by the distribution shift of online data.

2. The distributed task offloading method based on reinforcement learning according to claim 1, characterized in that, The determined observations include: In the time slot Each edge node Observe local information in the edge network; the local information includes the task arrival probability. The failure rate of this edge node. The computing power of this edge node The length of the execution queue of this edge node Edge nodes To other The length of the receive queue of each edge node Edge nodes To other Transmission rate of each edge node When a task arrives, the local information also includes the task's data size. Task complexity Task deadline When no tasks arrive, fill with 0. , and ; In the time slot Edge nodes The observation is defined as the following tuple: 。 3. The distributed task offloading method based on reinforcement learning according to claim 1, characterized in that, The determined action includes: In the time slot If edge nodes A task has arrived at the edge node. Based on current observations Choose an action To decide whether to offload the task to the edge node Therefore, each edge node needs at least Each discrete action is used to represent all edge nodes; In dynamic edge network environments, edge nodes In the time slot It's possible that the task sent by the end user wasn't received, in order to construct the state change tuple of the DRL. In this time slot, edge nodes We still need to choose an action, so we add an action to indicate that the edge node will not make an unload decision when no task arrives.

4. The distributed task offloading method based on reinforcement learning according to claim 1, characterized in that, The determination of the reward includes: A task may take multiple time slots to complete; for a task If the task is executed successfully, the edge node The edge node will receive a +1 feedback in the time slot in which it completes; if the task is dropped or fails to execute, the edge node will... You will receive a feedback of -1; In the distributed partially observed Markov decision process, all edge nodes share a team reward; in each time slot, the team reward is defined as the sum of the feedback from all edge nodes in that time slot.

5. A distributed task offloading device based on reinforcement learning, characterized in that, include: At least one processor; At least one memory for storing at least one program; When the at least one program is executed by the at least one processor, the at least one processor implements the method of any one of claims 1-4.

6. A computer-readable storage medium storing a processor-executable program, characterized in that, The processor-executable program, when executed by the processor, is used to perform the method as described in any one of claims 1-4.