Distributed multi-agent task offloading migration method and system based on priority experience replay and meta-learning
By employing a distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning, the slow convergence speed and high computational complexity of deep reinforcement learning algorithms in high-mobility vehicle environments are addressed, achieving efficient task migration and load balancing in vehicular edge computing networks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHONGQING UNIV
- Filing Date
- 2023-11-10
- Publication Date
- 2026-06-30
AI Technical Summary
Existing deep reinforcement learning algorithms suffer from problems such as limited initial samples, slow convergence speed, and high computational complexity in high-mobility and high-dynamic environments where vehicles span multiple MEC servers, leading to a decrease in QoS and an increase in latency for in-vehicle applications.
A distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning is adopted. By using a priority experience replay cache pool and meta-learning algorithm, the convergence speed of the DRL algorithm is optimized, the dependence of the model on high-priority experience is adjusted, and the task migration and load balancing in the vehicle edge computing network are optimized.
It improves the model's computational performance and convergence speed, optimizes task migration and load balancing in the vehicle edge computing network, reduces quantization errors and computational complexity, and improves decision-making efficiency in the vehicle environment.
Smart Images

Figure CN117492864B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of vehicle networking technology and relates to a distributed multi-agent task offloading and migration method and system based on priority experience replay and meta-learning. Background Technology
[0002] With the rapid development of the Internet of Vehicles (IoV), a series of in-vehicle applications covering information services, driving safety, and traffic efficiency have emerged. Furthermore, leveraging 5G technology and edge intelligence, by applying mobile edge computing technology to the IoV, in-vehicle edge computing technology can provide in-vehicle users with low-latency, high-bandwidth, and highly reliable application services.
[0003] In existing technologies, the high mobility of vehicles spanning multiple MEC (Mobile Edge Computing) servers and the increasing number of vehicles can lead to load imbalances among MEC servers, thereby reducing QoS (Quality of Service) and causing significant latency. Driven by edge intelligence, most research in the past two years has focused on solving offloading decisions using Deep Reinforcement Learning (DRL). The rapid development of artificial intelligence has led to the widespread application of Deep Reinforcement Learning algorithms, which combine the perception capabilities of deep learning with the decision-making capabilities of reinforcement learning, in areas such as autonomous driving, automatic translation, dialogue systems, and video detection.
[0004] While the DRL algorithm has demonstrated strong capabilities in solving complex Markov Decision Processes (MDPs) with high-dimensional state-action spaces, it still suffers from two major drawbacks when addressing decision-making problems in real-time V2X (vehicle-to-everything) communication. First, if the DQN algorithm (a classic reinforcement learning algorithm) is used as the primary technique for handling the discrete-continuous hybrid action space, discretizing the continuous actions leads to quantization errors and performance degradation because the DQN output depends on the selection of the optimal action. Furthermore, high-dimensional quantization in the continuous action space results in an exponential increase in computational complexity. Second, considering the high mobility and dynamic nature of the vehicle environment, existing DRL algorithms may experience mismatch problems when the environment changes, meaning that the algorithm cannot quickly make correct decisions in dynamic environments. Summary of the Invention
[0005] The purpose of this invention is to provide a distributed multi-agent task offloading and migration method and system based on priority experience replay and meta-learning, so as to solve the problems of small initial samples and slow convergence speed, and optimize performance.
[0006] To achieve the above objectives, the basic solution of this invention is: a distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning, comprising the following steps:
[0007] Initialize the in-vehicle computing network, priority experience replay buffer pool, random noise and in-vehicle edge computing system environment, and receive the initial state of the in-vehicle edge computing system environment. The vehicle state vector for each time slot is s.
[0008] Each base station equipped with a server is defined as an intelligent agent. For each intelligent agent, actions are selected based on task offloading decisions and task migration decisions. The action vector generated by each vehicle in each time slot is a.
[0009] Each agent performs an action, calculates the reward r, and obtains new state information;
[0010] Calculate sampling priority;
[0011] The state vector S, action vector a, reward r, and new state s' are combined into an experience tuple, and the experience tuple is stored in the priority experience replay cache pool.
[0012] When the number of samples in the priority experience replay buffer pool reaches the preset value, each agent samples from the priority experience replay buffer pool according to the sampling probability and puts it into the sampling buffer pool (Buf sample);
[0013] Calculate the sampling weights;
[0014] Calculate the minimum loss function based on the sampling weights and update the initial network;
[0015] The parameters of each agent are updated according to the gradient formula until the preset training rounds are reached. The parameters of the target network and the global network are then updated based on the parameters of each agent.
[0016] The working principle and beneficial effects of this basic solution are as follows: This technical solution designs a priority factor, replaces traditional experience replay with priority experience replay, adjusts the model's dependence on high-priority experience, and improves the DRL algorithm based on meta-learning to alleviate the problems of few initial samples and slow convergence speed, thereby optimizing the convergence speed and improving the model's computational performance. Based on intelligent edge computing, it utilizes multi-agent deep reinforcement learning to solve the problem of partially offloading computationally intensive tasks, and simultaneously optimizes the task migration and load balancing problems in vehicular edge computing networks caused by the highly dynamic environment, uneven spatiotemporal distribution of demand, and imbalance between resource supply and demand.
[0017] Furthermore, the method by which each agent selects actions based on task offloading decisions and task migration decisions is as follows:
[0018] In each time slot base station, agent e observes the vehicle environment within the coverage area of the base station and collects environmental parameters as the observation state. The state vector of each time slot is s. The agent makes decisions according to the policy. The action vector generated in each time slot is a. Each action includes vehicle unloading decision and task migration decision on the VEC server.
[0019] State space: The state S of the system at time t. t Represented as:
[0020]
[0021] Where E represents the number of in-vehicle edge computing servers in the system. This represents the state of agent e at time t. This represents the task information for vehicle n∈N arriving at time t. Let t represent the current channel status between vehicle n and the base station. λ represents the computing load of servers within the service range of the current base station at time t; n For the task data volume of vehicle n, C n Calculate the total number of CPU cycles required for this task for vehicle n; T tolerance The tolerance delay for the task; N is the total number of vehicles in the vehicle network at time t, and N is a positive integer;
[0022] Action space: The unloading and migration decisions made by agent e at time t when observing the local environment constitute actions. Then the system action space set a t Represented as:
[0023]
[0024] in, This represents the decision on the proportion of tasks that vehicle v unloads to the VEC servers within the current service range at time t. This indicates the migration decision at time t regarding whether to migrate the tasks currently on the vehicle edge computing server e to server e′.
[0025] Optimize task migration and load balancing between servers providing services to facilitate data processing.
[0026] Furthermore, the method for calculating the reward r for each agent's action is as follows:
[0027] In a distributed strategy, each agent's goal is to maximize its own utility, and the system's reward function is defined as the agent's utility at time t.
[0028] The optimization problem of the system is modeled as minimizing the overall task processing latency, minimizing migration costs, and balancing the load on the vehicle-mounted edge computing server. The reward function r of the agent is defined as:
[0029] r=-(λ l T+λ2C migrate +(1-λ2)LBF)
[0030] Where, λ l λ2 are user-defined weight parameters, λ2∈[0,1], T is the number of training iterations, and C is the number of iterations. migrate LBF is the load balancing factor, representing migration cost.
[0031] Migration cost C migrate Represented as:
[0032]
[0033] Where, λ n Let x be the total data size of task j. j λ represents the proportion of task j that is migrated. j x j The amount of data to be migrated is specified. m≠m′ indicates that the computing task will be migrated from the m-th MEC to the m'-th MEC, while m=m′ indicates that the computing task will not be migrated.
[0034] The computing load L of the m-th MEC server m Represented as:
[0035]
[0036] Where M represents the number of MEC servers, J represents the number of tasks, and L... m This represents the sum of computational tasks on the m-th MEC server. If task j is processed by server m, then...
[0037] Average computational load of all MECs Represented as:
[0038]
[0039] To determine whether the computational load is fairly distributed among the MECs in the system, load balancing is measured by the deviation of the computational load, and the Load Balancing Factor (LBF) is defined as:
[0040]
[0041] The optimization problem of the system is modeled as minimizing the overall task processing latency of the system, minimizing migration costs, and balancing the load on the VEC server.
[0042] Furthermore, the calculation scheme for sampling priority is as follows:
[0043] Calculate the target network loss (Loss):
[0044]
[0045] Among them, y e The target value for training the neural network on server e. The Q-value calculated for the critic network, where e is the server index, c is the critic network index, and θ e,c Here are the parameters of the critic network, μ is the function for obtaining actions, s′ is the new state obtained after state s performs action a, and a = {a1, ..., a}. e ,…,a E} represents the action to be performed;
[0046] Considering the loss and the number of training iterations T for experience extraction, an important factor δ is designed to measure the number of experience extractions. i :
[0047]
[0048] Where i is the experience index, T i Dis represents the number of times the experience is used; Dis represents the step size in the episode executed by the neural network; ε is the weight parameter.
[0049] Each experience in the experience buffer is assigned a selected probability p(i) based on the importance factor and the target network loss value Loss(i), with the priority being:
[0050]
[0051] in,
[0052] p(i)=α*Loss(i)+δ i
[0053] p(i) is the probability that the experience is selected; σ is the number of times the priority is amplified. The larger σ is, the more the experience is extracted based on the size of p(i). α is the probability offset to prevent starvation due to a low probability of drawing experience points because p(i) is too small; α is the weight parameter; n∈N represents a vehicle, N is the total number of vehicles in the vehicle network at time t, and N is a positive integer.
[0054] The sampling priority is calculated using the loss function. The larger the loss, the greater the difference between the evaluation value and the actual value of the target network for this experience. In this case, the sampling frequency needs to be increased to update the values of the target network and the evaluation network as soon as possible to achieve the best training effect.
[0055] Furthermore, the target Q-value y of agent e is calculated. e :
[0056]
[0057] Where, r e Let γ be the reward function for agent e, and γ be the discount rate. The Q-value is calculated for the critic network, where e is the agent's index, c is the critic network index (there are two critic networks, so c is either 1 or 2), μ′ is the function to obtain the action to be executed, and s e ∈ represents the current state of the agent, and ∈ represents random noise.
[0058] It is simple to operate and easy to use.
[0059] Furthermore, the method for calculating the sampling weights using the annealing factor is as follows:
[0060] Set importance sampling weight w sample for:
[0061]
[0062] Where S is the number of samples in the experience buffer; p(t) is the sampling probability; β∈[0,1] is the annealing factor used to control the impact of priority experience replay on the convergence result; and max(w) is the maximum sampling weight among all sampling weights.
[0063] By incorporating the annealing factor into priority experience replay, the model performance is optimized.
[0064] Furthermore, the method for calculating the minimum loss function and updating the initial network is as follows:
[0065] Minimize the Q-loss function L(θ) of the initial network Critic1,2 e,c=1,2 ) is defined as:
[0066]
[0067] Importance sampling weight w sample As an adjustment factor, it amplifies or reduces the influence of each sample when calculating the loss function to correct prediction bias caused by non-uniform sampling, thus guiding the network to obtain more accurate predictions.
[0068] Introduce a loss function to optimize network performance.
[0069] Furthermore, the method for updating the target network parameters using the parameters of each agent is as follows:
[0070] Using a meta-learning algorithm, the parameters of the actor network, critic network, and their corresponding target network constructed for each task are optimized through global shared initialization of parameters; each task obtains its own network weights by sampling different experiences from the priority experience replay cache pool.
[0071] Since the network loss function for each task is differentiable, and gradient descent is used to update the network weights based on sampling experience, then based on multiple gradient updates, the network parameters for task j are updated as follows:
[0072]
[0073] in, It is the experience set obtained by priority sampling from the experience replay buffer Buf of task j, where p is Priority, representing the priority. The subscript represents the network weights of the critic network in the k-th iteration, and the index represents task j. β represents the network weights of the actor network in the k-th iteration, where k represents the current iteration and k-1 represents the previous iteration; c β represents the individual update learning rate of the critic network. a This represents the individual learning rate update rate of the actor network; The loss function of the critic network is represented by . Represents the loss function of the actor network; Indicates the parameter Perform gradient descent to solve. Indicates the parameter Solve using gradient descent;
[0074] Every d steps, the gradient is calculated to update the parameters φ of the actor network. The parameters φ of the actor network are updated through the policy gradient:
[0075]
[0076] in, Represents each network parameter The average value after gradient update; Indicates network parameters Perform gradient updates; μ e For the action acquisition function, the input state s e We obtain state a, i.e., a = μ(s) e )+∈; This is a Q-network; μ is a function for acquiring actions, with the same meaning as μ′; s e The current state of the agent is represented by ∈, where ∈ represents random noise.
[0077] The agent updates the target network parameters:
[0078] θ′ e,c ←τθ e,c +(1-τ)θ′ e,c
[0079] φ′←τφ+(1-τ)φ′
[0080] Where, θ′ e,c For the updated target critic network parameters, θ e,c Here, φ′ represents the current target critic network parameters, e represents the agent index, and c represents the critic network index; φ′ represents the updated target actor network parameters, and φ represents the current target network parameters; τ represents the adjustable weight parameters.
[0081] The convergence speed of the DRL (Deep Reinforcement Learning) algorithm is optimized based on MAML (Meta-Learning Algorithm), thereby improving network performance.
[0082] Furthermore, the method for updating global network parameters is as follows:
[0083] By aggregating the adaptability of the trained policy to new sampling experiences for each task and performing a global update, the loss functions of each agent are summed to obtain the optimized global network parameters. The loss function is:
[0084]
[0085] in, This represents the loss function of the critic network for task j; This represents the loss function of the actor network for task j; This represents the network weights of the actor network for task j. This represents the network weights of the critic network for task j.
[0086] Based on gradient descent, the optimization function for the global network parameters is:
[0087]
[0088]
[0089]
[0090] Once both individual-level and global-level updates are complete, the algorithm enters the next round to continue updating the global network parameters;
[0091] In the meta-training phase, based on the trained parameters θ and The network parameters for the new task are determined at the start of the time step. and Initialize to the trained global parameters θ and Update.
[0092] Update global network parameters to reduce quantization error and improve network performance.
[0093] The present invention also provides a distributed multi-agent task offloading and migration system based on priority experience replay and meta-learning, including M base stations set along a one-way lane, each base station being equipped with K antennas and each base station being connected to a server, and N mobile vehicle agents along the road, each vehicle carrying a single antenna with limited computing resources, where K, M, and N are all positive integers and K>N.
[0094] The vehicle and server perform task offloading and migration according to the method described in this invention.
[0095] Using this system, based on the priority experience replay distributed algorithm, the convergence speed of the network algorithm is optimized, the dependence of the model on high-priority experience is adjusted, and the task offloading and migration of the agent is realized. Attached Figure Description
[0096] Figure 1 This is a flowchart illustrating the distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning of the present invention. Detailed Implementation
[0097] Embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0098] In the description of this invention, it should be understood that the terms "longitudinal", "lateral", "up", "down", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this invention.
[0099] In the description of this invention, unless otherwise specified and limited, it should be noted that the terms "installation", "connection" and "linking" should be interpreted broadly. For example, they can refer to mechanical or electrical connections, or internal connections between two components. They can be direct connections or indirect connections through an intermediate medium. Those skilled in the art can understand the specific meaning of the above terms according to the specific circumstances.
[0100] This invention discloses a distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning. It proposes a distributed multi-agent dual-delay deep deterministic policy gradient algorithm based on priority experience replay, and introduces the annealing factor into the priority experience replay by leveraging the idea of simulated annealing (SA) algorithm. This leads to the proposal of distributed MATD3 (APER-MATD3) based on annealing-controllable priority experience replay, which adjusts the model's dependence on high-priority experiences. Furthermore, it optimizes the convergence speed of the DRL (Deep Reinforcement Learning) algorithm based on MAML (Meta-learning algorithm).
[0101] like Figure 1 As shown, the distributed multi-agent task offloading and migration method includes the following steps:
[0102] Initialize the in-vehicle computing network, priority experience replay buffer pool, random noise and in-vehicle edge computing system environment, and receive the initial state of the in-vehicle edge computing system environment. The vehicle state vector for each time slot is s.
[0103] Each base station equipped with a server is defined as an intelligent agent. For each intelligent agent, actions are selected based on task offloading decisions and task migration decisions. The action vector generated by each vehicle in each time slot is a.
[0104] Each agent performs an action (in the neural network, the agent's current state is s, and after performing an action in state s, it will obtain a new state s', and so on, continuously training), calculates the reward r, and obtains new state information;
[0105] Calculate sampling priority;
[0106] The state vector S, action vector a, reward r, and new state s' are combined into an experience tuple, and the experience tuple is stored in the priority experience replay cache pool.
[0107] When the number of samples in the priority experience replay buffer pool reaches the preset value, each agent samples from the priority experience replay buffer pool according to the sampling probability and puts it into the sampling buffer pool (Buf sample);
[0108] Calculate the sampling weights;
[0109] Calculate the minimum loss function based on the sampling weights and update the initial network;
[0110] The parameters of each agent are updated according to the gradient formula until the preset training rounds (the number of iterations) are reached. The parameters of the target network and the global network are then updated using the parameters of each agent.
[0111] In a preferred embodiment of the present invention, the method for initializing the network, priority experience replay buffer pool, random noise, and vehicular edge computing system environment, and receiving the initial state of the vehicular edge computing system environment (a system model constructed considering factors such as multi-agent, vehicle, task information, and channel conditions) is as follows:
[0112] For each base station agent e, initialize three main networks and three target networks, where e is the index of the base station agent and is a positive integer. The main network includes two critic networks. and an actor network The three main networks have random network parameters θ e,1 θ e,2 and φ e (Actor network of base station agent e) (random network parameters); the target network includes two critical target networks Q′. θe,1 ,Q′ θe,2 A target network of actors The three initialized master network parameters are copied to the target network, i.e., θ′. e,1 =θ e,1 , θ′ e,2 =θ e,2 ,φ′ e =φ e And initialize the priority experience replay cache pool Buf;
[0113] A random noise ξ is initialized for motion detection, and the initial state of each base station equipped with a VEC server is received. This includes the status of each vehicle-mounted task within the base station's coverage area, the communication status between the vehicle and the base station, and the resource status of the VEC server; for each agent e, a random action is selected. ε e For noise, Represents network enter state.
[0114] In a preferred embodiment of the present invention, the method by which each agent selects an action based on task unloading decisions and task migration decisions is as follows:
[0115] In each time slot base station, agent e observes the vehicle environment within the coverage area of the base station and collects environmental parameters as the observation state. The state vector of each time slot is s. The agent makes decisions according to the policy. The action vector generated in each time slot is a. Each action includes vehicle unloading decision and task migration decision on the VEC server.
[0116] State space: The state S of the system at time t. t Represented as:
[0117]
[0118] Where E represents the number of in-vehicle edge computing servers in the system. This represents the state of agent e at time t. This represents the task information for vehicle n∈N arriving at time t. Let t represent the current channel status between vehicle n and the base station. λ represents the computing load of servers within the service range of the current base station at time t; n For the task data volume of vehicle n, C n Calculate the total number of CPU cycles required for this task for vehicle n; T tolerance The tolerance delay for the task; N is the total number of vehicles in the vehicle network at time t, and N is a positive integer;
[0119] Action space: The unloading and migration decisions made by agent e at time t when observing the local environment constitute actions. Then the system action space set a at time t t Represented as:
[0120]
[0121] in, This represents the decision on the proportion of tasks that vehicle v unloads to the VEC servers within the current service range at time t. This indicates the migration decision at time t regarding whether to migrate the tasks currently on the vehicle edge computing server e to server e′.
[0122] In a preferred embodiment of the present invention, the method for calculating the reward r for each agent performing an action is as follows:
[0123] In a distributed strategy, each agent's goal is to maximize its own utility, and the system's reward function is defined as the agent's utility at time t.
[0124] If we model the system optimization problem as minimizing the overall system task processing latency, minimizing migration costs, and balancing the load on the vehicle-mounted edge computing server, then the maximum utility that the agent can obtain is the solution of minimizing this optimization problem. The agent's reward function r is defined as:
[0125] r=-(λ l T+λ2C migrate +(1-λ2)LBF)
[0126] Where, λ l λ2 are user-defined weight parameters, λ2∈[0,1], T is the number of training iterations, and C is the number of iterations. migrate LBF is the load balancing factor, representing migration cost.
[0127] Migration cost C migrate Represented as:
[0128]
[0129] Where, λ n Let x be the total data size of task j. j λ represents the proportion of task j that is migrated. j x j The amount of data to be migrated is specified. m≠m′ indicates that the computing task will be migrated from the m-th MEC to the m'-th MEC, while m=m′ indicates that the computing task will not be migrated.
[0130] The computing load L of the m-th MEC server m Represented as:
[0131]
[0132] Where M represents the number of MEC servers, J represents the number of tasks, and L... m This represents the sum of computational tasks on the m-th MEC server. If task j is processed by server m, then...
[0133] Average computational load of all MECs Represented as:
[0134]
[0135] To determine whether the computational load is fairly distributed among the MECs in the system, load balancing is measured by the deviation of the computational load, and the Load Balancing Factor (LBF) is defined as:
[0136]
[0137] In a preferred embodiment of the present invention, the sampling priority is calculated as follows:
[0138] By combining the multi-agent TD3 (Twin Delayed Deep Deterministic policy gradient algorithm) algorithm with the distributed characteristics of the vehicle edge computing environment, the distributed multi-agent TD3 algorithm is introduced into the VEC system to solve the joint optimization problem of vehicle task offloading and migration and VEC server load balancing.
[0139] In PER (Priority Experience Replay), each experience is assigned a priority. When sampling experiences for learning, those with higher priority are sampled more frequently. In this way, PER can improve the utilization of experiences that are more valuable for learning, thereby increasing learning efficiency.
[0140] In the MATD3 (multi-agent twin delayed deep deterministic policy gradient) algorithm, each agent extracts its own experience to train its own neural network. To facilitate the extraction of higher-quality experience, the experience content stored in the experience cache includes not only the current state, the action taken, the next state, and the reward, but also data such as the target evaluation network loss (Loss), the number of training iterations (T), and the priority of the current experience. Priority is the sole indicator of the importance of experience and is the basis for extraction.
[0141] Calculate the target network loss (Loss):
[0142] Among them, y e The target value for training the neural network on server e. The Q-value calculated for the critic network, where e is the server index, c is the critic network index, and θ e,c Here are the parameters of the critic network, μ is the function for obtaining actions, s′ is the new state obtained after state s performs action a, and a = {a1, ..., a}. e ,…,a E} represents the action to be performed;
[0143] The larger the loss, the greater the difference between the evaluation value and the actual value of the target network for this experience. It is necessary to increase the sampling frequency in order to update the values of the target network and the evaluation network as soon as possible to achieve the best training effect.
[0144] When prioritizing experience, a key factor δ is designed to measure the number of experience extractions by comprehensively considering both Loss and the number of training iterations T. i :
[0145]
[0146] Where i is the experience index, T i δ represents the number of times the experience is used; Dis represents the step size in each episode (the step size of the neural network in each exploration round); ε is a weight parameter used to adjust δ. i The value will not be too large, which facilitates subsequent calculations;
[0147] Each experience in the experience buffer is assigned a selected probability p(i) based on the importance factor and the target network loss value Loss(i), with the priority being:
[0148]
[0149] in,
[0150] p(i)=α*Loss(i)+δ i
[0151] p(i) represents the probability of the experience being selected; σ represents the number of times the priority is amplified. The larger the σ, the more dependent the experience is on the magnitude of p(i) when extracting it; σ is an importance sampling factor that changes over time, representing the degree to which the probability is affected. When the initial network does not yet have a reasonable model, the training convergence speed is slow. Increasing the value of σ helps to replay valuable experiences on a larger scale, helping the network to converge quickly based on high-priority experiences. As the decision model gradually approaches the optimal model, the value of σ can be decreased to avoid overfitting the policy due to continuous replaying of high-priority experiences, thereby increasing the robustness of the decision model.
[0152] The probability offset is used to prevent starvation due to a low probability of drawing experience points caused by p(i) being too small; α is a weighting parameter to avoid Loss(i) and δ i Large differences in the values of Loss(i) and importance factor are detrimental to the calculation and evaluation of results. The larger the values of Loss(i) and importance factor, the greater the probability that the experience will be selected. n∈N represents a vehicle, N is the total number of vehicles in the vehicle network at time t, and N is a positive integer.
[0153] After the agent interacts with the environment, it stores the obtained single experience in the experience cache pool. At this time, the Loss and Priority of the experience are empty, and the number of samplings T is 0. Every K steps of exploration, the loss of these K steps of experience is calculated in batches and written into the experience pool, ensuring that the loss of all experiences has been calculated before sampling training.
[0154] M experiences are extracted from the experience cache pool. The priority (Priority(i)) of each experience is calculated, and each experience is added to a minibatch with probability (Priority(i)). This extraction of M experiences is repeated until the minibatch reaches a specified size. After training, the network parameters of the actor network, critic network, and corresponding target network have changed. Considering algorithm complexity and computational cost, the loss of the extracted minibatch experiences is only updated during training; the loss of other experiences in the experience cache pool is not recalculated.
[0155] More preferably, the target Q-value y of agent e is calculated. e :
[0156]
[0157] Where, r e Let γ be the reward function for agent e, and γ be the discount rate, which is a weight parameter subjectively set by the user. The Q-value is calculated for the critic network, where e is the agent's index and c is the critic network's index. There are two critic networks, so c is either 1 or 2; μ′ retrieves the function for executing the action, a i ~μ i (s i )+∈, where a i It is the action obtained, s i It is a state, s e ∈ represents the current state of the agent, and ∈ represents random noise.
[0158] In a preferred embodiment of the present invention, the method for calculating the sampling weight using the annealing factor is as follows:
[0159] In priority-based experience replay, the sampling process is non-uniform, meaning that the sampling probability is different for each transition. While this characteristic allows us to extract more important or valuable experiences more frequently, it can also bias the predictions of reinforcement learning algorithms.
[0160] To overcome this problem, this patent introduces importance sampling and an annealing factor β to adjust the sampling probability of each transformation, thereby balancing the impact of different transformations and reducing prediction bias.
[0161] Set importance sampling weight w sample for:
[0162]
[0163] Where S is the number of samples in the experience buffer; p(t) is the sampling probability; β∈[0,1] is the annealing factor used to control the impact of priority experience replay on the convergence result; and max(w) is the maximum sampling weight among all sampling weights.
[0164] In a preferred embodiment of the present invention, the method for calculating the minimum loss function and updating the initial network is as follows:
[0165] Minimize the Q-loss function L(θ) of the initial network Critic1,2 e,c=1,2 ) is defined as:
[0166]
[0167] Importance sampling weight w sample As an adjustment factor, it amplifies or reduces the influence of each sample when calculating the loss function to correct prediction bias caused by non-uniform sampling, thus guiding the network to obtain more accurate predictions.
[0168] In a preferred embodiment of the present invention, the method for updating the target network parameters through the parameters of each agent is as follows:
[0169] The meta-based DRL algorithm incorporates the concept of Model-Agnostic Meta-Learning (MAML) into DRL. Utilizing a meta-learning algorithm, it optimizes the actor network, critic network, and their corresponding target network parameters for each task through globally shared parameter initialization. Each task obtains its network weights by sampling different experiences from a priority experience replay cache.
[0170] Since the network loss function for each task is differentiable, and gradient descent is used to update the network weights based on sampling experience, then based on multiple gradient updates, the network parameters for task j are updated as follows:
[0171]
[0172] in, It is the experience set obtained by priority sampling from the experience replay buffer Buf of task j, where p is Priority, representing the priority. The subscript represents the network weights of the critic network in the k-th iteration, and the index represents task j. β represents the network weights of the actor network in the k-th iteration, where k represents the current iteration and k-1 represents the previous iteration; c β represents the individual update learning rate of the critic network. a This represents the individual learning rate update rate of the actor network; The loss function of the critic network is represented by . Represents the loss function of the actor network; Indicates the parameter Perform gradient descent to solve. Indicates the parameter Solve using gradient descent;
[0173] Every d steps, the gradient is calculated to update the parameters φ of the actor network. The parameters φ of the actor network are updated through the policy gradient:
[0174]
[0175] in, Represents each network parameter The average value after gradient update; Indicates network parameters Perform gradient updates; μ e For the action acquisition function, the input state s e We obtain state a, i.e., a = μ(s) e )+∈; This is a Q-network; μ is a function for acquiring actions, with the same meaning as μ′; s e The current state of the agent is represented by ∈, where ∈ represents random noise.
[0176] Agent e updates the target network parameters:
[0177] θ′ e,c ←τθ e,c +(1-τ)θ′ e,c
[0178] φ′←τφ+(1-τ)φ′
[0179] Where, θ′ e,c For the updated target critic network parameters, θ e,c Let θ′ be the current target critic network parameter, e represent the agent index, and j represent the critic network index, where j∈(1,2)); φ′ be the updated target actor network parameter, and φ be the current target network parameter; τ be an adjustable weight parameter. During the update, the newly obtained parameters θ′ cannot be used entirely for updating, as this would lead to overfitting. Therefore, the weight parameter τ is used to adjust the parameters, weighting the current target network parameter and the newly obtained parameters to obtain the final new target network parameter.
[0180] A more preferred method for updating global network parameters is as follows:
[0181] By aggregating the adaptability of the trained policy to new sampling experiences for each task and performing a global update, the loss functions of each agent are summed to obtain the optimized global network parameters. The loss function is:
[0182]
[0183] in, This represents the loss function of the critic network for task j; This represents the loss function of the actor network for task j; This represents the network weights of the actor network for task j. This represents the network weights of the critic network for task j.
[0184] Based on gradient descent, the optimization function for the global network parameters is:
[0185]
[0186]
[0187]
[0188] Once both individual-level and global-level updates are complete, the algorithm enters the next round to continue updating the global network parameters;
[0189] In the meta-training phase, based on the trained parameters θ and The network parameters for the new task are determined at the start of the time step. and Initialize to the trained global parameters θ and Update.
[0190] The present invention also provides a distributed multi-agent task offloading and migration system based on priority experience replay and meta-learning, including M base stations set along a one-way lane, each base station being equipped with K antennas and each base station being connected to a server, and N mobile vehicle agents along the road, each vehicle carrying a single antenna with limited computing resources, where K, M, and N are all positive integers and K>N.
[0191] The vehicle and server perform task offloading and migration according to the method described in this invention.
[0192] Using this system, based on the priority experience replay distributed algorithm, the convergence speed of the network algorithm is optimized, the dependence of the model on high-priority experience is adjusted, and the task offloading and migration of the agent is realized.
[0193] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0194] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Claims
1. A distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning, characterized in that, Includes the following steps: Initialize the in-vehicle computing network, priority experience replay buffer pool, random noise and in-vehicle edge computing system environment, and receive the initial state of the in-vehicle edge computing system environment. The vehicle state vector for each time slot is s. Each base station equipped with a server is defined as an intelligent agent. For each intelligent agent, actions are selected based on task offloading and task migration decisions. The action vector generated by each vehicle in each time slot is... ; Each agent performs an action, calculates the reward r, and obtains new state information; Calculate sampling priority; Let the state vector s and the action vector be... The reward r and the new state s' form an experience tuple, which is then stored in the priority experience replay cache pool. When the number of samples in the priority experience replay buffer pool reaches the preset value, each agent samples from the priority experience replay buffer pool according to the sampling probability and puts it into the sampling buffer pool. Calculate the sampling weights; Calculate the minimum loss function based on the sampling weights and update the initial network; The parameters of each agent are updated according to the gradient formula until the preset training rounds are reached. The parameters of the target network and the global network are then updated using the parameters of each agent. The method by which each agent selects actions based on task offloading and task migration decisions is as follows: Agents in each time slot base station The system observes the vehicle environment within the coverage area of the base station and collects environmental parameters as the observation state. The state vector for each time slot is... The agent makes decisions based on the policy, and the action vector generated in each time slot is... Each action includes vehicle unloading decisions and task migration decisions on the onboard edge computing server; State space: The state of the system at time t. Represented as: , in, This represents the number of vehicle-mounted edge computing servers in the system. , express Time-based intelligent agent state, express Time vehicle The mission information that has arrived for Current vehicle Channel conditions between the base station and the base station express The computing load of servers within the current base station's service range at any given time; For vehicles The amount of task data, For vehicles Calculate the total number of CPU cycles required for this task; N represents the tolerance delay for the task; N is the tolerance delay for the task. The total number of vehicles in the vehicle-to-everything (V2X) network at any given time. It is a positive integer; Action Space: Intelligent Agent exist Actions are the decisions made to unload and migrate while constantly observing the local environment. ,So Moment System Action Space Set Represented as: , in, , express Time vehicle Decisions regarding the proportion of tasks to be offloaded to the vehicle-mounted edge computing server within the current service area. express Does the current vehicle-mounted edge computing server...? Migrate tasks from the server Migration decisions.
2. The distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning as described in claim 1, characterized in that, The method for calculating the reward r for each agent's action is as follows: In a distributed strategy, each agent's goal is to maximize its own utility, and the system's reward function is defined as the agent's utility at time t. The system optimization problem is modeled as minimizing the overall system task processing latency, minimizing migration costs, and balancing the load on the vehicle-mounted edge computing server. The agent's reward function... Defined as: ), in, All are custom weight parameters. [0,1], The number of training iterations is used to extract experience. For migration costs, As a load balancing factor; Migration costs Represented as: , in, Let j be the total amount of data for task j. The proportion of task j to be migrated. The amount of data to be migrated, This indicates that the computation task is migrated from the m-th MEC to the m'-th MEC. This indicates that the computational task will not be migrated; The computing load of the m-th MEC server Represented as: , Where M represents the number of MEC servers and J represents the number of tasks. This represents the sum of computational tasks on the m-th MEC server. If task j is processed by server m, then... ; Average computational load of all MECs Represented as: , To determine whether the computational load is fairly distributed among the MECs in the system, load balancing is measured by the deviation of the computational load, and the Load Balancing Factor (LBF) is defined as: LBF = 。 3. The distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning as described in claim 1, characterized in that, The sampling priority is calculated as follows: Calculate the target network loss (Loss): , in, For server The target value for training the neural network. The Q-value is calculated for the critic network, where e is the server index and c is the critic network index. These are the parameters of the critic network. To obtain the function of the action, For state Execute action The new state that is obtained later = { } represents the action to be performed; Considering Loss and the number of training iterations T for experience extraction, we design an important factor to measure the number of experience extractions. : , Where i is the experience index, Dis indicates the number of times the experience is used, and Dis represents the step size in the episode, which is the number of rounds the neural network executes. ; A selected probability is assigned to each experience in the experience buffer based on the importance factor and the target network loss value Loss(i). The priority is: , in, , The number of times the priority is amplified. The larger the value, the more it depends on the size of p(i) for extraction experience; ∈(0,1) is the probability offset to prevent starvation from occurring due to the low probability of drawing experience due to p(i) being too small; These are weight parameters; Represents a vehicle, where N is... The total number of vehicles in the vehicle-to-everything (V2X) network at any given time. It is a positive integer.
4. The distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning as described in claim 3, characterized in that, Calculate the target Q value of agent e : , in, Let e be the reward function for agent e. For discount rate, The Q-value is calculated for the critic network, where e is the agent's index and c is the critic network's index. There are two critic networks, so c is either 1 or 2. To obtain the function that performs the action, The current state of the agent. .
5. The distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning as described in claim 3, characterized in that, The method for calculating sampling weights using the annealing factor is as follows: Set importance sampling weights for: , Where S is the number of samples in the experience buffer; It is the sampling probability; It is an annealing factor used to control the impact of priority experience replay on the convergence result; It is the largest sampling weight among all sampling weights.
6. The distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning as described in claim 5, characterized in that, The method for calculating the minimum loss function and updating the initial network is as follows: Minimize the Q-loss function of the initial network Critic1,2 Defined as: , Importance sampling weight As an adjustment factor, it amplifies or reduces the influence of each sample when calculating the loss function to correct prediction bias caused by non-uniform sampling, thus guiding the network to obtain more accurate predictions.
7. The distributed multi-agent task offloading and migration method based on priority experience replay and meta-learning as described in claim 6, characterized in that, The method for updating the target network parameters using the parameters of each agent is as follows: Using a meta-learning algorithm, the parameters of the actor network, critic network, and their corresponding target network constructed for each task are optimized through global shared initialization of parameters; each task obtains its own network weights by sampling different experiences from the priority experience replay cache pool. Since the network loss function for each task is differentiable, and gradient descent is used to update the network weights based on sampling experience, then based on multiple gradient updates, the network parameters for task j are updated as follows: , in, It is the experience set obtained by priority sampling from the experience replay buffer Buf of task j, where p is Priority, representing the priority. The subscript represents the network weights of the critic network in the k-th iteration, and the index represents task j. The network weights of the actor network in the kth iteration. Indicates the current iteration. This is the previous iteration; This represents the individual update learning rate of the critic network; ; The loss function of the critic network is represented by . Represents the loss function of the actor network; Indicates the parameter Perform gradient descent to solve. Indicates the parameter Solve using gradient descent; Every d steps, the gradient is calculated to update the parameters of the actor network. The actor network parameters are updated via the policy gradient. , in, ; Indicates network parameters Perform gradient updates; For the action acquisition function, input state , get the state , ; For Q-network; To obtain the function of the action, and Same meaning; The current state of the agent. ; The agent updates the target network parameters: , in, Updated target critic network parameters Here are the current target critic network parameters, where e represents the agent index and j represents the critic network index; For the updated target actor network parameters, These are the current target network parameters; These are adjustable weighting parameters.
8. A distributed multi-agent task offloading and migration system based on priority experience replay and meta-learning, characterized in that, It includes M base stations set up along a one-way lane, each base station is equipped with K antennas, and each base station is connected to a server. There are N mobile vehicle agents along the road, each vehicle carrying a single antenna with limited computing resources. K, M, and N are all positive integers and K>N. The vehicle and server perform task offloading and migration according to the method described in any one of claims 1-7.