A vehicle path optimization method and system based on deep reinforcement learning

By combining deep reinforcement learning and heuristic methods, an attention model with a dynamic encoder-decoder architecture was designed, which solved the problem of difficult graph structure differentiation in PDP and achieved efficient path optimization.

CN117371637BActive Publication Date: 2026-06-26CHONGQING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV
Filing Date
2023-11-20
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing vehicle routing optimization methods struggle to effectively address pick-up and delivery problems (PDP) with pairing and priority relationships, especially in large-scale scenarios where they cannot distinguish between different graph structures. Existing models also lack a good global context graph structure at the decoder end.

Method used

Combining deep reinforcement learning and heuristic methods, we design an attention model with a dynamic encoder-decoder architecture. By training a neural network to guide a local search algorithm, and using a proximal policy optimization algorithm to train the model, we can dynamically explore the structural features of problem instances and iteratively improve the solution.

Benefits of technology

It effectively solves the PDP problem by finding high-quality path sequences in large-scale scenarios through an iterative search process, dynamically capturing new instance information, distinguishing different graph structures, and improving the efficiency and quality of path optimization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117371637B_ABST
    Figure CN117371637B_ABST
Patent Text Reader

Abstract

The application relates to a vehicle path optimization method and system based on deep reinforcement learning, which comprises the following steps: S1, acquiring node feature embedding based on a problem instance; S2, acquiring position feature embedding based on the problem instance; S3, using a multilayer perception machine to fuse the node and position feature embedding information and completing coding; S4, using a removal decoder to realize removal operation of a PDP node pair; S5, using a repair decoder to realize reinsertion operation of the removed node pair; S6, performing corresponding actions according to the output of the removal and repair decoders to realize state transition and continuously iteratively improving a scheme; S7, using a dynamic mode to update a graph embedding in real time according to the output of the decoder; and S8, training a model by using a proximal policy optimization algorithm and finally outputting a path optimization sequence. The application combines deep reinforcement learning and a heuristic method, trains a neural network to guide a local search algorithm, learns an optimization strategy in a step-by-step manner and iteratively improves a solution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of vehicle routing optimization technology, and relates to a vehicle routing optimization method and system based on deep reinforcement learning. Background Technology

[0002] The Vehicle Routing Problem (VRP) is one of the most widely studied problems in the transportation and operations field, aiming to find a set of routes that minimize total cost. A typical VRP has a simple structure and can be viewed as a basic discrete combinatorial optimization problem, which has been repeatedly studied for many years. Many optimization problems can be transformed into VRP forms; therefore, VRP can be used to demonstrate the optimization performance of different algorithms, aiding in the design of efficient algorithms. Furthermore, VRP has significant practical applications in the real world, and VRP variants have emerged in many fields, particularly in the logistics industry, which is booming with globalization. Effective route optimization can save substantial costs, which is crucial for improving delivery efficiency and increasing corporate economic benefits.

[0003] In reality, customers may have their own delivery points, such as for city-wide delivery services and ride-sharing, unlike VRP where all customers share a single warehouse. Route planning for all such applications can be naturally described as a pickup and delivery problem (PDP), a representative variant of VRP. Typically, PDPs are characterized by pairing and prioritization relationships, where the pickup node must precede the delivery node. PDPs are primarily used to optimize the routes of pickup and delivery requests and are ubiquitous in logistics, robotics, food delivery services, and other fields. While current mainstream approaches primarily address typical VRP problems, research specifically targeting PDPs has received increasing attention from researchers in recent years.

[0004] Currently, optimization algorithms for solving VRP and its variant PDP can be broadly classified into three categories: exact algorithms, heuristic algorithms, and deep reinforcement learning algorithms.

[0005] 1) Exact Algorithms. Exact algorithms are those that can find the optimal solution to a problem. In early research, branch and bound and dynamic programming were commonly used exact algorithms for solving the Traveling Salesman Problem (PDP). Later, various exact solvers were proposed, such as the state-of-the-art combinatorial optimization problem exact solver CPLEX, the general integer programming solver Gurobi, and the highly specialized solver Concorde. Concorde is widely considered to be the fastest existing exact solver for the Traveling Salesman Problem. Exact algorithms theoretically guarantee optimality, but due to their high computational complexity and the time required to obtain the optimal solution, they cannot solve large-scale instances.

[0006] 2) Heuristic Algorithms. Heuristic algorithms trade optimality for computational efficiency. They are typically problem-specific and designed by iteratively applying a simple, handcrafted rule (called a heuristic). Google OR-Tools is a highly optimized program for solving VRPs that applies various heuristics, such as simulated annealing, greedy descent, and tabu search, to navigate the search space and optimize the solution using local search techniques. The 2-Opt algorithm is a move-based heuristic that replaces two edges to reduce the path length. The state-of-the-art heuristic solver LKH3 can find an optimal solution in time comparable to the Gurobi solver. Heuristic algorithms can find near-optimal solutions with polynomial computational complexity; however, designing a good heuristic algorithm is not easy, as it requires extensive problem-specific expert knowledge and handcrafted features.

[0007] 3) Deep Reinforcement Learning Algorithms. Deep reinforcement learning algorithms possess strong learning and generalization capabilities, overcoming the cumbersome parameter tuning shortcomings of exact and heuristic algorithms, and reducing the dependence on knowledge specific to specific problems and domains. Pointer networks were first used to solve VRPs through supervised learning. Since high-quality labels for VRPs are difficult to obtain, reinforcement learning algorithms, by learning optimal policies from problem instances, do not require supervised learning solutions, thus expanding the scope of VRP solutions. Attention-based models use a Transformer architecture to encode nodes and a pointer-like attention mechanism for decoding. Some research combines deep reinforcement learning with heuristics to train neural networks to guide local search algorithms, learning optimization strategies incrementally and iteratively improving solutions. For example, the NeuRewriter model learns a policy to select heuristics and rewrite local components of the current solution. Recently, a novel neural network integrating a heterogeneous attention mechanism has also been proposed. This heterogeneous attention mechanism specifically defines the attention of each role of a node while considering the priority constraints of the PDP problem. There are also models based on an improved graph attention network, mainly used to learn large neighborhood search heuristics for designing VRP.

[0008] Path optimization problems (PDPs) are NP-hard combinatorial optimization problems with priority constraints. Real-world path optimization scenarios are often extremely complex, with decision variables and constraints potentially reaching millions, leading to an exponential increase in the size of the feasible solution set. As the problem size increases, exact algorithms require sufficient time to obtain the optimal solution, heuristic algorithms cannot guarantee optimality, and their computational complexity is also high. Current neural methods have achieved some success in typical virtual resource allocation problems (VRPs), but most deep reinforcement learning-based solutions can only handle typical VRPs through shared delivery points, performing poorly in handling pairing and priority relationships in PDPs. Furthermore, in large-scale real-world scenarios, the graph embeddings of neural methods have highly similar amplitudes and are spatially very close. Current models cannot distinguish different graph structures in such scenarios, lacking a good global context graph structure at the decoder end. Therefore, current methods cannot be directly applied to solve PDPs, and research on PDPs remains a challenging optimization task, requiring the development of more suitable models based on the characteristics of PDPs. Summary of the Invention

[0009] In view of this, the purpose of this invention is to provide a vehicle path optimization method and system based on deep reinforcement learning. This method and system aim to combine deep reinforcement learning and heuristic methods to solve problem-oriented problems (PDPs). It trains a neural network to guide a local search algorithm, learning optimization strategies in a progressive manner and iteratively improving the solution. Its core lies in designing an attention model with a dynamic encoder-decoder architecture to dynamically explore the structural features of problem instances, using an efficient proximal policy optimization algorithm to train the model, and finally obtaining a high-quality path sequence.

[0010] To achieve the above objectives, the present invention provides the following technical solution:

[0011] A vehicle path optimization method based on deep reinforcement learning, comprising the following steps:

[0012] S1. Define the vehicle routing optimization problem and define the PDP under study on graph G = (V, E), where node V = P∪D∪x0 represents a position and edge E represents the journey between positions; obtain node feature embeddings based on problem instances;

[0013] S2. Based on the problem instance, obtain the location feature embedding;

[0014] S3. Use a multi-layer perceptron mechanism to fuse the embedded information of nodes and location features to complete the encoding;

[0015] S4. Use the removal decoder to implement the removal operation of PDP node pairs;

[0016] S5. Use the repair decoder to implement the re-insertion operation for removed node pairs;

[0017] S6. Perform corresponding actions based on the output of the removed and repaired decoders to achieve state transition and continuously improve the solution iteratively;

[0018] S7. Update the graph embedding in real time using a dynamic method based on the decoder output;

[0019] S8. Train the model using the near-end strategy optimization algorithm, output the path optimization sequence, and obtain the optimal solution for vehicle path optimization.

[0020] In this invention, the PDP under study is defined on graph G = (V, E), where node V = P∪D∪x0 represents a location, and edge E represents the journey between locations. For n one-to-one pickup and delivery requests, a PDP instance contains |V| = 2n + 1 distinct locations, where node x0 represents a warehouse, and set P = {1 + ,2 + , ..., n +} contains pickup nodes, set D = {1 - ,2 - , ..., n -} contains delivery nodes, each pickup node i + Many goods need to be transported to their delivery nodes. - In this scenario, we assume that all goods at the pickup node will be delivered entirely to the delivery node by vehicles with infinite capacity. Under this setup, the PDP describes a process where, starting from the warehouse, vehicles sequentially visit all pickup and delivery nodes once to perform a service and eventually return to the warehouse, with the goal of finding the shortest path that satisfies all requirements to the greatest extent possible. In this paper, we consider a representative variant of the PDP, the PDTSP. The solution δ is defined as a cyclic sequence (x0, ..., x...). 2n+1 ), where x0 and x 2n+1 The first node is the warehouse node, and the rest are arrangements of nodes in P∪D. This arrangement has a priority constraint, which requires that each pickup node i... + At its delivery node i - Previously visited.

[0021] In this method, the route construction process can essentially be viewed as a decision sequence for solving the PDP. Therefore, the PDP can be naturally formulated into a reinforcement learning form and solved accordingly. This invention models this route construction process as a Markov decision process (MDP), defining its state space S, action space A, state transition T, and reward function R, as follows:

[0022] State S: At time t, the state is defined as: 1) the characteristics δ of the current solution t 2) Action history, 3) Target value of the existing best solution. That is:

[0023]

[0024] Where, δ t It is described from the following two aspects: l(x) contains the two-dimensional coordinates of node x, i.e., node features, P t (x) represents x in δ t The index position in the table represents the node position feature; H(t, K) stores the K most recent actions at time t (if any); f(·) represents the objective function.

[0025] Action A: For action a t ={(i+,i - ), (j, k)}, where j, k∈V\{i + i -}, the agent deletes the node pair (i + i - Then, node i is re-inserted after nodes j and k respectively. + and i - .

[0026] State transition T: Use a deterministic transition rule to perform action a. t .

[0027] Reward R: The reward function is defined as follows: It is relative to The immediate reduction in cost (the amount by which the current time step is reduced relative to the minimum cost in the next time step) is the objective of the model proxy to maximize the expected total cost as low as possible relative to δ0, provided that the discount factor γ < 1.

[0028] Furthermore, in step S1, the specific implementation steps for obtaining the node feature embedding are as follows:

[0029] S11, Define h i For the output dimension d h Any x = 128 i Node features l(x) ∈V i Linear projection of )

[0030] S12, via parameter W x and b x : Learning linear projection to calculate initial d h Dimensional node embedding;

[0031] S13. The encoder used is based on the Transformer architecture. Each attention layer consists of two sub-layers: a multi-head attention (MHA) layer that performs message passing between nodes and a fully connected feedforward (FF) layer. Each sub-layer adds a residual connection and batch normalization (BN) processing. Based on this, the core formula for obtaining node feature embeddings is as follows:

[0032]

[0033]

[0034] Furthermore, in step S2, in order to better utilize the cyclic nature of the PDP problem, this invention employs cyclic position encoding to encode positional features. This enables the Transformer to encode cyclic sequences more accurately. The specific implementation steps for obtaining the positional feature embedding are as follows:

[0035] S21, Cyclic positional encoding for output dimension d g The positional feature embedding of 128 is initialized as follows:

[0036]

[0037] Among them, g i (d) The superscript d indicates g i The d-th dimension;

[0038] S22, The scalar z(i) is defined as follows:

[0039]

[0040] S23, angular frequency The definition is as follows:

[0041]

[0042] Furthermore, in step S3, h is directly added together. i +g i Fusing node and location feature embeddings directly might introduce unwanted noise into the Transformer. Therefore, this invention uses a Multilayer Perceptron (MLP) to use node feature embeddings as the primary embedding set and location feature embeddings as the secondary embedding set, thus synthesizing the heterogeneous primary and secondary attention relationships into a comprehensive attention relationship. The implementation of fusing node and location feature embeddings using a MLP is as follows:

[0043] S31. Node feature embedding is used to generate the multi-head raw self-attention score, as defined below:

[0044]

[0045] in, and It is a trainable matrix for each head m, with m=4.

[0046] S32. Location feature embedding is used to generate multi-head auxiliary attention scores, as defined below:

[0047]

[0048] in, It is a trainable matrix for each head m;

[0049] S33. The relevant definitions of comprehensive attention are as follows:

[0050]

[0051] S34. Calculate attention score α sel f and α aux The data is fed into a three-layer MLP with a structure (2m×2m×m) to calculate the following comprehensive multi-head attention score:

[0052]

[0053] S35. Normalize the score using softmax. The attention value for each head is further calculated using the following formula:

[0054]

[0055] S36, through a trainable matrix The final node embedding output is shown below:

[0056]

[0057] Furthermore, in step S4, the decoder consists of two custom removal decoders and a repair decoder, which automatically perform the removal and re-insertion operations of the pickup and delivery node pairs, respectively, and allow a pair of pickup and delivery nodes to operate simultaneously in the neighborhood search. This feature better solves the priority relationship of the PDP problem. The implementation of the removal decoder is as follows:

[0058] S41, for each x i ∈V calculate a fraction λ i , representing node x i Intimacy with its neighboring nodes:

[0059]

[0060] Where, pre(x) i ) and suc(x i ) respectively refer to x i The predecessor and successor nodes, and

[0061] S42. Obtain λ using a multi-head technique similar to that used at the encoder end. i,1 to λ i,m ;

[0062] S43, the decoder is based on a three-layer MLP. λ Aggregate each pickup and delivery node pair (i + i - The fraction:

[0063]

[0064] The MLP structure is (2m+4, 32, 32, 1), and the scalar c(i) calculates the request pairs (i) selected for deletion in the past K steps. + i - ) frequency, 1 last(t)=i′ It is a binary variable indicating whether the request i′ is selected in the last t-th step;

[0065] S44, Activation Layer C=6, the distribution is normalized using Softmax, and then used to sample node pairs (i + i - () as a removal action.

[0066] Furthermore, in step S5, given a removal request (i + i - The repair decoder outputs a joint distribution, and the two nodes are reinserted into the solution. The implementation of the repair decoder is as follows:

[0067] S51, Define node x α Two fractions μ p (x α x β ) and μ s (x α x β ), respectively representing accepting a node x β As a measure of preference for its new predecessor and successor nodes:

[0068]

[0069]

[0070] in, Like the encoder, the repair decoder also uses multiple heads;

[0071] S52, based on predecessor and successor fractions, the decoder uses MLP. u Predict node j and then re-insert node i + After node k, re-insert node i - Distribution:

[0072]

[0073]

[0074]

[0075] The MLP structure is (4m, 32, 32, 1); note that in the new solution, pre(·) and suc(·) should be considered, where node i + i - It has been deleted.

[0076] S53. Before using Soitmax for normalization, mask infeasible actions as --∞;

[0077] S54. Based on the obtained distribution, sample node pair (j, k) as a new insertion action to represent the insertion of node pair (i, j, k). + i - Reinsert it into the solution at the correct position.

[0078] Furthermore, in step S7, since the state of an instance changes according to the decisions made by the model in different decoding steps, the node features should be updated accordingly. Therefore, this invention proposes to use dynamic encoding to update the graph embedding in real time to better represent the node feature embedding. The specific implementation steps are as follows:

[0079] S71. When the agent reaches a delivery node, return to step S3, use the encoder to update the embedding of the remaining nodes, and use the dynamic attention mechanism to focus attention on nodes that have not yet been visited in order to capture new instance information; otherwise, continue to implement the decoding action.

[0080] S72. Aggregate the updated graph embeddings into each individual embedding to obtain the final node feature embeddings.

[0081] Furthermore, in step S8, in order to obtain a higher quality solution, the Proximal Policy Optimization (PPO) algorithm is used to train the model. The specific implementation of training the model using the PPO algorithm is as follows:

[0082] enter:

[0083] f) Initial policy network parameters θ

[0084] g) Initial value function parameters o

[0085] h) Clipping threshold ε

[0086] i) Initial learning rate η θ η φ

[0087] j) Learning rate decay parameter β

[0088] The main implementation steps of the PPO algorithm are as follows:

[0089] Step 1: For each batch, randomly and dynamically generate a batch of training instances D. b ;

[0090] Step 2: Initialize a set of random solutions {δ} i} Corresponds to instance D b ;

[0091] Step 3: Run the strategy π for n time steps. θ Collect experience Among them, a t′ ~π θ (a t′ |s t′ );

[0092] Step 4: Update time step t to t+n, and save the current policy network π. θ And value function v φ Each as the old strategy π old and the old value function v old ;

[0093] Step 5: Using the value function v φ Calculate the estimated return at time step t+1.

[0094] Step 6: Calculate the estimated return for t′∈{t, t-1, ..., tn}. and estimated advantages

[0095] Step 7, PPO in D b The process executes K update rounds, with its objective truncated by a threshold ε to penalize situations where the probability is lower than the target. The variance of a large strategy that is far from 1 is shown in the following formula:

[0096]

[0097] Step 8: Consider the estimated values ​​surrounding the previous value. The cutting formula is shown below:

[0098]

[0099] Step 9: To achieve better performance, define the baseline loss as shown in the following formula:

[0100]

[0101] Step 10: Finally, the policy network π θ and value network v φ The parameters are given by the Adam optimizer in terms of βη. θ ,βη φ The system is then updated, and the trained policy network is finally obtained.

[0102] The present invention also provides a vehicle path optimization system based on deep reinforcement learning.

[0103] The beneficial effects of this invention are as follows:

[0104] This invention employs efficient neural neighborhood search to solve the problem graph problem (PDP). Through an iterative search process, it effectively transforms a solution into another candidate solution within the current neighborhood. It combines deep reinforcement learning and heuristics, training a neural network to guide the local search algorithm, progressively learning optimization strategies and iteratively improving the solution. The core of this invention lies in designing an attention model with a dynamic encoder-decoder architecture to dynamically explore the structural features of problem instances, focusing attention on unvisited nodes to dynamically capture new instance information, better distinguishing different graph structures. An efficient proximal policy optimization algorithm is used to train the model, ultimately yielding a high-quality path sequence.

[0105] Other advantages, objectives, and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination, or may be learned from practice of the invention. The objectives and other advantages of the invention can be realized and obtained through the following description. Attached Figure Description

[0106] To make the objectives, technical solutions, and advantages of the present invention clearer, the preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, wherein:

[0107] Figure 1 This is an overall flowchart of the method described in this invention;

[0108] Figure 2This is a schematic diagram of a PDP in a real-world scenario. Detailed Implementation

[0109] The technical solution of the present invention will now be described in detail with reference to the accompanying drawings.

[0110] Figure 1 This is an overall flowchart of the method described in this invention, as follows: Figure 1 As shown, this invention provides a vehicle path optimization algorithm based on deep reinforcement learning. The algorithm includes the following steps: S1, obtaining node feature embeddings based on problem instances; S2, obtaining location feature embeddings based on problem instances; S3, fusing node and location feature embedding information using a multilayer perceptron to complete encoding; S4, using a removal decoder to remove PDP node pairs; S5, using a repair decoder to re-insert removed node pairs; S6, performing corresponding actions based on the outputs of the removal and repair decoders to achieve state transition and iteratively improve the solution; S7, updating the graph embeddings in real time using a dynamic method based on the decoder output; S8, training the model using a proximal policy optimization algorithm to finally output the path optimization sequence.

[0111] To better illustrate the implementation process of the algorithm proposed in this invention, a schematic diagram of a PDP with one warehouse and six pickup and delivery nodes is provided, based on a real-world scenario, as shown below. Figure 2 As shown, x0 is the warehouse node, and node 1 + ,2 + 3 + This is the pickup point, 1 - ,2 - 3 - It is the delivery node, and 1 + and 1 - 2 + and 2 - 3 + and 3 - Paired. In the initial state, due to the priority constraints of the PDP problem, node 1 - ,2 - 3 - The node is masked, and the solution is δ: (x0). In the first step, node 1 is selected. + After selection, node 1 + It is blocked, and its paired node 1 - When it becomes unmasked, the solution is δ: (x0, 1) + In the second step, select node 2. + And similar to the first step, update the state and actions. With the shortest path as the objective, the optimal solution δ is finally obtained: (x0, 1...) + ,2 + ,2- 1 - 3 + 3 - (x0).

[0112] In this embodiment, the present invention verifies the performance of the proposed model in solving the PDP through experiments. Specifically, when customers are divided into pairs of pickup and delivery nodes, the vehicle departs from the warehouse, visits all customers once under the pairing relationship and priority constraints, ensuring that the pickup node is visited before the delivery node, and finding the optimal path with the goal of reducing the total travel time.

[0113] Experimental setup

[0114] 1) Dataset

[0115] For the PDP problem, this invention, based on the setup in existing work, uses a two-dimensional uniform distribution within the range of 0-1 to randomly and independently generate the locations of warehouse and customer (pairs of pickup and delivery nodes). The distance between two nodes is calculated based on Euclidean space. Three sets of instances are independently generated for training, validation, and testing. In terms of scale, the problem size is set to |V| = 21, |V| = 51, and |V| = 101 (including one warehouse and all pairs of pickup and delivery nodes), and are respectively named PDP21, PDP51, and PDP101.

[0116] 2) Comparison Model

[0117] To evaluate the effectiveness of the proposed model, this invention selected comparison methods from four perspectives: exact algorithms, heuristic algorithms, constructivist deep reinforcement learning algorithms, and improved deep reinforcement learning algorithms. CPLEX was selected for the exact algorithm, LKH for the heuristic algorithm, AM, POMO, and DRL for the constructivist deep reinforcement learning algorithms, and DACT and N2S for the improved deep reinforcement learning algorithms.

[0118] 3) Model Setup and Evaluation Indicators

[0119] This invention uses three metrics as evaluation standards: path length (Obj.), optimal gap (Gap), and solution time (Time). Path length: The objective is to minimize the total path length. Optimal gap: The difference between the solution and the optimal solver, expressed as a percentage. Solution time: The testing time for the problem instance.

[0120] For the model proposed in this invention, the initial solution δ0 is constructed sequentially in a random manner. Model training and testing are run on a server with two NVIDIA 2080Ti GPUs and 40 CPUs. The training environment for the model is set up in PyCharm and implemented using the PyTorch framework. The training time of the model proposed in this invention varies depending on the size of the problem.

[0121] This experiment uses the Adam optimizer. Other specific hyperparameters and their values ​​are shown in Table 1.

[0122] Table 1. Hyperparameter values ​​of the model proposed in this invention.

[0123]

[0124] Analysis of experimental results:

[0125] All baselines were evaluated on a test dataset with 2000 instances. Since it is difficult to make an absolutely fair time comparison between Python code running neural methods on GPUs and code allowing traditional methods on CPUs, this invention follows the guidelines in Accorsi et al. to perform a better comparison, allowing each method to fully utilize the server's optimal settings. This invention reports experimental results for the proposed model (OUR) against other comparative methods, with the best difference relative to the heuristic solver LKH, as detailed in Table 2.

[0126] Table 2 shows the experimental results for instance sizes |V| = 21, 51, and 101.

[0127]

[0128] As can be seen from the experimental results in Table 2, compared with constructive methods, the model proposed in this invention achieves the best results across all three problem scales. Compared with improved methods, DACT provides better experimental results on PDP21, but its performance drops significantly with increasing problem scale. At PDP51, N2S provides better results, but at PDP101, within an acceptable timeframe, the model proposed in this invention outperforms models such as DACT and N2S. This may be because, when dealing with large-scale problems, the graph embeddings of neural methods have highly similar amplitudes and are spatially very close. Current models cannot distinguish different graph structures in this scenario, thus lacking a good global context graph structure at the decoder end. The model proposed in this invention considers the features of PDP pairing and priority relationships, and uses a dynamic encoder-decoder approach to update the graph embedding in real time, focusing attention on unvisited nodes and dynamically capturing new instance information. This approach better solves large-scale PDP problems.

[0129] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications can be made to the technical solutions of the present invention without departing from the spirit and scope of the present invention, and all such modifications should be covered within the scope of the claims of the present invention.

Claims

1. A vehicle path optimization method based on deep reinforcement learning, characterized in that: The method includes the following steps: S1. Define the vehicle routing optimization problem in the graph. The PDP under study is defined above, where the nodes are... Represents position, edge Represents the journey between locations; obtains node feature embeddings based on problem instances; S2. Based on the problem instance, obtain the location feature embedding; S3. Use a multi-layer perceptron mechanism to fuse the embedded information of nodes and location features to complete the encoding; S4. Use the removal decoder to implement the removal operation of PDP node pairs; S5. Use the repair decoder to implement the re-insertion operation for removed node pairs; S6. Execute corresponding actions based on the output of the removed and repaired decoders to achieve state transition and continuously improve the solution iteratively; S7. Update the graph embedding in real time using a dynamic method based on the decoder output; S8. Use the near-end strategy optimization algorithm to train the model, and finally output the path optimization sequence to obtain the optimal solution for vehicle path optimization. In step S1, the specific implementation steps for obtaining the node feature embedding are as follows: S11, Definition For output dimension any Node features Linear projection; S12, via parameters and : Learning linear projection to calculate the initial Dimensional node embedding; S13. The encoder used is based on the Transformer architecture. Each attention layer consists of two sub-layers: a multi-head attention (MHA) layer that performs message passing between nodes and a fully connected feedforward (FF) layer. Each sub-layer adds a residual connection and batch normalization (BN) processing. Based on this, the core formula for obtaining node feature embeddings is as follows: ; In step S2, cyclic positional encoding is used to encode positional features. The specific implementation steps for obtaining the positional feature embedding are as follows: S21, Cyclic positional encoding for output dimension The location feature embedding is initialized as follows: in, superscript express The dimension; S22, scalar The definition is as follows: S23, angular frequency The definition is as follows: ; In step S3, the fusion of node and location feature embeddings using a multilayer perceptron is implemented as follows: S31. Node feature embedding is used to generate the multi-head raw self-attention score, as defined below: in, and It is each head The trainable matrix, set , ; S32. Location feature embedding is used to generate multi-head auxiliary attention scores, as defined below: in, , It is each head The trainable matrix; S33. The relevant definitions of comprehensive attention are as follows: S34. Score Attention and Sent to a structure In the three-layer MLP, the comprehensive multi-head attention score is calculated as follows: S35. Normalize the score using softmax. This is further used to calculate the attention value for each head, as shown in the following formula: S36, through a trainable matrix The final node embedding output is shown below: ; In step S4, the decoder consists of two custom removal decoders and a repair decoder, which automatically perform the removal and re-insertion operations of the pickup and delivery node pairs, respectively. Furthermore, it allows a pair of pickup and delivery nodes to operate simultaneously in the neighborhood search. This feature better addresses the priority relationship in the PDP problem. The implementation of the removal decoder is as follows: S41, for each Calculate a fraction , representing a node Intimacy with its neighboring nodes: in, and They refer to The predecessor and successor nodes, and , ; S42, using a multi-head technique similar to that at the encoder end to obtain arrive ; S43, the decoder is based on three layers. Aggregate each pickup and delivery node pair The score: The MLP structure is as follows: scalar Calculate the past The request to be deleted during the step frequency, It is a binary variable representing the request. Is it in the last episode? Step was selected; S44, Activation Layer , The distribution is normalized using Softmax, and then used to sample node pairs. As a removal action; In step S5, the repair decoder is implemented as follows: S51, Define Nodes Two fractions and , respectively, indicate accepting a node The degree of preference for its new predecessor and successor nodes: in, , Like the encoder, the repair decoder also uses multiple heads; S52, based on predecessor and successor fractions, the decoder uses Predicting nodes Then re-insert the node and nodes Then re-insert the node Distribution: The MLP structure is as follows: ; S53. Before using Softmax for normalization, disable infeasible actions as... ; S54. Based on the obtained distribution, pair the nodes. Sample as a new insertion action to represent the node pair Reinsert it into the solution at the correct position.

2. The vehicle path optimization method based on deep reinforcement learning according to claim 1, characterized in that: In step S7, the graph embedding is updated in real time using dynamic encoding. The specific implementation steps are as follows: S71. When the agent reaches a delivery node, return to step S3, use the encoder to update the embedding of the remaining nodes, and use the dynamic attention mechanism to focus attention on nodes that have not yet been visited in order to capture new instance information; otherwise, continue to implement the decoding action. S72. Aggregate the updated graph embeddings into each individual embedding to obtain the final node feature embeddings.

3. The vehicle path optimization method based on deep reinforcement learning according to claim 2, characterized in that: In step S8, to obtain a higher quality solution, the Proximal Policy Optimization (PPO) algorithm is used to train the model. The specific implementation of training the model using the PPO algorithm is as follows: enter: Initial policy network parameters Initial value function parameters Clipping threshold Initial learning rate Learning rate decay parameter The main implementation steps of the PPO algorithm are as follows: Step 1: For each batch, randomly and dynamically generate a batch of training instances. ; Step 2: Initialize a set of random solutions Corresponding to the example ; Step 3: Use a strategy that runs for n time steps. Collect experience ,in, ; Step 4: Update the time step for And save the current policy network. and value function Each as an old strategy and the old value function ; Step 5: Using the value function Calculate time steps Estimated return ; Step 6 Calculate the estimated return and estimated advantages ; Step 7, PPO in Execution The number of update rounds, whose target is thresholded. Cut off, to punish the probability ratio The variance of a large strategy that is far from 1 is shown in the following formula: Step 8: Consider the estimated values ​​surrounding the previous value. The cutting formula is shown below: Step 9: To achieve better performance, define the baseline loss as shown in the following formula: Step 10: Finally, the policy network and value network The parameters are given by the Adam optimizer respectively. The system is then updated, and the trained policy network is finally obtained.

4. A vehicle routing optimization system based on deep reinforcement learning, characterized in that: The system employs the method described in any one of claims 1 to 3.