Traveling salesman cut path planning method and device based on reinforcement learning

By combining reinforcement learning with masking and attention mechanisms, this method optimizes processing path planning in industrial manufacturing, solving the problems of long computation time and local optima in traditional methods, and achieving fast and efficient path planning.

CN117787522BActive Publication Date: 2026-06-12NORTHWESTERN POLYTECHNICAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHWESTERN POLYTECHNICAL UNIV
Filing Date
2023-12-28
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for processing path planning in industrial manufacturing suffer from problems such as long computation time, easy getting trapped in local optima, and difficulty in handling complex and ever-changing manufacturing environments. Furthermore, traditional deep learning methods require data annotation and lack flexibility.

Method used

We employ a reinforcement learning-based approach combined with a masking mechanism to generate training data through self-learning, and integrate an attention mechanism to process variable-length sequences, thereby optimizing the cutting path planning.

🎯Benefits of technology

This improves the applicability and efficiency of the model for problems of different scales, avoids local optima, enables rapid solution of processing path planning, and reduces computation time.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117787522B_ABST
    Figure CN117787522B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on reinforcement learning's traveling salesman cutting path planning method and device, comprising: reading the target layout containing multiple part graphics;According to the best perforation point of pre-set algorithm to ensure the effectiveness of connecting each part perforation point under current process condition;After confirming the perforation point, the cutting path planning problem is converted into traveling salesman problem TSP solution operation, input each perforation point information to construct environment simulation model, obtain current state information, the current state information at least includes perforation point information, node mask information and current node information;Current state information is input into trained strategy model to obtain current path strategy to predict next perforation point, current path strategy is input into environment simulation model to obtain new state information, until the best processing sequence result of cutting sequence through all perforation points is obtained, it is favorable to output best processing sequence and improve planning efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of industrial manufacturing automation technology, and in particular to a method and apparatus for planning a traveling salesman cutting path based on reinforcement learning. Background Technology

[0002] In the industrial manufacturing sector, particularly in areas involving precision machining and drilling, the demand for machining path planning is growing. This demand stems from the pursuit of improving production efficiency, reducing material waste, and optimizing energy use. Therefore, the machining path planning problem is transformed into a classic Traveling Salesman Problem (TSP), which involves finding the shortest path connecting all drilling points.

[0003] Currently, the processing path planning problem in industrial scenarios mainly uses traditional methods, including dynamic programming, ant colony optimization, genetic algorithms, simulated annealing, and tabu search. However, these methods also face several challenges: First, they may encounter convergence difficulties, especially when dealing with complex and variable manufacturing environments. Second, the computational process of these methods can be very time-consuming, particularly when planning a large number of processing paths; these algorithms may take hours or even a whole day to arrive at a solution, which is clearly unacceptable in a fast-paced production environment. Finally, due to the limitations of these methods, they cannot guarantee a globally optimal solution in all cases; they are computationally time-consuming and prone to getting trapped in local optima.

[0004] In the field of deep learning, some scholars have pointed out that while Pointer networks and graph neural networks can solve the Time-Limited Problem (TSP), these are supervised learning methods that require labeled data, which is difficult to obtain, especially optimal solutions. Even if we have this data, the model can only learn to the level of these known solutions and cannot guarantee finding the truly optimal solution. Reinforcement learning methods, on the other hand, do not require pre-prepared labeled data. They learn through trial and error, and can discover better solutions.

[0005] While existing attention-based reinforcement learning methods have shown advantages in solving the Traveling Salesman Problem (TSP), they also have limitations. In particular, when dealing with different numbers of nodes, the model needs to be retrained for each different number of nodes, which limits its flexibility to some extent. Summary of the Invention

[0006] The purpose of this invention is to provide a method and apparatus for travel salesman path planning based on reinforcement learning. This reinforcement learning-based processing path planning method incorporates a masking mechanism, further optimizing path cutting and improving planning efficiency. Based on the masking mechanism in natural language processing, it can effectively handle sequences of variable length. By integrating the masking mechanism into the model, the model can flexibly adapt to different numbers of nodes, thus overcoming the limitations of existing technologies. This improvement not only enhances the model's ability to handle problems of different scales but also greatly enhances its applicability and efficiency in various industrial manufacturing environments.

[0007] This invention provides a reinforcement learning-based method for travel salesman path segmentation planning, comprising:

[0008] Read the target layout drawing containing multiple part graphics;

[0009] The optimal perforation point is determined based on a preset algorithm to ensure the effectiveness of connecting the perforation points of each part under the current process conditions;

[0010] After the perforation points are confirmed, the cutting path planning problem is transformed into a Traveling Salesman Problem (TSP) solution operation. The environmental simulation model is constructed by inputting the information of each perforation point to obtain the current state information. The current state information includes at least the perforation point information, node mask information, and current node information.

[0011] The current state information is input into the trained policy model to obtain the current path policy to predict the next perforation point. The current path policy is then input into the environmental simulation model to obtain new state information, until the optimal processing sequence result of cutting through all perforation points is obtained.

[0012] Preferably, the training steps of the trained policy model include:

[0013] Randomly generate node position coordinates as the training dataset;

[0014] The state information of the training dataset is input into the preset strategy model. The path is randomly selected based on the probability distribution of each node to obtain the first path order result of the part cutting.

[0015] The state information of the training dataset is input into the preset benchmark model, and a greedy algorithm is used to select the node with the highest selection probability in each node to obtain the second path order result of part cutting.

[0016] The first path order result is compared with the second path order result. Based on the comparison result, the parameters of the preset strategy model are optimized and updated, or the parameters of the preset strategy model are replaced with the preset benchmark model.

[0017] Preferably, the step of optimizing and updating the preset strategy model parameters based on the comparison results or replacing the preset strategy model parameters with a preset benchmark model further includes:

[0018] if Then update parameter θ BL =θ, otherwise do not update;

[0019] The current cutting path planning strategy is updated based on the gradient calculated using the loss function, as shown in the following formula:

[0020]

[0021] That is, by comparing the performance of the current preset strategy model with the performance of the preset benchmark model, the evaluation result is obtained, and the selection of the current cutting path planning strategy is further optimized based on the evaluation result;

[0022] Where L(π) i The ) indicates the path length output by the preset strategy model. The path length output by the preset baseline model is represented by θ, the parameters in the preset strategy model are represented by θ, and the current strategy π is represented by p. i The probability of.

[0023] Preferably, determining the optimal perforation point according to a preset algorithm includes:

[0024] Determine any node in the non-part area of ​​the target layout diagram as the first node, and the first node is the starting position of the tool head;

[0025] Based on the nearest neighbor algorithm and combined with the current process requirements, starting from the position of the first node, traverse all nodes on all parts and obtain the second node on the adjacent part with the best path from the first node. The current process requirements include the shortest distance between perforation points, the shortest distance between perforation points and the edge of the part, or constraints related to perforation points.

[0026] Using the second node as the starting node in the current state, obtain the third node on the adjacent part with the best path distance from the second node;

[0027] Repeat the above steps until all parts are identified as having a unique perforation point.

[0028] Preferably, the step of transforming the cutting path planning problem into a Traveling Salesman Problem (TSP) solution after confirming the piercing point further includes:

[0029] The perforation point information is input into the environmental simulation model, and the perforation point data is automatically filled in and marked with the node mask information;

[0030] After the current node is visited, mark the mask information of the current node;

[0031] The state information is obtained from the environmental simulation model and input into the preset strategy model to obtain the probability of each node being selected and to randomly output the path order of the current target node.

[0032] The state information is obtained from the environmental simulation model and input into the preset benchmark model. The node with the highest probability of being selected is selected in each node in a greedy manner, and the path order of the current target node is output.

[0033] After each node selection is completed, update the simulation environment, confirm the visited and unvisited paths among all nodes, and repeat the above steps of outputting the best path order of the target nodes until the entire target layout diagram is traversed and the node paths are selected. Output the path order of all target nodes from the preset strategy model and the preset baseline model respectively.

[0034] The path lengths for the two strategies are calculated based on the path order of all target nodes output by the two models. The path lengths between the two models are compared to update the model parameters. Finally, the model selected by the cutting path planning strategy is updated according to the direction of the model with the smallest output path length.

[0035] Preferably, the environmental simulation model includes:

[0036] The node location information includes the location information of the perforation point of each part, and is configured to the maximum number of nodes. If the actual number of nodes is less than the maximum number of nodes, a mask is used to fill the gaps to reach the preset number of nodes.

[0037] Node mask information is used to mark whether each node is accessible.

[0038] The state transition equation is used to update the environment state based on the current state information and the policy.

[0039] The policy gradient equation is used to output the path length of the current cutting path planning strategy and update the model parameters. The current cutting path planning strategy is the optimized and updated policy model.

[0040] Preferably, the network architecture of both the preset strategy model and the preset baseline model includes an encoder and a decoder, wherein the encoder is composed of multiple identical layers stacked together, and the encoding steps include:

[0041] The current state information is input into the encoder, i.e., input from the linear embedding layer, and the current state information is mapped to a high-dimensional space to obtain the embedded feature vector.

[0042] The embedded feature vector is input into the first attention layer, the relationship between the input features is calculated, the interaction between different perforation points is captured, and the first feature matrix is ​​output.

[0043] The output of the first attention layer is added to the input, and then the first normalization operation is performed. The feature vector after the first normalization is output to eliminate the difference in the units of different features.

[0044] The feature vector after the first normalization is input into the feedforward network for further nonlinear transformation. Then, the output of the feedforward network is added to the input, and a second normalization operation is performed to output the encoded target vector.

[0045] Preferably, the decoding steps of the decoder include:

[0046] The encoded target vector is used as the input to the decoder, i.e., input from the second attention layer, to calculate the relationship between the input features, capture the interaction between different perforation points, and output the second feature matrix;

[0047] The second feature matrix is ​​input into the linear transformation layer, and the output of the multi-head attention layer is merged into a single vector.

[0048] The single vector is input into the third attention layer, and the single vector is used as the query vector. Each node vector is used as the key vector. The query vector and the key vector are multiplied to obtain the attention score of each node.

[0049] The output of the third attention layer is used as the input of the mask layer, and the nodes that have been visited among all nodes are filtered according to the node mask information in the environment simulation model.

[0050] The attention score, filtered by the mask layer, is input into the Softmax layer and transformed into a probability distribution through the Softmax function. The final output is the mapping of the behavior policy from the current state s to action A, that is, the probability of finally selecting the punch point.

[0051] This invention provides a travel salesman path cutting planning device based on reinforcement learning, comprising:

[0052] The acquisition module is used to read the target layout drawing containing multiple part graphics;

[0053] The perforation point confirmation module is used to determine the optimal perforation point according to a preset algorithm to ensure the effectiveness of connecting the perforation points of each part under the current process conditions;

[0054] The TSP conversion module is used to convert the cutting path planning problem into a Traveling Salesman Problem (TSP) solution operation after the perforation point is confirmed, and to build an environmental simulation model and input the information of each perforation point to obtain the current state information. The current state information includes at least the perforation point information, node mask information and current node information.

[0055] The strategy output module is used to input the current state information into the trained strategy model to obtain the current path strategy to predict the next perforation point, and input the current path strategy into the environment simulation model to obtain new state information, until the optimal processing sequence result of cutting through all perforation points is obtained.

[0056] The present invention also provides an electronic device, comprising:

[0057] The memory is used to store the processing program;

[0058] The processor, when executing the processing program, implements the reinforcement learning-based traveling salesman path planning method as described in this embodiment.

[0059] Compared with the prior art, the present invention has the following beneficial effects:

[0060] This invention proposes a reinforcement learning-based processing path planning method that incorporates a masking mechanism. Inspired by the masking mechanism in natural language processing, this method effectively handles sequences of variable length. By integrating the masking mechanism into the model, it allows for flexible adaptation to different numbers of nodes, overcoming the limitations of existing technologies. This improvement not only enhances the model's ability to handle problems of varying scales but also significantly improves its applicability and efficiency in various industrial manufacturing environments. Furthermore, this method addresses the issues of long computation times and susceptibility to local optima in traditional algorithms. Once trained, the model solves new problems very quickly in practical applications, even for thousands of processing path planning graphs, providing results almost instantly. Simultaneously, this method also addresses the issues of data annotation and the inability to guarantee finding better solutions in traditional deep learning. The self-guided learning process of reinforcement learning in this method endows the model with the ability to transcend the limitations of initial training data, enabling it to move towards the true optimal solution. Therefore, this invention represents a significant technical optimization of existing methods, providing a more flexible and efficient solution to the processing path planning problem.

[0061] This invention integrates masking technology with attention-based reinforcement learning methods to achieve the effect of handling a variable number of processing points, enabling the model to flexibly handle processing tasks of different scales without retraining the model for each different task volume.

[0062] This invention does not rely on externally provided training datasets or pre-labeled data. Instead, it is able to generate all the data required for the training process itself.

[0063] This invention introduces a path-cutting planning strategy model, which generates strategies by randomly selecting paths, thus effectively avoiding the problem of generating only local optima. Furthermore, it collaborates with a baseline model, continuously adjusting and optimizing the strategy by comparing the paths generated by each model. Attached Figure Description

[0064] Figure 1 This is a schematic diagram illustrating the steps of the reinforcement learning-based traveling salesman path planning method in one embodiment of the present invention;

[0065] Figure 2 This is a target layout diagram in one embodiment of the present invention;

[0066] Figure 3 This is an example diagram illustrating the determination of perforation points in one embodiment of the present invention;

[0067] Figure 4 This is an example diagram illustrating the solution of the Traveling Salesman Problem (TSP) in one embodiment of the present invention;

[0068] Figure 5 This is a flowchart of the entire model training process for the Traveling Salesman Program (TSP) path cutting planning based on reinforcement learning, as described in one embodiment of the present invention.

[0069] Figure 6 This is a network architecture diagram of the encoder and decoder in one embodiment of the present invention;

[0070] Figure 7 This is a flowchart of the model inference process after training in one embodiment of the present invention. Detailed Implementation

[0071] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0072] The term "comprising" and its variations as used herein are open-ended inclusion, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.

[0073] It should be noted that the concepts of "first" and "second" mentioned in this application are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0074] It should be noted that the terms "a" and "a plurality of" used in this application disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0075] Example 1

[0076] The problem of planning the machining path during cutting can be simplified into two sub-problems. First, determine the locations of the piercing points. After determining the locations of the piercing points, the problem of determining the machining path is transformed into a classic Traveling Salesman Problem (TSP), which is to find the shortest path connecting all the piercing points.

[0077] While determining the piercing point beforehand might affect the cutting path, a step-by-step strategy is often easier to implement in practice, especially when dealing with complex or variable manufacturing tasks. From a process safety perspective, pre-determining the piercing point location helps avoid thermal melting issues during processing. Furthermore, in some precision machining scenarios, cutting or piercing the material can cause stress concentration, affecting the accuracy and stability of the final product; optimizing the piercing point location first can effectively mitigate this problem. In summary, dividing the machining path planning into two steps not only simplifies the problem but also reflects a comprehensive consideration of multiple factors.

[0078] When determining the location of piercing points, the main consideration is finding the optimal piercing point on the part surface. The nearest neighbor method can be used, and constraints such as the minimum distance between piercing points and the minimum distance between a piercing point and the part edge should be taken into account. When solving the Time-of-Stake (TSS) problem, the main consideration is finding the optimal path between piercing points to minimize processing time and energy consumption. This invention primarily focuses on how to solve the TSP problem.

[0079] like Figure 1 As shown, this invention provides a travel salesman path planning method based on reinforcement learning, comprising:

[0080] S1: Read the target layout drawing containing multiple part graphics;

[0081] S2: Determine the optimal perforation point based on a preset algorithm to ensure the effectiveness of the perforation point connection for each part under the current process conditions. The constraints of the current process include: 1. Edge distance: Perforation points should maintain an appropriate distance to avoid damage to the material edges. 2. Material properties: Consider the type and properties of the material being cut, such as hardness and thickness. 3. Thermal melting effect: Consider the impact of the heat generated during cutting on the surrounding area to avoid material damage. The effectiveness of this implementation refers to the optimal selection of the perforation point under the constraints.

[0082] S3: After confirming the perforation points, the cutting path planning problem is transformed into a Traveling Salesman Problem (TSP) solution operation, and an environmental simulation model is constructed. Information of each perforation point is input to obtain the current state information, which includes perforation point information, node mask information, and current node information, etc.

[0083] S4: Input the current state information into the trained policy model to obtain the path policy under the current state to predict the next perforation point. Input the current path policy into the environment simulation model to obtain new state information until the optimal processing sequence result after cutting through all perforation points is obtained.

[0084] In various embodiments of this application, the environmental simulation model includes: node location information, including the location information of each part's perforation point, configured as the maximum number of nodes. If the actual number of nodes is less than the maximum number of nodes, a mask is used to fill the gaps to reach the preset number of nodes; node mask information, used to mark whether each node is accessible; a state transition equation, used to update the environmental state based on the current state information and the strategy, that is, the nodes that have been visited are marked as 0 using the node mask information, indicating that they have been visited and will not be visited again; and a strategy gradient equation, used to output the path length of the current cutting path planning strategy and update the model parameters. The current cutting path planning strategy is an optimized and updated strategy model. In this embodiment, optimizing and updating the strategy model means optimizing and updating the strategy model parameters or replacing the strategy model parameters with the baseline model. That is, a complete round of training involves the input data strategy model using a random sampling method to obtain the path order, waiting for the strategy model to output all path orders, and the baseline model using a greedy method to obtain the path order. The two results of the baseline model and the strategy model are compared, and the strategy model parameters are optimized and updated, or the strategy model parameters are replaced with the baseline model parameters.

[0085] See Figure 2As shown, this embodiment uses multiple different layout patterns. The cutting order of these patterns differs, so we need to consider which pattern to cut first to determine the cutting path and minimize the cutting path. A complete cutting process involves the cutting head starting from the origin, moving to a ring to be cut, using a vertex of the ring as the piercing point (the initial node for cutting the ring), piercing through that point, and completing the cutting of the entire ring. Then, the cutting head selects the next ring to be cut using a path selection algorithm, cuts that ring, and repeats the above steps until all rings are cut. The cutting path consists of the length of the outline of the pattern to be cut and the length of the idle stroke. The length of the outline is fixed, but the idle stroke—the distance the cutting head moves to another outline after cutting one part—is variable. Because the piercing point positions of each ring and the cutting order of the patterns differ, the cutting path length will also be different. Therefore, the length of the idle stroke can be optimized, and the optimization process involves determining the cutting order of each ring. The entire cutting path planning can be divided into two steps: 1. Determining the perforation points; 2. Based on the determined perforation points, planning the Traveling Salesman Problem (TSP) and determining the cutting order of the graphic.

[0086] To determine the piercing point, in several embodiments of this application, step S2, determining the optimal piercing point according to a preset algorithm, includes:

[0087] Determine any node in the non-part area of ​​the target layout diagram as the first node, and the first node is the starting position of the tool head;

[0088] Based on the nearest neighbor algorithm and combined with the current process requirements, starting from the position of the first node, traverse all nodes on all parts and obtain the second node on the adjacent part with the best path from the first node. The current process requirements include the shortest distance between perforation points, the shortest distance between perforation points and the edge of the part, or constraints related to perforation points; that is, based on process requirements such as the shortest distance between perforation points and the shortest distance between perforation points and the edge of the part, obtain the second node on the adjacent part with the best path from the first node.

[0089] Using the second node as the starting node in the current state, and based on process requirements such as the shortest distance between perforation points and the shortest distance between perforation points and the edge of the part, obtain the third node on the adjacent part with the optimal path from the second node.

[0090] Repeat the above steps until all parts are identified as having a unique perforation point.

[0091] Those skilled in the art will understand that the selection of perforation points is as follows: Figure 3As shown, the black node in the lower left corner can be understood as the initial position of the cutting head, and the black nodes on the part can be understood as the determined piercing points. The selection of piercing points can be determined using the nearest neighbor algorithm. The specific steps are: starting from the initial position of the cutting head, traverse the points on the remaining rings, find the point on the ring closest to the current node, and take this point as the current node, repeating the same steps. This continues until all rings have a unique piercing point, thus determining the piercing points. In this embodiment, the rings refer to parts composed of nodes.

[0092] This embodiment determines the punch points and then solves the Traveling Salesman Problem (TSP). The specific implementation steps are as follows: The punch point information is input into the environmental simulation model, and the punch point data is automatically filled in and marked with node mask information; the data mask initially filled in also needs to be marked with mask information, and a mark of 0 indicates that it is inaccessible. After the current node is visited, the mask information of the current node is marked; the state information is obtained from the environmental simulation model and input into the preset strategy model to obtain the probability of each node being selected and randomly output the path order of the current target node; the state information is obtained from the environmental simulation model and input into the preset baseline model, and the node with the highest probability of being selected is selected in a greedy manner and the path order of the current target node is output; after each node selection is completed, the simulation environment is updated to confirm the visited and unvisited paths of all nodes, and the above steps of outputting the best path order of the target nodes are repeated until the entire target layout map is traversed and the node paths are selected. The path order of all target nodes is output from the preset strategy model and the preset baseline model respectively; the path length of the two strategies is calculated according to the path order of all target nodes output by the two models, the path length between the two models is compared and the model parameters are updated, and finally the model selected by the cutting path planning strategy is updated according to the direction of the model with the smallest output path length.

[0093] Those skilled in the art will understand that when a sequence of perforation points is input into the model, the model will output a cutting order that passes through all perforation points, i.e., the cutting order of the part, based on the position information of the perforation points. The following diagram illustrates the model training process. When perforation point information is input into the model, the environmental simulation model automatically fills in the data. If the number of perforation points is insufficient, a mask is used to fill in the gaps, and a mask value of 0 for a node indicates that the point cannot be accessed. After a node is accessed, its mask value is also marked as 0, indicating that the node has been accessed and will not be accessed again. When the state information 's' is input into the model, the policy model outputs the probability of selecting each node. Nodes with higher probabilities are selected more frequently, and nodes with lower probabilities are selected less frequently. The baseline model selects nodes using a greedy approach, i.e., selecting the node with the highest probability. After node selection is complete, the environment is updated, marking which nodes have been accessed and which have not. After all node paths in the entire graph have been selected, both the policy model and the baseline model output a node path order. At this point, based on the node path order generated by the two models, the path lengths for the two strategies can be calculated separately. The model parameters are then updated by comparing the path lengths between the two models, with the model always updating in the direction that generates the shortest path length. That is, the model takes nodes V7, V3, V13, V18, V32, and V25 as input and outputs a node traversal sequence of V3-V7-V11-V13-V18-V25-V32. See [link / reference needed]. Figure 4 As shown.

[0094] In various embodiments of this application, the training steps of the trained policy model include: randomly generating node position coordinates as a training dataset; the training dataset used in this embodiment is an automatically generated training data mechanism, which does not rely on externally provided training datasets or pre-labeled data. Instead, it can automatically generate all the data required for the training process. The state information of the training dataset is input into the preset policy model, and a path is randomly selected based on the probability distribution of each node to obtain the first path order result L(π) for part cutting. i The state information of the training dataset is input into a preset benchmark model, and a greedy algorithm is used to select the node with the highest selection probability in each node to obtain the second path sequence result of part cutting. For example, the probability of selecting node 1 is 50%, node 2 is 20%, and node 3 is 30%. In the baseline model, the output would be node 1, which has the highest probability. However, the strategy model is like a large wheel, where the probability of selecting node 1 is relatively high, while the probability of selecting node 2 is relatively low. Each node has a chance of being selected.

[0095] Compare the first path order result with the second path order result, and optimize and update the preset strategy model parameters based on the comparison result, or replace the preset strategy model parameters with the preset baseline model. If Then update parameter θ BL =θ, otherwise no update; this can be understood as follows: if the policy model is better, the baseline model will replicate the parameters of the policy model; if the baseline model is optimal, the policy model will update its parameters in the direction of shorter paths. The gradient is calculated based on the loss function to update the current cutting path planning policy, as shown in the following formula:

[0096]

[0097] That is, by comparing the performance of the current preset strategy model with the performance of the preset benchmark model, the evaluation result is obtained, and the selection of the current cutting path planning strategy is further optimized based on the evaluation result;

[0098] Where L(π) i The ) indicates the path length output by the preset strategy model. The path length output by the preset baseline model is represented by θ, the parameters in the preset strategy model are represented by θ, and the current strategy π is represented by p. i The probability of.

[0099] The model training principle is explained as follows: See the flowchart of the model training process in this embodiment. Figure 5 As shown, it includes the following steps:

[0100] 1. Extract the location information of perforation points from industrial manufacturing data, and construct an environmental simulation model based on the perforation point information. The environmental simulation model includes node location information: node information refers to the location information of perforation points, which can represent the cut parts. A maximum number of nodes can be specified, such as 512, or other appropriate values. If the actual number of nodes is less than the maximum number of nodes, a mask is used to fill in the gaps to reach the specified number; node mask information: each node is marked, where 0 indicates inaccessible and 1 indicates accessible; state transition equation: mainly used to update the environmental state; policy gradient equation: mainly used to output the path length of the current policy, which is used to update the model parameters later.

[0101] 2. Based on the state transition equation in the environmental simulation model, obtain the current state (`state`), which includes information such as punch point information, node mask information, and current node information. Input the current state into the policy model and the baseline model, where `state` represents the path output by the baseline model and `state` represents the path output by the policy model. The policy model and the baseline model aim to effectively handle path planning problems and improve performance. These two models have similar structures, both containing encoders and decoders to process instances and generate paths. Their difference lies in the path selection strategy. Policy Model: The policy model solves the problem by randomly selecting paths based on the probability distribution of the paths. This allows the model to explore and consider multiple possible paths, rather than relying solely on a greedy strategy. Through randomness, the policy model can better handle the uncertainty of the problem and avoid local optima. Baseline Model: The baseline model uses a greedy strategy, selecting the action or path with the highest probability at each step. This strategy simplifies the path selection process and usually leads to higher computational efficiency, although it may sacrifice some global performance. The policy model helps improve global performance by exploring multiple paths, while the baseline model provides a reasonable performance benchmark to help evaluate and guide the learning process of the policy model, making it easier to train.

[0102] 3. Input the model policy into the environment simulation model, update the environment state `state` using the state transition equation, and repeat steps 2 to obtain the path policy under the current state. Calculate the path length of the current policy using the policy gradient equation, which will be used to update the parameter (θ) in the baseline model and the policy model later.

[0103] 4. If Then update parameter θ BL =θ, otherwise no update. This is because when the path generated by the policy model is shorter and performs better than the path generated by the baseline model, we update the parameters of the baseline model using the parameters of the policy model, so that the better-performing policy model can be used for future path generation. If the path generated by the policy model is longer, we do not update the parameters to prevent performance degradation. This helps to gradually improve the performance of the policy model.

[0104] 5. This function represents the calculation of the gradient, i.e., the definition of the loss function, which is accomplished by comparing the performance of the current policy with that of the greedy policy. If the current policy performs better, the gradient will encourage the policy to update in that direction to improve performance. If the greedy policy performs better, the gradient will encourage the policy to update in the direction of the greedy policy to better explore the policy space.

[0105] 6. Steps 4 and 5 can be understood as follows: if the path randomly selected by the policy model is longer than the path derived by the baseline model using a greedy strategy, it indicates that the path explored by the policy model is effective. In this case, the policy responsible for generating that path in the policy model will be strengthened, meaning that the model is more likely to take similar actions when encountering similar situations in the future. Conversely, if the path generated by the policy model is inferior to the path of the baseline model, then the relevant policy will be weakened. This mechanism ensures that the model not only finds satisfactory solutions but also learns and continuously optimizes its decision-making process during exploration. Through such a feedback loop, the model can gradually learn and improve its path planning strategy, thereby achieving better path planning results in subsequent processing tasks.

[0106] In this implementation, both the preset strategy model and the preset baseline model configuration network architecture include neural networks with encoders and decoders. See [link to relevant documentation]. Figure 6 As shown in the figure, the input piercing point state information mainly consists of a list of coordinates for each piercing point, as well as mask information. The output is the behavior policy, which is the mapping from state s to action A, i.e., the probability of selecting a piercing point. The encoder is composed of multiple identical layers stacked together, and its structure is as follows: Figure 6 As shown, the area enclosed by the dashed lines can have N layers, for example, N = 3. This indicates that there can be N dashed-line enclosed areas for feature extraction. The specific encoder is described as follows: The current state information is input into the encoder, i.e., from the linear embedding layer. The current state information is mapped to a high-dimensional space to obtain an embedded feature vector. The embedded feature vector is input into the first attention layer to calculate the relationship between the input features, capture the interaction between different perforation points, and output the first feature matrix. The output of the first attention layer is added to the input, and then a first normalization operation is performed to output the first normalized feature vector to eliminate the dimensional differences between different features. The first normalized feature vector is input into the feedforward network for further nonlinear transformation operations. Then, the output of the feedforward network is added to the input, and a second normalization operation is performed to output the encoded target vector. In this embodiment, the encoder input: the input state s includes: the position information of the perforation points, mask information, and other related features.

[0107] The specific encoder and decoder are described as follows: The encoded target vector is used as the input to the decoder, i.e., input from the second attention layer. The relationship between input features is calculated to capture the interaction between different punch points, and the second feature matrix is ​​output. The second feature matrix is ​​input to the linear transformation layer, and the outputs of the multi-head attention layer are merged into a single vector. The single vector is input to the third attention layer, where it is used as the query vector, and each node vector is used as the key vector. The query vector is multiplied by the key vector to obtain the attention score of each node. The output of the third attention layer is used as the input to the mask layer, and the nodes that have been visited are filtered according to the node mask information in the environmental simulation model. The attention score after being filtered by the mask layer is input to the Softmax layer, and it is transformed into a probability distribution through the Softmax function. The final output is the mapping of the behavior policy from the current state s to action A, i.e., the probability of finally selecting a punch point.

[0108] Those skilled in the art will understand that the encoder structure used in this embodiment is as follows:

[0109] Linear Embedding Layer: Maps the input state s to a high-dimensional space to obtain the embedding vector.

[0110] The first attention layer is a multi-head attention mechanism: it calculates the relationship between input features, enabling the model to capture the interaction between different perforation points.

[0111] Add & Normalize: The output of the attention layer is added to the input and then normalized to stabilize the learning process and prevent network degradation.

[0112] Feedforward Network: Performs further nonlinear transformations on the output of the attention layer to improve the model's expressive power.

[0113] The encoder structure used in this embodiment is as follows:

[0114] The second attention layer used in this implementation is a multi-head attention mechanism: it accepts the output of the encoder and the output of the decoder in the previous step, calculates the relationship between the input features, and enables the model to capture the interaction between different punch points.

[0115] Linear Transformation: Combines the outputs of the multi-head attention layer into a single vector.

[0116] The third attention layer used in this implementation is a single-head attention mechanism: the merged vector is used as the query (q), each node vector is used as the key (k), and the query (q) is multiplied by the key (k) to obtain the attention score of each node.

[0117] Masking layer: Based on the node masking information in the environment simulation model, the attention score of inaccessible nodes is set to negative infinity to prevent the model from selecting these nodes.

[0118] Softmax layer: The attention score is converted into a probability distribution using the Softmax function, which helps the model decide which node to select in the next step.

[0119] The cutting planning method described in this embodiment involves the model reasoning process. (See [link]). Figure 7 As shown, this is the process of inputting instances into the model after training to obtain the optimal path sequence. This process is very straightforward: simply input the perforation point information for each part, and the model can quickly calculate the optimal processing order for each part. The model can handle variations in part size or the number of perforations. Furthermore, it can process data from multiple parts simultaneously, all within seconds, significantly accelerating the entire path planning process. In this embodiment, attention mechanism: a method of assigning different weights to the model to identify key information parts in the input data. Masking: In machine learning, masking techniques are used to filter out unnecessary parts of a sequence, allowing the model to focus on important information.

[0120] Example 2

[0121] Based on the same concept, the present invention provides a reinforcement learning-based traveling salesman path cutting planning device, comprising:

[0122] The data acquisition module is used to read the target layout drawing containing multiple part graphics;

[0123] The perforation point confirmation module is used to determine the optimal perforation point according to a preset algorithm to ensure the effectiveness of connecting the perforation points of each part under the current process conditions;

[0124] The TSP conversion module is used to convert the cutting path planning problem into a Traveling Salesman Problem (TSP) solution operation after the perforation point is confirmed. It inputs the information of each perforation point to build an environmental simulation model and obtains the current state information. The current state information includes at least the perforation point information, node mask information and current node information.

[0125] The strategy output module is used to input the current state information into the trained strategy model to obtain the current path strategy to predict the next perforation point, and input the current path strategy into the environment simulation model to obtain new state information, until the optimal processing sequence result of cutting through all perforation points is obtained.

[0126] It should be noted that the division of the various modules in this device / system embodiment is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these modules can be implemented entirely in software through processing element calls; they can also be implemented entirely in hardware; or some units can be implemented by processing element calls to software, while others can be implemented in hardware.

[0127] The implementation principles of the above-mentioned acquisition module, perforation point confirmation module, TSP conversion module and strategy output module have been described in the foregoing embodiments, so they will not be repeated here.

[0128] Example 3

[0129] Based on the same concept, an electronic device is also provided in some embodiments of this application. This electronic device includes a memory and a processor, wherein the memory stores a processing program, and the processor executes the processing program according to instructions. When the processor executes the processing program, the reinforcement learning-based traveling salesman path planning method described in the foregoing embodiments is implemented. For example, a graphics card (GPU) primarily accelerates the training process during model training, and loading the model for prediction onto the GPU also accelerates the prediction process.

[0130] In some embodiments of this application, a readable storage medium is also provided, which can be a non-volatile readable storage medium or a volatile readable storage medium. The readable storage medium stores instructions that, when executed on a computer, cause an electronic device containing such a readable storage medium to perform the aforementioned reinforcement learning-based traveling salesman path cutting planning method.

[0131] It is understood that, for the aforementioned reinforcement learning-based traveling salesman path planning methods, if they are all implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, optical disks, and other media capable of storing program code.

[0132] Computer-readable storage media may include data signals propagated in baseband or as part of a carrier wave, carrying readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable storage medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transfer a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical fiber, RF, etc., or any suitable combination thereof.

[0133] The program code for executing the technical solutions disclosed in this application can be written in any combination of one or more programming languages. These programming languages ​​include object-oriented programming languages—such as Python and C++—and conventional procedural programming languages—such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0134] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A reinforcement learning based traveling salesman cut-path planning method, characterized in that, include: Read the target layout drawing containing multiple part graphics; The optimal perforation point is determined based on a preset algorithm to ensure the effectiveness of connecting the perforation points of each part under the current process conditions; After confirming the perforation points, the cutting path planning problem is transformed into a Traveling Salesman Problem (TSP) solution operation. An environmental simulation model is constructed by inputting information for each perforation point to obtain the current state information. This current state information includes at least perforation point information, node mask information, and current node information. The environmental simulation model includes: node location information, including the location of each perforation point on the part, configured with a maximum number of nodes. If the actual number of nodes is less than the maximum number, a mask is used to fill the gaps to reach the preset number of nodes; node mask information, used to mark whether each node is accessible; a state transition equation, used to update the environmental state based on the current state information and the strategy; and a strategy gradient equation, used to output the path length of the current cutting path planning strategy and update the model parameters. The current cutting path planning strategy is an optimized and updated strategy model. The current state information is input into the trained policy model to obtain the current path policy to predict the next perforation point. The current path policy is input into the environmental simulation model to obtain new state information until the optimal processing sequence result of cutting through all perforation points is obtained. The training steps of the trained strategy model include: randomly generating node position coordinates as a training dataset; inputting the state information of the training dataset into a preset strategy model and randomly selecting paths based on the probability distribution of each node to obtain a first path order result for part cutting; inputting the state information of the training dataset into a preset benchmark model and using a greedy algorithm to select the node with the highest selection probability among each node to obtain a second path order result for part cutting; comparing the first path order result with the second path order result, and optimizing and updating the parameters of the preset strategy model or replacing the parameters of the preset strategy model with those of the preset benchmark model based on the comparison results; Both the preset strategy model and the preset baseline model have network architectures including encoders and decoders. The encoder is composed of multiple identical layers stacked together. The encoding steps include: inputting the current state information into the encoder, i.e., inputting from the linear embedding layer, mapping the current state information to a high-dimensional space to obtain an embedded feature vector; inputting the embedded feature vector into a first attention layer, calculating the relationship between input features, capturing the interaction between different perforation points, and outputting a first feature matrix; adding the output of the first attention layer to the input and performing a first normalization operation, outputting a first normalized feature vector to eliminate the dimensional differences between different features; inputting the first normalized feature vector into a feedforward network for further nonlinear transformation operations, and then adding the output of the feedforward network to the input and performing a second normalization operation to output the encoded target vector. The decoding steps of the decoder include: taking the encoded target vector as input to the decoder, i.e., inputting it from the second attention layer, calculating the relationship between input features, capturing the interaction between different punch points, and outputting a second feature matrix; inputting the second feature matrix into a linear transformation layer, merging the outputs of the multi-head attention layer into a single vector; inputting the single vector into a third attention layer, using the single vector as a query vector, and each node vector as a key vector, multiplying the query vector and the key vector to obtain the attention score value of each node; taking the output of the third attention layer as input to a mask layer, filtering all visited nodes according to the node mask information in the environmental simulation model; inputting the attention score value filtered by the mask layer into a Softmax layer, converting it into a probability distribution through the Softmax function, and finally outputting the mapping of the behavior policy from the current state s to action A, i.e., the probability of finally selecting a punch point.

2. The travel salesman path planning method based on reinforcement learning according to claim 1, characterized in that, The step of optimizing and updating the preset strategy model parameters based on the comparison results, or replacing the preset strategy model parameters with the preset benchmark model, further includes: if Then update parameters Otherwise, it will not be updated; The current cutting path planning strategy is updated based on the gradient calculated using the loss function, as shown in the following formula: , That is, by comparing the performance of the current preset strategy model with the performance of the preset benchmark model, the evaluation result is obtained, and the selection of the current cutting path planning strategy is further optimized based on the evaluation result; in, This indicates the path length output by the preset strategy model. This represents the path length output by the preset baseline model, θ represents the parameters in the preset baseline model, and p represents the current policy π. i The probability of.

3. The travel salesman path planning method based on reinforcement learning according to claim 1, characterized in that, The step of determining the optimal perforation point according to a preset algorithm includes: Determine any node in the non-part area of ​​the target layout diagram as the first node, and the first node is the starting position of the tool head; Based on the nearest neighbor algorithm and combined with the current process requirements, starting from the position of the first node, traverse all nodes on all parts and obtain the second node on the adjacent part with the best path from the first node. The current process requirements include the shortest distance between perforation points, the shortest distance between perforation points and the edge of the part, or constraints related to perforation points. Using the second node as the starting node in the current state, obtain the third node on the adjacent part with the best path distance from the second node; Repeat the above steps until all parts are identified as having a unique perforation point.

4. The travel salesman path planning method based on reinforcement learning according to claim 1, characterized in that, The step of transforming the cutting path planning problem into a Traveling Salesman Problem (TSP) solution after confirming the piercing point further includes: The perforation point information is input into the environmental simulation model, and the perforation point data is automatically filled in and marked with the node mask information; After the current node is visited, mark the mask information of the current node; The state information is obtained from the environmental simulation model and input into the preset strategy model to obtain the probability of each node being selected and to randomly output the path order of the current target node. The state information is obtained from the environmental simulation model and input into the preset benchmark model. The node with the highest probability of being selected is selected in each node in a greedy manner, and the path order of the current target node is output. After each node selection is completed, update the simulation environment, confirm the visited and unvisited paths among all nodes, and repeat the above steps of outputting the best path order of the target nodes until the entire target layout diagram is traversed and the node paths are selected. Output the path order of all target nodes from the preset strategy model and the preset baseline model respectively. The path lengths for the two strategies are calculated based on the path order of all target nodes output by the two models. The path lengths between the two models are compared to update the model parameters. Finally, the model selected by the cutting path planning strategy is updated according to the direction of the model with the smallest output path length.

5. A travel salesman path planning device based on reinforcement learning, characterized in that, The apparatus for implementing the reinforcement learning-based traveling salesman path planning method as described in claim 1 includes: The acquisition module is used to read the target layout drawing containing multiple part graphics; The perforation point confirmation module is used to determine the optimal perforation point according to a preset algorithm to ensure the effectiveness of connecting the perforation points of each part under the current process conditions; The TSP conversion module is used to convert the cutting path planning problem into a Traveling Salesman Problem (TSP) solution operation after the perforation points are confirmed. It inputs the information of each perforation point to build an environmental simulation model and obtains the current state information, which includes perforation point information, node mask information and current node information. The strategy output module is used to input the current state information into the trained strategy model to obtain the current path strategy to predict the next perforation point, and input the current path strategy into the environment simulation model to obtain new state information, until the optimal processing sequence result of cutting through all perforation points is obtained.

6. An electronic device, characterized in that, include: The memory is used to store the processing program; A processor, which, when executing the processing program, implements the reinforcement learning-based traveling salesman cutting path planning method as described in any one of claims 1 to 4.