A real-time network routing optimization method and device based on reinforcement learning

By transforming the network routing problem into a traveling salesman problem and utilizing a routing model trained with reinforcement learning, combined with attention mechanisms and beam search, the optimal route selection problem under dynamic changes in the network environment is solved, thereby improving network communication efficiency.

CN116527564BActive Publication Date: 2026-06-30TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2023-05-11
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing network routing optimization methods are ill-suited to the dynamic changes in the network environment, especially in large-scale networks where they fail to meet the requirements of real-time and dynamic operation, making it difficult to select the optimal route.

Method used

A reinforcement learning-based network routing optimization method is adopted to transform the network routing problem into a traveling salesman problem. The network routing model trained by reinforcement learning is used to plan the route through instances of the traveling salesman problem, and the optimal route is dynamically selected by combining attention mechanism and beam search.

Benefits of technology

It enables dynamic and real-time selection of the best route in complex network environments, improving network communication efficiency, avoiding routing path instability caused by changes in network status, and enhancing the accuracy and efficiency of data packet transmission.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116527564B_ABST
    Figure CN116527564B_ABST
Patent Text Reader

Abstract

This application provides a real-time network routing optimization method and apparatus based on reinforcement learning. The method includes: inputting information of a source network device node, information of a target network device node, and information of a current network device node into a pre-trained network routing model to obtain the shortest routing path from the current network device node to the target network device node output by the network routing model; and transmitting a data packet from the current network device node to the first network device node in the shortest routing path, wherein the first network device node is the next-hop network device node of the current network device node. In this application embodiment, routing optimization is achieved by transforming the network routing problem into a traveling salesman problem. Simultaneously, reinforcement learning can better perceive the dynamics and real-time nature of the network environment, enabling adaptive selection of the optimal route in complex network environments, thereby improving the communication efficiency of the entire network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of network routing optimization, and in particular to a real-time network routing optimization method and apparatus based on reinforcement learning. Background Technology

[0002] With the continuous emergence of new network applications, new demands and challenges have been placed on network transmission. Network routing optimization is a crucial issue in achieving goals such as real-time performance and low latency.

[0003] Traditional methods for solving network routing optimization problems, such as Routing Information Protocol (RISP), Open Shortest Path First (OSB), and Border Gateway Protocol (BGP), are ill-suited to the dynamic changes in network environments. Furthermore, in large-scale networks, dealing with massive amounts of network state matrices and network characteristic data, while various heuristic algorithms can yield approximate or suboptimal routes, they cannot meet the real-time and dynamic requirements of the network. Therefore, how to dynamically and in real-time select the optimal route in complex network environments is a pressing technical problem that needs to be solved by those skilled in the art. Summary of the Invention

[0004] In view of the above problems, embodiments of this application provide a real-time network routing optimization method and apparatus based on reinforcement learning, so as to overcome the above problems or at least partially solve the above problems.

[0005] A first aspect of this application discloses a real-time network routing optimization method based on reinforcement learning, the method comprising:

[0006] The information of the source network device node, the target network device node, and the current network device node are input into a pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model.

[0007] The data packet is transmitted from the current network device node to the first network device node in the shortest routing path, where the first network device node is the next-hop network device node of the current network device node;

[0008] The network routing model is trained using Traveling Salesman Problem (TSP) instances as training samples and trained using reinforcement learning. The device nodes in the network correspond to the city nodes in the TSP. The transmission delay of data packets from the first network device node to the second network device node corresponds to the travel distance from the starting city node to the ending city node in the TSP. The objective function is to minimize the transmission delay, which corresponds to minimizing the travel distance.

[0009] Optionally, the network routing model is constructed according to the following steps:

[0010] Construct a training dataset with examples of the Traveling Salesman Problem, wherein each training data in the training dataset is a shortest route path composed of multiple network device nodes;

[0011] Introduce the Critic network as a benchmark for comparison with the Actor network in the network routing model;

[0012] Based on the reinforcement learning algorithm, the training data in the training dataset is simultaneously input into the Critic network and the Actor network for training, and the results of the Actor network and the Critic network are obtained respectively.

[0013] The result of the Actor network is subtracted from the result of the Critic network to obtain the loss function value. The parameters of the network routing model are updated based on the automatic derivative of the loss function value with respect to the parameters of the network routing model. After training, the trained network routing model is obtained.

[0014] Optionally, the network routing model selects any network device node in the shortest routing path based on an attention mechanism in the following manner:

[0015] Only one network device node is selected in each time step, and the selected network device node is masked to ensure that the selected network device node is not selected again.

[0016] Optionally, after transmitting the data packet from the current network device node to the next-hop network device node in the shortest routing path, the method further includes:

[0017] Taking the next-hop network device node as the current network device node, the return step is as follows: input the information of the source network device node, the target network device node, and the current network device node into the pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

[0018] Optionally, after transmitting the data packet from the current network device node to the next-hop network device node in the shortest routing path, the method further includes:

[0019] If the network condition at the next moment is found to be the same as the network condition at the current moment, then the second network device node in the shortest routing path is selected as the network device node for the next moment, and the data packet is transmitted to the second network device node in the shortest routing path.

[0020] If the network condition at the next moment is found to be inconsistent with the network condition at the current moment, the next-hop network device node is taken as the current network device node, and the following steps are returned: the information of the source network device node, the information of the target network device node, and the information of the current network device node are input into the pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

[0021] Optionally, the method further includes:

[0022] Beam search is used to determine the shortest search path from the current network device node to the target network device node;

[0023] Compare the shortest search path with the shortest routing path;

[0024] The path with the shorter search path and the shortest routing path is selected as the optimal path.

[0025] Transmitting data packets from the current network device node to the first network device node in the shortest routing path includes:

[0026] If the optimal path is the shortest routing path, the data packet will be transmitted from the current network device node to the first network device node in the shortest routing path;

[0027] Also includes:

[0028] If the optimal path is the shortest search path, the data packet will be transmitted from the current network device node to the first network device node in the shortest search route.

[0029] A second aspect of this application discloses a real-time network routing optimization apparatus based on reinforcement learning, the apparatus comprising:

[0030] The planning module is used to input the information of the source network device node, the target network device node, and the current network device node into a pre-trained network routing model to obtain the shortest route path from the current network device node to the target network device node output by the network routing model.

[0031] The transmission module is used to transmit data packets from the current network device node to the first network device node in the shortest routing path, wherein the first network device node is the next-hop network device node of the current network device node;

[0032] The network routing model is trained using Traveling Salesman Problem (TSP) instances as training samples and trained using reinforcement learning. The device nodes in the network correspond to the city nodes in the TSP. The transmission delay of data packets from the first network device node to the second network device node corresponds to the travel distance from the starting city node to the ending city node in the TSP. The objective function is to minimize the transmission delay, which corresponds to minimizing the travel distance.

[0033] A third aspect of this application discloses an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executed, implements the real-time network routing optimization method based on reinforcement learning as described in the first aspect of this application.

[0034] A fourth aspect of this application discloses a computer-readable storage medium storing a computer program / instructions thereon, which, when executed by a processor, implements the real-time network routing optimization method based on reinforcement learning as described in the first aspect of this application.

[0035] A fifth aspect of this application discloses a computer program product including computer-readable code that, when executed, implements the reinforcement learning-based real-time network routing optimization method as described in the first aspect of this application.

[0036] The embodiments of this application have the following advantages:

[0037] In this embodiment, the network routing model obtains the shortest route from the current network device node to the target network device node based on the information of the source network device node, the target network device node, and the current network device node. Then, the data packet is transmitted from the current network device node to the first network device node in the shortest route, thus achieving network routing selection. Since only the first network device node in the shortest route needs to be transmitted, the problem of the shortest route changing due to dynamic changes in network status during data packet transmission is avoided. Furthermore, the network routing model is based on reinforcement learning and trained using Traveling Salesman Problem instances as training samples. Therefore, by transforming the network routing problem into a Traveling Salesman Problem, routing optimization is achieved. Simultaneously, reinforcement learning can better perceive the dynamics and real-time nature of the network environment. Thus, it achieves dynamic, real-time, and adaptive selection of the optimal route in complex network environments, thereby improving the overall network communication efficiency. Attached Figure Description

[0038] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0039] Figure 1 This is a flowchart illustrating the steps of a real-time network routing optimization method based on reinforcement learning, as provided in an embodiment of this application.

[0040] Figure 2 This is a schematic diagram illustrating the construction process of a network routing model provided in an embodiment of this application;

[0041] Figure 3 This is a network structure diagram of an attention mechanism provided in an embodiment of this application;

[0042] Figure 4 This is a schematic diagram of the structure of a real-time network routing optimization device based on reinforcement learning provided in an embodiment of this application. Detailed Implementation

[0043] To make the above-mentioned objectives, features, and advantages of this application more apparent and understandable, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0044] Reference Figure 1 As shown, Figure 1 This document illustrates a flowchart of a real-time network routing optimization method based on reinforcement learning, as provided in an embodiment of this application. Figure 1 As shown, the specific steps of a real-time network routing optimization method based on reinforcement learning provided in this application embodiment may include steps S110 and S120:

[0045] Step S110: Input the information of the source network device node, the target network device node, and the current network device node into the pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model.

[0046] In this embodiment, a network device node refers to a router or switch node in a network. A network contains multiple network device nodes connected according to a certain topology. Current network device node information refers to information related to all network device nodes in the current network. Specifically, this information includes network device node configuration information and connection relationships among network device nodes. A source network device node is the network device node that sends data packets, and a target network device node is the network device node that receives data packets. Data packets sent by the source network device node need to be forwarded through multiple intermediate network device nodes before reaching the target network device node.

[0047] In practice, the network routing model performs path planning based on the input information of the source network device node, the target network device node, and the current network device node, to obtain the shortest route from the current network device node to the target network device node (i.e., the path with the shortest transmission delay). The current network device node refers to the network device node where the current data packet resides. Specifically, if the data packet has not yet started transmission, the current network device node refers to the source network device node; in this case, the shortest route path refers to the shortest route between the source and target network device nodes. If the data packet has already been transmitted to an intermediate network device node, the current node refers to that intermediate network device node; in this case, the shortest route path refers to the shortest route between the intermediate network device node and the target network device node.

[0048] Step S120: Transmit the data packet from the current network device node to the first network device node in the shortest routing path, where the first network device node is the next-hop network device node of the current network device node.

[0049] The network routing model is trained using Traveling Salesman Problem (TSP) instances as training samples and trained using reinforcement learning. The device nodes in the network correspond to the city nodes in the TSP. The transmission delay of data packets from the first network device node to the second network device node corresponds to the travel distance from the starting city node to the ending city node in the TSP. The objective function is to minimize the transmission delay, which corresponds to minimizing the travel distance.

[0050] In this embodiment of the application, considering that the network state is dynamically changing, after obtaining the shortest route path from the current network device node to the target network device node, it is only necessary to transmit the data packet to the first network device node in the shortest route path, thereby avoiding the problem that the shortest route path changes due to the dynamic changes in the network state during the data packet transmission process.

[0051] In this model, the first network device node represents the network device node that sends the data packet, i.e., the source network device node, and the second network device node represents the network device node that receives the data packet, i.e., the target network device node. Based on the similarity between the Traveling Salesman Problem (TSP) and the network routing problem, and using heuristic solutions to the TSP, the network routing problem is transformed into a TSP for solution. Simultaneously, deep reinforcement learning combines the intelligent perception capabilities of deep learning with the autonomous decision-making capabilities of reinforcement learning. It can achieve direct control from the original input to the output through end-to-end autonomous learning, and can dynamically meet the needs of constantly changing network states in real time. Therefore, the method based on the embodiments of this application can better perceive the dynamics and real-time nature of the network environment.

[0052] In an optional embodiment, in order to adaptively select the best route in a complex network environment, step S120, after transmitting the data packet from the current network device node to the next-hop network device node in the shortest route path, further includes step S130:

[0053] Step S130: Using the next-hop network device node as the current network device node, return to step S110: Input the information of the source network device node, the target network device node, and the current network device node into the pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

[0054] In step S120, to avoid the shortest route path changing due to dynamic changes in network status during data packet transmission, the data packet is only transmitted to the first network device node in the shortest route path. Then, in step S130, a new shortest route path planning is started with the first network device node (next-hop network device node) in the shortest route path as the current network device node to obtain a new shortest route path. The data packet is then transmitted to the first network device node in the new shortest route path, thereby achieving real-time route optimization.

[0055] For example, if the shortest route path is obtained as A→B→C→D→E in step S110 (A is the current network device node, and E is the target network device node), then after transmitting the data packet from network device node A to network device node B in step S120, step S130 is executed. Network device node B is taken as the current network device node, and step S110 is executed to perform a new shortest route path planning, resulting in a new shortest route path B→F→D→E. The data packet is then transmitted from network device node B to network device node F. The above method is repeated until the data packet is transmitted to the target network device node E.

[0056] In this embodiment, the shortest route is planned by using the network device node where the data packet resides (the next-hop network device node) as the current network device node. This method of planning and transmitting data node by node effectively avoids the problem in related technologies where data transmission is directly based on the shortest route, leading to changes in the shortest route due to dynamic changes in network conditions and increased transmission latency. This enables dynamic, real-time, and adaptive selection of the optimal route in complex network environments, improving the overall network communication efficiency.

[0057] In one alternative embodiment, the network routing model selects any network device node in the shortest routing path based on an attention mechanism in the following manner: only one network device node is selected per time step, and the selected network device node is masked to ensure that the selected network device node is not selected repeatedly.

[0058] In this embodiment, only one network device node is selected per time step, specifically meaning that only one network device node is selected for data packet transmission at each time point. A previously selected network device node refers to a network device node that has already transmitted data packets. Previously selected network device nodes are masked, and therefore, they will not be selected again during the next shortest path planning. For example, if the shortest path obtained in step S110 is A→B→C→D→E (A is the current network device node, and E is the target network device node), and in step S120, the data packet is transmitted from network device node A to network device node B, then network device node A is a previously selected network device node. When shortest path planning is performed again, network device node A will not be selected again for shortest path planning.

[0059] Specifically, the network routing model selects any network device node in the shortest path based on an attention mechanism, according to the probability of each network device node being selected. For example, if there are 10 network device nodes, during the first shortest path planning, the corresponding network device node is selected based on the selection probability of each of the 10 network device nodes. For network device nodes that have already been selected, their selection probability is set to 0, so that they will not be selected in the next shortest path planning.

[0060] In an optional embodiment, to achieve efficient shortest path planning and further improve the communication efficiency of the entire network, different shortest path planning methods are flexibly adopted for data packet transmission according to changes in the network. Specifically, after transmitting the data packet from the current network device node to the next-hop network device node in the shortest path, step S120 further includes steps S140 and S150:

[0061] Step S140: If the network condition at the next moment is consistent with the network condition at the current moment, then the second network device node in the shortest routing path is selected as the network device node for the next moment, and the data packet is transmitted to the second network device node in the shortest routing path.

[0062] Step S150: If the network condition at the next moment is inconsistent with the network condition at the current moment, take the next-hop network device node as the current network device node and return to step S110: Input the information of the source network device node, the information of the target network device node, and the information of the current network device node into the pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

[0063] In this embodiment, the current network condition refers to the network condition at the time of shortest path planning, i.e., the network condition corresponding to the current shortest path. The next network condition refers to the network condition at the next time shortest path planning is performed. When it is detected that the next network condition is consistent with the current network condition, step S140 is executed. The consistency between the next and current network conditions indicates that the network has not changed, and the shortest path at this time is still the shortest path applicable to the next network condition. Therefore, there is no need to re-plan the shortest path at the next time; the second network device node in the current shortest path is directly selected as the network device node for the next time, and the data packet is transmitted to the second network device node.

[0064] For example, if the shortest route path obtained in step S110 is A→B→C→D→E (A is the current network device node, and E is the target network device node), then after transmitting the data packet from network device node A to network device node B in step S120, if no change in network conditions is detected, network device node C is directly selected as the chosen network device node, and the data packet is transmitted from network device node B to network device node C. Since there is no need to replan the shortest route path in the next moment, and the network device nodes in the current shortest route path are directly used for data packet transmission, the efficiency of shortest route path planning is improved.

[0065] If the network condition at the next time step is inconsistent with the current network condition, step S150 is executed. The inconsistency indicates a network change, meaning the current shortest path is not the same as the shortest path corresponding to the next time step's network condition. Therefore, the next-hop network device node needs to be used as the current network device node for shortest path planning. By replanning the shortest path when the network condition changes, the problem of increased packet transmission delay caused by dynamic changes in network status during data packet transmission is avoided.

[0066] In this embodiment, different shortest route planning methods are flexibly adopted for data packet transmission based on changes in the network conditions. If the network conditions at the next moment are the same as the current moment, there is no need to replan the shortest route; data packets are directly transmitted using network device nodes within the current shortest route, thus improving the efficiency of shortest route planning. If the network conditions at the next moment are different from the current moment, a new shortest route is replanned, enabling dynamic, real-time, and adaptive selection of the optimal route in complex network environments, thereby improving the overall communication efficiency of the network.

[0067] In an optional embodiment, to improve the accuracy of shortest route planning, beam search is combined with shortest route planning. Specifically, the method further includes steps A1 to A5:

[0068] Step A1: Use beam search to determine the shortest search path from the current network device node to the target network device node.

[0069] Step A2: Compare the shortest search path with the shortest routing path.

[0070] Step A3: Select the shorter path from the shortest search path and the shortest routing path as the optimal path.

[0071] Step A4: If the optimal path is the shortest routing path, the data packet is transmitted from the current network device node to the first network device node in the shortest routing path.

[0072] Step A5: If the optimal path is the shortest search path, the data packet is transmitted from the current network device node to the first network device node in the shortest search route.

[0073] In step A1, the beam search method is used to find the shortest search path from the current network device node to the target network device node based on the network device node, the target network device node, and the available network device nodes in the network. Then, the shortest search path determined in step A1 based on the beam search method is compared with the shortest routing path obtained in step S110 based on the network routing model to determine the optimal path, i.e., the path with the smaller data packet transmission delay. If the optimal path is the shortest routing path, step A4 is executed; otherwise, step A5 is executed.

[0074] For example, A is the current network device node, and B is the target network device node. Based on the shortest search path obtained in step A1, it is A→E→D→C→B. Based on the shortest routing path obtained in step S110, it is A→F→D→H→B. The transmission delay time T1 of the data packet on the shortest search path A→E→D→C→B and the transmission delay time T2 on the shortest routing path A→F→D→H→B are estimated, and the path with the smaller delay time is selected as the optimal path. If T1 is smaller, the data packet is transmitted from the current network device node A to the network device node E; if T2 is smaller, the data packet is transmitted from the current network device node A to the network device node F.

[0075] In this embodiment, the beam search method and the network routing planning method of the network routing model are combined to select the shorter path of the two methods as the optimal path, thereby improving the accuracy of the shortest route planning, reducing the transmission delay of data packets, and improving the communication efficiency of the entire network.

[0076] In an optional embodiment, the network routing model is a pre-built model for shortest route planning, said network routing model being constructed according to steps B1 to B4:

[0077] Step B1: Construct a training dataset with examples of the Traveling Salesman Problem, where each training data point in the training dataset is a shortest route path consisting of multiple network device nodes.

[0078] Step B2: Introduce the Critic network as a benchmark for comparison with the Actor network in the network routing model.

[0079] Step B3: Based on the reinforcement learning algorithm, the training data in the training dataset is simultaneously input into the Critic network and the Actor network for training, and the results of the Actor network and the Critic network are obtained respectively.

[0080] Step B4: Subtract the result of the Critic network from the result of the Actor network to obtain the loss function value. Update the parameters of the network routing model based on the automatic derivative of the loss function value with respect to the parameters of the network routing model. After training, the trained network routing model is obtained.

[0081] In this embodiment, the network routing optimization problem is transformed into a Traveling Salesman Problem (TSP). That is, the device nodes in the network correspond to the city nodes in the TSP. The transmission delay of a data packet from the first network device node to the second network device node corresponds to the travel distance from the starting city node to the ending city node in the TSP. The objective function aims to minimize the transmission delay, which corresponds to minimizing the travel distance. Examples of the TSP are used as training data for the network routing model. The network routing model is based on the Asynchronous Advantage Actor-Critic (A3C) algorithm. The main theoretical basis of this network routing model is the Markov Decision Process, which better perceives the dynamics and real-time nature of the network environment. Specifically, at time t, a random sample of training data is taken from the training dataset and input into the training network routing model. At time t+1, the action strategy (the next hop network device node for data packet transmission) for the next time t+1 is selected based on the network state distribution at time j (the probability of each network device node being selected).

[0082] Furthermore, the network routing model employs an encoder and decoder structure and uses an attention mechanism for network device node selection. This attention mechanism ensures that only one network device node is selected per time step, and previously selected network device nodes are masked to prevent them from being selected again. Specifically, the shortest route planning process is as follows: the encoder encodes the input training data into feature vectors, and performs attention operations on these feature vectors to obtain attention-bearing feature vectors, where the attention-bearing feature vectors represent the probability of each node being selected; then, the decoder decodes the attention-bearing feature vectors to obtain the shortest route path.

[0083] The calculation of the probability of each network device node being selected includes: 1) calculating the similarity between the decoder query vector and the encoder input data feature vector; 2) normalizing the similarity to obtain the probability of each node being selected. Specifically, this is expressed as follows:

[0084]

[0085] p(π i |π <i ,X i ) = softmax(a i )

[0086] Where W1, W2, and V are all trainable parameters of the network routing model. Let p be the similarity between the decoder query vector and the encoder input data feature vector, and p be the probability of a network device node being selected.

[0087] Figure 2 The diagram illustrates the construction process of the network routing model. First, the network routing optimization problem is transformed into a traveling salesman problem (TSM), and a training dataset of TSM instances is constructed. This training dataset is simultaneously input into the Actor network and Critic network in the network routing model for training. The Actor network and Critic network are solved using the A3C algorithm, and the encoder and decoder are processed using an attention mechanism. This yields the outputs (rewards) of the Actor network and the Critic network, respectively. The loss function is calculated based on the results of the Actor network and the Critic network, and the parameters of the network routing model are updated based on the loss function to obtain the trained network routing model.

[0088] Figure 2 The diagram illustrates the network structure of the attention mechanism in this embodiment. First, the linear embedding layer processes multiple input network device nodes (CN1, ..., CN2). L Linear embedding processing is performed to obtain the embedding value (L(CN1), ..., L(CN1)) for each network device node. L The linear embedding values ​​are summed and then fed into the attention mechanism layer (encoder and decoder) for processing. Then, for each network device node, the similarity 'a' between the decoder query vector and the encoder's input data feature vector is calculated. i Further calculation of the intermediate change c i Then, based on the intermediate changes, the probability of each network device node being selected is calculated.

[0089] In this embodiment, the network routing model obtains the shortest route from the current network device node to the target network device node based on the information of the source network device node, the target network device node, and the current network device node. Then, the data packet is transmitted from the current network device node to the first network device node in the shortest route, thereby realizing network routing selection.

[0090] Since only the first network device node in the shortest route needs to be transmitted, the problem of the shortest route changing due to dynamic changes in network status during data packet transmission is avoided. Furthermore, the network routing model is based on reinforcement learning and trained using Traveling Salesman Problem instances as training samples. Therefore, by transforming the network routing problem into a Traveling Salesman Problem, route optimization is achieved. Simultaneously, reinforcement learning can better perceive the dynamics and real-time nature of the network environment. Thus, it enables dynamic, real-time, and adaptive selection of the optimal route in complex network environments, thereby improving the overall communication efficiency of the network.

[0091] Reference Figure 4 The diagram shows a schematic representation of a real-time network routing optimization device based on reinforcement learning, according to an embodiment of this application. Figure 4 As shown, the device includes:

[0092] Planning module 41 is used to input the information of the source network device node, the target network device node, and the current network device node into a pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model.

[0093] Transmission module 42 is used to transmit data packets from the current network device node to the first network device node in the shortest routing path, wherein the first network device node is the next-hop network device node of the current network device node;

[0094] The network routing model is trained using Traveling Salesman Problem (TSP) instances as training samples and trained using reinforcement learning. The device nodes in the network correspond to the city nodes in the TSP. The transmission delay of data packets from the first network device node to the second network device node corresponds to the travel distance from the starting city node to the ending city node in the TSP. The objective function is to minimize the transmission delay, which corresponds to minimizing the travel distance.

[0095] In an optional embodiment, the apparatus further includes a construction module for constructing the network routing model, the construction module comprising:

[0096] The first construction submodule is used to construct a training dataset with examples of the Traveling Salesman Problem, wherein each training data in the training dataset is a shortest route path composed of multiple network device nodes;

[0097] The second construction submodule is used to introduce the Critic network as a benchmark for comparison with the Actor network in the network routing model;

[0098] The third construction submodule is used to simultaneously input the training data in the training dataset into the Critic network and the Actor network for training based on the reinforcement learning algorithm, and obtain the results of the Actor network and the results of the Critic network respectively.

[0099] The fourth submodule is used to subtract the result of the Critic network from the result of the Actor network as the loss function value, update the parameters of the network routing model according to the loss function value, and obtain the trained network routing model after training.

[0100] In an optional embodiment, the network routing model selects any network device node in the shortest routing path based on an attention mechanism in the following manner:

[0101] Only one network device node is selected in each time step, and the selected network device node is masked to ensure that the selected network device node is not selected again.

[0102] In an optional embodiment, the device further includes:

[0103] The first return module is used to take the next-hop network device node as the current network device node and return in the following steps: inputting the information of the source network device node, the information of the target network device node, and the information of the current network device node into a pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

[0104] In an optional embodiment, the device further includes:

[0105] The sending module is used to transmit data packets to the second network device node in the shortest routing path when the network condition at the next moment is consistent with the network condition at the current moment.

[0106] The second return module is used to, when the network condition at the next moment is detected to be inconsistent with the network condition at the current moment, take the next-hop network device node as the current network device node and return to the following steps: input the information of the source network device node, the target network device node, and the information of the current network device node into a pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

[0107] In an optional embodiment, the device further includes:

[0108] The search module is used to determine the shortest search path from the current network device node to the target network device node using beam search;

[0109] The comparison module is used to compare the shortest search path with the shortest routing path;

[0110] The selection module is used to select the shorter path from the shortest search path and the shortest routing path as the optimal path;

[0111] The transmission module includes a first transmission submodule and a second transmission submodule;

[0112] The first transmission submodule is used to transmit data packets from the current network device node to the first network device node in the shortest route when the optimal path is the shortest route path;

[0113] The second transmission submodule is used to transmit data packets from the current network device node to the first network device node in the shortest search route when the optimal path is the shortest search route.

[0114] This application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the reinforcement learning-based real-time network routing optimization method described in this application.

[0115] This application also provides a computer-readable storage medium storing a computer program / instruction thereon, which, when executed by a processor, implements the reinforcement learning-based real-time network routing optimization method described in this application.

[0116] This application also provides a computer program product, including computer-readable code, which, when executed, implements the reinforcement learning-based real-time network routing optimization method described in this application.

[0117] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0118] This application describes embodiments of methods and apparatus according to embodiments of this application with reference to flowchart illustrations and / or block diagrams. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0119] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0120] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0121] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.

[0122] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0123] The above provides a detailed description of a real-time network routing optimization method and apparatus based on reinforcement learning provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for the purpose of helping to understand the method and its core ideas. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A real-time network routing optimization method based on reinforcement learning, characterized in that, The method includes: The information of the source network device node, the target network device node, and the current network device node are input into a pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model. The data packet is transmitted from the current network device node to the first network device node in the shortest routing path, where the first network device node is the next-hop network device node of the current network device node; The network routing model is trained using Traveling Salesman Problem (TSP) instances as training samples and trained using reinforcement learning. The device nodes in the network correspond to the city nodes in the TSP. The transmission delay time of the data packets transmitted in the network from the first network device node to the second network device node corresponds to the travel distance from the starting city node to the ending city node in the TSP. The objective function is to minimize the transmission delay time, which corresponds to minimizing the travel distance. The method further includes, after transmitting the data packet from the current network device node to the next-hop network device node in the shortest routing path: If the network condition at the next moment is found to be the same as the network condition at the current moment, then the second network device node in the shortest routing path is selected as the network device node for the next moment, and the data packet is transmitted to the second network device node in the shortest routing path. If the network condition at the next moment is found to be inconsistent with the network condition at the current moment, the next-hop network device node is taken as the current network device node, and the following steps are returned: the information of the source network device node, the information of the target network device node, and the information of the current network device node are input into the pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

2. The method of claim 1, wherein, The network routing model is constructed according to the following steps: Construct a training dataset with examples of the Traveling Salesman Problem, wherein each training data in the training dataset is a shortest route path composed of multiple network device nodes; Introduce the Critic network as a benchmark for comparison with the Actor network in the network routing model; Based on the reinforcement learning algorithm, the training data in the training dataset is simultaneously input into the Critic network and the Actor network for training, and the results of the Actor network and the Critic network are obtained respectively. The result of the Actor network is subtracted from the result of the Critic network to obtain the loss function value. The parameters of the network routing model are updated based on the automatic derivative of the loss function value with respect to the parameters of the network routing model. After training, the trained network routing model is obtained.

3. The method of claim 1, wherein, The network routing model selects any network device node in the shortest routing path based on an attention mechanism in the following manner: Only one network device node is selected in each time step, and the selected network device node is masked to ensure that the selected network device node is not selected again.

4. The method of claim 1, wherein, After transmitting the data packet from the current network device node to the next-hop network device node in the shortest routing path, the method further includes: Taking the next-hop network device node as the current network device node, the return step is as follows: input the information of the source network device node, the target network device node, and the current network device node into the pre-trained network routing model to obtain the shortest route from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

5. The method of claim 4, wherein, The method further includes: Beam search is used to determine the shortest search path from the current network device node to the target network device node; Compare the shortest search path with the shortest routing path; The path with the shorter search path and the shortest routing path is selected as the optimal path. Transmitting data packets from the current network device node to the first network device node in the shortest routing path includes: If the optimal path is the shortest routing path, the data packet will be transmitted from the current network device node to the first network device node in the shortest routing path; Also includes: If the optimal path is the shortest search path, the data packet will be transmitted from the current network device node to the first network device node in the shortest search route.

6. A real-time network routing optimization apparatus based on reinforcement learning, characterized in that, The device includes: The planning module is used to input the information of the source network device node, the target network device node, and the current network device node into a pre-trained network routing model to obtain the shortest route path from the current network device node to the target network device node output by the network routing model. The transmission module is used to transmit data packets from the current network device node to the first network device node in the shortest routing path, wherein the first network device node is the next-hop network device node of the current network device node; The network routing model is trained using Traveling Salesman Problem (TSP) instances as training samples and trained using reinforcement learning. The device nodes in the network correspond to the city nodes in the TSP. The transmission delay time of the data packets transmitted in the network from the first network device node to the second network device node corresponds to the travel distance from the starting city node to the ending city node in the TSP. The objective function is to minimize the transmission delay time, which corresponds to minimizing the travel distance. The method further includes, after transmitting the data packet from the current network device node to the next-hop network device node in the shortest routing path: If the network condition at the next time step is consistent with the network condition at the current time step, then the second network device node in the shortest routing path is selected as the network device node for the next time step, and the data packet is transmitted to the second network device node in the shortest routing path. If the network condition at the next time step is inconsistent with the network condition at the current time step, then the next-hop network device node is selected as the current network device node, and the process returns to the following step: inputting the information of the source network device node, the target network device node, and the current network device node into the pre-trained network routing model to obtain the shortest routing path from the current network device node to the target network device node output by the network routing model, until the data packet is transmitted to the target network device node.

7. An electronic device, comprising: It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executed, implements the real-time network routing optimization method based on reinforcement learning as described in any one of claims 1-5.

8. A computer readable storage medium having stored thereon computer programs / instructions, characterized in that, When the computer program / instruction is executed by the processor, it implements the real-time network routing optimization method based on reinforcement learning as described in any one of claims 1-5.

9. A computer program product comprising computer readable code, characterized in that, When the computer-readable code is executed, it implements the real-time network routing optimization method based on reinforcement learning as described in any one of claims 1-5.