Method and device for multi-machine scheduling of track inspection robot based on genetic algorithm and reinforcement learning, equipment and medium
By using a multi-machine scheduling method based on genetic algorithms and reinforcement learning, the path and task allocation of the train inspection robot are optimized, solving the problem of coordinated scheduling of multiple robots in rail transit and improving the efficiency of task execution and problem solving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANJIN PINGGAO SHUZHI ELECTRICAL EQUIPMENT CO LTD
- Filing Date
- 2026-05-28
- Publication Date
- 2026-06-23
AI Technical Summary
In rail transit, the collaborative scheduling problem of multiple train inspection robots suffers from high computational resource consumption and low solution efficiency in complex environments, making it difficult to apply to real-world scenarios.
A multi-machine scheduling method based on genetic algorithm and reinforcement learning is adopted. By constructing a fitness evaluation function, initializing and screening the population with genetic algorithm, and constructing a reward function and Q-value function with reinforcement learning framework, a scheduling strategy is generated to optimize the path and task allocation of the inspection robot.
It improves the operational efficiency of multiple train inspection robots in complex environments, reduces computational resource consumption, and enhances solution efficiency.
Smart Images

Figure CN122264489A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent operation and maintenance of rail transit trains, and in particular to a multi-machine scheduling method, device, equipment and medium for train inspection robots based on genetic algorithms and reinforcement learning. Background Technology
[0002] As emerging computer information technologies such as 5G communication, AI, cloud computing, blockchain and big data gradually mature and are implemented in the field of engineering construction, the demand for intelligent and efficient operation and maintenance in the rail transit industry is also increasing. Since the subway inspection depot has multiple inspection lines at the same time, and the structure of the subway car bottom, side and other areas is also extremely complex, a single robot is difficult to fully cover the inspection needs. Therefore, multiple robots are often needed to work together to complete the maintenance work.
[0003] For the robot cooperative scheduling problem, some studies are limited to small-scale scenarios with a limited number of defect types and inspection robots. Once the problem scale increases, the number of solutions grows exponentially, requiring a significant amount of time to find an accurate solution for the model. Other studies have improved the solution efficiency to some extent, but still suffer from computational resource consumption issues, making them difficult to apply simply to real-world scenarios.
[0004] Therefore, how to improve the operational efficiency of multiple inspection robots in complex environments is a pressing technical problem that needs to be solved. Summary of the Invention
[0005] In view of this, the purpose of this invention is to provide a multi-machine scheduling method, apparatus, device, and medium for train inspection robots based on genetic algorithms and reinforcement learning, which can improve the operational efficiency of multiple train inspection robots in complex environments. The specific solution is as follows: Firstly, this application provides a multi-machine scheduling method for train inspection robots based on genetic algorithms and reinforcement learning, including: The fitness evaluation function of the train inspection robot is constructed based on the path length of the train inspection robot's travel path, the execution time of the inspection task, and the power consumption. The inspection robot population is initialized using a genetic algorithm to obtain an initial population. A offspring population is then generated based on the initial population, and the offspring population is screened based on the fitness evaluation function to obtain the target population. A reward function is constructed using a reinforcement learning framework based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population. A Q-value function is then constructed using the reward function, the state set corresponding to the target population, and the action set. This Q-value function is used to generate a scheduling strategy for the inspection robot, enabling multi-robot scheduling of the inspection robot. The action set is a set configured with different usage ratios for various population crossover algorithms. The state set includes the iteration progress of the target population and a set representing the population diversity of the target population using Hamming distance.
[0006] Optionally, the fitness evaluation function for the train inspection robot, constructed based on the path length of the robot's travel path, the execution time of the inspection task, and the power consumption, includes: The corresponding path length is determined based on the set of stops of the inspection robot on its travel path; The stopping time of the train inspection robot at the target inspection point on the target train is determined, and the travel time of the train inspection robot is determined based on the spacing of the target inspection points and the number of stopping points, so as to determine the execution time of the corresponding inspection task based on the stopping time and the travel time. The corresponding power consumption is determined based on the travel energy consumption of the inspection robot at a preset unit distance, the spacing, the power of the robot's robotic arm and camera, and the docking time. The fitness evaluation function of the inspection robot is constructed using the path length, execution time, power consumption, and a preset adjustable factor.
[0007] Optionally, the step of initializing the population of the inspection robot to obtain an initial population includes: The number of inspection points at the undercarriage of the target train and the number of dispatchable train inspection robots are set to complete the first initialization operation; The second initialization operation is performed by initializing the population of the inspection robot, including its parameters, population size, maximum number of iterations, crossover probability, and mutation probability. Based on the first initialization operation and the second initialization operation, a corresponding initial population is obtained, and the current fitness value of the initial population is determined based on the fitness evaluation function. The current Hamming distance of the initial population is determined so as to determine the current Hamming distance as the initial state of the initial population.
[0008] Optionally, the step of generating a progeny population based on the initial population and screening the progeny population based on the fitness evaluation function to obtain the target population includes: The scene features where the inspection robot is located are determined, and the scene features are mapped to selection operators, crossover operators, and mutation operators. Based on a greedy strategy and the selection operators, crossover operators, and mutation operators, selection operations, crossover operations, and mutation operations are performed on the initial population to obtain the offspring population. The offspring population is screened based on the fitness evaluation function to obtain a screened population. The selection operation, the crossover operation, and the mutation operation are then iteratively performed on the screened population until the preset convergence criterion or the preset iteration limit is met, at which point the iteration stops and the corresponding target population is obtained.
[0009] Optionally, constructing a reward function based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population includes: Based on the iteration progress of the target population and the action set corresponding to the target population, an action reward is constructed. A diversity reward is constructed based on the average Hamming distance of the target population and the iteration progress. An evolutionary reward is constructed based on the fitness value of the target population; Collision penalty is constructed based on the preset maximum moving speed and current moving speed of the inspection robot, as well as a preset fixed value; A reward function is constructed using the action reward, the diversity reward, the evolution reward, and the collision penalty.
[0010] Optionally, constructing the Q-value function using the reward function, the state set corresponding to the target population, and the action set includes: The target reward value corresponding to the target action in the action set is determined based on the reward function. Determine the target state in the state set corresponding to the target population, and determine the maximum Q value corresponding to the target state based on a preset Q table; A Q-value function is constructed based on the state set, the action set, the preset learning rate, the target reward value, the preset discount factor, and the maximum Q-value.
[0011] Optionally, the reinforcement learning framework is a framework that includes a current value network, a target value network, and a loss function for updating the target parameters in the current value network; The current value network is a network that receives the current state corresponding to the target population and outputs the Q value corresponding to the current state. The target value network is a network that receives the next state of the current state and outputs the Q value corresponding to the next state.
[0012] Secondly, this application provides a multi-machine scheduling device for train inspection robots based on genetic algorithms and reinforcement learning, comprising: The function construction module is used to construct the fitness evaluation function of the inspection robot based on the path length of the inspection robot's travel path, the execution time of the inspection task, and the power consumption. The population determination module is used to initialize the inspection robot population using a genetic algorithm to obtain an initial population, generate a offspring population based on the initial population, and screen the offspring population based on the fitness evaluation function to obtain a target population. The multi-machine scheduling module is used to construct a reward function based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population using a reinforcement learning framework. It then constructs a Q-value function using the reward function, the state set corresponding to the target population, and the action set. The Q-value function is used to generate a scheduling strategy for the inspection robots, and the scheduling strategy is used to perform multi-machine scheduling of the inspection robots. The action set is a set configured with different usage ratios for various population crossover algorithms. The state set includes the iteration progress of the target population and a set that uses the Hamming distance as the population diversity of the target population.
[0013] Thirdly, this application provides an electronic device, comprising: Memory, used to store computer programs; A processor is used to execute the computer program to implement the aforementioned multi-machine scheduling method for train inspection robots based on genetic algorithms and reinforcement learning.
[0014] Fourthly, this application provides a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned multi-machine scheduling method for train inspection robots based on genetic algorithms and reinforcement learning.
[0015] In this application, a fitness evaluation function for the inspection robot is constructed based on the path length of its travel path, the execution time of its inspection task, and its power consumption. A genetic algorithm is used to initialize the inspection robot population, resulting in an initial population. Offspring populations are generated from the initial population, and the offspring populations are screened based on the fitness evaluation function to obtain a target population. A reward function is constructed using a reinforcement learning framework based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population. A Q-value function is constructed using the reward function, the state set corresponding to the target population, and the action set. The Q-value function is used to generate a scheduling strategy for the inspection robot, and this strategy is used to perform multi-robot scheduling of the inspection robot. The action set is a set configured with different usage ratios for various population crossover algorithms. The state set includes the iteration progress of the target population and a set representing the population diversity of the target population using Hamming distance. As can be seen from the above, in this application, the fitness evaluation function of the inspection robot is constructed based on the path length of its travel path, the inspection task execution time, and the power consumption. A genetic algorithm is used to initialize the inspection robot population to obtain an initial population. After generating a offspring population based on this initial population, the fitness evaluation function is used to filter the offspring population to obtain the target population. Based on a reinforcement learning framework, a reward function is constructed by combining the iteration progress, Hamming distance, fitness value, and corresponding action set of the target population. Then, a Q-value function is constructed using this reward function, the state set, and the action set corresponding to the target population. The Q-value function is used to generate a scheduling strategy for the inspection robots, and this scheduling strategy is then used to achieve multi-robot scheduling of the inspection robots. In this way, this application can improve the operational efficiency of multiple inspection robots in complex environments. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0017] Figure 1 This is a flowchart of a multi-machine scheduling method for a train inspection robot based on genetic algorithm and reinforcement learning disclosed in this application; Figure 2 This is a flowchart of a genetic algorithm disclosed in this application; Figure 3 This is a schematic diagram of a deep reinforcement learning framework disclosed in this application; Figure 4This is a schematic diagram of the structure of a multi-machine scheduling device for a train inspection robot based on genetic algorithms and reinforcement learning disclosed in this application; Figure 5 This is a structural diagram of an electronic device disclosed in this application. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] Currently, some research on the robot cooperative scheduling problem is limited to small-scale scenarios with a limited number of defect types and inspection robots. As the problem scales up, the number of solutions increases exponentially, requiring significant time to find accurate solutions for the model. Other studies have improved solution efficiency to some extent, but still suffer from computational resource consumption issues, making them difficult to apply directly to real-world scenarios. Therefore, this application provides a multi-robot scheduling method, device, equipment, and medium for train inspection robots based on genetic algorithms and reinforcement learning, which can improve the operational efficiency of multiple train inspection robots in complex environments.
[0020] See Figure 1 As shown in the figure, this invention discloses a multi-machine scheduling method for train inspection robots based on genetic algorithms and reinforcement learning, including: Step S11: Construct the fitness evaluation function of the train inspection robot based on the path length of the train inspection robot's travel path, the execution time of the inspection task, and the power consumption.
[0021] In this embodiment, the path length first needs to be determined based on the set of stops along the train inspection robot's travel path. Specifically, when performing train inspection tasks, the train inspection robot plans a travel path based on the target train's type, length, and pre-set inspection points. This travel path includes multiple stops, each corresponding to a location where an inspection operation needs to be performed. The path length can be obtained by summing the distances between adjacent stops in the aforementioned set of stops.
[0022] Furthermore, it is necessary to determine the stopping time of the train inspection robot at the target inspection points on the target train, and to determine the travel time of the train inspection robot based on the spacing between the target inspection points and the number of stopping points. The execution time of the corresponding inspection task is then determined based on the stopping time and the travel time. Specifically, the train inspection robot needs to stay at each target inspection point for a certain period of time to complete operations such as image acquisition, data reading, or fault identification. This stopping time can be set according to the specific inspection requirements of the inspection point. Simultaneously, based on the spacing between adjacent target inspection points and the total number of stopping points, the travel time consumed by the train inspection robot moving between different stopping points can be calculated. Adding the stopping time and the travel time together yields the execution time required for the train inspection robot to complete the entire inspection task.
[0023] In this embodiment, the corresponding power consumption also needs to be determined based on the travel energy consumption of the inspection robot per preset unit distance, the spacing, the power of the robot's robotic arm and camera, and the docking time. Specifically, the inspection robot generates corresponding travel energy consumption for each unit distance it moves during its journey. Multiplying this travel energy consumption by the spacing of each segment and summing them up yields the total energy consumption during the journey. Furthermore, when the inspection robot performs its inspection task at the docking point, the robotic arm and camera are in working order. The energy consumption during the inspection process can be calculated based on the power of the robotic arm, the power of the camera, and the docking time. Adding the travel energy consumption to the inspection energy consumption yields the total power consumption of the inspection robot to complete this inspection task.
[0024] Finally, a fitness evaluation function for the inspection robot is constructed using the path length, execution time, power consumption, and a preset adjustable factor. This fitness evaluation function allows for a quantitative assessment of the scheduling scheme represented by each individual in the genetic algorithm, thus providing a basis for subsequent selection, crossover, and mutation operations.
[0025] Step S12: Initialize the inspection robot population using a genetic algorithm to obtain an initial population, generate a offspring population based on the initial population, and screen the offspring population based on the fitness evaluation function to obtain the target population.
[0026] In this embodiment, the number of inspection points on the undercarriage of the target train and the number of dispatchable train inspection robots are set to complete the first initialization operation. Specifically, the undercarriage structure of the target train is complex, including multiple key components such as bogies, braking devices, and traction motors. Each component corresponds to a location point that needs to be inspected. The number of inspection points is set according to the train model and inspection standards. Simultaneously, the number of dispatchable train inspection robots is determined based on the actual scale of the train inspection task and on-site conditions. This number is limited by factors such as the robot's charging station capacity, communication bandwidth, and available space. This first initialization operation lays the foundation for subsequent population coding and scheduling scheme generation.
[0027] Furthermore, the inspection robots need to undergo population initialization, including initializing their parameters, population size, maximum number of iterations, crossover probability, and mutation probability to complete the second initialization operation. Specifically, each inspection robot individual includes parameters such as its initial position, moving speed, and detection capability. The population size represents the number of candidate solutions simultaneously retained in the genetic algorithm. The maximum number of iterations determines the upper limit of the algorithm's operation, and the crossover probability and mutation probability control the frequency of generating new individuals during the genetic operation. Through the above second initialization operation, the various control parameters required for the operation of the genetic algorithm are established.
[0028] Furthermore, based on the first and second initialization operations, a corresponding initial population is obtained. The current fitness value of the initial population is determined based on the fitness evaluation function, and the current Hamming distance of the initial population is determined, thus setting the current Hamming distance as the initial state of the initial population. Specifically, the initial population consists of multiple coded individuals, each representing a scheduling scheme for a train inspection robot. The fitness value reflects the comprehensive performance of each scheduling scheme in three dimensions: path length, execution time, and power consumption. The Hamming distance measures the degree of difference between individuals in the population. Using the current Hamming distance as the initial state of the initial population provides a reference for subsequent assessment of population diversity.
[0029] After obtaining the initial population, the scene characteristics of the inspection robot are determined, and these scene characteristics are mapped to selection, crossover, and mutation operators. Based on a greedy strategy and the selection, crossover, and mutation operators, selection, crossover, and mutation operations are performed on the initial population to obtain a offspring population. Specifically, scene characteristics may include information such as the target train's track layout, the distribution density of undercarriage inspection points, and the location of the inspection robot's charging stations. These characteristics are mapped to corresponding genetic operators, enabling the algorithm to adapt to different actual working environments. The greedy strategy prioritizes retaining individuals with higher fitness values during the selection operation, and new offspring individuals are generated through crossover and mutation operations, thus forming the offspring population.
[0030] Once the offspring population is determined, it is screened based on the fitness evaluation function to obtain a screened population. The selection, crossover, and mutation operations are then iteratively performed on this screened population until a preset convergence criterion or a preset iteration limit is met. The algorithm then stops iterating when this condition is met, and the target population is obtained after the iteration stops. Specifically, during the screening process, individuals with fitness values reaching a preset threshold are retained, while individuals with lower fitness values are eliminated. When the change in the optimal fitness value of the population over several consecutive generations is less than the preset convergence threshold, the preset convergence criterion is met. Alternatively, when the number of iterations reaches a preset iteration limit, the algorithm stops iterating, and the population obtained at the point of cessation is determined as the target population.
[0031] Step S13: Construct a reward function based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population using a reinforcement learning framework. Then, construct a Q-value function using the reward function, the state set corresponding to the target population, and the action set. Use the Q-value function to generate a scheduling strategy for the inspection robot, and use the scheduling strategy to perform multi-robot scheduling of the inspection robot. The action set is a set formed by configuring different usage ratios for various population crossover algorithms. The state set includes the iteration progress of the target population and a set representing the population diversity of the target population using Hamming distance.
[0032] In this embodiment, the constructed reward function includes action reward, diversity reward, evolutionary reward, and collision penalty.
[0033] First, action rewards are constructed based on the iteration progress of the target population and the action set corresponding to the target population. Specifically, the iteration progress reflects the proportion of the current generation of the genetic algorithm to the maximum number of iterations, and each action in the action set corresponds to a configuration scheme for the use of the population crossover algorithm. When the reinforcement learning framework selects a certain action, it configures the crossover operation in the genetic algorithm according to that action, and observes the improvement in the fitness value of the target population in subsequent iterations after the configuration, quantifying this improvement as an action reward.
[0034] Secondly, a diversity reward is constructed based on the average Hamming distance of the target population and the iteration progress. Specifically, the average Hamming distance is used to characterize the average degree of difference between individuals in the target population; the larger the value, the better the population diversity. As the iteration progress increases, population diversity usually gradually decreases. When the average Hamming distance is lower than a preset diversity threshold, the diversity reward is set to a negative value to guide the reinforcement learning framework to select actions that can maintain population diversity.
[0035] Third, an evolutionary reward is constructed based on the fitness value of the target population. Specifically, the optimal fitness value of the target population in the current iteration is compared with the optimal fitness value of the target population in the previous generation, and the increase in fitness value can be calculated. This increase is used as the evolutionary reward. If the fitness value does not change or decreases, the evolutionary reward can be set to zero or a negative value.
[0036] Fourth, a collision penalty is constructed based on the preset maximum moving speed and current moving speed of the inspection robot, as well as a preset fixed value. Specifically, when the inspection robot is moving under the vehicle, there is a risk of collision if the relative speed between adjacent robots is too large. The ratio of the current moving speed to the preset maximum moving speed is calculated. When this ratio exceeds a preset safety threshold, a collision penalty is generated in combination with the preset fixed value. This collision penalty is used to suppress the scheduling strategy generated by the reinforcement learning framework that may lead to robot collisions.
[0037] Finally, a reward function is constructed using the action reward, the diversity reward, the evolution reward, and the collision penalty. This reward function can be used to evaluate the long-term reward generated by performing a specific action in a specific state.
[0038] Furthermore, based on the reward function, a target reward value corresponding to the target action in the action set is determined; a target state in the state set corresponding to the target population is determined, and a maximum Q value corresponding to the target state is determined based on a preset Q-table; a Q-value function is constructed based on the state set, the action set, the preset learning rate, the target reward value, the preset discount factor, and the maximum Q value. Specifically, the preset learning rate controls the speed at which new information covers old information, and the preset discount factor is used to balance the importance of current rewards and future rewards.
[0039] It should be noted that the reinforcement learning framework in this embodiment includes a current value network, a target value network, and a loss function for updating the target parameters in the current value network. The current value network receives the current state corresponding to the target population and outputs the Q-value corresponding to the current state. The target value network receives the next state of the current state and outputs the Q-value corresponding to the next state. In other words, by minimizing the loss function and continuously updating the target parameters in the current value network, the estimated Q-value function gradually approaches the true value, thereby generating a scheduling strategy for the train inspection robot.
[0040] As can be seen from the above, in this application, the fitness evaluation function of the inspection robot is constructed based on the path length of its travel path, the inspection task execution time, and the power consumption. A genetic algorithm is used to initialize the inspection robot population to obtain an initial population. After generating a offspring population based on this initial population, the fitness evaluation function is used to filter the offspring population to obtain the target population. Based on a reinforcement learning framework, a reward function is constructed by combining the iteration progress, Hamming distance, fitness value, and corresponding action set of the target population. Then, a Q-value function is constructed using this reward function, the state set, and the action set corresponding to the target population. The Q-value function is used to generate a scheduling strategy for the inspection robots, and this scheduling strategy is then used to achieve multi-robot scheduling of the inspection robots. In this way, this application can improve the operational efficiency of multiple inspection robots in complex environments.
[0041] The technical solution of the embodiments of this application will be described in detail below. Specifically, the present invention adopts the design idea of adaptive genetic algorithm combined with reinforcement learning, including the following steps: Based on the robot's travel path to the work area, take the set of stopping points along the path and calculate the total path length L (i.e., path length).
[0042] To calculate the time required for the robot to perform one inspection task, the robot's dwell time at each inspection point can be taken. The sum, plus the robot's travel time (That is, execution time), where m is the number of docking points in the docking point set. The spacing between each point v It is the robot's moving speed.
[0043] Design of robot power consumption function (That is, power consumption). It is the energy consumption per unit distance the robot travels. It's the power of the robotic arm. It's the camera's power. This refers to the time required for a single stop to perform a testing task.
[0044] Constructing a multi-objective fitness evaluation function , , These are adjustable factors for path length, task time, and power consumption, respectively. It can be determined using the following formula: ; In the formula, This is the maximum upper limit that can be set for the weighting factor. This is the minimum lower limit that can be set for the weighting factor.
[0045] The initial population and encoding use integer encoding. Let N be the number of points to be detected on the train carriage, M be the number of schedulable robots, and M-1 be the number of virtual tasks. Therefore, the chromosome encoding length of an individual is N+M-1 bits. Each gene represents a task number, and the gene position indicates the task execution order. The sequence from the virtual task number to the previous virtual task number represents the detection point task number assigned to that robot.
[0046] The ratios of the three crossover methods—PMX (Partially Mapped Crossover), LOX (LeftOrdered Crossover), and C1 (Cycle Crossover)—were adjusted to 0.6, 0.3, and 0.1, respectively. Individuals in the population were randomly selected for crossover operations to increase population diversity and avoid premature convergence. Specific operations were as follows: PMX: Swap a segment of the parent generation's path, establish a conflict mapping relationship, and repair duplicate paths one by one; LOX: Preserve the order of a segment in the middle of the parent generation's path, remove the parent generation from the mother generation with that order, and then insert the parent generation with the left alignment to generate new offspring; C1: Starting from number 1, find the first cycle of the parent generation within the mother generation, retain the parent generation's position within the cycle, and replace the remaining positions with the mother generation's position to generate offspring.
[0047] The state design is based on population diversity and is divided into 3 states H=[[0,0.4],[0.4,0.6],[0.6,1]] (where [0,0.4] is the Hamming distance between individuals in the population). The iteration progress is also divided into 3 states R=[[0,0.4],[0.4,0.6],[0.6,1]] (where [0,0.4] is the range of values for the current generation / maximum iteration generation).
[0048] The action set is designed to include three types of actions: exploration, balancing, and development (i.e., various group crossover algorithms). For the exploration action, the ratio of PMX, LOX, and C1 is set to 0.6, 0.2, and 0.2, respectively. For the balancing action, the ratio of PMX, LOX, and C1 is set to 0.4, 0.3, and 0.3, respectively. For the development action, the ratio of PMX, LOX, and C1 is set to 0.6, 0.3, and 0.1, respectively.
[0049] A reward mechanism is designed based on action selection, population Hamming distance, and population optimal fitness value.
[0050] Reference Figure 2 As shown, the process of the genetic algorithm involved in this invention involves initializing the population, robot parameters, population size, maximum number of generations, crossover probability, and mutation probability. Then, hard constraints are applied to the initialized population, and a fitness evaluation function is designed using a multi-objective fusion strategy. Next, selection, crossover, and mutation operations are performed on the population to generate offspring. After merging the populations, high-quality individuals are selected through fitness evaluation, and crossover and mutation operators are used to simulate gene recombination and mutation, continuously iterating until the convergence criterion or iteration limit is met, at which point the global optimal solution is output.
[0051] Reference Figure 3 As shown, regarding the reinforcement learning process involved in this invention, it should be noted that the core part of the framework is that the agent uses a deep Q-network to approximate the Q-value function, continuously interacts with the environment, and optimizes the robot scheduling strategy. The Q-value update calculation is as follows: ; In the formula, and Let these represent the state at time t and the selected action, respectively. This indicates that the agent performs an action at time t. The reward value obtained later Indicates the state The maximum Q value corresponding to the following Q table; It's the learning rate. It's a discount factor. The parameters of the neural network are updated through gradient descent and backpropagation. At each time step, the network selects an action based on the current state and calculates the corresponding reward, the reward function. R The calculation formula is as follows: ; ; ; ; ; In the formula, , , These are respectively action rewards, diversity rewards, and evolution rewards. Represents a set of actions (1 represents exploration; 2 represents balance; 3 represents development). Indicates the population iteration progress (current generation / maximum number of iterations). This represents the average Hamming distance. This represents the fitness value of the optimal chromosome in the t-th generation of the population. Indicates a collision penalty. Indicates the current movement speed. Let be the robot's maximum moving speed. Then, based on the error between the actual reward and the predicted value, the gradient is calculated using the loss function, and the parameters are updated using gradient descent, where the loss function is expressed as follows: ; In the formula, It is the network's corresponding state-action pair Q value, It is the reward for the current state. It is the probability of the target strategy occurring. Here, E represents the learning rate and the expected value. DQN (Deep Q-Network) consists of two neural networks with identical structures but different parameters: the current value network and the prediction target network. The network parameters are updated periodically. The activation function can be Sigmoid, and the hyperparameter optimizer can be Optuna.
[0052] To verify the superiority and accuracy of the designed improved GA (Genetic Algorithm), the robot's inspection path was optimized using traditional GA, adaptive GA, and improved GA in a parking garage scenario with 20 detection points. The robot path optimization, detection task time, and energy consumption of the three algorithms were compared, and the corresponding comparison results are shown in Table 1 below.
[0053] Table 1 Comparison Results
[0054] The following provides a description of the train undercarriage inspection robot, algorithm design, and collaborative control system in the embodiments of this application.
[0055] The train undercarriage inspection robot mainly includes: the robot body structure, which consists of a tracked chassis, inspection devices (vision system and supplementary lighting system), control system, and wireless communication module; a dual-motor drive system, with each track equipped with an independent DC motor, controlled by two sets of motor drivers; a main control module, which adopts an embedded industrial control motherboard; and a sensing system, including ultrasonic radar, inertial sensors, and laser navigators, used to achieve precise robot positioning, autonomous navigation, and autonomous obstacle avoidance.
[0056] The algorithm design in this application includes the following steps: initializing the parameters of the reinforcement learning genetic algorithm, setting the iteration count to 1, and generating an initial population by combining random sorting and heuristic rule methods; constructing a multi-objective fitness function, introducing a dynamic weight mechanism, and mapping scene features in real time to crossover, mutation, and selection operators; using a greedy strategy to select actions for crossover and mutation operations to generate offspring populations; selecting high-quality individuals through fitness evaluation and simulating gene recombination and mutation using crossover and mutation operators; decoding high-quality individuals to form action sequences, adding them to the experience pool for training; and using the newly added experience to update the Q-value, thereby updating the network and achieving policy optimization.
[0057] The collaborative control system architecture can be divided into two parts: the terminal side and the edge side. The control functions on the terminal side are implemented by the robot's control unit. The master control unit is responsible for processing task plans and generating action commands, while subordinate units execute corresponding operations and interact with the external environment based on these commands. The edge side consists of edge servers, responsible for running algorithms to generate optimal job scheduling plans, distributing them to each robot, and receiving response information from the terminal devices. It also completes local data processing and decision feedback loops. Its system resources focus on real-time processing of terminal data and performance optimization of the computing power model. Through the edge computing architecture, it achieves efficient task scheduling and distributed model training, effectively reducing the terminal's computing load and significantly improving system response efficiency and operational stability.
[0058] Accordingly, see Figure 4 As shown in the figure, this application provides a multi-machine scheduling device for train inspection robots based on genetic algorithms and reinforcement learning, including: The function construction module 11 is used to construct the fitness evaluation function of the inspection robot based on the path length of the inspection robot's travel path, the execution time of the inspection task, and the power consumption. The population determination module 12 is used to initialize the inspection robot population using a genetic algorithm to obtain an initial population, generate a offspring population based on the initial population, and screen the offspring population based on the fitness evaluation function to obtain a target population. The multi-machine scheduling module 13 is used to construct a reward function based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population using a reinforcement learning framework. It then uses the reward function, the state set corresponding to the target population, and the action set to construct a Q-value function. The Q-value function is used to generate a scheduling strategy for the inspection robot, and the scheduling strategy is used to perform multi-machine scheduling of the inspection robot. The action set is a set configured with different usage ratios for various population crossover algorithms. The state set includes the iteration progress of the target population and a set that uses the Hamming distance as the population diversity of the target population.
[0059] In some specific embodiments, the function construction module 11 specifically includes: The length determination unit is used to determine the corresponding path length based on the set of stopping points of the inspection robot on its travel path. The event determination unit is used to determine the stopping time of the train inspection robot at the target inspection point on the target train, and to determine the travel time of the train inspection robot based on the spacing of the target inspection points and the number of stopping points, so as to determine the execution time of the corresponding inspection task based on the stopping time and the travel time. The power consumption determination unit is used to determine the corresponding power consumption based on the travel energy consumption of the inspection robot at a preset unit distance, the spacing, the power of the robot's robotic arm and camera, and the docking time. The function construction unit is used to construct the fitness evaluation function of the inspection robot using the path length, the execution time, the power consumption, and a preset adjustable factor.
[0060] In some specific embodiments, the population determination module 12 specifically includes: The first initialization unit is used to set the number of inspection points at the undercarriage of the target train and the number of dispatchable train inspection robots to complete the first initialization operation. The second initialization unit is used to initialize the population of the inspection robot and initialize the parameters, population size, maximum number of iterations, crossover probability and mutation probability of the inspection robot to complete the second initialization operation. The state determination unit is used to obtain the corresponding initial population based on the first initialization operation and the second initialization operation, determine the current fitness value of the initial population based on the fitness evaluation function, and determine the current Hamming distance of the initial population, so as to determine the current Hamming distance as the initial state of the initial population.
[0061] In some specific embodiments, the population determination module 12 specifically includes: An operation execution unit is used to determine the scene characteristics where the inspection robot is located, and map the scene characteristics into selection operators, crossover operators and mutation operators. Based on a greedy strategy and the selection operators, crossover operators and mutation operators, selection operations, crossover operations and mutation operations are performed on the initial population to obtain the offspring population. The population determination unit is used to screen the offspring population based on the fitness evaluation function to obtain the screened population, and to iteratively perform the selection operation, the crossover operation and the mutation operation on the screened population until the preset convergence criterion or the preset iteration limit is met and the iteration stops, and the corresponding target population is obtained after the iteration stops.
[0062] In some specific embodiments, the multi-machine scheduling module 13 specifically includes: The first reward construction unit is used to construct an action reward based on the iteration progress of the target population and the action set corresponding to the target population. The second reward construction unit is used to construct a diversity reward based on the average Hamming distance of the target population and the iteration progress. The third reward construction unit is used to construct an evolutionary reward based on the fitness value of the target population; The penalty construction unit is used to construct a collision penalty based on the preset maximum moving speed and current moving speed of the inspection robot, as well as a preset fixed value; A function construction unit is used to construct a reward function using the action reward, the diversity reward, the evolution reward, and the collision penalty.
[0063] In some specific embodiments, the multi-machine scheduling module 13 specifically includes: A reward value determination unit is used to determine, based on the reward function, the target reward value corresponding to the target action in the action set; The Q-value determination unit is used to determine the target state in the state set corresponding to the target population, and to determine the maximum Q-value corresponding to the target state based on a preset Q-table. The function construction unit is used to construct a Q-value function based on the state set, the action set, the preset learning rate, the target reward value, the preset discount factor, and the maximum Q-value.
[0064] In some specific implementations, the reinforcement learning framework is a framework that includes a current value network, a target value network, and a loss function for updating the target parameters in the current value network; The current value network is a network that receives the current state corresponding to the target population and outputs the Q value corresponding to the current state. The target value network is a network that receives the next state of the current state and outputs the Q value corresponding to the next state.
[0065] Furthermore, embodiments of this application also disclose an electronic device, Figure 5 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input / output interface 25, and a communication bus 26. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the multi-machine scheduling method for inspection robots based on genetic algorithms and reinforcement learning disclosed in any of the foregoing embodiments. Furthermore, the electronic device 20 in this embodiment may specifically be an electronic computer.
[0066] In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 25 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.
[0067] In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk or optical disk, etc. The resources stored thereon can include operating system 221, computer program 222, etc., and the storage method can be temporary storage or permanent storage.
[0068] The operating system 221 is used to manage and control the various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc. In addition to including a computer program capable of performing the multi-machine scheduling method for inspection robots based on genetic algorithms and reinforcement learning, which is executed by the electronic device 20 as disclosed in any of the foregoing embodiments, the computer program 222 may further include computer programs capable of performing other specific tasks.
[0069] Furthermore, this application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned multi-machine scheduling method for train inspection robots based on genetic algorithms and reinforcement learning. Specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.
[0070] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.
[0071] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0072] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.
[0073] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0074] The technical solutions provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A multi-machine scheduling method for train inspection robots based on genetic algorithms and reinforcement learning, characterized in that, include: The fitness evaluation function of the train inspection robot is constructed based on the path length of the train inspection robot's travel path, the execution time of the inspection task, and the power consumption. The inspection robot population is initialized using a genetic algorithm to obtain an initial population. A offspring population is then generated based on the initial population, and the offspring population is screened based on the fitness evaluation function to obtain the target population. A reward function is constructed using a reinforcement learning framework based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population. A Q-value function is then constructed using the reward function, the state set corresponding to the target population, and the action set. This Q-value function is used to generate a scheduling strategy for the inspection robot, enabling multi-robot scheduling of the inspection robot. The action set is a set configured with different usage ratios for various population crossover algorithms. The state set includes the iteration progress of the target population and a set representing the population diversity of the target population using Hamming distance.
2. The multi-machine scheduling method for train inspection robots based on genetic algorithm and reinforcement learning according to claim 1, characterized in that, The fitness evaluation function for the train inspection robot, constructed based on the path length of its travel path, the execution time of its inspection task, and its power consumption, includes: The corresponding path length is determined based on the set of stops of the inspection robot on its travel path; The stopping time of the train inspection robot at the target inspection point on the target train is determined, and the travel time of the train inspection robot is determined based on the spacing of the target inspection points and the number of stopping points, so as to determine the execution time of the corresponding inspection task based on the stopping time and the travel time. The corresponding power consumption is determined based on the travel energy consumption of the inspection robot at a preset unit distance, the spacing, the power of the robot's robotic arm and camera, and the docking time. The fitness evaluation function of the inspection robot is constructed using the path length, execution time, power consumption, and a preset adjustable factor.
3. The multi-machine scheduling method for train inspection robots based on genetic algorithm and reinforcement learning according to claim 1, characterized in that, The process of initializing the population of the inspection robot to obtain an initial population includes: The number of inspection points at the undercarriage of the target train and the number of dispatchable train inspection robots are set to complete the first initialization operation; The second initialization operation is performed by initializing the population of the inspection robot, including its parameters, population size, maximum number of iterations, crossover probability, and mutation probability. Based on the first initialization operation and the second initialization operation, a corresponding initial population is obtained, and the current fitness value of the initial population is determined based on the fitness evaluation function. The current Hamming distance of the initial population is determined so as to determine the current Hamming distance as the initial state of the initial population.
4. The multi-machine scheduling method for train inspection robots based on genetic algorithm and reinforcement learning according to claim 1, characterized in that, The process of generating a progeny population based on the initial population and screening the progeny population based on the fitness evaluation function to obtain the target population includes: The scene features where the inspection robot is located are determined, and the scene features are mapped to selection operators, crossover operators, and mutation operators. Based on a greedy strategy and the selection operators, crossover operators, and mutation operators, selection operations, crossover operations, and mutation operations are performed on the initial population to obtain the offspring population. The offspring population is screened based on the fitness evaluation function to obtain a screened population. The selection operation, the crossover operation, and the mutation operation are then iteratively performed on the screened population until the preset convergence criterion or the preset iteration limit is met, at which point the iteration stops and the corresponding target population is obtained.
5. The multi-machine scheduling method for train inspection robots based on genetic algorithm and reinforcement learning according to claim 1, characterized in that, The construction of the reward function based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population includes: Based on the iteration progress of the target population and the action set corresponding to the target population, an action reward is constructed. A diversity reward is constructed based on the average Hamming distance of the target population and the iteration progress. An evolutionary reward is constructed based on the fitness value of the target population; Collision penalty is constructed based on the preset maximum moving speed and current moving speed of the inspection robot, as well as a preset fixed value; A reward function is constructed using the action reward, the diversity reward, the evolution reward, and the collision penalty.
6. The multi-machine scheduling method for train inspection robots based on genetic algorithm and reinforcement learning according to claim 1, characterized in that, The step of constructing the Q-value function using the reward function, the state set corresponding to the target population, and the action set includes: The target reward value corresponding to the target action in the action set is determined based on the reward function. Determine the target state in the state set corresponding to the target population, and determine the maximum Q value corresponding to the target state based on a preset Q table; A Q-value function is constructed based on the state set, the action set, the preset learning rate, the target reward value, the preset discount factor, and the maximum Q-value.
7. The multi-machine scheduling method for train inspection robots based on genetic algorithms and reinforcement learning according to any one of claims 1 to 6, characterized in that, The reinforcement learning framework includes a current value network, a target value network, and a loss function for updating the target parameters in the current value network. The current value network is a network that receives the current state corresponding to the target population and outputs the Q value corresponding to the current state. The target value network is a network that receives the next state of the current state and outputs the Q value corresponding to the next state.
8. A multi-machine scheduling device for train inspection robots based on genetic algorithms and reinforcement learning, characterized in that, include: The function construction module is used to construct the fitness evaluation function of the inspection robot based on the path length of the inspection robot's travel path, the execution time of the inspection task, and the power consumption. The population determination module is used to initialize the inspection robot population using a genetic algorithm to obtain an initial population, generate a offspring population based on the initial population, and screen the offspring population based on the fitness evaluation function to obtain a target population. The multi-machine scheduling module is used to construct a reward function based on the iteration progress, Hamming distance, fitness value, and action set corresponding to the target population using a reinforcement learning framework. It then constructs a Q-value function using the reward function, the state set corresponding to the target population, and the action set. The Q-value function is used to generate a scheduling strategy for the inspection robots, and the scheduling strategy is used to perform multi-machine scheduling of the inspection robots. The action set is a set configured with different usage ratios for various population crossover algorithms. The state set includes the iteration progress of the target population and a set that uses the Hamming distance as the population diversity of the target population.
9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor is configured to execute the computer program to implement the multi-machine scheduling method for inspection robots based on genetic algorithms and reinforcement learning as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, Used to store computer programs; wherein, when the computer programs are executed by a processor, they implement the multi-machine scheduling method for inspection robots based on genetic algorithms and reinforcement learning as described in any one of claims 1 to 7.