Construction site scheduling method and system based on heterogeneous multi-agent reinforcement learning

The construction site scheduling method based on heterogeneous multi-agent reinforcement learning solves the problem of coordinated scheduling of multiple tower cranes and ground vehicles in high-rise building construction, realizing safe and efficient construction site management and improving construction efficiency and safety.

CN122243142APending Publication Date: 2026-06-19HUAZHONG UNIV OF SCI & TECH +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In high-rise building construction, existing technologies struggle to coordinate the high-density and dynamically changing construction demands of multiple tower cranes and ground-based horizontal transport vehicles, leading to decision-making delays or excessive conservatism. Furthermore, traditional methods are ill-suited for achieving safe and efficient collaboration among heterogeneous equipment.

Method used

A construction site scheduling method based on heterogeneous multi-agent reinforcement learning is adopted. By constructing a site topology network and dynamic task flow, a heterogeneous multi-agent scheduling model is trained. Combined with a phased evolution strategy and a multi-objective reward function, the collaborative operation of tower cranes and horizontal transport vehicles is realized.

🎯Benefits of technology

It enables safe and efficient collaborative scheduling of heterogeneous machinery in complex and dynamic environments, reducing equipment idleness and project delays, and has the ability to proactively avoid conflicts, thereby improving construction efficiency and safety.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243142A_ABST
    Figure CN122243142A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of building engineering informatization and intelligent construction technology. It discloses a construction site scheduling method and system based on heterogeneous multi-agent reinforcement learning. The method includes: analyzing the spatial geometric information of the construction site based on building information model data and construction schedule plans to construct a site topology network and dynamic construction task flow; constructing a heterogeneous multi-agent scheduling model; training it using a phased evolution strategy, the training process sequentially including a capability building phase, a safety constraint guidance phase, a strategy internalization phase, and a strategy correction phase; outputting discrete scheduling instructions through the trained scheduling model to drive tower cranes and horizontal transport vehicles to perform collaborative operations; calculating multi-objective reward values ​​based on the action execution feedback of the agents and updating the strategy network parameters. This method effectively solves the problems of difficulty in heterogeneous equipment coordination, lag in dynamic task response, and high collision risk in multi-tower operations, achieving safe and efficient end-to-end scheduling of materials at the construction site.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information technology and intelligent construction technology in building engineering, specifically to a construction site scheduling method and system based on heterogeneous multi-agent reinforcement learning. Background Technology

[0002] In high-rise building construction, multiple tower cranes (MTCs) and automated guided vehicles (AGVs) constitute the core logistics system. Currently, actual construction site scheduling mainly relies on manual command (signalmen). This approach often suffers from excessive cognitive load leading to decision-making delays or overly conservative approaches for safety reasons when facing high-density, dynamically changing construction demands. Traditional operations research optimization methods struggle to handle unstructured disturbances on-site (such as manual, ad-hoc task insertions), while conventional multi-agent reinforcement learning methods often fail to converge or produce dangerous actions when dealing with heterogeneous equipment collaboration and complex spatial collision avoidance constraints. Therefore, there is an urgent need for an intelligent scheduling method that can integrate on-site operational norms, adapt to dynamic environments, and possess efficient collaborative capabilities. Summary of the Invention

[0003] In response to the above-mentioned deficiencies or improvement needs of existing technologies, this invention provides a construction site scheduling method and system based on heterogeneous multi-agent reinforcement learning, which solves the existing technical problem of difficulty in safely and efficiently coordinating the scheduling of heterogeneous machinery in complex dynamic environments.

[0004] To achieve the above objectives, according to one aspect of the present invention, a construction site scheduling method based on heterogeneous multi-agent reinforcement learning is provided, comprising the following steps: S10: Based on building information model data and construction schedule, analyze the spatial geometry information of the construction site, construct the site topology network, and generate a dynamic construction task flow; S20: Based on the site topology network constructed in step S10 and the generated dynamic construction task flow, construct a heterogeneous multi-agent scheduling model. The agent scheduling model includes a first strategy network for horizontal transport vehicles and a second strategy network for tower cranes. S30: The scheduling model obtained in step S20 is trained using a phased evolution strategy; S40: At the actual decision-making moment, acquire local observation data of the on-site intelligent agent in real time, and output discrete scheduling instructions through the scheduling model trained in step S30 to drive the tower crane and horizontal transport vehicle to perform collaborative operations. S50: In step S40, during the collaborative operation of the tower crane and the horizontal transport vehicle, the multi-objective reward function value is calculated based on the action execution feedback of the agent, and the policy network parameters are updated.

[0005] Preferably, the site topology network in step S10 is a node-edge graph structure obtained by parsing based on the IFC standard; the generation of the dynamic construction task flow integrates push mechanism, pull mechanism and dependency chain mechanism.

[0006] Preferably, in step S20, the first policy network adopts a standard multilayer perceptron architecture, which maps the local observation vector containing the horizontal transport vehicle's body state and local task features into discrete action commands.

[0007] Preferably, the second policy network in step S20 adopts a deep multilayer perceptron architecture, which includes independently configured hidden layer dimensions and layer normalization mechanisms, and is configured with a feature extraction module for extracting spatiotemporal conflict semantic features, wherein the spatiotemporal conflict semantic features include: The conflict status bit is used to indicate whether the agent is currently in the working area of ​​other peer agents; Window distance feature is used to indicate the normalized time step of the agent’s current task progress from entering or leaving the conflict time window. Static advantage features are used to indicate the priority of tasks determined based on static engineering constraints.

[0008] Preferably, the spatiotemporal conflict semantic feature vector The calculation formula is:

[0009] in, This is a conflict state bit. For window distance features, This is a static advantage feature.

[0010] Preferably, the conflict state bit The expression is:

[0011] in, Discrete numerical coding is used to characterize the severity of collisions: 0.0 (Free) indicates no collision; 1.0 (Self-in) indicates that the local machine has entered the cross-operation area but the collaborating machine has not; 2.0 (Peer-in) indicates that the collaborating machine has entered the cross-operation area but the local machine has not; 3.0 (Deadlock) indicates that both machines have a tendency to enter the cross-operation area or have entered simultaneously, with a high risk of collision or deadlock. The window distance feature The expression is:

[0012] in, This represents the time step from the current task progress on this machine to the boundary of the conflict window. The time step for the collaborative tower crane to reach the window boundary. The time constant is the normalization constant; The static advantage features The expression is:

[0013] in, and These are the base priority scores for local tasks and collaborative tasks, respectively. This is the scaling factor.

[0014] As a preferred option, the phased evolution strategy in step S30 includes: In the first phase, the course scheduler is activated, and the agent's basic task processing ability is trained in a low-conflict environment by gradually increasing the task density coefficient and the urgency of the deadline. In the second phase, an active conflict defense expert mechanism is activated. When the predicted conflict characteristics are less than the safety threshold, the avoidance party is forced to take a pause action based on the preset priority rules, and the strategy network is guided by imitation learning. In the third stage, the proactive conflict defense expert mechanism is removed, and the intelligence is internalized with conflict avoidance strategies solely based on the reward feedback of reinforcement learning. In the fourth stage, a strategic monitor is activated to identify and prohibit suboptimal actions that are left idle or ignored for a long time, and to carry out long-term planning and correction of the strategy.

[0015] Preferably, step S40 further includes generating a dynamic action mask, specifically by acquiring local observation data of the on-site intelligent agent in real time, and generating a dynamic action mask based on the current material inventory status, equipment load status, and task feasibility; the dynamic action mask is a binary action mask, used to filter out actions that violate physical constraints, including insufficient material inventory, full buffer zone, and tower crane lifting without load.

[0016] As a preferred option, the multi-objective reward function value in step S50 is calculated by adding the four-layer reward system. Specifically, the first layer is the core task incentive, including task completion reward and critical path task multiplier reward; the second layer is the operational efficiency incentive, including delay penalty, early completion reward and unnecessary suspension penalty; the third layer is the safety and compliance incentive, including correct avoidance reward based on the "five yields" rule, violation operation penalty and dangerous approach penalty; the fourth layer is the cooperation incentive, including upstream and downstream supply and demand cooperation reward for horizontal transport vehicles and tower cranes.

[0017] To achieve the above objectives, according to another aspect of the present invention, a field construction scheduling system based on heterogeneous multi-agent reinforcement learning is provided, comprising: The simulation environment construction module is used to collect on-site building information model data and schedule plans, and generate a simulation environment that includes storage yards and lifting points for task generation. The heterogeneous model training module is used to implement the training steps of the method described above and generate a policy network that includes conflict feature extraction. The decision execution module is used to receive real-time on-site status and output dispatch instructions to each construction machinery terminal.

[0018] In summary, the technical solutions conceived by this invention have the following beneficial effects compared with the prior art: This invention breaks through the limitations of pursuing efficiency in a single machine. By establishing a heterogeneous scheduling model that includes AGVs and TCs and a multi-objective reward function, it can guide equipment to pay attention to the supply and demand coordination of upstream and downstream equipment while completing its own tasks. Thanks to the effective mechanism of eliminating cross-agent state dependencies in this system, it effectively reduces problems such as equipment idleness and project delays caused by material shortages or response delays.

[0019] This invention innovatively incorporates spatiotemporal conflict semantic features, including conflict state bits, window distance, and static priority, providing a precise quantification standard for collision risk in high-density multi-tower operations. Based on these features, the system introduces a safety constraint mechanism based on engineering prior rules during training, which can accurately identify and forcibly correct high-risk scheduling actions, guiding the agent to quickly internalize safe operating procedures. This enables the scheduling system to possess proactive conflict avoidance capabilities when facing dense, overlapping operations, achieving a leap from passive collision avoidance to proactive safe collaboration.

[0020] This invention designs a four-stage evolutionary training mechanism, effectively solving the problem of low efficiency in multi-agent joint exploration under large-scale dynamic construction scenarios. Specifically, the capability building stage effectively alleviates the initial blindness of exploration by dynamically adjusting task density; the safety constraint guidance stage uses the aforementioned constraint mechanism to correct high-risk actions; the strategy internalization stage relies on environmental reward feedback to achieve autonomous optimization of the model; and the strategy correction stage completely avoids agents falling into an overly conservative, suboptimal standby state to avoid collision penalties by shielding inefficient actions. Attached Figure Description

[0021] Figure 1 This is an overall flowchart of the MOHARL method in one embodiment of the present invention.

[0022] Figure 2 This is a schematic diagram of dynamic environment construction and task generation in one embodiment of the present invention.

[0023] Figure 3 This is a schematic diagram of a custom training architecture in one embodiment of the present invention.

[0024] Figure 4 This is a schematic diagram of a heterogeneous policy network and feature extraction architecture in one embodiment of the present invention.

[0025] Figure 5 This is a logic diagram of expert arbitration and supervisor intervention in one embodiment of the present invention.

[0026] Figure 6 This is a schematic diagram of a phased training strategy in one embodiment of the present invention.

[0027] Figure 7 This is a schematic diagram of a multi-timescale loop structure in one embodiment of the present invention. Detailed Implementation

[0028] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0029] Please see Figures 1-7 The construction site scheduling method and system based on heterogeneous multi-agent reinforcement learning provided by this invention focuses on building a 4DBIM-based simulation environment and solving the multi-objective optimization problem of heterogeneous equipment under spatiotemporal constraints through an improved soft Actor-Critic algorithm. For example... Figure 1 As shown, the method includes the following steps: S10: Based on Building Information Modeling (BIM) data and construction schedule, analyze the spatial geometry of the construction site, match it with the construction schedule, and construct a site topology network. The site topology network is a node-edge graph structure obtained based on IFC standard analysis and schedule matching, which includes storage yard, hoisting point, and work surface nodes; and generates a dynamic construction task flow, which integrates push mechanism, pull mechanism, and dependency chain mechanism.

[0030] The construction scheduling problem is modeled as a decentralized partially observable Markov decision process (Dec-POMDP), defined as a tuple. ; in, : Global state space, which includes the state of all tasks, the position of all agents, and the buffer inventory level; Joint action space; State transition probability function; Multi-objective reward function; Observation space; The observation probability function defines the probability of an agent obtaining local observations under specific states. The probability of; Discount factor.

[0031] As attached Figure 2 As shown in Figure A, the system constructs a field topology network containing three types of material demand supply nodes and tower crane nodes (the arrow direction only represents the topological relationship and not the direction of fixed logistics): Yard nodes: The storage location of materials, which is also the pick-up point for AGVs; Lifting point node: The intermediate handover point between AGV unloading and tower crane picking up goods, which is equivalent to the transfer buffer zone of logistics; Working face node: The actual construction location; Tower crane node; tower crane base location.

[0032] To simulate the non-stability of real construction, this embodiment designs three task generation mechanisms (such as...). Figure 2 (as shown in B in the image) The schedule-driven push mechanism generates work tasks that conform to the dual-modal time distribution based on the construction plan, that is, it generates hoisting tasks according to the dual-modal time distribution of the construction plan. Inventory-driven pull mechanism: Real-time monitoring of the buffer zone. When the inventory in the buffer zone is lower than the preset reorder point or lower than the daily demand ratio, the horizontal transportation replenishment task is automatically triggered; that is, when the inventory is lower than the reorder point, the AGV replenishment task is automatically triggered. Dependency Chain Mechanism: Based on the construction process sequence (such as "reinforcement-formwork-concrete"), construct the pre-dependencies of material requirements for the critical path.

[0033] S20: Based on the site topology network constructed in step S10 and the generated dynamic construction task flow, construct a heterogeneous multi-agent scheduling model. For tower cranes (TC) and horizontal transport vehicles (AGVs), construct heterogeneous policy networks respectively, as shown in the attached diagram. Figure 4 As shown, the agent scheduling model includes a first policy network for horizontal transport vehicles (AGVs) and a second policy network for tower cranes (TCs). The first strategy network is a perception network for local task scheduling. It adopts a standard multilayer perceptron (MLP) architecture, with the number of hidden layer nodes configured as 128 and 128 respectively. The ReLU activation function is used between layers, and a layer normalization mechanism is introduced to suppress internal covariates and accelerate network convergence. It directly maps the local observation vector containing the AGV body state and local task features into discrete action commands. Since the AGV is mainly limited by ground path planning and inventory threshold, its observation space does not contain complex aerial collision avoidance semantics.

[0034] The second policy network is a deep perceptron network designed for high-dimensional complex scheduling. It employs a deep multilayer perceptron architecture with independent hidden layers (node ​​count configured as [256, 256, 128]), layer normalization, and a dropout mechanism. It includes independently configured hidden layer dimensions and a layer normalization mechanism to improve generalization ability for complex states and prevent overfitting when handling high-density conflict states. Furthermore, the second policy network is equipped with a feature extraction module for extracting spatiotemporal conflict semantic features. These features include: conflict state bits, indicating whether the agent is currently within the work area of ​​other peer agents; window distance features, indicating the normalized time step from the agent's current task progress to entering or leaving the conflict time window; and static advantage features, indicating the job priority determined based on static engineering constraints.

[0035] That is: for the state space, define the intelligent agent. At any moment Local observation vector For the tower crane agent, its observation vector It includes its own motion state, task state, and key spatiotemporal conflict semantic features (Conflict Triad). To quantify collision risk, this embodiment defines a conflict feature vector. as follows:

[0036] in, This is a conflict state bit. For window distance features, This is a static advantage feature.

[0037] Conflict state bit Discrete state encoding indicates whether the current state is in a collision zone.

[0038]

[0039] in, Discrete numerical coding is used to characterize the severity of collisions: 0.0 (Free) indicates no collision; 1.0 (Self-in) indicates that the local machine has entered the cross-operation area but the collaborating machine has not; 2.0 (Peer-in) indicates that the collaborating machine has entered the cross-operation area but the local machine has not; 3.0 (Deadlock) indicates that both machines have a tendency to enter the cross-operation area or have entered simultaneously, with a high risk of collision or deadlock. Window distance features : Indicates the normalized distance from entering or leaving the conflict time window. The code uses truncation normalization, and its expression is:

[0040] in, This represents the time step from the current task progress on this machine to the boundary of the conflict window. The time step for the collaborative tower crane to reach the window boundary. The time constant is the normalization constant, and its value can be preset according to the on-site working conditions.

[0041] Static Advantages : Used to assist neural networks in quickly identifying priorities based on static rules, its expression is:

[0042] in, and These are the base priority scores for local tasks and collaborative tasks, respectively. This is the scaling factor.

[0043] S30: The scheduling model obtained in step S20 is trained using a phased evolution strategy. The entire training process relies on, for example, Figure 3 The custom training architecture shown comprises four phases: capability building, safety constraint guidance, policy internalization, and strategy correction. The capability building phase effectively mitigates initial exploration blindness by dynamically adjusting task density. The safety constraint guidance phase corrects high-risk actions using the aforementioned constraint mechanisms. The policy internalization phase leverages environmental reward feedback to achieve autonomous optimization of the model. The strategy correction phase, by shielding inefficient actions, completely prevents the agent from falling into an overly conservative, suboptimal standby state to avoid collision penalties. Specifically: In the first phase (capability building period), the course scheduler is activated, and the task density coefficient is gradually increased in the early stages of training. The urgency of deadlines is used to train the agent's basic task processing capabilities in a low-conflict environment, enabling the agent to focus on learning basic picking, unloading, and navigation capabilities. The second phase (security constraint guidance period) involves activating a proactive conflict defense expert mechanism, such as... Figure 5 As shown in Figure A, this mechanism proactively predicts potential collisions at the decision-making level. If a conflict is detected, it calculates the priorities of both parties based on the "five yielding" rules (such as lower tower yielding to higher tower, and later yielding to earlier). It then forces the party with the lower priority to suspend its actions and quickly learns safety rules through a learning-guided strategy network. Specifically, when the predicted conflict characteristics are less than a safety threshold, it forces the party that is trying to avoid the conflict to suspend its actions based on preset priority rules and guides the strategy network through learning-guided actions. The third stage (strategy internalization period) removes the proactive conflict defense expert mechanism and relies solely on the reward feedback of reinforcement learning to internalize external rules into its own decision parameters, enabling the intelligent body to internalize conflict avoidance strategies. The fourth stage (strategic correction period) involves activating strategic oversight mechanisms, such as... Figure 5 As shown in B, suboptimal actions such as prolonged idle time or ignoring feasible high-value tasks are identified and prohibited, and the strategy is corrected through long-term planning. Specifically, for passive and conservative strategies that may arise in the later stages of training (i.e., maintaining idle time for absolute safety), the supervisor monitors the dwell time of the task queue. When it is detected that feasible tasks have been shelved for more than a threshold... At this time, the "standby" option is filtered (or disabled) by dynamic action masking, forcing the agent to explore a better scheduling sequence.

[0044] In this embodiment, an entropy-regularized multi-agent reinforcement learning strategy (staged evolutionary training) is employed, preferably implemented using a soft Actor-Critic structure. To address the problem of multi-agent systems easily getting trapped in local optima and resulting in low exploration efficiency in dense conflict environments, this invention designs a strategy as follows: Figure 6 The four-stage evolutionary training strategy is shown.

[0045] Specifically, introducing curriculum factors Dynamically adjust task difficulty during the training phase. Density function generated by the task obey:

[0046] During the security constraint guidance phase, a proactive conflict defense expert mechanism is introduced. Let the expert strategy be... The agent's strategy is When a potential collision is detected (i.e. and When this occurs, expert intervention is activated, and a hybrid strategy is employed. Represented as:

[0047] In the formula, The probability of expert intervention is determined by the course progress. The expert arbitration logic is based on the "five yielding principles" for safety in tower crane operations in the construction field (i.e., lower tower yields to higher tower, later-entering tower yields to earlier-entering tower, lightly loaded tower yields to heavily loaded tower, unloaded tower yields to loaded tower, and moving tower yields to stationary tower), which involves comparing the priority scores of the conflicting parties. Decide the outcome:

[0048] In the formula, , and These are the weighting coefficients for the corresponding height, load weight, and time difference of entering the overlapping area, respectively. For the tower crane height, For load weight, To enter time, This indicates the load status. Those with low scores are judged to need to give way and are forced to perform a "pause" action.

[0049] In the phased evolution of model parameter optimization, this embodiment employs the Adaptive Moment Estimator (Adam) optimizer and configures a linear learning rate decay strategy based on the number of training steps. For the evaluation network, the mean squared error is used to calculate the temporal difference loss of the dual soft Q network; for the policy network, to achieve effective internalization of prior rules, its loss function is derived from the policy gradient loss of maximum entropy reinforcement learning. Cross-entropy loss with behavioral clones Combined composition, that is: In the formula, This is used to minimize the cross-entropy between the network output policy distribution and the expert-mandated actions. It serves as an expert-guided coefficient, used to dynamically balance the autonomous environment exploration of the agent with the cloning of hard-constrained behaviors based on prior knowledge.

[0050] S40: At the actual decision-making moment, acquire local observation data of the on-site intelligent agent in real time, and output discrete scheduling instructions and generate dynamic action masks through the scheduling model trained in step S30 to drive the tower crane (TC) and horizontal transport vehicle (AGV) to perform collaborative operations. For motion space modeling, unlike traditional continuous motion control, the motion space of this invention... It is modeled as a high-level discrete scheduling instruction to accommodate long-term construction tasks.

[0051]

[0052] action : Indicates pause / idle. This action is selected when the agent is at a conflict disadvantage or has no tasks to perform.

[0053] action : indicates that the candidate in the candidate list is selected and executed. One task.

[0054] This method employs a multi-timescale architecture that separates decision-making and control, such as... Figure 7 As shown, at each decision point (e.g., every minute), the agent outputs discrete scheduling actions (e.g., selecting a specific task or pausing) based on local observations. During subsequent physics simulation cycles (e.g., 60 seconds), the system continuously executes these actions and performs real-time collision detection and state updates.

[0055] Real-time acquisition of local observation data of the on-site intelligent agent, and generation of dynamic action mask based on the current material inventory status, equipment load status and task feasibility; the dynamic action mask is a binary action mask, used to filter out actions that violate physical constraints, including insufficient material inventory, full buffer zone and tower crane lifting without load.

[0056] S50: In step S40, during the collaborative operation of the tower crane and the horizontal transport vehicle, the multi-objective reward function value is calculated based on the action execution feedback of the agent, and the policy network parameters are updated.

[0057] After executing scheduling actions at the physical simulation layer, the system calculates the multi-objective reward function value based on environmental feedback. The reward function consists of the sum of four components: First level: Core task incentives (Core Task Incentives) This layer primarily drives progress completion and upstream / downstream collaboration, including task completion rewards and critical path task multiplier rewards.

[0058] in, The basic reward for completing the task; Indicates whether the task is on the critical path. For critical mission multiplier factors; As a collaborative reward, it is triggered when the AGV's transportation successfully unlocks the material waiting task of the downstream TC.

[0059] Second layer: Operational efficiency incentives

[0060] This layer focuses on the timeliness of tasks and the rationality of actions, including penalties for delays, rewards for early completion, and penalties for unnecessary suspensions.

[0061]

[0062] in, This is the penalty coefficient for delay. This refers to the actual time step in the task's completion. The deadline specified for the task; As a basic bonus for early completion, The base amount for penalties for unnecessary idleness; and These are Boolean functions indicating whether the project is completed ahead of schedule and whether it is in an unnecessary idle state, respectively.

[0063] Third layer: Safety and regulatory incentives ( ) This layer provides dense feedback signals to smooth training, including rewards for correct avoidance based on the "five yields" rule, penalties for violations, and penalties for dangerous approach.

[0064] in, To maintain a small positive incentive for work performance, Fine-tuning penalty for standby mode. and It is a Boolean indicator function; Punishment for abusing the pause action in non-conflict zones.

[0065] Provide small positive incentives for each step of the work process, and make negative adjustments for unnecessary pauses.

[0066] Fourth layer: Incentives for collaboration and cooperation ( ) This layer is the core of tower crane operations and aims to strengthen safety standards, including incentives for upstream and downstream supply and demand coordination between horizontal transport vehicles and tower cranes.

[0067]

[0068] By providing high positive rewards for proactive avoidance behavior, instead of traditional sparse collision penalties, agents are guided to shift from "passive collision avoidance" to "proactive collaboration".

[0069] The on-site construction scheduling system based on heterogeneous multi-agent reinforcement learning provided in this embodiment includes: The simulation environment construction module is used to collect on-site building information model (BIM) data and schedule plans, and generate a simulation environment that includes storage yards and lifting points for task generation. The heterogeneous model training module is used to implement the training steps of the method described above and generate a policy network that includes conflict feature extraction. The decision execution module is used to receive real-time on-site status and output dispatch instructions to each construction machinery terminal.

[0070] A computer-readable storage medium storing a computer program thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of the construction site scheduling method and system based on heterogeneous multi-agent reinforcement learning as described above.

[0071] To fully verify the effectiveness and practicality of the method of the present invention, this embodiment conducted multi-benchmark comparison experiments in a simulation environment based on real data, and performed dynamic simulation verification based on real engineering data in a large-scale actual engineering project in Wuhan.

[0072] In the phased evolution training process, the core training parameters configuration and selection basis of this embodiment are as follows: (1) The discount factor is set to 0.99: Given that construction scheduling is a long-term continuous decision-making process, a high discount factor helps the agent focus on long-term progress goals and avoids overfitting short-term local rewards in policy updates. (2) The experience replay buffer capacity is set to 100,000 steps: Using a buffer of moderate capacity can effectively prevent the historical state transition samples under simple working conditions in the early stage from interfering with policy learning under complex and high-conflict working conditions in the later stage. (3) Target network soft update and automatic entropy adjustment: The entropy coefficient of the policy distribution is dynamically adjusted by an automatic temperature adjustment mechanism to take into account both the sufficient state space exploration in the early stage of training and the stable policy utilization in the later stage of training.

[0073] The following is an explanation through specific experiments: 1. Experiment 1: Simulation Comparison Experiment Based on Real Data A simulation environment was constructed using BIM topology data from a high-rise residential project, with 5 tower cranes and 4 automated guided vehicles (AGVs). High-density task flows from the project's historical construction logs were used as input (task conflict probability set to 0.85, representing a high-pressure working condition). The proposed method (MOHARL) was compared with three traditional scheduling strategies: First-Come-First-Served (FIFO), Highest Priority First (HPF), and Shortest Job Time First (SPT).

[0074] The experiment statistically analyzed the multidimensional indicators of each strategy during continuous operation cycles, and the results showed that: Thanks to the effective incentives for collaborative supply and demand among upstream and downstream devices in the four-level reward function, the method of this invention reduces the total operating cost by approximately 1.6% compared to the FIFO strategy while ensuring timely task completion. In radar chart evaluation, the method of this invention achieves the optimal comprehensive balance in terms of cost, security, and response speed, avoiding the shortcomings of single rules (such as SPT) which, although fast, have extremely poor security.

[0075] Furthermore, to fully verify the practical effect of the proposed technical solution and the necessity of each core module, this embodiment conducted an ablation comparison experiment on the four-stage collaborative evolution mechanism. Experimental data and evolution curve analysis show that: if the "course guidance" mechanism is removed, the agent is prone to policy convergence failure when facing initial high-density concurrent tasks; if the "safety constraint guidance" stage is removed, the tower crane agent's spatial collision rate increases sharply in the early training stage, making it difficult to independently learn safe avoidance actions; if the "strategic supervision" stage is removed, the agent is prone to developing conservative suboptimal strategies that remain in standby mode for extended periods in the later training stages. The above ablation experiments not only clarify the specific roles of each training stage in suppressing collisions and improving efficiency, but also verify the irreplaceable nature of the four-stage evolution technology solution of this invention and its real technical effect in actual engineering scheduling through rigorous comparative data.

[0076] 2. Experiment 2: Dynamic Deduction and Verification of Engineering Examples (Case Study) The verification target is a large residential development project in Wuhan, which includes 12 individual buildings and 5 tower cranes, with complex overlapping areas for tower crane operations (overlap rate of approximately 60%). On-site workers send unstructured hoisting requests in real time via handheld terminals (e.g., "5 tons of steel bars are needed in Area A"), and the system needs to generate scheduling instructions in real time.

[0077] Using the BIM model of the project and a full set of actual work data for a specific day (including various task types such as heavy component hoisting and auxiliary material transfer between work surfaces), decision-making simulations were conducted, focusing on examining the system's safety and robustness in the face of real, dynamic, and non-stationary human demands.

[0078] Throughout the day's operation, the system detected multiple risks of spatial conflicts involving tower cranes. Verification results showed that all tower cranes involved in the conflicts autonomously issued a "pause" command via the policy network before entering the overlapping area, achieving a 100% conflict-aware pause rate, and no conflict escalated into dangerous concurrent operations. Despite removing the mandatory intervention of the expert mechanism during the simulation phase, the agent still exhibited avoidance behavior conforming to the "five yields" principle (e.g., lower towers yielding to higher towers), demonstrating that the phased training strategy successfully internalized safety rules into the agent's autonomous decision-making capability.

[0079] In summary, comparative experiments have demonstrated the advantages of this invention in reducing costs and increasing efficiency, while engineering examples have confirmed its extremely high safety and robustness under real and complex working conditions.

[0080] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A construction site scheduling method based on heterogeneous multi-agent reinforcement learning, characterized in that, Includes the following steps: S10: Based on building information model data and construction schedule, analyze the spatial geometry information of the construction site, construct the site topology network, and generate a dynamic construction task flow; S20: Based on the site topology network constructed in step S10 and the generated dynamic construction task flow, construct a heterogeneous multi-agent scheduling model. The agent scheduling model includes a first strategy network for horizontal transport vehicles and a second strategy network for tower cranes. S30: The scheduling model obtained in step S20 is trained using a phased evolution strategy; S40: At the actual decision-making moment, acquire local observation data of the on-site intelligent agent in real time, and output discrete scheduling instructions through the scheduling model trained in step S30 to drive the tower crane and horizontal transport vehicle to perform collaborative operations. S50: In step S40, during the collaborative operation of the tower crane and the horizontal transport vehicle, the multi-objective reward function value is calculated based on the action execution feedback of the agent, and the policy network parameters are updated.

2. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 1, characterized in that, In step S10, the on-site topology network is a node-edge graph structure obtained by parsing based on the IFC standard; the generation of the dynamic construction task flow integrates push mechanism, pull mechanism and dependency chain mechanism.

3. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 1, characterized in that, In step S20, the first policy network adopts a standard multilayer perceptron architecture, which maps the local observation vector containing the horizontal transport vehicle's body state and local task features into discrete action commands.

4. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 1, characterized in that, In step S20, the second policy network adopts a deep multilayer perceptron architecture, which includes independently configured hidden layer dimensions and layer normalization mechanisms, and is configured with a feature extraction module for extracting spatiotemporal conflict semantic features. The spatiotemporal conflict semantic features include: The conflict status bit is used to indicate whether the agent is currently in the working area of ​​other peer agents; Window distance feature is used to indicate the normalized time step of the agent’s current task progress from entering or leaving the conflict time window. Static advantage features are used to indicate the priority of tasks determined based on static engineering constraints.

5. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 4, characterized in that, The spatiotemporal conflict semantic feature vector The calculation formula is: in, This is a conflict state bit. For window distance features, This is a static advantage feature.

6. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 5, characterized in that, The conflict status bit The expression is: in, Discrete numerical coding is used to characterize the severity of collisions: 0.0 (Free) indicates no collision; 1.0 (Self-in) indicates that the local machine has entered the cross-operation area but the collaborating machine has not; 2.0 (Peer-in) indicates that the collaborating machine has entered the cross-operation area but the local machine has not; 3.0 (Deadlock) indicates that both machines have a tendency to enter the cross-operation area or have entered simultaneously, with a high risk of collision or deadlock. The window distance feature The expression is: in, This represents the time step from the current task progress on this machine to the boundary of the conflict window. The time step for the collaborative tower crane to reach the window boundary. The time constant is the normalization constant; The static advantage features The expression is: in, and These are the base priority scores for local tasks and collaborative tasks, respectively. This is the scaling factor.

7. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 1, characterized in that, The phased evolution strategy in step S30 includes: In the first phase, the course scheduler is activated, and the agent's basic task processing ability is trained in a low-conflict environment by gradually increasing the task density coefficient and the urgency of the deadline. In the second phase, an active conflict defense expert mechanism is activated. When the predicted conflict characteristics are less than the safety threshold, the avoidance party is forced to take a pause action based on the preset priority rules, and the strategy network is guided by imitation learning. In the third stage, the proactive conflict defense expert mechanism is removed, and the intelligence is internalized with conflict avoidance strategies solely based on the reward feedback of reinforcement learning. In the fourth stage, a strategic monitor is activated to identify and prohibit suboptimal actions that are left idle or ignored for a long time, and to carry out long-term planning and correction of the strategy.

8. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 1, characterized in that, Step S40 also includes generating a dynamic action mask, specifically by acquiring local observation data of the on-site intelligent agent in real time and generating a dynamic action mask based on the current material inventory status, equipment load status, and task feasibility; the dynamic action mask is a binary action mask used to filter out actions that violate physical constraints, including insufficient material inventory, full buffer zone, and tower crane lifting without load.

9. The construction site scheduling method based on heterogeneous multi-agent reinforcement learning as described in claim 1, characterized in that, In step S50, the multi-objective reward function value is calculated by adding the four-layer reward system. Specifically, the first layer is the core task incentive, including task completion reward and critical path task multiplier reward; the second layer is the operational efficiency incentive, including delay penalty, early completion reward and unnecessary suspension penalty; the third layer is the safety and compliance incentive, including correct avoidance reward based on the "five yields" rule, violation operation penalty and dangerous approach penalty; and the fourth layer is the cooperation incentive, including upstream and downstream supply and demand cooperation reward for horizontal transport vehicles and tower cranes.

10. A field construction scheduling system based on heterogeneous multi-agent reinforcement learning, characterized in that, include: The simulation environment construction module is used to collect on-site building information model data and schedule plans, and generate a simulation environment that includes storage yards and lifting points for task generation. A heterogeneous model training module is used to implement the training steps of the method as described in any one of claims 1 to 9, and generate a policy network including conflict feature extraction; The decision execution module is used to receive real-time on-site status and output dispatch instructions to each construction machinery terminal.