A heterogeneous robot collaborative scheduling method for space station multi-cabin sections

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By adopting a hierarchical distributed collaborative architecture and a task commitment mechanism in the multi-module environment of the space station, the problems of state dimension growth and communication overhead in multi-agent scheduling methods are solved, achieving high efficiency and stability in task allocation, and improving the matching accuracy of multi-skill tasks and the targeting of resource allocation.

CN122287697APending Publication Date: 2026-06-26NANJING UNIV OF INFORMATION SCI & TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: NANJING UNIV OF INFORMATION SCI & TECH
Filing Date: 2026-05-27
Publication Date: 2026-06-26

Application Information

Patent Timeline

27 May 2026

Application

26 Jun 2026

Publication

CN122287697A

IPC: G06N3/008; G06N7/01; G06N3/0499; G06F18/213; G06F18/22; B25J9/16; G06F123/02

AI Tagging

Technology Topics

CoschedulingDistributed computing

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A space time-varying conflict prediction and operation face guiding method for multi-work-species interpenetration construction
CN122262711AReduce idlingImprove adaptability Data processing applications Biological modelsGeneration processAdaptive identification
A timing class information data dynamic regularizing cooperative scheduling processing method
CN122285231AData stream Term memory
An iot cloud edge coordination micro-service scheduling method
CN122019213BResource allocation Interprogram communication Edge computing The Internet
AGV task dynamic assignment system and method based on multi-objective collaborative optimization
CN122243155AForecasting Time domainEvent trigger
Distributed energy consumption optimization system for electrical loads based on edge computing
CN122292393AEdge computing Edge node

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing multi-agent scheduling methods in the multi-module environment of the space station suffer from problems such as rapid growth of state dimension, high computational load, difficulty in balancing the supply relationship of local modules with the global task demand relationship, and easy introduction of communication overhead and synchronization waiting during the coordination process, resulting in problems such as task allocation not meeting physical rules and inaccurate skill matching.

Method used

A hierarchical distributed collaborative architecture based on a semi-Markov decision process is adopted. By constructing a global task overview vector, an agent overview vector, and a skill gap tension vector, feature splicing is performed to generate regional control parameters and segment priority scores. The task graph is reduced by combining a three-dimensional localization mask, and matching and updating are performed by sharing a virtual state space and a task commitment mechanism. This reduces the synchronization negotiation overhead under communication constraints and improves the feasibility and stability of task allocation.

Benefits of technology

It reduces the state space complexity in multi-module and multi-task concurrent scenarios, reduces ineffective exploration, improves the feasibility of task allocation and the stability of collaborative scheduling, enhances the targeting of skill gap identification and resource allocation, and ensures the efficiency of task allocation and collaborative effect.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122287697A_ABST

Patent Text Reader

Abstract

This invention discloses a heterogeneous robot collaborative scheduling method for multiple modules of a space station. It employs a hierarchical distributed collaborative architecture based on a semi-Markov decision process. Physical constraints are transformed into filtering conditions through a pre-defined physical constraint sub-mask that includes area access, resource capacity, deadline reachability, and skill matching. A task residual update mechanism mitigates the waiting and overhead issues caused by explicit synchronization negotiation under communication constraints. A task commitment mechanism is introduced, dynamically updating the task residual based on the confidence level of the preceding robot's commitment to the target task. Subsequent robots then fill in, replace, or re-match based on the updated task residual. When local matching enters an oscillating state or local matching convergence stalls, a skill gap tension vector reflecting the gap status of each skill dimension is generated and fed back to the upper layer. The upper layer determines the cause based on the gap type and degree of different skill dimensions and takes appropriate measures for collaborative scheduling.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence and embodied intelligence multi-robot collaboration technology, and in particular to a method for collaborative scheduling of heterogeneous robots across multiple modules of a space station. Background Technology

[0002] With the development of embodied intelligence and multi-robot system technologies, heterogeneous robot swarms have been gradually applied to scenarios such as automated warehousing, deep space exploration, and space station operation and maintenance. The interior of a space station typically contains multiple modules with physical isolation and access restrictions, and operation and maintenance tasks often have the characteristics of multiple skills accumulation, requiring different types of robots such as sensing, operation, and inspection to work together.

[0003] In such complex and constrained environments, existing multi-agent scheduling methods still have shortcomings: First, they mostly adopt a single-layer flat decision architecture, directly concatenating the global task state with the agent state into the network. When facing concurrent scenarios with multiple modules and multiple tasks, the state dimension increases rapidly, the computational load is high, and it is difficult to balance the supply relationship of local modules with the demand relationship of global tasks. Second, they rely heavily on reward functions for ex-post penalties, making it difficult to pre-process hard constraints such as module access, the upper limit of the number of workers, and deadlines, which can easily lead to invalid exploration and collaborative stagnation that do not meet physical rules. Third, the collaborative process relies heavily on a global synchronization clock and explicit communication. In situations where space station occlusion, communication is limited, and robot execution time varies greatly, additional communication overhead and synchronization waiting can easily be introduced.

[0004] Furthermore, existing sequential decision-making methods typically perform deterministic deductions of task requirements immediately after the preceding agent selects a task. Subsequent agents struggle to determine the reliability of the preceding commitments, and when the preceding agent fails to arrive as expected, they may misjudge that the task requirements have been met, affecting replacement and collaborative grouping. Existing matching methods often calculate a single matching score by combining the agent's skill vector with the task requirement vector as a whole, making it difficult to distinguish the satisfaction of different skill gaps. This may lead to duplicate investment of some skills while key skills remain unmet. In hierarchical scheduling, when a lower-level local matching fails, the upper level can usually only receive a simple failure signal, making it difficult to identify the specific missing skills, the tasks involved, and their urgency. This can easily lead to redundant coordination or abandonment of existing local matching results. Summary of the Invention

[0005] Purpose of the invention: The present invention aims to provide a heterogeneous robot collaborative scheduling method for multi-module space stations that is suitable for complex and constrained environments, improves the relevance of task allocation and the stability of the execution process.

[0006] Technical solution: The heterogeneous robot cooperative scheduling method for multiple modules of a space station described in this invention includes the following steps: A hierarchical distributed collaborative architecture is constructed based on a semi-Markov decision process. In the upper layer, the global task overview vector, the global agent overview vector, and the skill gap tension vector are concatenated and input into a multilayer perceptron to obtain regional control parameters and priority scores of the modules. Based on the priority scores, modules are selected, and the global task graph is reduced to a locally active subgraph through the three-dimensional localization mask of the module. The locally active subgraph, regional control parameters, and three-dimensional localization mask are sent to the lower layer. The lower layer receives the local active subgraph, regionalized control parameters, and 3D localization mask sent by the upper layer. Within the scope defined by the local active subgraph, it reads the task residual features of candidate tasks from the shared virtual state space to obtain the capability features of the awakened robot. It independently maps the task residual features and capability features to generate a task feature matrix and an agent feature matrix. Based on the 3D localization mask, it determines the current set of open skill dimensions and calculates the comprehensive matching score based on the current set of open skill dimensions, the task feature matrix, and the agent feature matrix. Based on the candidate task attributes within the local active subgraph and the real-time state of the awakened robot, it determines the physical constraint sub-mask of the candidate task. Based on the effective slices of the physical constraint sub-mask and the 3D localization mask, it determines the comprehensive restriction mask. The comprehensive restriction mask is superimposed on the comprehensive matching score and normalized to obtain the safe action probability distribution. Finally, it determines the execution intent based on the safe action probability distribution. The robot and task are matched according to the final execution intention, and the robot's execution intention for the target task is recorded as a commitment, which is configured with a commitment confidence level. Based on the robot's capability tensor and the commitment confidence level, the current task residual of the target task is reduced by confidence to obtain the updated task residual. The updated task residual and final execution intention are written into the shared virtual state space, enabling subsequent robots to fill in, replace, or rematch based on the updated task residual; When it is detected that local matching has entered an oscillation state or local matching has stalled, the current skill gap tension vector is calculated based on the task set in the current local active subgraph and the current task residual of each task. The skill gap tension vector is then sent to the upper layer, so that the upper layer updates the local active subgraph, regional control parameters and three-dimensional localization mask based on the skill gap tension vector.

[0007] Furthermore, the heterogeneous robot collaborative scheduling method for multiple modules of a space station also includes setting a global sparse reward after the lower-level task-level node matcher completes a scheduling cycle and obtains the task completion result; and setting a stage-level discounted reward at the upper level according to a semi-Markov decision process.

[0008] Furthermore, the aforementioned three-dimensional localization mask is Where m is the task index. For the skill dimension index, t is the time-series slot index; the specific assignment rules for the three dimensions are as follows: Spatial dimensional constraints: Based on the physical attributes of the module to which the task belongs, task nodes in inactive areas are shielded to limit the operating space of heterogeneous robots; Skill Dimension Sequential Control: Based on the requirements of each task and the current supply and demand tension of resources, dynamically determine the exposure priority of each skill dimension, and realize the sequential opening of multi-skill task requirements from high priority to low priority; Time-series constraints: Based on the task's deadline margin and the current time slot, determine whether the task is open in the corresponding time slot, so as to shield tasks that have not yet entered the current decision cycle and prioritize the exposure of urgent tasks. The default rule for determining the priority of skill exposure: the requirement vector for task m. For each non-zero skill dimension, the exposure priority is determined according to the supply and demand relationship of the corresponding skill dimension within the current active subgraph.

[0009] Furthermore, the aforementioned comprehensive matching score for: ; ; in, The current set of open skill dimensions is represented by a three-dimensional localization mask. The open status of the skill dimension on candidate task m and current time slot t is determined; The weight coefficient for the j-th skill dimension (set according to the common experience of people in this field); The attention matching score is given for the j-th skill dimension. , Let be the independent linear transformation matrix for the j-th skill dimension channel. Let be the independent learnable parameter vector for the j-th skill dimension channel. It is the hyperbolic tangent activation function; The requirement characteristics of the corresponding task in the j-th skill dimension; The corresponding intelligent agent's ability characteristics in the j-th skill dimension.

[0010] Furthermore, comprehensive restriction mask for ; in, For the three-dimensional localization mask The effective slice at the current moment and on the current set of open skill dimensions; the physical constraint sub-mask of the candidate task includes the region admission mask. Resource capacity mask Deadline reachability mask Matching mask with skills .

[0011] Furthermore, the task residual update includes residual pre-deduction, residual bounce, residual solidification, and residual recovery; The residual pre-deduction is as follows: when robot j is matched to target task m according to its final execution intention, the current requirement of the target task is deducted based on the robot's capability tensor and initial commitment confidence, resulting in the time step [time value missing]. Task residual : ; in, The tensor representing the current requirements of the target task; The set of all robots currently holding valid commitments to task m, wherein the valid commitments are execution intention records that have not been revoked, replaced, and are still involved in task residual deduction; For any robot in this set, For robots Capability tensor For robots exist Confidence level of commitment at any given moment; This indicates a non-linear truncation operation; The residual rebound is as follows: when robot j is en route to the target task m, the commitment confidence decays over time, and the confidence reduction decreases accordingly, yielding the result at time... Task residual : ; ; ; in, This indicates the confidence level of the commitment during the residual rebound phase. The committed decay rate in the regionalized control parameters. For the system clock, This is the estimated time for the robot to reach the mission location. It is an adjustable hyperparameter; The residual solidification is as follows: when robot j reaches the target task m, the robot's commitment confidence is solidified, and its capability contribution is stably involved in the task residual deduction. The residual recovery is as follows: when robot j times out, becomes unreachable, or is replaced, remove the robot from the set of valid commitments for the target task m, or set its commitment confidence to zero, and recalculate the task residuals for the target task.

[0012] Furthermore, after residual pre-deduction, a revocable soft lock is set for the target task. The revocable soft lock is used to record the comprehensive matching score of the preceding robot i for the target task m. The following parameters are used as a benchmark for subsequent robots to trigger commitment overwriting: When a soft lock corresponding to the preceding robot i already exists on the target task m, and the difference between the comprehensive matching score of the subsequent robot i' for the target task m and the soft lock strength exceeds the local overwriting threshold in the regional control parameters, commitment overwriting is triggered; the commitment overwriting includes invalidating the commitment confidence of the preceding robot i for the target task m, removing the confidence deduction amount corresponding to the preceding robot, and recalculating the task residual of the target task m according to the residual recovery; after the residual recovery is completed, the residual pre-deduction is performed on the target task m based on the capability tensor and commitment confidence of the subsequent robot i', and the soft lock strength is updated; wherein, the number of overwritings of a single task in the same matching cycle does not exceed the preset number threshold, or the local overwriting threshold corresponding to the kth overwriting is set to k times the initial local overwriting threshold.

[0013] Furthermore, the skills gap tension vector for ; in, The set of tasks within the currently active subgraph. Let m be the current task residual vector. This is a weighted vector based on the urgency of the deadline. Element-wise multiplication; urgency weight vector based on deadlines The values of each component are inversely proportional to the remaining margin of the deadline for task m.

[0014] Furthermore, the global sparse reward R is ; in, The latest completion time for all heterogeneous robots to complete their tasks and return to the docking node; This represents the total number of task nodes that failed due to timeouts or constraint conflicts. The weighted cumulative number is the sum of all commitment revocation events within a single round, calculated using preset weights. The failure penalty weighting coefficient, To cancel the penalty weighting coefficient (a common experience setting for those in this field).

[0015] Furthermore, phased discount rewards for ; in, As the discount factor, This is the stage reward fed back by the lower-level environment after the system completes the execution of the t-th selected module.

[0016] Beneficial Effects: Compared with existing technologies, the significant advantages of this invention are: 1. By constructing a hierarchical distributed collaborative architecture based on a semi-Markov decision process, this invention reduces the state space complexity in multi-segment, multi-task concurrent scenarios and improves scheduling decision efficiency; 2. By using physical constraint sub-masks, three-dimensional localization masks, and comprehensive constraint masks, this invention reduces invalid explorations that do not meet the requirements of regional access, resource capacity, deadline reachability, and skill matching, thereby improving the feasibility of task allocation; 3. By using a shared virtual state space, task commitment mechanism, and task residual update mechanism, this invention reduces the waiting and overhead caused by explicit synchronous negotiation under communication constraints, and improves the timeliness of subsequent robot replacement, substitution, or re-matching; 4. By using skill dimension decoupling matching and skill gap tension vector feedback, this invention improves the targeting of skill gap identification and resource allocation in multi-skill tasks, thereby improving the stability of task allocation and collaborative scheduling effect in complex and constrained environments. Attached Figure Description

[0017] Figure 1 This is a schematic diagram of a layered distributed collaborative architecture; Figure 2 This is a flowchart of the upper layer of the hierarchical distributed collaborative architecture. Figure 3 This is a flowchart of the lower layer of the hierarchical distributed collaborative architecture. Detailed Implementation

[0018] The heterogeneous robot cooperative scheduling method for multiple modules of a space station as described in this invention includes the following steps: A hierarchical distributed collaborative architecture is constructed based on a semi-Markov decision process. In the upper layer, the global task overview vector, the global agent overview vector, and the skill gap tension vector are concatenated and input into a multilayer perceptron to obtain regional control parameters and priority scores of the modules. Based on the priority scores, modules are selected, and the global task graph is reduced to a locally active subgraph through the three-dimensional localization mask of the module. The locally active subgraph, regional control parameters, and three-dimensional localization mask are sent to the lower layer. The lower layer receives the local active subgraph, regionalized control parameters, and 3D localization mask sent by the upper layer. Within the scope defined by the local active subgraph, it reads the task residual features of candidate tasks from the shared virtual state space to obtain the capability features of the awakened robot. It independently maps the task residual features and capability features to generate a task feature matrix and an agent feature matrix. Based on the 3D localization mask, it determines the current set of open skill dimensions and calculates the comprehensive matching score based on the current set of open skill dimensions, the task feature matrix, and the agent feature matrix. Based on the candidate task attributes within the local active subgraph and the real-time state of the awakened robot, it determines the physical constraint sub-mask of the candidate task. Based on the effective slices of the physical constraint sub-mask and the 3D localization mask, it determines the comprehensive restriction mask. The comprehensive restriction mask is superimposed on the comprehensive matching score and normalized to obtain the safe action probability distribution. Finally, it determines the execution intent based on the safe action probability distribution. The robot and task are matched according to the final execution intention, and the robot's execution intention for the target task is recorded as a commitment, which is configured with a commitment confidence level. Based on the robot's capability tensor and the commitment confidence level, the current task residual of the target task is reduced by confidence to obtain the updated task residual. The updated task residual and final execution intention are written into the shared virtual state space, enabling subsequent robots to fill in, replace, or rematch based on the updated task residual; When it is detected that local matching has entered an oscillation state or local matching has stalled, the current skill gap tension vector is calculated based on the task set in the current local active subgraph and the current task residual of each task. The skill gap tension vector is then sent to the upper layer, so that the upper layer updates the local active subgraph, regional control parameters and three-dimensional localization mask based on the skill gap tension vector.

[0019] like Figure 1 As shown, the layered distributed collaborative architecture includes: Shared Virtual State Space: Serving as the state storage area for the entire system, it eliminates the need for explicit communication between nodes. It allows modules at all levels to read and modify data to maintain and synchronize the state of the entire space station. The space stores the following: the current position and idle state of heterogeneous robots, the task distribution of regionalized control parameters, the current task residuals of each task node, and the commitment record quadruple. With the soft lock information triple .

[0020] Phase-level resource coordination module: Deployed at the central computing node or edge server of the space station. This module obtains a global view from the shared virtual state space and receives skill gap tension vectors from lower layers. (Initially a zero vector). Its core functions include: prioritizing each module using a multilayer perceptron (MLP); and generating a three-dimensional localization mask composed of spatial, skill, and temporal parameters. and includes partial overwrite threshold Committed decay rate and maximum alliance size limit The module provides regionalized control parameters and differentiated responses such as cross-regional allocation, boundary expansion, or local abandonment based on the tension vector. It passes locally active subgraphs, regionalized control parameters, and 3D localization masks to lower layers to initiate inner-layer matching loops.

[0021] Task-level node matching module: Deployed in the local or distributed control domain of heterogeneous robots, this module reads the task residual features of candidate tasks from the shared virtual state space within the scope of the locally active subgraph and obtains the capability features of the awakened robot; it independently maps the task residual features and capability features to generate a task feature matrix and an agent feature matrix, and calculates a comprehensive matching score based on the current open skill dimension set.

[0022] Multi-level physical constraint filtering module: Determines the region admission mask based on the candidate task attributes within the local active subgraph and the real-time state of the awakened robot. Resource capacity mask Deadline reachability mask Matching mask with skills These four sub-masks are aggregated with effective slices of the 3D localization mask to generate a comprehensive constraint mask. Superimposed on the overall matching score Softmax normalization is then performed to obtain the probability distribution of safe actions.

[0023] The task residual update management module manages four stages: residual pre-deduction, residual bounce, residual solidification, and residual recovery, and maintains revocable softlock information and commitment overwriting based on softlocks. After a decision is made, the robot's execution intention for the target task is recorded as a commitment, which is configured with a commitment confidence level. Based on the robot's capability tensor and the commitment confidence level, the system performs confidence deduction on the current task residual of the target task and dynamically updates it, enabling subsequent robots to fill in, replace, or rematch based on the updated task residual.

[0024] Skill Gap Tension Backflow and Mask Reconstruction Module: Real-time monitoring of local matching status. When local matching is detected to be entering an oscillating state or local matching convergence stalls, the current skill gap tension vector is calculated based on the task set within the current local active subgraph and the current task residual of each task. This skill gap tension vector is then sent to the upper layer, enabling the upper layer to update the local active subgraph, regional control parameters, and 3D localization mask based on this skill gap tension vector.

[0025] like Figure 2 As shown, the phased resource coordination module selects modules based on their priority scores and reduces the global task graph to a locally active subgraph using a 3D localization mask. The specific workflow is as follows: H1: Obtain the global task state and global agent state from the shared virtual state space, and generate a global task overview vector and a global agent overview vector; and obtain the skill gap tension vector from the lower layer. When the system is first started, the skill gap tension vector is a zero vector.

[0026] The skill gap tension vector is used to characterize the gap situation of each skill dimension within the local active subgraph, and each component represents the combined weighted value of the current task residual and task urgency on the corresponding skill dimension.

[0027] ; in, The set of tasks within the currently active subgraph. Let m be the current task residual vector. This is a weighted vector based on the urgency of the deadline. This is element-wise multiplication; the urgency weight vector The values of each component are inversely proportional to the remaining margin of the deadline for task m; the more urgent the deadline, the greater the weight.

[0028] When the system starts up for the first time, It is initialized as a zero vector; in subsequent iterations, this vector is dynamically updated based on the skill satisfaction level feedback from the lower layer, serving as the basis for the upper-layer coordinator to identify and schedule tasks.

[0029] H2: The global task overview vector Global agent overview vector and the tension vector of skill gap Feature stitching is performed and input into a multilayer perceptron (MLP). The MLP obtains the priority score P(zone|s) of the zone and regionalization control parameters through nonlinear mapping of its internal network parameters. The priority score of the zone is used to determine the active zone that needs to be prioritized for resource allocation. The regionalization control parameter package includes a contention overwrite threshold. Committed decay rate And the maximum alliance size limit N. Its formula is: ; in, This represents the score or probability of each module being selected as a priority processing object under the current system state s. It is a local overwrite threshold, used to control whether a subsequent robot can overwrite an existing commitment of a preceding robot. It is the decay rate of commitment confidence, used to control how quickly commitment confidence decays over time, and is defined as: ; in, This is the estimated time for the robot to reach the mission location. It is an adjustable hyperparameter.

[0030] It represents the maximum number of collaborative robots allowed in a single task, used to control how many robots are allowed to participate collaboratively in a single task; these parameters together constitute the regional control parameters for each module, and together they form the regional control intent for a specific module, which will be combined with the spatial mask later to uniformly constrain the behavior of the lower layer.

[0031] H3: Based on the priority score output by the multilayer perceptron, one or more priority-compliant modules are selected as active modules to be executed in parallel, using a dynamic threshold or Top-K sampling method, in order to support concurrent processing of multi-module tasks.

[0032] H4: Based on the selected module, generate a three-dimensional localization mask consisting of spatial, skill, and temporal dimensions, which is used to dynamically reduce the fully connected global task graph into locally active subgraphs.

[0033] The data structure of the three-dimensional localization mask is defined as follows: , where m is the task index, used to identify the specific task to be executed within the space station. This is an index for the skill dimension, corresponding to the heterogeneous skill types possessed by the robot, such as inspection, operation, and perception. `t` is a temporal slot index, used to schedule open nodes for tasks on the timeline; this dynamic reduction process is achieved through joint constraints across the following three dimensions:

[0034] (1) Spatial Dimension Constraints: Based on the active segment targets output in step H3, the system verifies the physical location attributes of each task m in the global task set. If the segment where task m is located is not selected as an active segment, the mask value of the task in all skills and timing is directly determined. (i.e., implementing physical shielding); if task m belongs to the currently selected active segment, then task node m is retained to participate in the subsequent calculations of dimensions j and t. This confines the working space of the heterogeneous robot within the active subgraph.

[0035] (2) Skill Dimension Sequential Control: For a task m that retains the spatial dimension, the system does not open all its skill requirements at once, but instead calculates the exposure priority of each skill dimension. This determines the mask value for each j-channel, enabling sequential opening of multi-skill task requirements from high priority to low priority.

[0036] The rules for determining exposure priority are shown in the following formula: ; in, The exposure priority score for the j-th skill dimension. The larger the value, the scarcer the skill resource, and the higher the priority of exposure for the corresponding dimension. The set of tasks within the currently active subgraph. : The demand for task m in the j-th skill dimension. : The set of idle heterogeneous robots within the currently active subgraph. : The ability value of robot i in skill dimension j. Zero constant is used to ensure the stability of mathematical calculations.

[0037] Through this mechanism, for task m, the system compares the non-zero dimensions of its requirement vector. Value, only the specific skill dimension with the highest supply-demand ratio (i.e., the most scarce). The corresponding mask value is set to open, that is, let Simultaneously, the mask values for the remaining skill dimensions j of task m that have not reached the highest priority are set to... (Temporarily closed).

[0038] (3) Timing Constraints: Delay exposure control is implemented based on the deadline margin of task m to determine whether the task is visible to the robot under the current system clock slot t. For task m with a tight deadline, the system sets the timing mask for the current and subsequent decision cycles t to 0 (i.e., ...). For urgent task m, a delay time is set to immediately enter the decision-making cycle; for non-urgent task m, a delay time is set. To adjust and determine the current system time At that time, forced mask (Execution delay hidden) until time progresses to Only then will its mask value be restored to 0.

[0039] The generated 3D localization mask M3D is used to reduce the global task graph to a local active subgraph, and is sent to the lower-level task-level node matching module along with the local active subgraph and regional control parameters.

[0040] H5: Based on the regionalized control parameters generated in step H2 ( ) and the three-dimensional localization mask generated in step H4 ( The system will synchronously send the above information to the lower-level task-level node matching module. Through the sending of this information, the system will... As a constraint, while utilizing control parameters ( Quantitative adjustment of lower-level collaborative strategies.

[0041] like Figure 3 As shown, after receiving the local active subgraph, regionalization control parameters, and 3D localization mask M3D from the upper layer, the inner matching loop of the lower-level task-level node matcher officially starts, and its specific execution process is as follows:

[0042] L1: The awakened robot has been identified.

[0043] The system determines the robot to be woken up based on its operating status. In one implementation, a timestamp-driven event queue Q can be constructed to manage task completion events, node arrival events, return to docking point events, and re-matching events after commitment overwriting, and accordingly wake up the corresponding robot to participate in the lower-level task-level node matching.

[0044] L2: Reading the residual features of candidate tasks and acquiring robot capability features.

[0045] Based on the robot awakened in step L1, and combined with the local active subgraph, regional control parameters, and three-dimensional localized mask M3D issued in step H5, within the scope defined by the local active subgraph, the task residual features of the candidate tasks are read from the shared virtual state space, and the capability features of the awakened robot are obtained.

[0046] The awakened robot only reads the task residual features of each candidate task within the local active subgraph; the task residual features are dynamically updated based on the commitment confidence.

[0047] Simultaneously, the reading behavior is constrained by the local active subgraph and the 3D localization mask M3D. The system masks inactive regions and task information in unopened skill dimensions, outputting a task residual feature set. This task residual feature set is then used as an input vector for independent mapping processing to generate the task feature matrix. .

[0048] Through this step, the lower-level robot can make subsequent matching judgments based on the updated task residuals, and avoids invalid feature extraction of inactive areas and unopened skill slots.

[0049] L3: Using a dual-stream encoder, the task residual features and the capability features of the awakened robot are independently mapped to generate a task feature matrix. With agent feature matrix .

[0050] The independent mapping is used to project the task residual features and agent capability features into a hidden embedding space of the same dimension, so as to provide semantically consistent input for subsequent decoupling attention alignment.

[0051] The specific implementation of the mapping: In a specific implementation, the mapping operator of the dual-stream encoder can adopt a linear projection layer or a multilayer perceptron (MLP) as used in existing technologies. For the task residual features and capability features, the system sets independent network parameters with non-shared weights to extract deep features in their respective dimensions.

[0052] The specific mapping process follows the following mathematical expression: For the task residual features, the generated task feature matrix... for: .

[0053] For the agent's capability characteristics, the generated agent feature matrix for: .

[0054] : Represents the current task residual vector of the target task m at the current time t, whose value dynamically rebounds as the confidence of the preceding robot decays; : Represents the initial capability feature vector of the awakened idle heterogeneous robot i, which consists of its skill levels such as inspection, operation, and perception of its type; and : These are the learnable mapping weight matrices for the task residual features and capability features, respectively, used to perform linear transformations; and : These are the corresponding bias term vectors, used to enhance the model's translation representation capability; : Represents a nonlinear activation function (such as ReLU or GeLU), which aims to introduce a nonlinear transformation into a linear mapping, thereby improving the expressive power of feature extraction.

[0055] Subsequently, the task feature matrix With agent feature matrix Break it down along the skill dimension Each sub-feature matrix, where This is the preset total dimension of heterogeneous skills. The task side is divided into... ,in, The corresponding task's requirement characteristics in the j-th skill dimension; the agent side is broken down into ,in, The corresponding agent's ability features in the j-th skill dimension; the splitting enables the subsequent attention matching score to be calculated independently in each skill dimension.

[0056] L4: Relying on the skill dimension to decouple attention mechanism, for each skill dimension Calculate the attention matching score independently for this skill dimension. The calculation formula is:

[0057] ; in, , Let be the independent linear transformation matrix for the j-th skill dimension channel. Let be the independent learnable parameter vector for the j-th skill dimension channel. It is the hyperbolic tangent activation function; The requirement characteristics of the corresponding task in the j-th skill dimension; The corresponding intelligent agent's ability features in the j-th skill dimension; the linear transformation matrix and the learnable parameter vector are the network weight parameters of the attention mechanism conventional in this field.

[0058] The open state of the skill dimension is affected by a three-dimensional localization mask. Control, only when When the mask value of the j-th skill dimension is 0, the skill dimension is considered to be in an open state, and only then will the j-th skill dimension channel participate in the calculation. The mask value is... The skill-based channel is not included in the current matchmaking; the matchmaking scores of all open channels are weighted and summed to generate a comprehensive matchmaking score. : ; in, This is the collection of currently available skill dimensions. The weighting coefficients for each skill dimension; the comprehensive matching score is used to characterize the overall matching relationship between the awakened heterogeneous robot and the candidate task.

[0059] Furthermore, as a preferred embodiment of the present invention, the specific steps of constructing a physical constraint sub-mask that includes regional access, resource capacity, deadline reachability, and skill matching include:

[0060] L5: Physical constraint submask calculation

[0061] Execution begins based on the candidate task attributes within the locally active subgraph and the real-time state of the awakened robot.

[0062] The system performs parallel computation on four physical constraint submasks for each candidate task (set to 0 if verification passes, set to 0 if failure fails). ): Area access mask Verify whether the type of robot being woken up has the physical permission to enter the compartment where the candidate task is located; Resource capacity mask : Verify whether the number of currently active participants in the task (including all commitments in transit and locked) is lower than the actual capacity limit; Deadline reachability mask Calculate the estimated completion time (current time + movement time + execution time) and check if it is less than or equal to the absolute deadline of the task; Skill Matching Mask Subject to 3D mask Gating only verifies whether the robot's capabilities in the currently available skill dimensions intersect with the current task residual after the task's dynamic bounce.

[0063] This step outputs four sub-mask matrices for each candidate task and passes them to step L6 to generate the final integrated constraint mask.

[0064] The specific steps for aggregating and filtering the physical constraint sub-mask and the 3D localization mask include:

[0065] L6: Determining the synthetic constraint mask based on effective slices of the physical constraint submask and the 3D localization mask M3D. The calculation formula is:

[0066] ; in, For the three-dimensional localization mask A valid two-dimensional slice based on the currently available skills; valid mask values are set to 0, and invalid mask values are set to... By performing element-wise addition, any sub-mask item is At that time, its comprehensive restriction mask The value is This is to achieve joint filtering of various constraints.

[0067] L7: The composite restriction mask The comprehensive matching score is obtained by superimposing the aforementioned skill-dimensional decoupled attention calculation. Softmax normalization is then performed to output the probability distribution of safe actions. :

[0068] ; Among them, the The comprehensive matching score is obtained by decoupling attention calculation from the aforementioned skill dimensions.

[0069] This step outputs the probability distribution of safe actions taken by the current decision-making entity for the locally active subgraph.

[0070] The robot then selects the most probable legitimate task from the distribution as the final decision to execute, and passes it to step L8 as the basis for matching the robot with the task and updating the task residual.

[0071] L8: The task residual update includes residual pre-deduction, residual bounce, residual solidification, and residual recovery. The specific execution logic for each stage is as follows: Phase 1 Residual Pre-deduction: When robot i is matched to target task m at time t0 according to its final execution intention, the current requirement of the target task is deducted based on the robot's capability tensor and initial commitment confidence, resulting in the task residual at time t0. The task residual maintained in the shared virtual state space is the cumulative effect after deduction of all valid commitments, calculated using the following formula: ; in, The tensor representing the current requirements of the target task; The set of all robots that currently hold a valid commitment to task m; For any robot in this set, For robots Capability tensor For robots exist The confidence level of the commitment at any given moment; for the specific robot that makes the initial commitment at this moment. Its initial commitment confidence level The maximum value is 1; : Non-linear truncation operation, used to prevent negative values from appearing in the task residual.

[0072] The quadruple containing this commitment information will then be used. Write to the shared virtual state space and mark the robot as in transit.

[0073] Phase 2 Residual Springback: When robot i is en route to task m, with the system clock... As the process continues, the confidence level of the commitment decays over time, and the confidence reduction decreases accordingly, causing the task residual of the target task m to rebound: ; in, The decay rate parameter is specified by the upper-level regional control parameters. This is the system's current real-time clock; :robot The initial moment of establishing a commitment.

[0074] The attenuation rate Associated with the robot's expected arrival time, defined as: ; in, This is the estimated time for the robot to reach the mission location. This is an adjustable hyperparameter. The specific value can be set by those skilled in the art based on conventional experience and the degree of uncertainty in the actual environment of the space station. When there is high uncertainty in accessibility in the corresponding area of the space station (such as high population density, many obstacles leading to a high probability of delay), It can be set to a larger value (e.g., 1.5~2.0) to accelerate commitment decay, thereby prompting subsequent robots to detect residual rebound more quickly and intervene to compensate as early as possible; while in a more stable environment, This can be set to a small value (e.g., 0.5~1.0) to maintain the stability of the commitment and avoid unnecessary overwrite oscillations. In practical implementation, The optimal value can also be determined through grid search or based on the system convergence during the joint pre-training phase of reinforcement learning.

[0075] The task residual of task m in the shared virtual state space is dynamically updated as the confidence of each robot's commitment decays: ; in, : The current task residual of task m at time t. : The current demand tensor for task m. C(m): The set of all robots currently holding valid commitments to task m. j: The index of any robot in the set C(m). : The capability feature vector of robot j. The confidence level of robot j at time t, which decays over time during the residual bounce phase, leading to changes in the calculated task residual. The corresponding increase produces a "rebound" effect.

[0076] Stage 3 Residual Curing:

[0077] When robot i physically reaches the target task m, residual fixation is triggered. The specific execution logic of this stage consists of two steps: mathematical update and system action. (1) Mathematical update: The robot's commitment confidence is fixed and no longer decays over time, that is: ;

[0078] Simultaneously, during the residual solidification phase, the task residual is recalculated to ensure that the robot's capability contribution stably participates in the task residual deduction. ;in, , C(m), , and The meaning is the same as the aforementioned residual pre-deduction and residual springback stages; in the residual solidification stage, βj(t) corresponding to robot j that has reached the target task m is solidified to 1.

[0079] (2) System action: The system records robot i as having reached the target task m and keeps its commitment valid. When the current task residual of task m in the current open skill dimension is 0, the three-dimensional localization mask M3D is updated, the mask of the corresponding skill dimension is closed, and the mask of the next priority skill dimension is opened, so as to promote the sequential opening of multi-skill task requirements from high priority to low priority.

[0080] Phase 4 Residual Recovery: This phase is used to handle commitment failures caused by robot i timeouts, unreachability, or replacement.

[0081] The system will force the residual recovery process to be executed when one of the following two independent triggering conditions is met: Condition (a) Timeout / Unreachable: When the promised maintenance time exceeds the maximum permissible duration ( ), or the system predicts that continuing execution will inevitably result in a delay ( When this condition is met, the system determines that the commitment has naturally expired; Condition (b) Commitment Overwriting: When the difference in matching score calculated by a subsequent superior robot exceeds the local overwriting threshold, commitment overwriting is triggered, and the commitment confidence of the preceding robot i to the target task m becomes invalid.

[0082] Once any of the above conditions are triggered, the system strictly performs the following sequence of operations: (1) Commitment confidence invalidation: Force the commitment confidence of robot i to zero, that is: ; (2) Residual recovery: The committed contribution of robot i is completely removed from the cumulative deduction of task m, causing the task residual to fully recover: ; in, , C(m), , and The meaning is the same as that of the aforementioned residual pre-deduction, residual rebound, and residual solidification stages.

[0083] (3) System action: Release robot i back to the idle state; if the commitment is overridden due to condition (b), then make robot i re-participate in task matching in subsequent scheduling cycles.

[0084] L8a: In the residual pre-deduction stage of step L8, the system places a revocable soft lock on the commitment; the strength of the soft lock is defined as the robot's overall matching score on the target task. ; Soft lock information is recorded as a triple. It is written into the shared virtual state space for subsequent robots to read during matching; L8b: When the robot i' is awakened in the next step calculates the comprehensive matching score for task m in step L4. Subsequently, if a soft lock placed by the preceding robot i already exists on task m, the system performs a commitment overwrite determination; the overwrite trigger condition is that the difference between the matching score of the subsequent robot and the strength of the existing soft lock exceeds the local overwrite threshold. : ; Within the same matching cycle, the number of overwrites for a single task does not exceed a preset threshold. After each overwrite, the threshold for the matching score difference required for subsequent overwrites increases, with the threshold for the k-th overwrite increasing as follows: To suppress repeated overwrite oscillations; L8c: If the commitment overriding condition in step L8b is met, the system performs the following sequence of operations: First, the commitment of the preceding robot i is forced into stage 4 residual recovery, and then... Remove robot i's committed contribution from the task residual calculation, that is, let The system then recalculates the residuals of task m accordingly: ; Release robot i into an idle state, allowing it to re-participate in task matching in subsequent scheduling cycles; after residual recovery and old commitment clearing are completed, the subsequent robot i' establishes a new commitment to task m and enters phase 1, i.e., and execute: ; in, The moment for establishing a new commitment for subsequent robot i'; subsequently, the softlock information is updated to ; L9: When the subsequent robot is awakened, the robot reads the task residuals of each candidate task from the shared virtual state space; the task residuals contain the commitment confidence information corresponding to the valid commitments of the preceding robot; the subsequent robot enters steps L2 to L7 based on the task residuals to perform matching and determine the final execution intention, and realizes the replacement, substitution or re-matching without explicit communication. L10: Real-time monitoring of the inner matching loop's running status. The current inner matching loop ends when the task residuals of all tasks within the locally active subgraph are zero. When local matching is detected to be in an oscillating state or to be stagnant in convergence, step L11 is triggered to generate a skill gap tension vector. Oscillation in local matching includes situations where the number of commitment overwrites exceeds a preset threshold within the same matching cycle and the task residuals within the locally active subgraph do not substantially decrease; stagnant local matching convergence includes situations where the decrease in task residuals of the locally active subgraph is less than a preset threshold for multiple consecutive scheduling cycles. L11: When the system detects that local matching has entered an oscillating state or that local matching has stalled, it generates a skill gap tension vector based on the task set within the current locally active subgraph and the current task residuals of each task. The calculation formula is as follows: ; in, The set of tasks within the currently active subgraph. Let m be the current task residual vector. This is a weighted vector based on the urgency of the deadline. This is element-wise multiplication; the urgency weight vector The values of each component are inversely proportional to the remaining margin of the deadline for task m; the generated skill gap tension vector The severity of the skill gap is encoded according to the skill dimension and sent to the upper-level stage resource coordinator, so that the upper level can update the local active subgraph, regional control parameters and three-dimensional localization mask based on the skill gap tension vector.

[0085] Furthermore, after the lower-level task-level node matcher completes a scheduling cycle and obtains the task completion result, a global sparse reward is set; at the upper level, a stage-level discounted reward is set according to a semi-Markov decision process.

[0086] Constructing a hierarchical reward function: For the global collaborative goal of the space station, define the global sparse reward of the lower-level mission-level node matcher at the end of the round as follows: ; in, The latest completion time for all heterogeneous robots to complete their tasks and return to the docking node; This represents the total number of task nodes that failed due to timeouts or constraint conflicts. The weighted cumulative number is the sum of all commitment revocation events within a single round, calculated using preset weights. The failure penalty weighting coefficient, The penalty weighting coefficient is used to cancel the penalty.

[0087] The aforementioned penalty weight coefficient, as an adjustable hyperparameter of the reinforcement learning reward function, can be empirically set by those skilled in the art based on conventional experience with the fault tolerance rate of space station scheduling: typically set as follows: (For example , This clearly demonstrates to the network that 'task timeout failure' is a fatal error that must be avoided at all costs, while 'being overwritten and undone by a better robot' is merely an acceptable cost of collaborative trial and error. In the specific implementation of the model pre-training phase, the optimal value of this hyperparameter can also be automatically optimized and calibrated through grid search or hyperparameter optimization algorithms based on the model's training convergence curve and final task success rate in the simulation environment.

[0088] For the upper-level stage resource coordinator, the stage-level discounted reward based on the semi-Markov decision process is defined as: ; in, As the discount factor, This is the stage reward fed back by the lower-level environment after the system completes the execution of the t-th selected module; the accumulated discounted rewards at each stage form the training reward of the upper-level coordinator.

Claims

1. A method for collaborative scheduling of heterogeneous robots across multiple modules of a space station, characterized in that, Includes the following steps: A hierarchical distributed collaborative architecture is constructed based on a semi-Markov decision process. In the upper layer, the global task overview vector, the global agent overview vector, and the skill gap tension vector are concatenated and input into a multilayer perceptron to obtain regional control parameters and priority scores of the modules. Based on the priority scores, modules are selected, and the global task graph is reduced to a locally active subgraph through the three-dimensional localization mask of the module. The locally active subgraph, regional control parameters, and three-dimensional localization mask are sent to the lower layer. The lower layer receives the local active subgraph, regionalized control parameters, and 3D localized mask sent by the upper layer. Within the scope defined by the local active subgraph, it reads the task residual features of the candidate task from the shared virtual state space and obtains the capability features of the awakened robot. It then independently maps the task residual features and capability features to generate the task feature matrix and the agent feature matrix. The current set of open skill dimensions is determined based on the 3D localization mask. A comprehensive matching score is calculated based on the current set of open skill dimensions, the task feature matrix, and the agent feature matrix. The physical constraint sub-mask of the candidate task is determined based on the candidate task attributes and the real-time state of the awakened robot within the local active subgraph. A comprehensive restriction mask is determined based on the effective slices of the physical constraint sub-mask and the 3D localization mask. The comprehensive restriction mask is superimposed on the comprehensive matching score and then normalized to obtain the safe action probability distribution. The final execution intention is determined based on the safe action probability distribution. The robot and task are matched according to the final execution intention, and the robot's execution intention for the target task is recorded as a commitment, with a commitment confidence level configured. Based on the robot's capability tensor and commitment confidence level, the current task residual of the target task is reduced by confidence to obtain the updated task residual. The updated task residual and final execution intention are written into the shared virtual state space, enabling subsequent robots to fill in, replace, or rematch based on the updated task residual; When it is detected that local matching has entered an oscillation state or local matching has stalled, the current skill gap tension vector is calculated based on the task set in the current local active subgraph and the current task residual of each task. The skill gap tension vector is then sent to the upper layer, so that the upper layer updates the local active subgraph, regional control parameters and three-dimensional localization mask based on the skill gap tension vector.

2. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 1, characterized in that, It also includes setting a global sparse reward after the lower-level task-level node matcher completes a scheduling cycle and obtains the task completion result; and setting a stage-level discounted reward at the upper level according to a semi-Markov decision process.

3. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 1, characterized in that, The three-dimensional localization mask is Where m is the task index. For skill-level indexes, t is the time-series slot index; the specific assignment rules are as follows: Spatial dimensional constraints: Based on the physical attributes of the module to which the task belongs, task nodes in inactive areas are shielded to limit the operating space of heterogeneous robots; Skill Dimension Sequential Control: Based on the requirements of each task and the current supply and demand tension of resources, dynamically determine the exposure priority of each skill dimension, and realize the sequential opening of multi-skill task requirements from high priority to low priority; Time-series constraints: Based on the task's deadline margin and the current time slot, determine whether the task is open in the corresponding time slot, so as to shield tasks that have not yet entered the current decision cycle and prioritize the exposure of urgent tasks. The default rule for determining the exposure priority of skill dimensions is as follows: For each non-zero skill dimension in the demand vector of task m, the exposure priority is determined according to the supply and demand relationship of the corresponding skill dimension in the current active subgraph.

4. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 1, characterized in that, The comprehensive matching score for: in, The current set of open skill dimensions is represented by a three-dimensional localization mask. The open status of the skill dimension on candidate task m and current time slot t is determined; Let be the weight coefficient for the j-th skill dimension; The attention matching score is given for the j-th skill dimension. , Let be the independent linear transformation matrix for the j-th skill dimension channel. Let be the independent learnable parameter vector for the j-th skill dimension channel. It is the hyperbolic tangent activation function; The requirement characteristics of the corresponding task in the j-th skill dimension; The corresponding intelligent agent's ability characteristics in the j-th skill dimension.

5. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 1, characterized in that, The aforementioned comprehensive restriction mask for in, For 3D localization mask The effective slice at the current moment and on the current set of open skill dimensions; the physical constraint sub-mask of the candidate task includes the region admission mask. Resource capacity mask Deadline reachability mask Matching mask with skills .

6. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 1, characterized in that, Task residual updates include residual pre-deduction, residual bounce, residual solidification, and residual recovery; The residual pre-deduction is as follows: when robot j is matched to target task m according to its final execution intention, the current requirement of the target task is deducted based on the robot's capability tensor and initial commitment confidence, resulting in the time step [time value missing]. Task residual : in, The tensor representing the current requirements of the target task; This is the set of all robots that currently hold valid commitments to task m. Valid commitments are records of execution intentions that have not been revoked, replaced, and are still involved in task residual deduction. For any robot in this set, For robots Capability tensor For robots exist Confidence level of commitment at any given moment; This indicates a non-linear truncation operation; The residual rebound is as follows: when robot j is en route to the target task m, the commitment confidence decays over time, and the confidence reduction decreases accordingly, yielding the result at time... Task residual : in, This indicates the confidence level of the commitment during the residual rebound phase. The committed decay rate in the regionalized control parameters. For the system clock, This is the estimated time for the robot to reach the mission location. It is an adjustable hyperparameter; The residual solidification is as follows: when robot j reaches the target task m, the robot's commitment confidence is solidified, and its capability contribution is stably involved in the task residual deduction. The residual recovery is as follows: when robot j times out, becomes unreachable, or is replaced, remove the robot from the set of valid commitments for the target task m, or set its commitment confidence to zero, and recalculate the task residuals for the target task.

7. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 6, characterized in that, After residual pre-deduction, a revocable soft lock is set for the target task. The revocable soft lock is used to record the comprehensive matching score of the preceding robot i for the target task m. The commitment overwriting is triggered when a soft lock corresponding to the preceding robot i already exists on the target task m, and the difference between the comprehensive matching score of the subsequent robot i' for the target task m and the soft lock strength exceeds the local overwriting threshold in the regional control parameters. The commitment overwriting includes invalidating the commitment confidence of the preceding robot i for the target task m, removing the confidence deduction corresponding to the preceding robot, and recalculating the task residual of the target task m according to the residual recovery. After residual recovery is completed, residual pre-deduction is performed on target task m based on the capability tensor and commitment confidence of subsequent robot i', and the soft lock strength is updated; wherein, the number of overwrites of a single task in the same matching cycle does not exceed the preset number threshold, or the local overwrite threshold corresponding to the kth overwrite is set to k times the initial local overwrite threshold.

8. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 1, characterized in that, The skill gap tension vector for in, The set of tasks within the currently active subgraph. Let m be the current task residual vector. This is a weighted vector based on the urgency of the deadline. Element-wise multiplication; urgency weight vector based on deadlines The values of each component are inversely proportional to the remaining margin of the deadline for task m.

9. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 2, characterized in that, The global sparse reward R is in, The latest completion time for all heterogeneous robots to complete their tasks and return to the docking node; This represents the total number of task nodes that failed due to timeouts or constraint conflicts. The weighted cumulative number is the sum of all commitment revocation events within a single round, calculated using preset weights. The failure penalty weighting coefficient, The penalty weighting coefficient is used to cancel the penalty.

10. The heterogeneous robot cooperative scheduling method for multiple modules of a space station according to claim 2, characterized in that, The aforementioned stage-level discounted rewards for in, As the discount factor, This is the stage reward fed back by the lower-level environment after the system completes the execution of the t-th selected module.