Task chain intelligent optimization method and system based on dynamic causal diagram and reinforcement learning

By combining a dynamic causal graph model with reinforcement learning, this method evaluates the causal influence and reachability probability of the action space, performs pruning and priority ranking, and constructs a comprehensive reward signal by combining a causal success rate model. This solves the problems of excessively large action space and sparse rewards in multi-stage task chains in traditional reinforcement learning, and achieves optimal task chain optimization with causal interpretability and behavioral traceability.

CN122242702APending Publication Date: 2026-06-19HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2026-03-19
Publication Date
2026-06-19

Smart Images

  • Figure CN122242702A_ABST
    Figure CN122242702A_ABST
Patent Text Reader

Abstract

This invention relates to a method and system for intelligent task chain optimization based on dynamic causal graphs and reinforcement learning, comprising: S1. Constructing a dynamic causal graph; S2. Using the dynamic causal graph to evaluate the causal influence and causal reachability of actions on task nodes, performing dynamic pruning and priority ranking to form a constrained action space; S3. Constructing a comprehensive reward signal based on the dynamic causal graph; S4. Integrating the constrained action space and comprehensive reward signal into the iterative training process of reinforcement learning; S5. Real-time sensing of environmental changes, updating the dynamic causal graph, adjusting the constrained action space and comprehensive reward signal to achieve policy network optimization; S6. Outputting the optimal task chain sequence; By constructing a dynamic causal graph to perform causal pruning and priority ranking of the action space, combined with the comprehensive reward signal, non-causal correlation interference is effectively eliminated, improving the convergence speed, success rate, and resource utilization efficiency of task chain planning in dynamic environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent decision-making and emergency command and control, and particularly relates to a task chain intelligent optimization method and system based on dynamic causal graphs and reinforcement learning. Background Technology

[0002] A task chain typically consists of multiple stages, including detection / sensing, identification, tracking / location, decision-making, response, and assessment. It describes the dependencies between various functional components required to complete a specific objective during complex operations such as emergency command, maritime search and rescue, and disaster relief. In real-world applications, sensor states, platform capabilities, communication link stability, and the behavioral characteristics of targets or events are all highly dynamic and uncertain, posing significant challenges to the construction, execution, and optimization of task chains.

[0003] Reinforcement learning (RL) has been widely applied in intelligent decision-making scenarios such as multi-agent collaboration, path planning, and task allocation because it can automatically learn optimal policies through interaction with the environment. However, traditional reinforcement learning often faces problems such as excessively large action space, sparse rewards, and uninterpretable policies in multi-stage, long-sequence tasks.

[0004] Causal graph models have been introduced into task node analysis to describe causal dependencies between nodes, and can be used to assess the reachability of critical tasks, path contribution, and potential causes of failure. Dynamic causal graphs can further update node relationships and causal strength based on real-time environmental states, thus more accurately characterizing the decision-making logic in dynamic systems. However, existing research often uses causal reasoning and reinforcement learning independently: causal graphs focus on explanation and inference, while reinforcement learning focuses on policy search. The deep integration of the two in complex task chains is still in the exploratory stage.

[0005] Existing research on task chain / process optimization mainly focuses on two technical approaches: (1) Task chain optimization method based on heuristic algorithm: Heuristic search (such as particle swarm optimization, genetic algorithm) is used to screen and sort candidate actions / steps to generate executable task sequences, but it is not adaptable to dynamic environment and is difficult to handle complex causal dependencies in multiple stages.

[0006] (2) Task chain optimization method based on reinforcement learning: Reinforcement learning is used to find the optimal action sequence, but there are problems such as large action space, lack of structured constraints, sparse rewards and uninterpretable policies.

[0007] In summary, existing technologies cannot simultaneously solve the three core problems of action space optimization, reward sparsity, and causal interpretability, nor have they developed a task chain optimization method that deeply integrates dynamic causal reasoning with reinforcement learning.

[0008] Based on this, the present invention is proposed. Summary of the Invention

[0009] The purpose of this invention is to provide a task chain intelligent optimization method and system based on dynamic causal graphs and reinforcement learning to solve the problems existing in the background technology.

[0010] To achieve the above objectives, the technical solution of the present invention is as follows: A task chain intelligent optimization method based on dynamic causal graphs and reinforcement learning includes the following steps: S1. Construct a dynamic cause-effect graph model for the scene; S2. Based on the scene dynamic causal graph model, evaluate the causal influence degree and causal reachability probability of each original action in the original action space on the task node. According to the evaluation results and preset rules, perform dynamic pruning and priority sorting on the original actions to form a constrained action space. S3. Establish success rate nodes for each stage of the task chain, and infer the success probability of each success rate node based on the scenario dynamic causal graph model to calculate the chain success rate, construct counterfactual gain as an additional reward shaping item, and together with the chain success rate, constitute a comprehensive reward signal. S4. Input the constrained action space formed in step S2 and the comprehensive reward signal obtained in step S3 into the reinforcement learning policy network, and use the policy gradient algorithm to iteratively train the policy network. S5. Real-time perception of environmental changes, dynamic update of the structure or parameters of the scene dynamic causal graph model, synchronous adjustment of the constrained action space and comprehensive reward signal, and realization of adaptive optimization of the policy network; S6. The reinforcement learning strategy network outputs the optimal task chain sequence.

[0011] Furthermore, step S1 includes the following processes: S11. Collect multi-dimensional scene element data; S12. Define the set of causal nodes based on the collected data; S13. Establish causal edges between causal nodes through rule bases, mechanistic models, or expert knowledge, and quantify the causal strength; S14. Construct a scenario-based dynamic causal graph model that supports updating node states and causal strength parameters over time steps.

[0012] Furthermore, the multi-dimensional scene element data includes at least platform capability parameters, sensor performance, target behavior, communication link status, and environmental conditions. The sources of element data include, but are not limited to, historical records, simulations, or real-time observations.

[0013] Furthermore, the causal node set includes task nodes, platform capability nodes, environment state nodes, and target behavior nodes.

[0014] Furthermore, step S2 includes the following processes: S21. Generate the original action space containing specific operation instructions; S22. Calculate the causal influence of each primitive action in the primitive action space on the task node; S23. Use backdoor adjustment rules to identify backdoor paths between actions and task nodes, construct a backdoor variable set to eliminate non-causal correlations, and obtain the causal reachability probability of actions to task nodes; S24. According to the preset rules, remove actions in the original action space that have a causal influence degree or causal reachability probability lower than the threshold and whose task nodes are not connected, to obtain the constrained action space. S25. Based on causal influence, causal reachability, topological depth, and node dependency, a priority scoring function is constructed to score the constrained action space and sort it from high to low priority, which serves as the action selection order for the reinforcement learning policy network.

[0015] Furthermore, step S3 includes the following processes: S31. Establish success rate nodes for each stage of the task chain; S32. Based on the scene dynamic cause-effect graph model, Bayesian update combined with the do operator is used to infer the success probability of each success rate node in the current action and environment; S33. Based on the topology of the task chain, if it is a sequential structure, the chain success rate is calculated by multiplication; if there is a branch or parallel structure, the chain success rate is calculated by the probability of success of at least one path. S34. When a task fails at a certain stage, the potential success rate of alternative actions is estimated through a scenario dynamic causal graph model. Counterfactual gain is constructed as an additional reward shaping term and together with the chain success rate, it constitutes a comprehensive reward signal.

[0016] Furthermore, step S4 includes the following processes: S41. Employ the near-end policy optimization algorithm, parameterizing the policy network as an Actor network and the value network as a Critic network; S42. Input the constrained action space formed in step S2 into the policy network; S43. Using the comprehensive reward signal, update the policy network parameters according to the shearing loss function of the near-end policy optimization algorithm.

[0017] Furthermore, in step S42, a causal mask is applied to mask actions that are causally unreachable or irrelevant to the task logic in the current state, retaining only the actionable actions and weighting them.

[0018] Furthermore, in step S43, when using the comprehensive reward signal, a task completion reward and / or resource consumption penalty are added.

[0019] A task chain intelligent optimization system based on dynamic causal graphs and reinforcement learning, using the above method, includes: The causal graph model building module is used to build dynamic causal graph models for a scene. The action space constraint module is used to evaluate the causal influence and causal reachability of each original action in the original action space on the task node based on the scene dynamic causal graph model. According to the evaluation results and preset rules, the original actions are dynamically pruned and prioritized to form a constrained action space. The reward signal construction module is used to establish success rate nodes for each stage of the task chain, and calculate the chain success rate by inferring the success probability of each success rate node based on the scenario dynamic causal graph model. It constructs counterfactual gain as an additional reward shaping item, and together with the chain success rate, it constitutes a comprehensive reward signal. The policy training module is used to input the constrained action space and the comprehensive reward signal into the reinforcement learning policy network, and to perform iterative training of the policy network using the policy gradient algorithm. An adaptive update module is used to perceive environmental changes in real time, dynamically update the structure or parameters of the scene dynamic causal graph model, and synchronously adjust the constrained action space and comprehensive reward signal to achieve adaptive optimization of the policy network. The task sequence generation module is used to output the optimal task chain sequence using the trained policy network.

[0020] The advantages of this invention are: 1. This invention constructs a dynamic causal graph model based on task nodes, platform capabilities, environmental states, and target behaviors to infer the causal reachability and effectiveness contribution of candidate actions. By calculating the causal influence degree, backdoor path shielding relationship, and causal reachability probability corresponding to each action, the original action space of reinforcement learning is pruned in real time, eliminating actions that are irrelevant to the current task objective or have low causal contribution. Simultaneously, the topological structure of the causal chain is used to generate an action priority ranking sequence, achieving structured constraints on the action space and enhancing task relevance, significantly reducing the decision search space and improving policy learning efficiency.

[0021] 2. This invention constructs a chain-like success rate model based on the causal dependencies between stages of a task chain. It infers the success probability of each node in the causal graph and calculates the overall task chain success rate based on the structure of the causal path, using this as the reward signal for reinforcement learning. Simultaneously, a counterfactual inference mechanism is introduced: when a step fails, the potential success rate difference is estimated based on the causal contribution of alternative actions or nodes, thus forming a causally sensitive reward shaping strategy. This reward mechanism effectively solves the reward sparsity problem in complex task chains in traditional reinforcement learning, making the learning process more stable and possessing stronger task interpretability.

[0022] 3. The method of this invention uses the dynamically pruned action space as the policy search domain and the success rate of chained tasks as the core reward signal. It learns the optimal policy online through Proximal Policy Optimization (PPO). When changes in environmental state lead to updates to the causal structure, the system can adjust the causal graph in real time and synchronously update the action space and reward function, achieving adaptive policy optimization in dynamic scenarios. The final output task chain sequence not only has optimal execution performance but also possesses causal interpretability and behavioral traceability, improving the intelligence and reliability of task chain construction. Key features include action space optimization, reward sparsity, and causal interpretability. Attached Figure Description

[0023] Figure 1 This is a flowchart illustrating the intelligent optimization method for task chains based on dynamic causal graphs and reinforcement learning in this embodiment. Figure 2a This is a schematic diagram comparing the rewards of the causal PPO algorithm and the standard PPO algorithm in an example application case. Figure 2b This is a schematic diagram comparing the success rates of the causal PPO algorithm and the standard PPO algorithm in an example application case. Figure 2c This is a schematic diagram illustrating the causal guidance usage of the causal PPO algorithm and the standard PPO algorithm in an example application case. Figure 2d This is a schematic diagram showing the performance comparison between the causal PPO algorithm and the standard PPO algorithm in the last 100 rounds of an example application case. Detailed Implementation

[0024] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Many specific details are set forth in the following description to provide a thorough understanding of the present invention. However, the present invention can be practiced in many other ways different from those described herein, and those skilled in the art can make similar modifications without departing from the spirit of the present invention. Therefore, the present invention is not limited to the specific embodiments disclosed below. Technical features in the various embodiments of the present invention can be combined accordingly without mutual conflict.

[0025] This embodiment proposes a task chain intelligent optimization method based on dynamic causal graphs and reinforcement learning, such as... Figure 1 As shown, it includes the following steps: S1. Constructing a dynamic cause-effect graph: Constructing a dynamic cause-effect graph model for the scene; S2. Action Reachability Assessment and Action Space Construction: Based on the scene dynamic causal graph model, assess the causal influence and causal reachability probability of each original action in the original action space on the task node. According to the assessment results and preset rules, dynamically prune and prioritize the original actions to form a constrained action space. S3. Task Chain Success Rate Modeling: Establish success rate nodes for each stage of the task chain, and infer the success probability of each success rate node based on the scenario dynamic causal graph model to calculate the chain success rate. Construct counterfactual gain as an additional reward shaping item, and together with the chain success rate, constitute a comprehensive reward signal. S4. Reinforcement Learning Policy Training (PPO): Input the constrained action space formed in step S2 and the comprehensive reward signal obtained in step S3 into the reinforcement learning policy network, and use the policy gradient algorithm to iteratively train the policy network. S5. Dynamic Cause-and-Effect Graph Update and Policy Adaptation: Real-time perception of environmental changes, dynamic update of the structure or parameters of the scene dynamic cause-and-effect graph model, synchronous adjustment of the constrained action space and comprehensive reward signal, and realization of adaptive optimization of the policy network; S6. Output the optimal task chain: The reinforcement learning policy network outputs the optimal task chain sequence.

[0026] In this embodiment, step S1 specifically includes the following process: S11. Collect multi-dimensional scene element data, including but not limited to: platform capability parameters (such as rescue boats, unmanned boats, helicopters, etc.), sensor performance (such as radar / sonar / photoelectric), target behavior (such as distressed ships / personnel drifting behavior), communication link status, environmental conditions (such as weather, sea conditions), etc. The source of element data can be historical records, simulations or real-time observations.

[0027] S12. Define a set of causal nodes based on the collected data. The set of causal nodes shall include at least task nodes (such as search and detection, location, rescue approach, rescue, transfer, and assessment), platform capability nodes, environmental status nodes, and target behavior nodes.

[0028] S13. Establish causal edges between causal nodes through rule bases, mechanism models, or expert knowledge, and quantify the causal strength using probability or structure functions, such as P(successful search | track and radar status).

[0029] S14. Construct a scene dynamic causal graph model (DCG) that supports updating node states and causal strength parameters over time steps.

[0030] In this embodiment, step S2 specifically includes the following process: S21. Generate the original action space using sets. This indicates that each action in the set... Corresponding to specific operational commands, such as switching sonar modes, adjusting course, initiating drone reconnaissance, requesting air support, launching lifeboats, and landing rescue equipment.

[0031] S22. Based on the topology of the scene dynamic cause-effect graph model constructed in step S1, calculate the causal impact (CI) of each action on the task node. CI measures the contribution of an action to the success probability of the task node. For actions... and task nodes CI can be expressed as: ; In the formula, Indicates the execution of an action Post-task node The probability of success; This indicates the probability of a task node succeeding when no action is performed.

[0032] S23. Using backdoor adjustment rules, by identifying backdoor paths between actions and task nodes, construct a backdoor variable set z, and use the following formula to eliminate non-causal correlations to obtain the causal reachability probability of the action to the task node: .

[0033] S24. According to preset rules, remove (or prune) actions in the original action space whose causal influence or causal reachability is below a threshold and whose task nodes are not connected, to obtain a constrained action space. The preset rules are as follows: Constrained action space set : , And actions Connecting critical mission nodes CI thres P represents the preset threshold for the degree of causal influence. thres A preset threshold representing the probability of causal reachability.

[0034] S25. In the constrained action space, actions need to be scored to guide reinforcement learning in prioritizing actions with higher causal contributions and more critical paths. The priority scoring function is Priority(a) = f(causal influence, causal reachability probability, topology depth, node dependency). The priority of each action is calculated based on this function and arranged from highest to lowest priority, serving as the action selection order for the reinforcement learning policy network.

[0035] The topology depth here refers to the shortest causal path length from the current action node to the task target node in the dynamic causal graph model of the scenario. The path length is determined by the number of causal edges or intermediate nodes contained in the path. The smaller the topology depth, the shorter the causal impact path of the action on the task target, and the higher its score in priority ranking.

[0036] The node dependency is obtained based on the topology of the scene's dynamic cause-effect graph.

[0037] In this embodiment, step S3 specifically includes the following process: S31. Establish success rate nodes for each stage of the task chain, such as P_detect, P_locate, P_approach, P_rescue, P_transfer, and P_assess. The probability of these nodes is the probability of the task succeeding under the current state and action at that stage.

[0038] S32. Based on the scene dynamic causal graph model, Bayesian update combined with the do operator is used to infer the success probability of each success rate node in the current action and environment.

[0039] Bayesian update is performed using the following formula, where This represents the set of parent nodes of success rate node i, i.e., the task or platform status that affects that success rate node: ; Then, the do operator intervention operation is performed. .

[0040] S33. Calculate the chain success rate P chain : If the task chain is sequential, then the following formula is used for calculation: ; If the task chain has branches or parallel paths, the following formula can be used to calculate the probability of a task chain succeeding, which represents the probability of at least one path succeeding: ; .

[0041] S34. When a task fails at a certain stage, an alternative action is estimated using a scene dynamic cause-effect graph model. Potential success rate Constructing counterfactual gains As an additional reward shaping factor, and together with the chain success rate, it constitutes a comprehensive reward signal, giving reinforcement learning causal sensitivity.

[0042] In this embodiment, step S4 specifically includes the following process: S41. Due to the large action space, the proximal policy optimization algorithm (PPO algorithm), which has good convergence performance in the policy gradient method, is used to parameterize the policy network into an Actor network. This indicates that the value network is parameterized as a Critic network, using... express.

[0043] S42. Input the constrained action space (i.e., the action space set after pruning and priority sorting) formed in step S2 into the policy network as the set of optional actions for the policy output, and apply a causal mask. Causal masking introduces causal logic constraints into reinforcement learning policy networks.

[0044] In reinforcement learning, policy networks typically output a probability distribution for the entire set of actions. However, in task chains, not all actions are causally sound; some actions are either causally incapable of reaching the goal in the current state or irrelevant to the task logic. If the policy network selects these actions, it can lead to ineffective attempts, sparse rewards, or low training efficiency. Therefore, a causal mask is introduced to filter out ineffective actions and retain and weight the effective actions. .

[0045] S43. Using the comprehensive reward signal, update the policy network parameters according to the shearing loss function of the near-end policy optimization algorithm, as shown in the following formula, where This is an advantage estimate.

[0046] ; ; This step can incorporate task completion rewards or resource consumption penalties, aiming to deeply integrate the logical correctness of causal reasoning with the optimality of reinforcement learning results. This guides the agent to proactively seek the lowest-cost path while ensuring task success, thereby achieving a leap from a "feasible solution" to an "efficient optimal solution." The calculation formulas for "task completion rewards" and "resource consumption penalties" are common knowledge in reinforcement learning (RL) and operations research, meaning they can be calculated using existing formulas.

[0047] In this embodiment, step S5 specifically includes the following process: S51. Real-time perception of environmental changes, such as target movement, platform damage, sea state changes, communication anomalies, etc. S52. Dynamically update the structure or parameters of the dynamic cause-effect graph model of the scenario, for example, a storm causes a decrease in the success rate of the search and detection, or a platform failure causes a reduction in available actions; S53. Recalculate the action space pruning results and priority ranking (i.e., the action space after updating constraints). S54. Synchronously update the comprehensive reward signal (the chain success rate model changes with the causal structure).

[0048] S55. The policy network performs rapid online updates and uses importance sampling for rapid adaptation.

[0049] In step S6, the output of the optimal task chain sequence includes: the optimal action sequence (the specific actions of each stage of the task chain), the corresponding causal path and contribution description (interpretability), the overall success rate prediction of the task chain, and the traceable decision basis for strategy execution.

[0050] This embodiment also proposes a system based on the above method, including: The causal graph model building module is used to build dynamic causal graph models for a scene. The action space constraint module is used to evaluate the causal influence and causal reachability of each original action in the original action space on the task node based on the scene dynamic causal graph model. According to the evaluation results and preset rules, the original actions are dynamically pruned and prioritized to form a constrained action space. The reward signal construction module is used to establish success rate nodes for each stage of the task chain, and calculate the chain success rate by inferring the success probability of each success rate node based on the scenario dynamic causal graph model. It constructs counterfactual gain as an additional reward shaping item, and together with the chain success rate, it constitutes a comprehensive reward signal. The policy training module is used to input the constrained action space and the comprehensive reward signal into the reinforcement learning policy network, and to perform iterative training of the policy network using the policy gradient algorithm. An adaptive update module is used to perceive environmental changes in real time, dynamically update the structure or parameters of the scene dynamic causal graph model, and synchronously adjust the constrained action space and comprehensive reward signal to achieve adaptive optimization of the policy network. The task sequence generation module is used to output the optimal task chain sequence using the trained policy network.

[0051] The effectiveness of the method and system in this embodiment will be verified by using application examples below.

[0052] This case study constructs a simulated maritime search and rescue (SAR) mission environment. This environment includes dynamic weather (sunny / foggy / rainy / stormy), sea states (calm / moderate / large / giant waves), and a day / night cycle system. It also introduces communication stability and GPS accuracy perturbations, as well as distressed targets exhibiting health decay and location drift characteristics. The platform is equipped with various sensors and rescue resources, and the mission requires completing the entire sequence of "search and discovery → precise location → establishing contact → implementing rescue → transfer → completion" within 40 steps.

[0053] The state space comprises 23 dimensions: weather normalization, sea state normalization, day / night indicator, communication stability, GPS accuracy, sensor overall efficiency, total resource normalization, distance to target, search efficiency, positioning accuracy, fuel ratio, crew fatigue, target discovery indicator, target location indicator, target contact indicator, rescue readiness, communication progress, percentage of steps taken within a phase, health value normalization, health status indicator, radar usage bias, electro-optical usage bias, and time pressure.

[0054] The action space consists of 13 actions: observation, activation of radar, activation of sonar, activation of electro-optical, requesting air support, switching to high-resolution mode, initiating tracking, locating the target, preparing rescue equipment, deploying lifeboats, deploying helicopter rescue, activating communication relay, and changing platform position.

[0055] The core of this embodiment is the causal PPO algorithm, which integrates domain knowledge into the reinforcement learning process through a causal graph mechanism. In this case, the algorithm constructs a 13 (action) × 6 (task phase) causal strength matrix to quantify the causal impact of actions on task progress. The causal strength is dynamically adjusted according to the real-time environment: for example, foggy weather reduces the causal strength of photoelectric sensors, while nighttime increases the causal strength of radar. This mechanism enables the intelligent agent to understand the causal relationship between actions and task phase transitions.

[0056] The PPO algorithm employs the classic Actor-Critic shared architecture. The policy network takes a 23-dimensional state vector as input, which is processed through three fully connected layers (128-128-64) for feature extraction. The network then branches into two heads: the Actor head outputs a 13-dimensional action probability distribution, and the Critic head outputs a state value estimate. The PPO algorithm's parameters are set as follows: learning rate of 3e-4, discount factor of 0.95, PPO pruning factor of 0.2, batch size of 256, 5 updates per batch, entropy regularization coefficient of 0.01, value loss weight of 0.5, and gradient pruning threshold of 1.0.

[0057] Experimental results are as follows Figures 2a to 2d As shown in the figure. Experimental results show that the causal PPO algorithm achieves a significant performance improvement compared to the standard PPO algorithm after 2000 training rounds. Figure 2a As shown, the average reward of the causal PPO algorithm converges to 132.23, while the standard PPO algorithm only converges to 66.68, representing an improvement of 98.3%. Figure 2b The difference in success rates is even more significant. The success rate of the causal PPO algorithm is stable at 74.0%, while the success rate of the standard PPO algorithm is only 15.0%, representing an improvement of 393.3%. As can be seen from the convergence curve, the causal PPO algorithm reaches a 50% success rate in the early stages of training (the first 500 rounds), demonstrating its rapid learning ability; while the convergence curve of the standard PPO algorithm hovers at a low level, indicating that purely random exploration is inefficient. Figure 2c The adaptive adjustment process of the causal guidance weight is demonstrated. The weight gradually decreases from the initial 0.8 to 0.3, which is synchronized with the improvement of algorithm performance, reflecting a reasonable transition from strong guidance to autonomous learning. Figure 2d The final performance comparison bar chart shows the difference between the two methods in terms of rewards and success rates.

[0058] The causal graph-guided approach successfully filtered out 73 instances of unsuitable photoelectric sensor use under nighttime conditions during the search and discovery phase, and 8 instances of dangerous boat launching operations under stormy conditions during the rescue phase, effectively preventing ineffective exploration and risky decisions. The decision-making demonstration shows that the algorithm learned to select the most effective action strategy under different environmental conditions: prioritizing the use of photoelectric sensors under clear nighttime conditions, immediately switching to high-resolution mode after target discovery, and actively preparing rescue equipment during the contact phase. This demonstrates that the method in this embodiment, by incorporating domain knowledge into reinforcement learning in the form of a causal graph, enables the algorithm to understand the inherent logical relationships between actions and task phase transitions, achieving faster, more stable, and safer strategy optimization in complex, multi-step chain reaction tasks.

[0059] The above embodiments are only used to explain the concept of the present invention, and are not intended to limit the protection of the present invention. Any non-substantial modifications made to the present invention using this concept should fall within the protection scope of the present invention.

Claims

1. A task chain intelligent optimization method based on dynamic causal graphs and reinforcement learning, characterized in that, Includes the following steps: S1. Construct a dynamic cause-effect graph model for the scene; S2. Based on the scene dynamic causal graph model, evaluate the causal influence degree and causal reachability probability of each original action in the original action space on the task node. According to the evaluation results and preset rules, perform dynamic pruning and priority sorting on the original actions to form a constrained action space. S3. Establish success rate nodes for each stage of the task chain, and infer the success probability of each success rate node based on the scenario dynamic causal graph model to calculate the chain success rate, construct counterfactual gain as an additional reward shaping item, and together with the chain success rate, constitute a comprehensive reward signal. S4. Input the constrained action space formed in step S2 and the comprehensive reward signal obtained in step S3 into the reinforcement learning policy network, and use the policy gradient algorithm to iteratively train the policy network. S5. Real-time perception of environmental changes, dynamic update of the structure or parameters of the scene dynamic causal graph model, synchronous adjustment of the constrained action space and comprehensive reward signal, and realization of adaptive optimization of the policy network; S6. The reinforcement learning strategy network outputs the optimal task chain sequence.

2. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 1, characterized in that, Step S1 includes the following processes: S11. Collect multi-dimensional scene element data; S12. Define the set of causal nodes based on the collected data; S13. Establish causal edges between causal nodes through rule bases, mechanistic models, or expert knowledge, and quantify the causal strength; S14. Construct a scenario-based dynamic causal graph model that supports updating node states and causal strength parameters over time steps.

3. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 2, characterized in that, The multi-dimensional scene element data includes at least platform capability parameters, sensor performance, target behavior, communication link status, and environmental conditions. The sources of element data include, but are not limited to, historical records, simulations, or real-time observations.

4. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 2, characterized in that, The set of causal nodes includes task nodes, platform capability nodes, environmental state nodes, and target behavior nodes.

5. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 1, characterized in that, Step S2 Includes the following processes: S21. Generate the original action space containing specific operation instructions; S22. Calculate the causal influence of each primitive action in the primitive action space on the task node; S23. Use backdoor adjustment rules to identify backdoor paths between actions and task nodes, construct a backdoor variable set to eliminate non-causal correlations, and obtain the causal reachability probability of actions to task nodes; S24. According to the preset rules, remove actions in the original action space that have a causal influence degree or causal reachability probability lower than the threshold and whose task nodes are not connected, to obtain the constrained action space. S25. Based on causal influence, causal reachability, topological depth, and node dependency, a priority scoring function is constructed to score the constrained action space and sort it from high to low priority, which serves as the action selection order for the reinforcement learning policy network.

6. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 1, characterized in that, Step S3 Includes the following processes: S31. Establish success rate nodes for each stage of the task chain; S32. Based on the scene dynamic cause-effect graph model, Bayesian update combined with the do operator is used to infer the success probability of each success rate node in the current action and environment; S33. Based on the topology of the task chain, if it is a sequential structure, the chain success rate is calculated by multiplication; if there is a branch or parallel structure, the chain success rate is calculated by the probability of success of at least one path. S34. When a task fails at a certain stage, the potential success rate of alternative actions is estimated through a scenario dynamic causal graph model. Counterfactual gain is constructed as an additional reward shaping term and together with the chain success rate, it constitutes a comprehensive reward signal.

7. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 1, characterized in that, Step S4 includes the following processes: S41. Employ the near-end policy optimization algorithm, parameterizing the policy network as an Actor network and the value network as a Critic network; S42. Input the constrained action space formed in step S2 into the policy network; S43. Using the comprehensive reward signal, update the policy network parameters according to the shearing loss function of the near-end policy optimization algorithm.

8. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 1, characterized in that, In step S42, a causal mask is applied to mask actions that are causally unreachable or irrelevant to the task logic in the current state, retaining only the actionable actions and weighting them.

9. The intelligent task chain optimization method based on dynamic causal graphs and reinforcement learning as described in claim 7, characterized in that, In step S43, when using the comprehensive reward signal, add task completion rewards and / or resource consumption penalties.

10. A task chain intelligent optimization system based on dynamic causal graphs and reinforcement learning, used to execute the method described in any one of claims 1 to 9, characterized in that, include: The causal graph model building module is used to build dynamic causal graph models for a scene. The action space constraint module is used to evaluate the causal influence and causal reachability of each original action in the original action space on the task node based on the scene dynamic causal graph model. According to the evaluation results and preset rules, the original actions are dynamically pruned and prioritized to form a constrained action space. The reward signal construction module is used to establish success rate nodes for each stage of the task chain, and calculate the chain success rate by inferring the success probability of each success rate node based on the scenario dynamic causal graph model. It constructs counterfactual gain as an additional reward shaping item, and together with the chain success rate, it constitutes a comprehensive reward signal. The policy training module is used to input the constrained action space and the comprehensive reward signal into the reinforcement learning policy network, and to perform iterative training of the policy network using the policy gradient algorithm. An adaptive update module is used to perceive environmental changes in real time, dynamically update the structure or parameters of the scene dynamic causal graph model, and synchronously adjust the constrained action space and comprehensive reward signal to achieve adaptive optimization of the policy network. The task sequence generation module is used to output the optimal task chain sequence using the trained policy network.