A method and system for generating a kill web based on offline reinforcement learning

By adopting a kill net generation method based on offline reinforcement learning, the problems of poor scene adaptability and low automation in existing technologies are solved. This method enables efficient kill net generation that can quickly respond to complex battlefield environments and has cross-stage collaborative decision-making capabilities.

CN122287331APending Publication Date: 2026-06-26INST OF SOFTWARE - CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF SOFTWARE - CHINESE ACAD OF SCI
Filing Date
2026-03-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing kill net generation technologies are inadequate in terms of scene adaptability, automation, generation speed, and end-to-end coordination. They are unable to cope with complex and dynamic battlefield environments and sudden threats, and rely too much on human intervention.

Method used

By employing an offline reinforcement learning approach, a generative trajectory decision network is constructed through scenario modeling of state space, action space, and reward function, combined with proximal policy optimization and meta-reinforcement learning, to achieve efficient and automated decision-making on battlefield situations.

Benefits of technology

It enables rapid response and low-reliability kill net generation in complex battlefield environments, improving the system's adaptability and generation efficiency, possessing cross-stage collaborative decision-making capabilities, and reducing the need for human intervention.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122287331A_ABST
    Figure CN122287331A_ABST
Patent Text Reader

Abstract

This invention discloses a kill net generation method and system based on offline reinforcement learning, relating to the fields of simulation and reinforcement learning. To address the problems of slow kill net generation speed, poor adaptability, reliance on manual intervention, and lack of end-to-end coordination in existing technologies, this invention performs scene modeling based on the tasks, resources, and dynamic factors involved in the simulation scenario, including state space, action space, reward function, and state transition rules. Based on the scene modeling, reinforcement learning interactive simulation is performed to obtain offline data containing simulation trajectories. The offline data is divided into multi-task segments to obtain a multi-task offline dataset, and a meta-reinforcement learning training is performed on the generative trajectory decision network. The trained generative trajectory decision network is then used to generate the target kill net. This invention can efficiently generate highly adaptable and automated kill nets, quickly respond to changing battlefield environments, reduce manual intervention, and improve generation efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of simulation and reinforcement learning, specifically to a kill net generation method and system based on offline reinforcement learning. Background Technology

[0002] Kill chain generation utilizes artificial intelligence and information networks to dynamically construct a closed-loop task flow from target discovery to strike assessment, achieving multi-node collaboration and adaptive adjustment. Existing technologies include information network communication, heuristic algorithms, distributed computing, and multi-target optimization. Kill web generation, on the other hand, connects nodes such as reconnaissance, command, control, strike, and assessment through big data analysis, artificial intelligence, and automation technologies to build an efficient and flexible kill network, enabling rapid detection, precise strikes, and sustained suppression. Existing technologies involve kill web element modeling and operational process modeling methods. Task scheduling mechanisms are the core of kill web execution; commonly used scheduling strategies include priority scheduling, time-slice round-robin, and resource preemption, such as EarliestDeadlineFirst (EDF), Round Robin, and Max-Weight Matching. The allocation of operational task resources is crucial; commonly used techniques include auction mechanisms, game theory allocation, and greedy strategies. Offline reinforcement learning trains policies using historical data without real-time interaction with the environment, enabling agents to make decisions in unknown environments. Common techniques include conservative Q-learning, algorithms to reduce bootstrap error, RAD (Representational Awareness-based DQN algorithm), and model-based offline policy optimization algorithms.

[0003] Existing kill net generation technologies suffer from the following drawbacks: First, they lack scene adaptability. Most existing methods construct kill nets based on fixed mission scenarios or preset rules, making them ill-suited for complex and dynamically changing battlefield environments and lacking the ability to handle unexpected scenarios such as sudden threats and interference attacks. Second, they involve excessive human intervention. Most systems rely on manual configuration of node relationships, priority settings, and intervention in the generation process, limiting the automation and real-time nature of decision-making and failing to meet the second-level kill requirements in actual combat. Third, they lack efficient and rapid generation mechanisms. Existing algorithms generally employ iterative optimization or heuristic search strategies, resulting in long generation cycles, high computational costs, and an inability to achieve dynamic response capabilities. Finally, the algorithms are fragmented and one-sided. Existing algorithms focus only on a single aspect of kill net generation (such as target allocation, path planning, or node coordination), lacking unified modeling of the entire "perception-decision-attack-evaluation" chain, leading to poor overall coordination and low system efficiency. Furthermore, existing methods have poor generalization ability to unknown scenarios and lack the ability to perceive uncertainty, resulting in unreliable generation results. Summary of the Invention

[0004] The purpose of this invention is to propose a kill net generation method and system based on offline reinforcement learning to solve the problems of slow kill net generation speed, poor adaptability, reliance on manual intervention, and lack of end-to-end collaboration in the existing technology. It can efficiently generate kill nets with strong adaptability and high automation, and can quickly respond to changing battlefield environments, reduce manual intervention, and improve generation efficiency.

[0005] To achieve the above objectives, the present invention adopts the following technical solution.

[0006] A killnet generation method based on offline reinforcement learning includes the following steps: Based on the tasks, resources, and dynamic factors involved in the simulation scenario, scenario modeling is performed, including state space, action space, reward function, and state transition rules. Based on the scenario modeling, reinforcement learning interactive inference is performed to obtain offline data containing the inference trajectory; The offline data is divided into multiple tasks to obtain a multi-task offline dataset; The generative trajectory decision network is trained using meta-reinforcement learning on the multi-task offline dataset to obtain the trained generative trajectory decision network. The initial situational data is input into the trained generative trajectory decision network to generate a target kill network.

[0007] Furthermore, based on the tasks, resources, and dynamic factors involved in the simulation scenario, scenario modeling is performed, including state space, action space, reward function, and state transition rules, as follows: Based on the scenario description, a state-space model covering equipment availability status, target type, and target status is constructed. Based on the scenario description and task type, task assignment is defined for the equipment entities, and an action space model is constructed. Based on the preset goals and resource values ​​in the scenario description, a reward function model is constructed that includes damage rewards, consumption penalties, and progress rewards. Based on resource changes and situational extraction after the execution of actions, the environmental state is identified and updated to complete state transition modeling.

[0008] Furthermore, reinforcement learning interactive inference is performed based on the scenario modeling to obtain offline data containing the inference trajectory, including: Based on the action space, state space, and reward function formed by the scenario modeling, the input data structure of the near-end policy optimization algorithm is constructed. The input data structure is used to interact with the simulation scenario in real time to obtain situational data, which is then converted into state space data and input into the near-end policy optimization algorithm. Based on the state space data, select actions in the action space and feed them back to the simulation scene, and perform interactive training repeatedly until the algorithm converges. The interaction trajectory after the acquisition algorithm converges is collected, and offline data including state, action, post-execution state, reward and reward are extracted.

[0009] Furthermore, the offline data is divided into multiple tasks to obtain a multi-task offline dataset, including: An index is established based on the number of targets and equipment types in the offline data, and the index is mapped to multiple mutually exclusive task subsets using a stratified sampling algorithm. Statistically analyze the state vector distribution of each task subset and calculate the overlap of the state distribution histograms between task subsets; The sampling weights are adjusted according to the overlap until the preset difference index is met, thereby generating a multi-task offline dataset.

[0010] Furthermore, the generative trajectory decision network includes: Embedding layers are used to map the input state vector and action label into feature vectors of fixed dimensions; A position encoding module is used to superimpose the position vector onto the feature vector to introduce time step information; A multi-head self-attention layer is used to calculate the correlation weights of features at different times and focus on key situational information; The feedforward neural network layer is used to perform point-by-point nonlinear transformations on the features output by the attention layer; The normalization and residual connection structure performs residual connection processing and layer normalization processing on the output of each layer; The output projection layer maps the final features to the action probability distribution in the action space.

[0011] Furthermore, training the generative trajectory decision network using the aforementioned multi-task offline dataset for meta-reinforcement learning also includes: Based on different simulation scenarios, meta-tasks are generated and divided to obtain a task condition vector composed of multiple meta-task scenarios. Using the aforementioned multi-task offline dataset, inner-layer meta-task training is performed for each meta-task scenario defined by the task condition vector to generate inference trajectory data for each meta-task. The multi-task training results corresponding to the inferred trajectory data are summarized, and the outer layer meta-knowledge is updated.

[0012] Furthermore, based on different simulation scenarios, meta-tasks are generated and divided to obtain a task condition vector composed of multiple meta-task scenarios, including: Extract enemy configuration parameters, friendly resource parameters, and environmental constraint parameters to construct a mission condition vector; The task condition vector is associated with a subset of features of the state space model to define multiple meta-task scenarios; The task condition vector and the initial situation vector are embedded and fused together as auxiliary inputs to the generative trajectory decision network.

[0013] Furthermore, using the aforementioned multi-task offline dataset, inner-layer meta-task training is performed for each meta-task scenario defined by the task condition vector, generating inferred trajectory data for each meta-task, including: Sample meta-task scenarios from the experience replay pool and initialize policy network parameters; In the sampled scenario, the near-end strategy optimization algorithm is used to adjust the firepower allocation and target selection actions to maximize the cumulative reward; Record the trajectory data generated during inner-layer training and evaluate the policy performance based on the task completion rate.

[0014] Furthermore, the multi-task training results corresponding to the inferred trajectory data are summarized, and outer-layer meta-knowledge updates are performed, including: The weighted difference between the cumulative reward and the policy entropy is calculated based on the policy distribution after inner layer training to obtain the single-task inner layer loss; The weights are dynamically allocated according to the difficulty of each task, and the inner-level losses of the single tasks of multiple meta-tasks are summarized to construct the outer-level total objective function. The gradient of the outer total objective function with respect to the meta-parameters is calculated based on the chain rule, and the initialization parameters of the generative trajectory decision network are updated using the gradient descent algorithm.

[0015] Further, the initial situational data is input into the trained generative trajectory decision network to generate a target kill network, including: The first frame data of each trajectory in the initial situation data is extracted as the initial situation information and mapped to the initial feature vector; Based on the initial feature vector, inference is performed, and the first action and the predicted post-execution state are output; The generated actions are combined with the predicted post-execution states and fed back to the model input. The generation process is executed in an autoregressive manner until the target achievement conditions are met, and the kill net action sequence is output to generate the target kill net.

[0016] A killnet generation system based on offline reinforcement learning, comprising: The scenario modeling module is responsible for modeling the state space, action space, reward function, and state transition rules based on the tasks, resources, and dynamic factors involved in the simulation scenario. The offline data acquisition module is responsible for performing reinforcement learning interactive inference based on the scenario model and acquiring offline data containing the inference trajectory. The offline data partitioning module is responsible for partitioning the offline data into multiple tasks to obtain a multi-task offline dataset; The meta-training module uses the multi-task offline dataset to perform meta-reinforcement learning training on the generative trajectory decision network to obtain the trained generative trajectory decision network. The kill net generation module inputs the initial situational data into the trained generative trajectory decision network to generate the target kill net.

[0017] The present invention has achieved the following beneficial effects.

[0018] 1. This invention is based on the Proximal Policy Optimization (PPO) algorithm, utilizing a simulation environment to generate high-quality, diverse, and comprehensive historical action task datasets (such as reconnaissance-attack-evaluation trajectories under multiple scenarios and adversarial strategies). Through PPO-driven inference simulation, it overcomes the limitations of traditional methods that rely on real battlefield data or manually labeled data, constructing a high-fidelity, highly diverse, and scalable offline training dataset. This provides high-quality raw materials for subsequent offline learning, fundamentally solving the problems of poor adaptability caused by insufficient data and limited scenario diversity.

[0019] 2. This invention employs a meta-reinforcement learning (Meta-RL) framework, enabling the model to quickly learn the capabilities of different battlefield tasks during the training phase, achieving rapid adaptation and generalization to new tasks. By jointly training on multiple typical scenarios (such as air superiority and joint naval strikes), the kill net generation strategy gains the ability to generalize from one example to another, quickly adjusting the strategy in new scenarios with only a small number of samples, significantly reducing reliance on manual intervention and enabling single-training, multi-scenario deployment. Utilizing an offline reinforcement learning framework and historical mission data for strategy pre-training allows the system to quickly generate highly adaptable kill nets without real-time interaction, solving the problems of slow generation and poor adaptability.

[0020] 3. This invention, based on the Transformer model, models kill web generation as a sequential decision-making process. Utilizing the Transformer's global attention mechanism, it achieves joint modeling and dynamic coordination of the entire "perception-decision-attack-assessment" process. Breaking away from traditional modular and serial processing methods, it constructs an end-to-end kill web generation model. Through a self-attention mechanism, it automatically identifies key nodes, predicts path dependencies, and optimizes resource allocation, truly achieving cross-stage collaborative decision-making, significantly reducing the need for manual intervention, and enabling unattended, autonomous generation.

[0021] 4. This invention combines the uncertainty perception mechanism of RAD (Real-Time Analysis) with a representation uncertainty estimation mechanism introduced into the Transformer generative model to dynamically identify high-risk, unseen state-action combinations and assign them confidence scores. During the generation process, it automatically avoids high-risk strategies, achieving intelligent fault tolerance and self-doubt capability, significantly improving the system's robustness in out-of-distribution scenarios, and solving the problems of overfitting to historical data and inability to cope with sudden threats. Attached Figure Description

[0022] Figure 1 This is a flowchart of the main process of a kill net generation method based on offline reinforcement learning in the embodiment. Figure 2 This is a flowchart illustrating the environment modeling, offline data acquisition, and segmentation process in this embodiment. Figure 3 This is a flowchart of the near-end strategy optimization algorithm in the embodiment; Figure 4 This is a Transformer structure diagram of the generative trajectory decision network in the embodiment; Figure 5 This is a flowchart of the meta-training and meta-testing process in the embodiment. Figure 6 This is a block diagram of a kill net generation system based on offline reinforcement learning in one of the embodiments. Detailed Implementation

[0023] To make the various technical features, advantages, or effects of the present invention more apparent and understandable, detailed descriptions are provided below through embodiments.

[0024] This invention provides a kill web generation method based on offline reinforcement learning, such as... Figure 1 As shown, the specific steps are as follows: Step S1: Based on the tasks, resources, and dynamic factors involved in the simulation scenario, perform scenario modeling of the state space, action space, reward function, and state transition rules.

[0025] Specifically, such as Figure 2 As shown, by abstracting and representing the tasks, resources and dynamic factors of complex simulation scenarios, a basic framework for training and evaluation of intelligent decision-making algorithms is provided to support complex task scenarios such as attacking fixed targets, slow targets and discrete time-sensitive targets.

[0026] In an optional embodiment of the present invention, step S1 may include the following sub-steps: Step S11: Summarize and categorize the resource types, quantities, location information, status information, and performance involved in the simulation scenario to obtain scenario description data.

[0027] Step S12: Based on the scene description data, extract the environmental entity information of the simulation system at the current moment and construct a state space model covering the equipment availability status, target type and target status.

[0028] Specifically, the state-space model interacts with the simulation model to obtain complete state data of the simulation system at a certain moment, returned by the simulation platform. Taking an aircraft entity as an example, the state-space model covers the equipment's available state, quantity, attack state, target type, target state, and target quantity information.

[0029] Step S13: Based on the scene description data and task type, define the allocation of action tasks for different equipment entities and construct an action space model.

[0030] Specifically, by assigning tasks to different equipment entities in the scenario, the aim is to maximize strike effectiveness and achieve mission objectives. Mission types primarily include strike missions and reconnaissance missions, with different mission types corresponding to different operational objectives and execution strategies.

[0031] Step S14: Based on the preset target and resource value in the scenario description data, construct a reward function model that includes damage reward, consumption penalty and progress reward.

[0032] Specifically, the reward function model follows the principle of mission completion, meaning that the number of targets destroyed is positively correlated with the reward, while the cost of equipment consumption is negatively correlated with the reward. The reward function model includes: 1. Damaging targets is a positive reward, belonging to the category of incentive-based positive rewards, aiming to damage more valuable targets. Its expression is: in, This indicates that damaging the target is rewarded. Indicates the discount factor. The value function representing the damage to the target.

[0033] 2. Consumption with negative penalty, which is a form of consumption with negative reward, aims to achieve the task at a low cost. Its expression is: in, This indicates a negative penalty for consumption. Indicates the cost of a single piece of equipment. This indicates the quantity of equipment consumed.

[0034] 3. Project Completion Reward: This is a final reward, provided based on the task's progress. Its expression is: in, Indicates a reward for plan completion. This indicates a negative penalty for not completing the progress. Indicates actual progress. Indicates the preset progress target. This represents hyperparameters.

[0035] Step S15: Based on the resource changes and situation information extraction after the execution of the action, identify and update the environmental state, and complete the state transition modeling.

[0036] Specifically, state transition modeling includes: modifying state data based on received actions, calculating the consumption of friendly ammunition after the actions are executed; and obtaining updated environmental state information after each time step by identifying and extracting information from the simulation environment situation.

[0037] Step S2: Perform reinforcement learning interactive inference based on scene modeling to obtain offline data containing the inference trajectory.

[0038] Specifically, such as Figure 2 and Figure 3 As shown, offline data is collected through a reinforcement learning training process using the Proximal Policy Optimization (PPO) algorithm trained online. When a new scenario is acquired that can provide a game-like adversarial environment for reinforcement learning training, the new scenario can be either manually constructed or generated by an algorithm.

[0039] In an optional embodiment of the present invention, step S2 may include the following sub-steps: Step S21: Based on the action space, state space, and reward function formed by scene modeling, construct the input data structure of the near-end policy optimization algorithm.

[0040] Specifically, the scenario is first modeled in step S1 to form the necessary action space, state space and reward function for reinforcement learning training of the near-end policy optimization algorithm, and then reinforcement learning training is started.

[0041] Step S22: Use the input data structure to interact with the simulation scenario in real time to obtain situational data, and convert it into state space data to input into the near-end policy optimization algorithm.

[0042] Specifically, situational data is acquired in real time from the simulation scenario, and this data is transformed into the input data structure required by the near-end policy optimization algorithm using a state-space modeling method. The algorithm's input data is the state-space data transformed from the situational data generated in real time from the simulation scenario.

[0043] Step S23: Select an action in the action space based on the state space data and feed it back to the simulation scene. Repeatedly execute interactive training until the algorithm converges.

[0044] Specifically, the near-end policy optimization algorithm selects actions in the action space based on the input state and outputs them to the simulation scene. The simulation scene changes its state according to the actions taken, thus starting the next training cycle. When the near-end policy optimization algorithm for the strike or patrol scenario gradually converges, the interactive training process is completed.

[0045] Step S24: Collect the interaction trajectory after the algorithm converges, and extract offline data including state, action, post-execution state, reward and reward.

[0046] Specifically, the interaction data between the algorithm and the simulation model during operation is collected, i.e., the state transition trajectory of the algorithm in the simulation scenario. This data is processed and stored in an offline dataset. Each piece of structured data in the offline dataset includes: 1. State (state0): refers to the state at each decision, which is the state data obtained by the reinforcement learning model after receiving situational data and processing it.

[0047] 2. Action: Refers to the action data selected by the reinforcement learning algorithm in the given state.

[0048] 3. Post-execution state (state1): refers to the state data fed back by the inference model after the action is executed.

[0049] 4. Reward: refers to the real-time reward value obtained by the reinforcement learning algorithm after selecting the action.

[0050] 5. Return: refers to the cumulative reward value after all actions are performed in a single simulation.

[0051] Status, action, and post-execution status data are in JSON format, while reward and reward data are in Float format.

[0052] Step S3: Divide the offline data into multiple tasks to obtain a multi-task offline dataset.

[0053] Specifically, such as Figure 2 As shown, the offline dataset should contain multiple task scenarios, each representing a different simulation scenario, including variations in the number of targets, differences in equipment types, and differences in target characteristics. By appropriately partitioning the dataset, the model can learn general decision-making strategies across different task scenarios. The partitioning process must consider task diversity, ensuring the offline dataset covers various simulation scenarios, equipment configurations, and differences in task target characteristics, which helps improve the model's generalization ability. Simultaneously, the offline dataset should include comprehensive task trajectory information, reflecting the possible decisions and behaviors in each task scenario, ensuring the model extracts common features across different tasks.

[0054] In an optional embodiment of the present invention, step S3 may include the following sub-steps: Step S31: Based on the number of targets and equipment types in the offline data, an index is established, and the index is mapped to multiple mutually exclusive task subsets using a stratified sampling algorithm.

[0055] Specifically, data is stratified based on key state variables in the scenario. For example, the number of targets is divided into three levels: low density (1-5), medium density (6-20), and high density (more than 20); equipment type combinations are divided into "single type" and "mixed type" categories. By controlling the proportion of data at each level in the total dataset, for example, ensuring that the sample number error at each level does not exceed ±10%, the model can learn decision features at different scales. The data partitioning process first establishes an index based on metadata fields in the offline dataset, such as the number of targets and equipment types, and then uses a stratified sampling algorithm to map the index into several mutually exclusive task subsets.

[0056] Step S32: Statistically analyze the state vector distribution of each task subset and calculate the overlap of the state distribution histograms between each task subset.

[0057] Specifically, data coverage is measured by trajectory length distribution and state space coverage. During processing, the full trajectory data of the complete simulation process is retained, while short segments with abnormal interruptions are removed. Simultaneously, the distribution of state vectors across each feature dimension is statistically analyzed to ensure that each possible combination of state features, such as whether the enemy has been detected or not, or whether our ammunition is sufficient or insufficient, has at least a certain number of samples in the dataset, thus guaranteeing generalization capability during the meta-training phase.

[0058] Step S33: Adjust the sampling weights according to the degree of overlap until the preset difference index is met, and generate a multi-task offline dataset.

[0059] Specifically, the overlap of the state distribution histograms of each subset is verified. If the overlap is too high, the sampling weights are readjusted until the preset scene difference index is met. Through the above partitioning principles, the offline dataset can capture the diversity and key features of task scenarios, providing data support for the training of meta-reinforcement learning models and improving the model's adaptability in complex simulation environments.

[0060] Step S4: Use a multi-task offline dataset to train the generative trajectory decision network using meta-reinforcement learning.

[0061] Specifically, the generative trajectory decision network adopts the trajectory Transformer model, which uses an encoder-decoder architecture and consists of several cascaded Transformer blocks at its core. Under the meta-reinforcement learning framework, the generative trajectory decision network acquires general meta-knowledge through training in multi-task environments, thereby quickly adapting to new task scenarios.

[0062] In an optional embodiment of the present invention, the generative trajectory decision network is as follows: Figure 4 As shown, it includes: Embedding layer: Used to receive input state vectors and action labels, and map them to fixed-dimensional feature vectors through linear transformation.

[0063] Position encoding module: used to add the learnable position vector to the embedding vector to distinguish the state information at different times and solve the problem of sequence order loss.

[0064] Multi-head self-attention layer: used to calculate the correlation weights of current moment features with features from all other historical moments, thereby enabling the focus on key battlefield situation information.

[0065] Feedforward neural network layer: used to perform point-by-point nonlinear transformation on the features output by the attention layer.

[0066] Normalization and residual connection structure: used to introduce residual connections and perform layer normalization in each substructure to ensure stable gradient propagation in deep networks.

[0067] Output projection layer: After processing by the final Transformer block, the final features are mapped to the action probability distribution in the action space through a linear layer and the Softmax function, thus completing the mapping of the action space.

[0068] In an optional embodiment of the present invention, the training process in step S4 is as follows: Figure 5 As shown, it may include the following sub-steps: Step S41: Generate and divide meta-tasks based on different simulation scenarios to obtain a task condition vector composed of multiple meta-task scenarios.

[0069] Specifically, by diversifying the action tasks, the model learns to adapt to different simulation scenarios. Based on battlefield situation and task information in the offline dataset, the scenario is divided into several meta-tasks. The input of each meta-task scenario is the initial scenario situation, and the output is an optimized kill net. When generating tasks, task complexity must be balanced to ensure that the meta-tasks include both high-complexity scenarios and simple tasks, avoiding unbalanced training.

[0070] Step S411: Extract enemy configuration parameters, friendly resource parameters, and environmental constraint parameters to construct a task condition vector.

[0071] Specifically, the task condition vector is used to represent different meta-task scenarios.

[0072] Enemy configuration parameters include the total number of enemy units. Number of high-value targets Enemy formation density index.

[0073] Our resource parameters include the number of available strike units. Current ammunition total Battery life threshold.

[0074] Environmental constraints include time limits for simulation. , Region complexity coefficient.

[0075] The above condition parameters together form the task condition vector τ. For the i-th meta-task, its task representation is defined in tuple form: in, This represents the task condition vector for the i-th meta-task. Indicates the total number of enemy units. Indicates the number of high-value targets. This indicates the number of attack units available to our side. This indicates the current total amount of ammunition on our side. Indicates a time limit.

[0076] Step S412: Associate the task condition vector with a subset of features of the state space model to define multiple meta-task scenarios.

[0077] Specifically, each element in the task condition vector corresponds to a specific subset of features in the state space modeling in step S1. By establishing association logic, the state constraints under different meta-tasks are clarified.

[0078] Step S413: Embed and fuse the task condition vector and the initial situation vector as auxiliary input for the generative trajectory decision network.

[0079] Specifically, in the model input stage, the task condition vector and the initial situation state vector are concatenated or embedded and fused as auxiliary condition inputs to the trajectory Transformer model to guide the model in generating a kill net that adapts to the current task constraints.

[0080] Step S42: Using the multi-task offline dataset, perform inner-layer meta-task training for each meta-task scenario defined by the task condition vector to generate the inferred trajectory data of each meta-task.

[0081] Specifically, the goal of inner-layer meta-task training is to optimize the model's policy within a single scenario, enabling it to fulfill the requirements of the current task. After dividing the scenario into meta-task scenarios, reinforcement learning algorithms are used to train the model to generate and collect data.

[0082] In an optional embodiment of the present invention, step S42 may include the following sub-steps: Step S421: Sample the meta-task scenario from the experience replay pool and initialize the policy network parameters.

[0083] Specifically, a specific meta-task scenario is sampled from the experience replay pool, and the environment state and policy network parameters of the current task are initialized. The policy network parameters are derived from the policy network within the near-end policy optimization algorithm.

[0084] Step S422: In the sampled scenario, adjust the firepower allocation and target selection actions through the near-end strategy optimization algorithm to maximize the cumulative reward.

[0085] Specifically, in the inner-layer meta-task training, the model optimizes the policy network using a reinforcement learning algorithm based on the proximal policy optimization algorithm, based on the input of the current task. The model explores the optimal policy in the task scenario, maximizing the kill network benefit by gradually adjusting fire allocation and target selection operations.

[0086] Step S423: Record the trajectory data generated by the inner training layer and evaluate the policy performance based on the task completion rate.

[0087] Specifically, the trajectory generated during each inner-layer meta-task training is recorded. This trajectory includes state, action, reward, and outcome, providing a reference for subsequent outer-layer optimization stages. At the end of the inner-layer meta-task training, the model evaluates the current task scenario, recording the execution results and performance metrics of the model strategy. These performance metrics include task completion rate and resource consumption minimization, and are used as data feedback to determine whether further model optimization is needed.

[0088] Step S43: Summarize the multi-task training results corresponding to the inferred trajectory data and perform outer layer meta-knowledge update.

[0089] Specifically, the core purpose of outer-layer meta-knowledge updates is to extract general policy characteristics and knowledge from the training results of multiple tasks, providing support for the global optimization of the model. Through outer-layer updates, the model can compare trajectory data under different task scenarios, extract implicit task patterns and adaptive strategies, ensuring that the model can quickly adjust its decisions when facing new tasks, and improving its generalization ability and adaptability in dynamic and ever-changing simulation scenarios.

[0090] In an optional embodiment of the present invention, step S43 may include the following sub-steps: Step S431: Calculate the weighted difference between the cumulative reward and the policy entropy based on the policy distribution after inner layer training to obtain the single-task inner layer loss.

[0091] Specifically, after the inner layer training is completed, for the i-th meta-task... Calculate the performance loss of its policy network. This loss function combines the cumulative reward with a policy entropy regularization term used to encourage exploration, and its expression is: in, This represents the single-task inner loss of the i-th meta-task. This represents the policy network parameters after the inner adaptive layer for this task. Represents the meta-parameter, This indicates the policy distribution under this task. This represents the cumulative reward for the task trajectory. Represents policy entropy, This represents the balance coefficient.

[0092] Step S432: Dynamically allocate weights according to the difficulty of each task, summarize the single-task inner-layer losses of multiple meta-tasks, and construct the outer-layer total objective function.

[0093] Specifically, the task losses obtained from the inner layer training are weighted and summed to form the overall objective function of the outer layer, i.e., the meta-loss, which guides the update of the model's initial parameters. Its expression is: in, Let M represent the outer total objective function, and M represent the number of meta-tasks participating in this update. This represents the weight of the i-th task, which is preferably dynamically adjusted based on the task difficulty or sample importance.

[0094] Step S433: Calculate the gradient of the outer total objective function with respect to the meta-parameters based on the chain rule, and update the initialization parameters of the generative trajectory decision network using the gradient descent algorithm.

[0095] Specifically, the outer training layer updates the initial parameters of the generative trajectory decision network using the gradient descent algorithm, enabling it to adapt to new task scenarios more quickly. The gradient of the total loss with respect to the meta-parameters is calculated based on the chain rule, and a step size of [missing information] is executed. The optimized update is expressed as follows: in, This represents the optimized and updated meta-parameters. This indicates the step size. Through the above iterative process, the optimized meta-parameters contain general decision-making characteristics across scenarios, representing a common strategy shared across multiple task scenarios, enabling it to converge quickly with only a small number of samples when facing new tasks.

[0096] Step S5: Input the initial situational data into the trained generative trajectory decision network to generate the target kill network.

[0097] Specifically, such as Figure 5As shown, after receiving initial situational data and a target list, the trained generative trajectory decision network gradually generates subsequent action trajectories through an autoregressive approach. These action trajectories are combined to generate a target kill network targeting the target list. The generation process enables dynamic adjustment and real-time updates of the actions within the target kill network, ensuring close coordination between the strike mission and changes in the enemy's situation.

[0098] In an optional embodiment of the present invention, step S5 may include the following sub-steps: Step S51: Extract the first frame data of each trajectory in the initial situation data as the initial situation information and map it into an initial feature vector.

[0099] Specifically, the initial situation information is the first data point of each trajectory in the offline data, obtained by abstracting and representing the simulation environment data based on state-space modeling rules. The initial situation information is in vector form. This indicates that the initial feature vector is mapped after the embedding layer. : in, Represents the initial feature vector. Represents a linear embedding matrix. This represents the initial situation information vector. This indicates the position code.

[0100] Step S52: Perform inference based on the initial feature vector and output the first action and the predicted post-execution state.

[0101] Specifically, based on the initial situational information, the generative trajectory decision network first infers our first action, which falls within the task type category defined in the action space modeling of step S1. Simultaneously with determining the first action, the situational prediction head within the generative trajectory decision network outputs the predicted post-execution state based on the current situational and action information. The decoder then uses the initial feature vector... And the generated action history predicts the first action The calculation process is as follows: The probability distribution is output through linear transformation and normalized exponential function, and the first action is selected: in, These represent the query vector, key vector, and value vector, respectively. This represents the corresponding weight matrix. Represents the vector dimension. This represents the feature vector of the intermediate layer. and The parameters representing the generation of the action, This indicates the selection of the first action in the initial situation. The probability distribution.

[0102] Step S53: Combine the generated action with the predicted post-execution state and feed it back to the model input.

[0103] Specifically, once the first action is generated, the model combines the predicted post-action state with the previous action, feeding this new situational information back to the input. This process is implemented in the decoder, which gradually expands the generated action sequence through an autoregressive approach.

[0104] Step S54: The generation process is executed cyclically in an autoregressive manner until the target achievement conditions are met, and the kill net action sequence is output to generate the target kill net.

[0105] Specifically, the model iteratively executes the above reasoning and feedback process until all high-priority targets are successfully attacked or suppressed. For each step t, the update formula is as follows: Status Update: ; Multi-head attention calculation: Action generation: in, This represents the updated feature vector at step t. This represents the feature vector from the previous time step. Indicates the action embedding weight. Indicates the action performed at the previous moment. This represents the probability distribution of actions at the current moment. Through the above iterative process, the final kill net action sequence is output.

[0106] This invention also provides a kill net generation system based on offline reinforcement learning for performing the above-described method, such as... Figure 6 As shown, it includes: The scenario modeling module is responsible for modeling the state space, action space, reward function, and state transition rules based on the tasks, resources, and dynamic factors involved in the simulation scenario. The offline data acquisition module is responsible for performing reinforcement learning interactive inference based on scene modeling and acquiring offline data containing the inference trajectory. The offline data partitioning module is responsible for partitioning offline data into multiple tasks to obtain a multi-task offline dataset. The meta-training module uses a multi-task offline dataset to perform meta-reinforcement learning training on the generative trajectory decision network, thereby obtaining the trained generative trajectory decision network. The kill net generation module inputs the initial situational data into the trained generative trajectory decision network to generate the target kill net.

[0107] Verification of the effectiveness of the method of this invention: To verify the effectiveness of the method of the present invention, this embodiment constructs a simulation test environment for performance evaluation. The specific test process and results are described below: 1. Test environment and configuration parameters: This embodiment uses the EADSIM simulation platform as the test base. The test scenarios cover 500 randomly generated initial situation samples ranging from simple single-mission scenarios to complex multi-mission scenarios. The simple scenario is represented by a single-target strike mission, while the complex scenario is represented by a joint firepower anti-carrier battle group mission.

[0108] The following indicators are selected as evaluation criteria in this embodiment: Generation success rate: refers to the proportion of times the model successfully generates a kill net action sequence that conforms to the task logic under different initial situations of complexity.

[0109] Generation speed: refers to the computation time required for the model to go from receiving the initial situational input to outputting a complete sequence of kill net actions.

[0110] Resource scheduling rationality: refers to the degree to which the resource allocation results in the action plan generated by the model conform to the preset cost constraint principle.

[0111] 2. Test Result Analysis: This embodiment compares the operational data of manual decision-making, traditional rule-based algorithms, and the method of this invention under different complexity scenarios, and the statistical analysis results are as follows: (1) Success rate analysis: When facing complex simulation scenarios with high dynamics and strong adversarial elements, the kill net generated by the method of this invention has a higher success rate than traditional rule-based algorithms. Experimental data shows that the success rate of traditional rule-based algorithms fluctuates when dealing with unexpected and sudden situations; while the method of this invention, utilizing general decision-making knowledge acquired through meta-reinforcement learning, maintains a high generation success rate under situations of varying complexity, verifying the technical reliability of this invention in terms of scenario generalization ability.

[0112] (2) Generation rate analysis: In terms of generation speed, the method of this invention uses a generative trajectory decision network for reasoning, and the time cost of generating the kill network is less than that required for manual decision-making, which can meet the real-time decision-making needs under complex battlefield situations.

[0113] (3) Analysis of the rationality of resource scheduling: By analyzing the action execution records in the simulation log, the action plan formulated by the method of this invention conforms to the preset cost constraint principle in terms of resource consumption. Under the premise of meeting the mission objective, this method can effectively control equipment damage and ammunition consumption. The test results verify the rationality of the weighted design of positive damage reward and negative consumption penalty in the reward function of this invention, proving that this method has the ability to optimize resource scheduling under multiple constraints.

[0114] Although the present invention has been disclosed above with reference to embodiments, it is not intended to limit the present invention. Appropriate modifications or equivalent substitutions made by those skilled in the art to the technical solutions of the present invention should be covered within the protection scope of the present invention, which is defined by the claims.

Claims

1. A killnet generation method based on offline reinforcement learning, characterized in that, Includes the following steps: Based on the tasks, resources, and dynamic factors involved in the simulation scenario, scenario modeling is performed, including state space, action space, reward function, and state transition rules. Based on the scenario modeling, reinforcement learning interactive inference is performed to obtain offline data containing the inference trajectory; The offline data is divided into multiple tasks to obtain a multi-task offline dataset; The generative trajectory decision network is trained using meta-reinforcement learning on the multi-task offline dataset to obtain the trained generative trajectory decision network. The initial situational data is input into the trained generative trajectory decision network to generate a target kill network.

2. The method as described in claim 1, characterized in that, Based on the tasks, resources, and dynamic factors involved in the simulation scenario, scenario modeling is performed, including state space, action space, reward function, and state transition rules. Based on the scenario description, a state-space model covering equipment availability status, target type, and target status is constructed. Based on the scenario description and task type, task assignment is defined for the equipment entities, and an action space model is constructed. Based on the preset goals and resource values ​​in the scenario description, a reward function model is constructed that includes damage rewards, consumption penalties, and progress rewards. Based on resource changes and situational extraction after the execution of actions, the environmental state is identified and updated to complete state transition modeling.

3. The method as described in claim 2, characterized in that, Based on the scenario modeling, reinforcement learning interactive inference is performed to obtain offline data containing the inference trajectory, including: Based on the action space, state space, and reward function formed by the scenario modeling, the input data structure of the near-end policy optimization algorithm is constructed. The input data structure is used to interact with the simulation scenario in real time to obtain situational data, which is then converted into state space data and input into the near-end policy optimization algorithm. Based on the state space data, select actions in the action space and feed them back to the simulation scene, and perform interactive training repeatedly until the algorithm converges. The interaction trajectory after the acquisition algorithm converges is collected, and offline data including state, action, post-execution state, reward and reward are extracted.

4. The method as described in claim 1 or 3, characterized in that, The offline data is divided into multiple tasks to obtain a multi-task offline dataset, including: An index is established based on the number of targets and equipment types in the offline data, and the index is mapped to multiple mutually exclusive task subsets using a stratified sampling algorithm. Statistically analyze the state vector distribution of each task subset and calculate the overlap of the state distribution histograms between task subsets; The sampling weights are adjusted according to the overlap until the preset difference index is met, thereby generating a multi-task offline dataset.

5. The method as described in claim 1, characterized in that, The generative trajectory decision network includes: Embedding layers are used to map the input state vector and action label into feature vectors of fixed dimensions; A position encoding module is used to superimpose the position vector onto the feature vector to introduce time step information; A multi-head self-attention layer is used to calculate the correlation weights of features at different times and focus on key situational information; The feedforward neural network layer is used to perform point-by-point nonlinear transformations on the features output by the attention layer; The normalization and residual connection structure performs residual connection processing and layer normalization processing on the output of each layer; The output projection layer maps the final features to the action probability distribution in the action space.

6. The method as described in claim 1 or 5, characterized in that, Training the generative trajectory decision network using the aforementioned multi-task offline dataset for meta-reinforcement learning also includes: Based on different simulation scenarios, meta-tasks are generated and divided to obtain a task condition vector composed of multiple meta-task scenarios. Using the aforementioned multi-task offline dataset, inner-layer meta-task training is performed for each meta-task scenario defined by the task condition vector to generate inference trajectory data for each meta-task. The multi-task training results corresponding to the inferred trajectory data are summarized, and the outer layer meta-knowledge is updated.

7. The method as described in claim 6, characterized in that, Using the aforementioned multi-task offline dataset, inner-layer meta-task training is performed for each meta-task scenario defined by the task condition vector, generating inferred trajectory data for each meta-task, including: Sample meta-task scenarios from the experience replay pool and initialize policy network parameters; In the sampled scenario, the near-end strategy optimization algorithm is used to adjust the firepower allocation and target selection actions to maximize the cumulative reward; Record the trajectory data generated during inner-layer training and evaluate the policy performance based on the task completion rate.

8. The method as described in claim 7, characterized in that, Summarize the multi-task training results corresponding to the inferred trajectory data, and perform outer layer meta-knowledge updates, including: The weighted difference between the cumulative reward and the policy entropy is calculated based on the policy distribution after inner layer training to obtain the single-task inner layer loss; The weights are dynamically allocated according to the difficulty of each task, and the inner-level losses of the single tasks of multiple meta-tasks are summarized to construct the outer-level total objective function. The gradient of the outer total objective function with respect to the meta-parameters is calculated based on the chain rule, and the initialization parameters of the generative trajectory decision network are updated using the gradient descent algorithm.

9. The method as described in claim 1, characterized in that, The initial situational data is input into the trained generative trajectory decision network to generate a target kill network, including: The first frame data of each trajectory in the initial situation data is extracted as the initial situation information and mapped to the initial feature vector; Based on the initial feature vector, inference is performed, and the first action and the predicted post-execution state are output; The generated actions are combined with the predicted post-execution states and fed back to the model input. The generation process is executed in an autoregressive manner until the target achievement conditions are met, and the kill net action sequence is output to generate the target kill net.

10. A kill net generation system based on offline reinforcement learning, characterized in that, include: The scenario modeling module is responsible for modeling the state space, action space, reward function, and state transition rules based on the tasks, resources, and dynamic factors involved in the simulation scenario. The offline data acquisition module is responsible for performing reinforcement learning interactive inference based on the scenario model and acquiring offline data containing the inference trajectory. The offline data partitioning module is responsible for partitioning the offline data into multiple tasks to obtain a multi-task offline dataset; The meta-training module uses the multi-task offline dataset to perform meta-reinforcement learning training on the generative trajectory decision network to obtain the trained generative trajectory decision network. The kill net generation module inputs the initial situational data into the trained generative trajectory decision network to generate the target kill net.