Methods, systems and computer program products for controlling a combat system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A machine learning-based combat system efficiently manages air defense assets to address the limitations of manual systems, achieving high success rates in neutralizing large numbers of fast, small targets by using null-op commands to optimize engagement strategies.

DE102024136182A1Pending Publication Date: 2026-06-11HELSING GMBH

View PDF 4 Cites 0 Cited by

Patent Information

Authority / Receiving Office: DE · DE
Patent Type: Applications
Current Assignee / Owner: HELSING GMBH
Filing Date: 2024-12-04
Publication Date: 2026-06-11

Application Information

Patent Timeline

04 Dec 2024

Application

11 Jun 2026

Publication

DE102024136182A1

IPC: F41H11/02

CPC: F41H11/02; G05B13/0265; F41G7/007; F41G3/04

AI Tagging

Application Domain

Defence devicesDirection controllers

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Multi-target association identification method, system, device and medium for drone swarm
CN121916729BDefence devicesArtificial life
Methods, systems, and computer program products for controlling an engagement system
WO2026120084A1Defence devicesDirection controllers
Detection device, detection system, detection method, and computer program
WO2026133992A1Defence devicesImage analysis Feature extraction Radiology
Driving lock
DE102024137021A1Defence devicesTraffic restrictions
An anti-unmanned aerial vehicle system target detection method based on multi-source heterogeneous data fusion
CN115932834BDefence devicesImage analysis

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing air defense systems are manually operated, leading to slow human threat assessment and decision-making processes that limit the ability to neutralize large numbers of fast, small, and poorly observable targets, such as unmanned aerial vehicles (UAVs), despite having sufficient weapons.

Method used

A combat system utilizing a machine learning algorithm, particularly reinforcement learning, to centrally manage combat assets by determining action commands, including null-op commands to instruct assets not to engage targets during specified intervals, enabling rapid and effective responses to saturation attacks.

Benefits of technology

The system achieves an operational success rate of 88% in simulation tests, compared to 7% for rule-based methods and 0% for human operations, allowing scalable and efficient management of combat resources and coordination against multiple targets.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 00000000_0000_ABST

Patent Text Reader

Abstract

A computer-implemented method for controlling a combat system, the method comprising: determining one or more combat assets included in the combat system, each combat asset being configurable to engage one or more target objects; receiving sensor data indicating a position, and preferably movement, of one or more target objects relative to the one or more combat assets; simultaneously determining, by an agent component and based on the received sensor data, an action command for each of the one or more combat assets and a zero-op command to instruct at least one, preferably a subset, of the combat assets not to engage any target object during a specified time interval;and transmitting the respective action command to the one or more combat assets, wherein the agent component preferably comprises a machine learning algorithm, more preferably a machine learning algorithm based on reinforcement learning.

Need to check novelty before this filing date? Find Prior Art

Description

Technical field

[0001] This disclosure relates to systems, methods, and computer program products for controlling a combat system. The disclosure is applicable in the field of threat assessment and weapon allocation, particularly with regard to applications in air and space defense, especially air defense. background

[0002] A common problem in air defense is the arrival of a large number of fast, small targets with poor observability. For example, an attack with a large number of unmanned aerial vehicles (UAVs) requires a rapid decision-making process.

[0003] Known air defense systems are manually operated. Human threat assessment and decision-making processes can be so slow that, despite the availability of a sufficient number of weapons, only a limited number of targets can be destroyed. This limits the ability to neutralize or reduce the effectiveness of an attack.

[0004] There is a need for systems and procedures that overcome these shortcomings. Brief description

[0005] Systems, methods and devices for controlling a combat system are disclosed and claimed herein.

[0006] A first aspect of the present disclosure relates to a computer-implemented method for controlling a combat system. The method comprises the following steps: • Determining one or more combat assets included in the combat system, each of which is configurable for combat against one or more target objects; • Receiving input data, in particular sensor data, indicating the position and preferably the movement of one or more target objects relative to one or more combat assets; • Simultaneously determining, by an agent component and based on the received input data, an action command for each of the one or more combat assets and a null-op command to instruct at least one, preferably a subset, of the combat assets not to engage any target during a specified time interval; and • Transmitting the respective action order to one or more combat assets.

[0007] The agent component preferably includes a machine learning algorithm, and even more preferably a machine learning algorithm based on reinforcement learning.

[0008] The combat resources can thus be managed centrally by an operating unit. This means that one agent component can control the entire decision-making process instead of multiple agent components. This enables the determination of action orders that include a coordinated response to the targets. The response has been found to allow for a rapid and effective reaction to a large number of targets (e.g., the targets conducting a saturation attack). In simulation-based tests, the method has demonstrated an operational success rate of 88%, compared to a success rate of 7% for a conventional rule-based method and a success rate of 0% for a human operations manager for the same simulation scenario. The advantage of this approach is that the specific way in which we construct the action space (i.e.,, Null-Op commands and target selection), enables scalable and efficient training and an effective way to deal with saturation attacks and coordination of combat resources.

[0009] The combat system can include a land-based, sea-based, or air-based air defense system. Action orders can direct combat assets to destroy target objects. A combat action can preferably involve firing a combat asset, which may include any type of weapon system, at a target object. For example, a combat action can involve firing a guided missile and the subsequent flight of the guided missile to the target object. A successful combat action can include destroying, disrupting, and / or neutralizing the target object. Target objects can include aircraft, guided missiles, UAVs, or any other object, especially moving objects.

[0010] The combat system can thus comprise a system capable of controlling multiple combat assets acting collectively. Each combat asset is preferably capable of conducting combat against up to one target at a time. Each combat asset can comprise one or more weapon systems and / or vehicles or aircraft, such as unmanned vehicles, in particular unmanned aerial vehicles (UAVs). The combat system can include an air and space defense system. The combat system can comprise any system capable of performing a threat evaluation and weapon assignment (TEWA) task. The combat asset can comprise any means capable of achieving a lethal or non-lethal effect on the target. The effect can include disfiguring, neutralizing, or destroying one or more of the targets.The combat device preferably comprises a weapon such as a cannon, e.g., anti-aircraft artillery, or a guided missile or rocket launcher. However, a combat device can also comprise any other type of delivery system, such as any electronic attack device.

[0011] The input data can be received from a sensor that is included in the combat system or is communicatively connected to it.

[0012] The agent component preferably includes a machine learning algorithm, and even more preferably a machine learning algorithm based on reinforcement learning.

[0013] A machine learning algorithm is well-suited for controlling the combat system. This is because, given sufficient data, a machine learning algorithm can be trained to calculate action commands based on observations that encompass information about targets and combat resources in a near-optimal manner. The machine learning algorithm thus provides a mapping from observations to actions. A reinforcement learning algorithm is particularly suitable because it allows training based on data generated through experience gained from interactions with an environment. It naturally improves its own performance over the long term throughout the entire training process.It can be trained through reinforcement learning, whereby the agent component works with observations and a history of observations from an environment encompassing target objects and combat systems, and works to instruct the combat systems to conduct battles against the target objects.

[0014] The agent component preferably comprises an output layer containing one or more heads. Each head is configured to determine and / or output an action command and / or a null-op command. In a preferred embodiment, the agent component comprises a plurality of output heads, including at least one combat-assertion output head configured to determine an action command, which provides instructions for an action by a combat-assertion device, and / or a local null-op command, which instructs a combat-assertion device not to engage the target. The combat-assertion output head can be directly coupled to a combat-assertion device to transmit the command to the combat-assertion device. However, it is preferred to post-process the output of each warhead using a post-processing algorithm.

[0015] The agent component preferably includes an additional output head, which can be referred to as a delay head, configured to issue a global null-op command instructing all combat assets not to engage any target. The post-processing algorithm can modify any output from any of the combat asset output headers in this case to a null-op command. This has the advantage that a strategy that does not involve engaging any target by any combat asset does not necessarily require local null-op commands for each combat asset. This allows the agent component to determine the optimal time to engage multiple approaching targets at a lower frequency than its current trading frequency, and thus make better long-term decisions.Furthermore, any class imbalance caused by a large number of zero-op steps is limited to the instructions determined by the delay head.

[0016] The null-op command can generally instruct one or more combat assets not to engage any target. The null-op command can be part of the action command or a separate instruction. In other words, the action performed by the combat asset can include engaging one or more targets or not engaging any target. The null-op command can be transmitted to the combat asset or cause the agent component to refrain from transmitting any commands during the specified time interval. In the latter case, the combat asset is preferably configured not to engage, i.e., to refrain from engaging any target unless it receives an instruction to engage any target. The null-op command can include a specification of the time interval, e.g.,The time interval may be a number of seconds or a predetermined unit of time, or it may not specify a time interval at all. In the latter case, the combat asset is preferably instructed not to engage any target during a predetermined time step.

[0017] The null-op command has the advantage that the agent component is not only configured to select a target for each combat asset, but it can also choose not to engage any target at all. Therefore, the agent component can develop a strategy in which the combat system temporarily holds back from an identified approaching target until it is close enough to significantly increase the probability of a successful engagement. If the agent component includes a reinforcement learning-based machine learning algorithm, the probability of a successful engagement does not need to be explicitly specified. It can be increased, preferably maximized, by the training algorithm.

[0018] The null-op command can be defined as a sequence of null-op commands. In other words, defining a time interval during which no combat is initiated can cause the agent component to transmit a null-op command to the respective combat asset at each time step. The null-op command can include the omission of any command whatsoever.

[0019] In one embodiment, the respective action order to a combat asset includes an order for the combat asset to conduct combat against one or more of the one or more target objects. In this embodiment, the method further comprises determining a plurality of respective action orders to conduct combat against a target object among the one or more target objects simultaneously by each of the plurality of combat assets.

[0020] This means that the agent component can control multiple combat assets to engage multiple targets. One advantage of this is increased flexibility: battles against targets can be conducted with one or more combat assets, depending on their availability and suitability for the target, and according to the target's relevance. If multiple combat assets are engaged against the same target simultaneously, it is preferred that each asset receives an action order independently.

[0021] In another embodiment, the method further comprises the agent component determining a further action command that causes the combat asset to refrain from engaging any target until a reactivation command is received. This allows the agent component to further delay the combat asset's engagement.

[0022] In a further embodiment, the method further comprises • not to engage the target object with the weapon during the specified time interval; and / or • to conduct combat against a target object using the combat weapon after the end of the specified time interval; and / or • not to conduct combat against a target object using the combat weapon after the end of the specified time interval.

[0023] If the combat system is not engaged in combat against the target, particularly during the specified time interval, the combat system may employ a strategy determined by the agent component, whereby combat against the target is only conducted when an action command from the agent component instructs the combat system to engage the target. After the end of the specified time interval, the combat system may engage the target, particularly according to an instruction from the agent component. In embodiments, the combat system may not engage the target after the end of the specified time interval, for example, in response to a second null-op command.In examples, the first null-op command can specify a time interval during which none of a plurality of combat assets can engage the target, and a second null-op command can specify a time interval during which only one of the combat assets cannot engage a target.

[0024] In yet another embodiment, the method can further include the transmission, by the agent component, of the null-op command to the combat asset, which is instructed not to engage any target. This means that the combat asset is explicitly instructed not to engage any target. The null-op command can be transmitted at a predetermined rate, preferably identical to the rate at which the agent component is configured to determine an action. This rate is preferably a rate at which the agent component executes a sequence of computational steps to determine the action and / or null-op commands. This allows the null-op commands to be used as active state signals for the combat asset.

[0025] The procedure can further include not transmitting any action commands within the specified time interval. This streamlines the bandwidth on each data link between the agent component and the combat assets.

[0026] In one embodiment, the method further comprises the continuous transmission, by the agent component, of a plurality of identical and / or different action commands.

[0027] In another embodiment, the null-op command instructs all combat assets not to engage any target during the specified time interval.

[0028] In other words, the null-op command can apply to all combat assets. Such a null-op command can be considered a global null-op command. This allows the agent component to develop a strategy in which no combat occurs during the specified time interval. If the agent component includes a delay head configured to determine a global null-op command, then only the delay head needs to determine the global null-op command to ensure that no combat is conducted against any target. This reduces the class imbalance that arises from a strategy in which all combat assets remain inactive for a certain interval. In other words, the agent component can generate combat instructions against a target through the combat asset output heads, which are then discarded as long as a global null-op command is issued.In this case, no battle is fought against any target. This reduces the class imbalance in a scenario where no battle is fought against any target.

[0029] The respective action command can include multiple null-op commands, with each null-op command instructing a specific combat asset to refrain from engaging any target during a specified time interval. This can be implemented through individual asset output heads of the agent component, which issue each null-op command. Alternatively, it can be implemented through a post-processing algorithm that determines the null-op commands for each of the respective combat assets in response to receiving a global null-op command from the delay head. This allows for the determination of instructions for the respective delay heads based on a global command.

[0030] In another embodiment, the null-op command can include a specification of a delay and / or the specified time interval. In particular, the null-op command can define one or more individual delays to be applied individually to one or more combat assets. This allows the agent component to determine a preferred strategy that stipulates no operation during a predetermined time step or specified time interval. The agent component is then flexible enough to be trained, for example, through reinforcement learning, to arrive at a more appropriate strategy. Even in a scenario where there is a risk of a class imbalance between action and non-action commands, for example,Because most action orders are zero-op orders, determining the respective zero-op orders for all combat assets further reduces this class imbalance for each combat asset.

[0031] In one embodiment, action commands are transmitted only in response to receiving an acknowledging user input within a predetermined time interval prior to determining the action command. Preferably, a null-op command is transmitted if no acknowledging user input is received during the acknowledgment time interval. This can be described as a human-in-the-loop solution. This allows for human control of the device to increase safety, while simultaneously benefiting from the automated handling of complexity and recommendation of combat decisions for which the agent component has been trained.

[0032] A second aspect of the present disclosure relates to a computer-implemented method for training an agent component, in particular the agent component of the first aspect, of a combat system comprising one or more combat assets configurable for conducting combat against one or more target objects. The method comprises the following steps: • Receiving training data comprising observations of an environment that indicate the position and preferably the movement of one or more target objects relative to the one or more combat assets; • Determine, through the agent component and based on the training data, an action order for each of the combat assets, wherein the action order includes an instruction to engage one or more of the target objects, or a zero-op order to not engage any target object during a specified time interval; • Determining a reward based on the action; and • Update the agent component based on the reward.

[0033] The advantage lies in the fact that this type of training, particularly reinforcement learning, allows for the provision of an agent component trained to implement a long-term strategy over the duration of a scenario for the entire purpose defined by the reward. A long-term strategy might include a policy of maximizing rewards. In implementations, the training data can be based on a scenario lasting from a few minutes to several hours. A typical time step in the decision-making process, specifically the duration of a loop for determining observations, determining action commands, determining the reward, and updating the agent, can be on the order of milliseconds. This learning method has been shown to produce a successful policy over millions of time steps.Training data can be generated through simulations (which is preferred), but also using real-world data, e.g., to enable learning during field operations.

[0034] The training procedure may further include determining the active state, position, and / or movement of one or more protected assets, particularly high-value assets. The protected assets may, but need not, be part of the combat system. The active state, position, and / or movement of the protected assets may be used to determine a reward for them. The reward for protected assets may include a negative reward if a target has successfully engaged the protected asset.

[0035] In one embodiment, the reward includes one or more of the following: • a positive intermediate reward in response to a determination that the predicted action will lead to a successful engagement against a target, wherein the intermediate reward is preferably based on a predetermined importance rating regarding the target; • a negative intermediate reward in response to a determination that a weapon has consumed part of a finite resource, in particular ammunition and / or energy; and / or • a negative intermediate reward in response to a determination that a predetermined asset, in particular a sensor and / or a combat asset included in the combat system, is no longer operational, especially against which a battle has been successfully conducted by one or more of the target objects.

[0036] Applying a reward may involve updating a control parameter of the agent component depending on the reward. The importance rating may be predetermined, for example, set by a user to assign a fixed, higher importance rating to a higher-value target object—that is, a target object considered more important. Alternatively, the importance rating may be determined by an algorithm, such as a predetermined formula that calculates the importance rating based on the target object's type, location, and / or velocity. However, the importance rating may also be a fixed value, and in some cases, it may be the same for all target objects. Even in these cases, the training procedure has been shown to lead to a successful policy.

[0037] The negative intermediate reward preferentially considers any engagement against the combat system and / or any protected asset by any of the targets. In other words, the targets can include threats configured to engage any asset that the combat system is configured to protect. Specifically, the targets can attack any of the combat assets and / or sensors. In this case, the training can enable the agent component to engage the targets in such a way that the engagement against them is successful before they are able to engage the combat system successfully, for example, in a counterattack. The combat system can be configured to protect any asset, such as infrastructure, by defining, for example,This can be configured through user input, with a negative reward in the event of a successful engagement against the asset by the target objects. In an illustrative example, the combat system might include long-range combat assets, and the target objects might include a shorter-range means of engaging other target objects, such as a part of the combat asset and / or one or more protected assets. In this case, training has been shown to induce the agent to develop a policy that favors engaging the target objects with long-range combat assets before the target objects can engage the combat assets and / or protected assets.

[0038] The negative intermediate reward can depend on the scarcity of the resource. For example, during the scenario, the reward for consuming the resource can be increased, perhaps through an analytical function, as the resource becomes scarcer. In other words, higher negative rewards can be applied when a scarce, e.g., expensive, resource such as a guided missile is consumed, especially as more and more ammunition is used and the limited supply decreases. This allows the agent component to be trained to develop a policy for rationalizing ammunition use, e.g., by using less scarce and / or cheaper ammunition types in an initial engagement and only using more expensive and / or scarce ammunition types if the initial engagement is unsuccessful. In one implementation, however, the reward values are independent of the amount of ammunition consumed.In one embodiment, the training includes applying a final reward, in particular a positive final reward if the combat operation against all target objects was successful, and / or a negative final reward if the combat operation of a target object against a protected asset was successful and / or against all combat means was successful, so that the agent component can no longer defend the protected assets.

[0039] The final reward can be determined / applied at the end of an episode, particularly at a time when the battle against all targets has been successful and / or when all sensors and / or combat equipment are no longer operational and / or when a high-value asset has been destroyed by one or more of the targets.

[0040] In one embodiment, the method further comprises the following: • Simulating the environment to determine the training data; • Iterative updating of the simulated environment by performing a simulated time step based on the action and determining a further action based on the updated environment; and • preferably terminate the iteration in response to a termination condition or after a predetermined time.

[0041] Here, updating the simulated environment preferably includes determining a counterattack against one or more of the sensors and / or one or more of the combat assets by one or more of the target objects.

[0042] Here, the termination condition preferably includes a stipulation that the combat operations against all target objects were successful and / or a stipulation that all sensors and / or all combat assets suffered a successful counterattack.

[0043] The specific commands that can be determined by the agent component, as described above for the case of inference, can also be determined during the training process, including in particular: • Determining a further action order to instruct the combat asset to refrain from conducting combat against any target of the one or more targets until receiving a reactivation order; • Transmitted, through the agent component, of the null-op command to the combat asset with the instruction not to engage any target; • No transmission of any action commands within the specified time interval; • Continuous transmission, through the agent component, of a plurality of identical and / or different action commands; • Determining a null-op order instructing all combat assets not to engage any target during the specified time interval; and / or • Determining an action order that includes a plurality of null-op orders, each null-op order instructing a respective combat asset not to engage any target during a respective specified time interval.

[0044] The null op command can include a specification of a delay and / or the specified time interval, even during training.

[0045] Similarly, the actions of the combat assets can be carried out through a simulated or a real-world environment, including • not to conduct combat against any target among the one or more targets during the specified time interval using the combat vehicle; and / or • to conduct combat against a target object among the one or more target objects using the combat weapon after the end of the specified time interval; and / or • not to conduct combat against a target object among the one or more target objects by means of the combat weapon after the end of the specified time interval.

[0046] This makes it possible to carry out all the steps taken during the inference process during the training process as well, and to determine the reward function based on realistic behavior of the agent component.

[0047] Another embodiment relates to the method of the first aspect, wherein the agent component was trained by the method of the second aspect.

[0048] In one embodiment, the respective action command includes a null-op command that instructs at least one, preferably a subset of, combat assets not to engage any target during a predetermined time interval. This allows the agent to decide to wait with its combat operations, e.g., for better combat geometries and an increased probability of successful combat.

[0049] In one embodiment, the agent component comprises a machine learning algorithm, preferably a neural network, more preferably a neural network comprising the following: • a backbone layer comprising an input of the neural network configured to receive the input data and / or the observation, wherein the backbone layer is configured to generate a representation of the observation; and • a recurrent neural network configured to generate an action based on the representation and to manage a memory of past observations in order to act optimally; and • An output layer configured to output the action command.

[0050] Because a backbone layer receives input from the observation of multiple target objects, the neural network processes this information, including all its parts. It therefore takes full situational awareness into account and thus makes central control efficient.

[0051] In one embodiment, the backbone layer comprises a transformer-like neural network.

[0052] In one embodiment, the output layer comprises a plurality of output heads, wherein at least one output head from the plurality of output heads is configured to generate an action command that can be operated to control a combat asset of the combat assets.

[0053] The output of the output head can be viewed as a vector in an action space. Because each individual output head is separate from the other output heads, the action spaces are distinct. The output, i.e., the action command, can connect the agent to a target object by specifying that combat should be waged against the target object. However, the output command can also cause the combat agent to remain inactive for a time step (null operation).

[0054] In one embodiment, the output heads include a delay output head configured to generate the output or a delay command.

[0055] This makes training efficient. In an illustrative example, it can be implemented as multiple MLP layers, stacked one after the other, which take the embedding produced by the recurrent neural network as input and output logits that parameterize a categorical distribution with 20 options. Each option corresponds to a multiple of the number of null operations that must be applied in the environment to prevent combat from being initiated by the combat resources.

[0056] In one embodiment, the recurrent neural network includes a long short-term memory (LSTM). The advantage of the LSTM is that it allows training over long sequences of experience and efficient recall of past observations for optimal action, while avoiding the vanishing gradient problem that commonly occurs in other types of recurrent neural networks.

[0057] In one embodiment, the output head of a weapon can include a null-op command that instructs the weapon not to engage any of the targets for a step, even if the delay head has not instructed the weapon to remain inactive. This has the advantage of allowing the agent component to further select a subset of weapons to engage one or more of the targets in the current step of the episode, while simultaneously keeping other weapons inactive and conserving their ammunition.

[0058] A third aspect of the present disclosure relates to a system comprising one or more processors and one or more storage devices, wherein the system is configured to carry out the computer-implemented method according to one of the preceding aspects.

[0059] In one embodiment, the system further comprises one or more of the combat equipment and preferably one or more sensors that can be operated to generate the input data.

[0060] A fourth aspect of the present disclosure relates to a computer program product that is to be loaded into the working memory of a computer. The computer program product comprises instructions which, when executed by a processor of the computer, cause the computer to execute a computer-implemented method according to the first and / or second aspect. In one embodiment, a non-transitory, computer-readable storage medium is provided that stores instructions executable by one or more processors. The instructions comprise any of the steps of a method according to the first and / or second aspect of the present disclosure.

[0061] A fifth aspect of the present disclosure relates to a computer program product comprising a trained machine learning module that can be obtained by the computer-implemented method according to the first and / or second aspect. In one embodiment, a non-transitory, computer-readable storage medium is provided that stores instructions executable by one or more processors. The instructions comprise any of the steps of a method according to the first and / or second aspect of the present disclosure. Brief description of the drawings

[0062] The features, functions and advantages of the present disclosure become more apparent from the detailed description given below when considered in conjunction with the drawings, in which the same reference numbers refer to similar elements. • Fig. Figure 1 is a flowchart of a procedure for controlling a combat system; • Fig. Figure 2 is a flowchart of a procedure for deploying an agent component of a combat system; • Fig. Figure 3 is a block diagram showing the structure of an agent component and corresponding data types; and • Fig. Figure 4 is a schematic drawing of a scenario. Detailed description of preferred embodiments

[0063] Fig. Figure 1 is a flowchart of a procedure 100 for controlling a combat system such as the one in Fig. 4. Combat system 410 shown. The steps of procedure 100 can be carried out by an agent component, in particular one by the one shown in Fig. 2 methods shown, 200 available agent components, the structures as in Fig. 3 shown includes, and / or the agent component 420 from Fig. 4. Procedure 100 enables the operation of the combat system for conducting combat against one or more target objects, preferably their destruction.

[0064] The process is preferably performed in a loop, so that the steps are repeated. The steps can be repeated at a predetermined frequency, e.g., once every 500 milliseconds. The steps can be repeated at varying time intervals determined by the processing time. Fast processing enables a rapid response to any change observed by the sensors.

[0065] Method 100 begins by determining 102 one or more combat means. The combat means are comprised of a combat system. Each of the combat means is configurable for combat against the target objects. In embodiments, each of the combat means can be configured for combat against one target object at a time. Step 102 preferably includes determining, for each of the available combat means, an active state indicating whether the combat means is operational and an ammunition state indicating the remaining quantity of ammunition.

[0066] In step 104, input data is received. This input data can include sensor data. Sensor data can be received from any sensor capable of observing a target object. The sensor data indicates the position and preferably also the movement of the target objects relative to the combat assets.

[0067] Once steps 102 and 104 are complete, the agent component is provided with an observation of the environment upon which it is to act. The action is then generated by steps 106-120.

[0068] In step 106, the agent component simultaneously determines an action command for each of the one or more combat assets. This is based on the received sensor data. This preferably includes processing the sensor data by a machine learning algorithm, and more preferably, processing the sensor data by an artificial neural network. For the artificial neural network, it is more preferred to include a reinforcement learning algorithm. According to the reinforcement learning paradigm, the agent component can process both the input data received in step 104 and information stored in the agent component based on past input data.

[0069] The agent component is configured to issue a delay command for all combat assets. This command instructs all combat assets to refrain from engaging any target during a specified time interval. Preferably, the combat assets are configured to ignore any other command received during that interval.

[0070] It is preferred if the agent component is configured to determine a null-op command for a combat asset that can be applied even if no delay command has been predicted for all combat assets.

[0071] In step 118, the action command is transmitted to the one or more combat assets. Transmitting the action commands may include an evaluation step, in which it is first determined whether the action command includes a delay command for all combat assets. If so, no action commands are transmitted. It is further preferred to receive confirming user input 120. This can be implemented as a human-in-the-loop solution, in which the action command is issued to a human interface device to request confirmation from the user that the action commands can be transmitted, and in which the action commands are transmitted only if confirming user input is received.However, this can also be implemented as a human-in-a-loop solution, where action commands are transmitted to the combat system when user input to activate the transmission is received within a predetermined time interval, preferably much longer than the execution time of steps 102 to 118. This is preferable because the user input can be received asynchronously, meaning the user can authorize autonomous operation of the combat system in advance for an interval of, for example, 15 minutes.

[0072] Procedure 100 assesses and processes the threat posed by each individual target and controls the available combat resources. This includes increasing kill probabilities, reducing the risk of loss of combat resources, sensors, and / or high-value assets, and reducing ammunition requirements.

[0073] Method 100 is preferably applicable for defense against a saturation attack, where a large number of target objects appear at approximately the same time.

[0074] Fig. Figure 2 is a flowchart of a procedure 200 for providing an agent component of a combat system.

[0075] Method 200 can be considered a reinforcement learning method. According to the concept of reinforcement learning and within the specific circumstances explained in this description, an agent component is generally configured to observe a real or simulated environment and to act within that environment by determining actions. Specifically, observations—that is, data determined based on the environment and / or providing clues about the environment—can be supplied to an input of the agent component. It is preferable to process observations through preprocessing or other machine learning algorithms, such as neural feedforward networks, to generate a representation that is then fed into the agent component. The agent component is configured to process the observations and / or representations and to determine an action as an output. The action can influence the environment, for example, by...by being fed into a controlled device. Method 200 is a method for training an agent component. The training may require large amounts of training data to generate a robust model that can be used to draw conclusions about test data. During the training, intermediate and / or final rewards can be applied, and the agent component can be trained to increase, preferably maximize, the rewards, as explained below.

[0076] The procedure 200 begins with the initialization 202 of the machine learning algorithm. Preferably, this includes configuring the agent component to generate a random action in response to an observation. Subsequent training steps then determine the learned behavior of the agent component.

[0077] In this example, the agent component is trained by interacting with a simulated environment. Preferably, the simulation includes one or more target objects, such as attacking UAVs, and the combat system's assets and sensors. The simulation can include a large number of target objects arriving simultaneously, the destruction of sensors and / or assets due to combat conducted by the target objects, and other changes determined by the simulation settings. The simulation also includes the actions taken by the agent component, particularly in response to the outputs generated by the agent component, i.e., decisions made. The simulation preferably runs in a headless mode during training, with no graphics being generated except those necessary for generating observational data for input into the agent component.The interaction between the simulation and the processing of observations by the agent component is carried out as follows: Steps 204-210 are repeated for each time step, which preferably has a predetermined duration. During each time step, the simulation environment processes the simulation to reflect both the changes specified by the simulation settings, such as the appearance of new target objects, and the changes caused by the actions determined by the agent component, such as successful engagements against threats. Furthermore, during each time step, observations that can be captured by the agent component are determined and processed to generate an action in response to the ongoing simulated situation, which is then incorporated into the simulation in the next time step.

[0078] In step 204, the observations are delivered to the agent component. The observations can be determined by a simulation. It is preferred to use a simulation environment that captures a scenario on which the agent component is to be trained. To enable the agent component to generalize, it is further preferred to perform a plurality of simulations and deliver the observations generated by the simulations as a batch of inputs to the agent component. The simulations evolve over time to determine a new simulation state, and observations are determined from this state by extracting, from the simulation, the threat states of target objects observable using the simulation's sensors, and further by extracting the corresponding combat states, effector states, and sensor activity states.This preferably includes updating the active state of sensors and combat equipment; that is, the active state is set to "False" when the simulation has determined the corresponding sensor or combat equipment to be destroyed during the current time step. This preferably includes updating the ammunition status of the combat equipment to reflect ammunition consumption during the last combat step.

[0079] In step 206, the agent component determines an action to be performed in response to the observation. This preferably includes processing the observation by a neural network, as in Fig. Figure 3 shows that preferably a backbone layer generates a representation of the observation, and a recurrent neural network then produces outputs that include an action command for each combat asset. Each such action command can include a null-op command, which instructs the corresponding combat asset not to engage the target in the current time step. The action command can also include an engagement command, which specifies which target is to be engaged. Because the same agent component determines the action commands for all combat assets, the behavior of the combat assets is inherently coordinated.This has the advantage that the agent component can be trained to determine an action throughout the entire combat system, which can include attacking a high-value target with multiple combat resources at the same time and attacking a low-value target with only one or no combat resources at all.

[0080] In step 207, the action is applied to the environment. This preferably includes issuing action commands and / or null-op commands to simulated combat assets via the agent component, so that the agent component acts on the environment in this step.

[0081] In step 208, an intermediate reward is determined, which depends on the outcome of the action command generated during the last time step and / or preceding steps. More specifically, the intermediate reward can depend on the scenario's development history. If the simulation determines that an engagement against a target resulted in its destruction, a positive reward is assigned. Determining rewards in this way allows the agent component to be trained to increase, or preferably maximize, the kill probabilities achieved through the combat system. The size of the reward preferably depends on the importance of the target. In particular, the simulation settings can include a user-defined value for each target. This value is then used to determine the reward during the time step.Furthermore, it is preferred to apply a negative reward for any use of ammunition. The size of the negative reward depends on the type and scarcity of the ammunition. It is preferred to set predetermined parameters for determining the ammunition type as part of the simulation settings. The scarcity of ammunition can be determined as a function of the remaining quantity.

[0082] Step 210 determines whether a termination condition has been met. A termination condition can include the determination that no target objects remain (successful end result) or that all combat equipment and / or all sensors and / or all protected assets have been destroyed. If no termination condition is met, a time step is determined in step 212. This preferably involves increasing the current time value by a predetermined value, preferably around 500 milliseconds. This value allows the agent component to train a temporally fine-resolution coordination strategy that includes rapid responses to any changes in the situation. After the time step, the next iteration of steps 204-210 is performed.

[0083] When a termination condition is met, a final reward is applied. This includes a large positive reward if no target objects remain and a large negative reward if a high-value asset has been destroyed. The final reward is otherwise zero.

[0084] Through procedure 200, the agent component has learned a guideline that maps observations to actions. In other words, setting the weights in the agent component enables the trained agent component to determine an action for any test observations that could be delivered to an input of the agent component. The action determination is based on the training process and generally causes it to generate actions that lead to a high reward. It should be noted that the agent component itself can be considered a model-free agent because the model of the situation is contained within the simulation, which can be a separate algorithm that communicates with the agent component only by determining the states and observations and by determining the reward.The training of the agent component based on intermediate rewards and the final reward can be accomplished using Proximal Policy Optimization (PPO), as explained, for example, in the following reference: Schulman et al.: Proximal Policy Optimization Algorithms. arXiv: 1707.06347v2 [cs.LG] (2017).

[0085] In an exemplary training process, the agent component was trained using 80 instances of a simulation environment in headless mode. For each training iteration, 80 different runs of the scenario were used as a stack. Each episode was up to three minutes long, and the agent component could act in discrete steps for up to 300 steps.

[0086] Fig. Figure 3 is a block diagram showing the structure of an agent component and its corresponding data types. Data types are represented as rectangles with rounded corners, and components of the agent component—i.e., the backbone layer (326), the recurrent neural network (330), and the heads (332)—are represented as normal rectangles.

[0087] Observation 300 is a data type that encompasses the situational awareness of the agent component. It can be determined by identifying a state of the simulation, for example, by advancing the simulation by one or more time steps, and by extracting the data from the simulation that is observable by the sensors that are part of the simulation. Observation 300 can thus be determined for training purposes and thereby represent the training data on which the agent component is trained through reinforcement learning. Observation 300 can also be determined in this way for testing purposes, to see how a trained agent component reacts to a given simulation. However, Observation 300 can also be determined from real-world data. For example, target objects can be detected and areas (preferably positional data) can be determined by a radar device.The effector states 302 can be determined from state updates of the devices. This enables the agent component to act on observations in a real-world situation.

[0088] The effector states 302 encompass any state of the combat equipment. In particular, for each combat equipment, an active state 304 indicates whether or not the combat equipment can be used to engage a target. In a real-world application, the combat equipment can send continuous state updates to the agent component, and the active state 304 can be derived from these state updates. The active state 304 is preferably encoded as a Boolean value, set to "True" when the combat equipment is operational and to "False" otherwise. The ammunition value 306 includes information about the remaining quantity of ammunition for the combat equipment. The ammunition value 306 can be an integer of the available ammunition (e.g., the number of artillery shells or guided missiles) or a floating-point value indicating a charge of an energy storage device, e.g.,In the event that the combat equipment includes a radiation weapon, the ammunition specification 306 preferably includes a vector that comprises one dimension for each ammunition type. The sensor activity state 308 indicates for each sensor whether the sensor is functioning or not. It can include a Boolean value for each sensor.

[0089] The combat states 310 provide information about ongoing combats. Here, each combat state denotes an action in which one of the combat assets has initiated a combat engagement against a specific target. Each combat state 310 can include an identification of a superior combat asset that initiated a combat engagement against the target 312, a start time indication 314, which specifies when the action order for the combat was generated by the agent component, and a launch time indication 316, which specifies the time at which a projectile (e.g., shell, guided missile, or energy pulse) was released from the combat asset.

[0090] Each of the target states 318 relates to one of the target objects and includes a position 320 and a velocity 322 of the target object and an indication of engagements 324 that the combat system has conducted against the target object. The target state may also include further information about the target object, such as a target object type (not shown). However, it may be impossible to determine the target object type from the input data, particularly if the input data is generated by a sensor that is not sensitive to the target object type. For example, if the sensor is a radar device, only position and velocity may be obtainable. This does not preclude the application of the present disclosure because the agent component does not necessarily process a target object type.

[0091] Observation 300 relates to a given time step. Observation 300 can be supplied to a backbone layer 326 of the agent component. The backbone layer 326 is configured to receive observation 300 at an input and to generate a representation 328 of observation 300. The backbone layer 326 preferably comprises a transformer-like neural network. Using a transformer-like neural network has the advantage that it can process its input data in parallel. Furthermore, the representations generated thanks to the attention mechanism of a transformer are better suited for the present application. Representation 328 can then be supplied to an input of the recurrent neural network 330, which may include a long short-term memory (LSTM).The recurrent neural network 330 preferably stores information about past states, in particular data based on input data received in previous time steps. The recurrent neural network 330 comprises a plurality of output heads 332. The output heads 332 include effector heads 336, 338, 340, 342, 344, and 346, each of which is configured to generate an action command from the action commands 352, 354, 356, 358, 360, and 362. Each of the action commands can be sent to a combat asset. The generation of the corresponding action commands by the effector heads means that the action is divided into a different action space for each combat asset. This means that the assignment of a target object to an agent can be done in a manageable way, while the central control, i.e., the generation of a coordinated strategy, is done by the agent component.In particular, the agent component can be trained to develop a coordinated policy that applies to all combat resources.

[0092] The action command can include an instruction to engage a target object by means of a weapon that is related to the effector head, in particular by means of communication with it. The action command can also include a null-op command instructing the corresponding weapon not to engage any target object during a specified delay or for the duration of a time step. In the present embodiment, each effector head is configured to issue an action command instructing the corresponding weapon to engage a target object or to engage no target object at all. The agent component can generate a time interval in which one or more or a subset of the weapon elements do not engage any target object by determining single or repeated local null-op commands from the corresponding effector heads.This allows for the development of appropriate strategies, including longer periods in which no combat is prescribed.

[0093] The recurrent neural network 330 further includes a null-op head 334 configured to issue a general null-op command 350, which causes all combat resources to ignore the corresponding action command and refrain from any operation. The general null-op command 350 can include a delay command; that is, the null-op command can include a numerical value specifying a time during which no operations should be performed. This has the advantage that the agent component can be efficiently trained at very small time steps, e.g., 500 milliseconds, while deciding to refrain from acting for longer time steps if no targets are in sight, no target poses an immediate risk, or no target is optimally positioned to maximize the kill probability.In realistic scenarios, action commands that initiate combat against a target occur much less frequently than the step size. This means that the actions the agent component must generate to maximize reward are, in most cases, zero-op commands. This leads to a class imbalance that makes training the machine learning algorithm inefficient or even impossible. However, if the machine learning algorithm is configured to output a time interval during which no action command should initiate combat, the class imbalance is much more limited.

[0094] In other words, the delay instruction 350, through its architecture, reduces the weighting of cases where no battle should be initiated. Therefore, efficient training is possible despite the class imbalance. More precisely, the null-op instruction can be post-processed by a post-processing algorithm (not shown) that sets all action instructions C1-C6 to null-op instructions, such that during the specified delay, the action output from the agent component includes null-op instructions. In other words, the agent component can be configured to issue and transmit null-op instructions at each time step. The null-op instruction can also cause the agent component to refrain from sending any action instructions during the specified time interval.

[0095] The numerical value specified in the Null-Op command 350 can be specified in time units such as milliseconds or seconds. However, it is preferred to specify the Null-Op command as an integer number of time steps during which no combat is to be undertaken. The time steps generally define the rate at which the system can generate any output commands. In some embodiments, the time step is 500 milliseconds, and the Null-Op head can be configured to output integer values between 0 and 20. This allows for defining intervals up to 10 seconds into the future during which no combat is to be undertaken. However, higher time step values can also be chosen.

[0096] Fig. Figure 4 is a schematic diagram of a scenario. The scenario includes multiple target objects 402. In the context of air defense, the target objects can include aircraft, such as unmanned aerial vehicles (UAVs), manned aircraft, or guided missiles. The method is particularly advantageous when the target objects are fast-moving, such as UAVs, and when there is a large number of target objects. The target objects 402 in this example include low-value target objects 404 and one high-value target object 406. The combat system 410 in this example is an air defense system. The combat system 410 is generally operable to destroy one or more of the target objects 402 while they are in flight, or otherwise to neutralize or reduce the effectiveness of the attack by the target objects 402.

[0097] The combat system 410 includes sensors 416, which in this example is a radar device. Sensor 416, in this example, is a radar device. However, other sensor devices are also possible, including passive sensors such as cameras or other active sensors such as sonar or lidar devices. Sensor 416 is configured to detect at least some of the target objects 402. Although Fig. While Figure 4 only shows target objects that have been detected, there may be other undetected target objects (not shown). However, for the purpose of a reinforcement learning problem with partial observation, only the target objects detected by sensors 416 are part of observation 300.

[0098] The combat system 410 further comprises combat assets 412 and 414. In this example, combat assets 412 and 414 comprise anti-aircraft artillery 412 and guided missile launchers 414. In other examples, the combat system 410 may include other combat assets such as radiation weapons or other electronic attack devices. The combat assets 412 and 414 are controlled by the agent component 422 and are operable for conducting combat against target objects 402 and thus for acting in the environment. The combat assets can be land-based, sea-based, air-based, space-based, or underwater-based. The combat assets 412 and 414 are preferably configured to execute any action command issued by the agent component. More precisely, each combat tool is preferably communicatively coupled to a delay head of the agent component and is intended to execute any action commands from the corresponding delay head.Each combat asset is preferably configured not to engage any target unless it has received an action order. This makes it possible to implement a strategy in which no engagement is conducted against any target during a given time interval, without having to regularly send zero-op orders to the combat assets.

[0099] Both the sensor 416 and the combat equipment 412, 414 are part of the environment 400 in the sense that they can be rendered inoperable, i.e., damaged or destroyed by other influences, in particular as a result of being engaged in combat by target objects 402. This is indicated by the active state 304 for the combat equipment and the active state 308 of the sensor.

[0100] Sensor 416 and combat equipment 412 and 414 are also parts of combat system 410, which further includes a command center 418. Command center 418 is the central part of combat system 410 and is configured to control sensor 416 and combat equipment 412 and 414. It includes an agent component 422, which is operable to execute procedure 100 during an operation. It includes a preprocessor 420, which is configured to preprocess the signals detected by sensor 416, which in this case are radio frequency signals, and to output a sensor activity state, which can be a Boolean value set to "True" if sensor 416 continues to transmit information, and target object states such as those related to... Fig. 3 described. The output is sent to agent component 422, which receives it as part of the observations. Agent component 422 is in bidirectional communication with combat assets 412, 414. Agent component 422 thereby receives information about the status of each of the combat assets 412, 414, in particular whether it is still operational, and about the available quantity of ammunition. Agent component 422 can determine an action by executing procedure 100 and send the action commands to the combat assets 412, 414. Agent component 422 preferably comprises a machine learning algorithm, more preferably the one described in Fig.Figure 3 shows a neural network. The environment 400 can further comprise a protected asset 426, which in this embodiment is not part of the combat system 410. However, a successful combat operation against the protected asset, conducted by the target objects, can result in negative final rewards from the reward function 424. This allows the agent component to be trained to develop a policy that protects the asset 426, which can be a high-value asset, such as civilian or military infrastructure.

[0101] Agent component 422 can also be trained during a field operation within the combat system 410. For this purpose, a reward function 424 is required, which also accepts sensor data as input and determines success after each time step and at the end of the training, indicating whether the action was successful and inputting a reward into agent component 422. The reward function can be an algorithm containing general functions for determining the rewards. However, it is preferable to use simulations for training. In this case, the environment 400 is part of a simulation, and the target object states are preferably fed directly into agent component 422, thus bypassing the detection step by sensor 416 and the preprocessing. However, sensor 416 is part of the simulated environment, and loss of the sensor results in a negative reward determined by reward function 424.After completing training, the agent component 422 can be deployed in the field in a real-world system. Reference sign 100 methods for controlling a combat system 102-120 steps of the procedure 100 200 methods for generating an agent component of a combat system 202-214 Steps of the 200 procedure 300 observations 304 Active state 306 ammunition 308 Sensor active state 310 Combat conditions 312 superior combat equipment 314 Start time information 316 Firing time indication 318 Target Object States 320 Position 322 speed 324 battles 326 Backbone layer 328 Representation 330 Recurrent neural network 332 Output layer 334 Null-Op-Head 336 First effector head 338 Second effector head 340 Third effector head 342 Fourth effector head 344 Fifth effector head 346 Sixth effector head 348 Action 350 Null-Op-Instruction 352 First Action Order 354 Second Action Order 356 Third Action Order 358 Fourth Action Order 360 Fifth Action Order 362 Sixth Action Order 400 surroundings 402 target objects 404 Low-value target 406 high-quality target object 408 battles 410 Combat System 412, 414 Combat equipment 416 Sensor 418 Command Center 420 Preprocessor 422 Agent component 424 Reward function 426 protected assets QUOTES INCLUDED IN THE DESCRIPTION

[0000] This list of documents cited by the applicant was automatically generated and is included solely for the reader's convenience. The list is not part of the German patent or utility model application. The DPMA accepts no liability for any errors or omissions. Cited non-patent literature

[0000] Proximal Policy Optimization (PPO) can be performed as explained, for example, in the following reference: Schulman et al.: Proximal Policy Optimization Algorithms. arXiv: 1707.06347v2 [cs.LG] (2017

[0084]

Claims

[1] Computer-implemented method (100) for controlling a combat system (410), wherein the method comprises: Determine (102) one or more combat resources (412, 414) included in the combat system (410), each of the combat resources (412, 414) being configurable for combat against one or more target objects (402); Receiving (104) input data, in particular sensor data, indicating a position and preferably movement of one or more target objects (402) relative to one or more combat means (412, 414); Simultaneous determination (106), by an agent component (422) and based on the received input data, of an action command for each of the one or more combat resources (412, 414) and a null-op command to instruct at least one, preferably a subset, of the combat resources (412, 414) not to engage any target during a specified time interval; and Transmitting (118) the respective action command to the one or more combat means (412, 414), wherein the agent component (422) preferably comprises a machine learning algorithm, more preferably a machine learning algorithm based on reinforcement learning. [2] Method according to claim 1, wherein the respective action order to a combat means (412, 414) includes an order to conduct combat by the combat means (412, 414) against one or more of the one or more target objects (402); wherein the method preferably further comprises determining a plurality of respective action orders to conduct combat against a target object among the one or more target objects (402) simultaneously by each of a plurality of combat means (412, 414). [3] Method according to any of the preceding claims, further comprising determining, by the agent component (422), a further action order that causes the combat means (412, 414) to refrain from conducting combat against any target object among the one or more target objects (402) until receiving a reactivation order. [4] Method according to any of the preceding claims, further comprising: not to conduct combat against a target object among the one or more target objects (402) by means of the combat means (412, 414) during the specified time interval; and / or to conduct combat against a target object among the one or more target objects (402) by means of the combat vehicle (412, 414) after the end of the specified time interval; and / or not to conduct an engagement against a target object among the one or more target objects (402) by means of the combat means (412, 414) after the end of the specified time interval. [5] Method according to any of the preceding claims, further comprising: Transmitted, by the agent component (422), of the null-op command to the combat asset (412, 414) with the instruction not to engage any target object; No transmission of any action commands within the specified time interval. [6] Method according to any of the preceding claims, further comprising continuous transmission, by the agent component (422), of a plurality of identical and / or different action commands. [7] Method according to any of the preceding claims, wherein the null-op order instructs all combat assets (412, 414) not to engage any target during the specified time interval; and / or wherein the respective action order includes a plurality of null-op orders, each null-op order instructing a respective combat asset (412, 414) not to engage any target during a respective specified time interval. [8] Method according to any of the preceding claims, wherein the null op instruction includes a specification of a delay and / or the specified time interval. [9] Method according to any of the preceding claims, wherein the action commands are transmitted only in response to receiving an affirmative user input within a predetermined affirmation time interval prior to determining the action command, and wherein preferably a null op command is transmitted if no affirmative user input is received during the affirmation time interval. [10] Computer-implemented method for training an agent component (422), in particular the agent component (422) according to one of the preceding claims, of a combat system (410), comprising one or more combat means (412, 414) that can be configured to conduct combat against one or more target objects (402), wherein the method comprises: Determining training data comprising observations (300) of an environment (400) that indicate a position and preferably movement of one or more target objects (402) relative to one or more combat means (412, 414); Determine, by the agent component (422) and on the basis of the training data, an action order for each of the combat assets (412, 414), wherein the action order includes an instruction to engage one or more of the target objects (402) or a zero-op order not to engage any target object during a specified time interval; Determining a reward based on the action; and Update agent component (422) based on the reward. [11] Method according to claim 10, wherein the reward comprises one or more of the following rewards: a positive intermediate reward in response to a determination that the predicted action will lead to a successful engagement against a target, wherein the intermediate reward is preferably based on a predetermined importance rating regarding the target; a negative intermediate reward in response to a determination that a weapon (412, 414) has consumed part of a finite resource, in particular ammunition and / or energy; and / or a negative intermediate reward in response to a determination that a predetermined item, in particular a sensor (416) and / or a combat device (412, 414) included in the combat system (410), is no longer operational, in particular against which a battle has been successfully conducted by one or more of the one or more target objects (402). [12] Method according to claim 10 or 11, wherein the training comprises applying a final reward, in particular a positive final reward if the engagement was successful against all target objects (402), and / or a negative final reward if a target object has successfully engaged a protected asset (426), in particular any part of the combat system (410). [13] Method according to any one of claims 10 to 12, further comprising: Simulating the environment (400) to determine the training data; iterative updating of the simulated environment (400) by performing a simulated time step based on the action and determining a further action based on the updated environment (400); and Preferably, the iteration should end in response to a termination condition or after a predetermined time. where updating the simulated environment (400) preferably includes determining a counterattack against one or more of the sensors (416) and / or one or more of the combat assets (412, 414) by one or more of the target objects (402), and / or wherein the termination condition preferably includes a provision that the conduct of the engagement against all target objects (402) was successful and / or a provision that all sensors (416) and / or all combat assets (412, 414) suffered a successful counterattack. [14] Method according to any one of claims 1-9, wherein the agent component (422) has been trained by the method according to any one of claims 10-13. [15] Method according to any of the preceding claims, wherein the respective action command includes the or a null-op command which instructs at least one, preferably a subset of, combat means (412, 414) not to engage in combat against any target object during a predetermined time interval. [16] Method according to any of the preceding claims, wherein the agent component (422) comprises a machine learning algorithm, preferably a neural network, more preferably a neural network comprising: a backbone layer (326), preferably a transformer-type neural network, comprising an input of the neural network configured to receive the input data and / or the observation (300), wherein the backbone layer (326) is configured to generate a representation (328) of the observation (300); and a recurrent neural network (330), in particular a recurrent neural network configured to manage a memory of one or more past observations (300), especially a long short-term memory; and / or An output layer (332) is configured to output the action command. [17] Method according to claim 16, wherein the output layer (332) comprises a plurality of output heads (334, 336, 338, 340, 342, 344, 346), wherein at least one output head from the plurality of output heads is configured to generate an action command (350, 352, 354, 356, 358, 360, 362) that is operable to control a combat device (412, 414) among the combat devices (412, 414), wherein the output heads preferably comprise a null-op output head (334) configured to generate the or a null-op command (350). [18] System comprising one or more processors and one or more storage devices, wherein the system is configured to carry out the computer-implemented method according to one of claims 1-17, and wherein the system preferably further comprises one or more of the combat equipment and / or one or more sensors (416) that are operable to generate the input data. [19] Computer program product for loading into the main memory of a computer, comprising: Instructions which, when executed by a processor of the computer, cause the computer to execute a computer-implemented method according to any one of claims 1-17; and / or a trained machine learning module that can be obtained by the computer-implemented method according to one of claims 10-17.