Methods, systems, and computer program products for controlling an engagement system
A reinforcement learning-based engagement system optimizes air defense by centrally managing multiple weapons systems, achieving high success rates against fast, low-observability targets through coordinated action and no-operation commands.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HELSING GMBH
- Filing Date
- 2025-12-04
- Publication Date
- 2026-06-11
AI Technical Summary
Existing air defense systems struggle with fast, low-observability targets such as unmanned aerial vehicles (UAVs) due to slow human decision-making, limiting the effectiveness of weapon deployment despite sufficient resources.
A computer-implemented engagement system using a reinforcement learning-based machine-learning algorithm to centrally manage multiple engagement means, determining coordinated action commands, including no-operation commands to optimize target engagement and conserve resources.
The system achieves an 88% mission success rate in simulations, effectively handling saturation attacks by quickly and efficiently coordinating weapon systems to maximize engagement success while conserving ammunition and minimizing asset loss.
Smart Images

Figure EP2025085518_11062026_PF_FP_ABST
Abstract
Description
Methods, Systems, and Computer Program Products for Controlling an Engagement SystemTechnical Field
[0001] The present disclosure relates to systems, methods, and computer program products for controlling an engagement system. The disclosure is applicable in the field of threat evaluation and weapons assignment, in particular in view of applications in aerospace defence, in particular air defence.Background
[0002] A frequent problem in air defence is the arrival of a large number of fast, small targets with low observability. For example, an attack with a large number of unmanned aerial vehicles (UAVs) requires fast decision making.
[0003] Known air defence systems are manually operated. Human threat assessment and decision-making can be so slow that only a limited number of targets can be destroyed despite a sufficient number of weapons being available. This limits the ability to nullify or reduce the effectiveness of the attack.
[0004] There is a need for systems and methods that overcome these shortcomings.Summary
[0005] Disclosed and claimed herein are systems, methods, and devices for controlling an engagement system.
[0006] A first aspect of the present disclosure relates to a computer-implemented method of controlling an engagement system. The method comprises the following steps.• determining one or more engagement means comprised in the engagement system, wherein each of the engagement means is configurable to engage one or more targets;• receiving input data, in particular sensor data, indicative of a position, and preferably movement, of the one or more targets relative to the one or more engagement means;• simultaneously determining, by an agent component and based on the received input data, an action command for each of the one or more engagement means and a nooperation command instructing at least one, preferably a subset of, the engagement means to not engage any target during a specified time interval; and• transmitting the respective action command to the one or more engagement means.
[0007] The agent component preferably comprises a machine-learning algorithm, more preferably a reinforcement learning based machine learning algorithm.
[0008] The engagement means may thus be centrally managed by one entity. That is, one agent component can control the entire decision-making instead of a plurality of agent components. This allows determining action commands that contain a coordinated response to the targets. The response has been found to allow quickly and effectively reacting to a large number of targets (e.g. the targets are performing a saturation attack). In simulationbased tests, the method has shown an 88 % mission success rate compared to 7 % success rate for a conventional rule-based method and 0 % success rate for a human operator for the same simulation scenario. The advantage of the approach is that the specific way we construct the action space (i.e. no-operation commands and selection of targets) allows for a scalable and efficient training, and an effective way of dealing with saturation attacks and coordination of engagement means.
[0009] The engagement system may comprise a ground-based, naval-based, or air-based air defence system. The action commands may direct the engagement means to destroy the targets. An engagement may preferably include firing by an engagement means, which may include any kind of weapon system, onto a target. For example, an engagement may comprise launching a missile and the subsequent travel of the missile to the target. A successful engagement may include destroying, disrupting, and / or neutralizing the target. Targets may include aircraft, missiles, UAVs, or any other object, in particular moving object.
[0010] The engagement system may thus comprise a system operable to control a plurality of engagement means that act collectively. Each engagement means is preferably operable to engage up to one target at the same time. Each engagement means may comprise one or more weapons systems and / or vehicles or aircraft, such as unmanned vehicles, in particular unmanned aerial vehicles (UAVs). The engagement system may comprise an aerospace defence system. The engagement system may comprise any system operable to fulfil a task of Threat Evaluation and Weapon Assignment (TEW A). The engagement means may comprise any means operable to achieve a lethal or non-lethal effect on the target. The effect may include degrading, neutralizing, or destroying one or more of the targets. The engagement means preferably comprises a weapon, such as a gun, e.g. anti-aircraft artillery, or a missileor rocket launcher. However, an engagement means may also comprise any other kind of effector, such as any electronic attack device.
[0011] The input data may be received from a sensor comprised in or communicatively connected to the engagement system.
[0012] The agent component preferably comprises a machine-learning algorithm, more preferably a reinforcement learning based machine-learning algorithm.
[0013] A machine learning algorithm is well suited to control the engagement system. This is because a machine learning algorithm can be trained to compute the action commands based on an observation comprising information on the targets and the engagement means in a near- optimal way given enough data. The machine learning algorithm thereby provides a mapping from the observations to the actions. A reinforcement learning algorithm is particularly suitable since it allows training based on data generated through experience generated via interactions with an environment. It improves, by nature, its own weights long-term over the complete duration of the training. It may be trained by reinforcement learning, wherein the agent component acts upon observations and history of observations from an environment comprising targets and engagement systems, and acts to direct the engagement systems to engage the targets.
[0014] The agent component preferably comprises an output layer comprising one or more heads. Each head is configured to determine and / or output an action and / or no-operation command. In a preferred embodiment, the agent component comprises a plurality of output heads including at least one engagement means output head configured to determine an action command instructing an action by an engagement means and / or a local no-operation command instructing the engagement means not to engage any target. The engagement means output head may be communicatively directly coupled to an engagement means to transmit the command to the engagement means. However, it is preferred to post-process the output of each engagement head by a post-processing algorithm.
[0015] The agent component preferably comprises a further output head, which may be referred to as a delay head, configured to issue a global no-operation command instructing all engagement means not to engage any target. The post-processing algorithm may amend any output of any of the engagement means output heads to a no-operation command in this case. This has the advantage that a strategy comprising not engaging any target by any engagement means does not necessarily comprise local no-operation commands for each engagement means. Thereby, the agent component can reason about the optimal moment to engage aplurality of incoming targets at a lower temporal frequency than at which it acts, and make better long-term decisions. Moreover, any class imbalance caused by a large number of nooperation steps is limited to the commands determined by the delay head.
[0016] The no-operation command may generally instruct one or more of the engagement means not to engage any target. The no-operation command may be part of the action command or a separate instruction. Put differently, the action effected by the engagement means may comprise engaging one or more targets or not engaging any target. The nooperation command may be transmitted to the engagement means or cause the agent component not to transmit any commands during the specified time interval. In the latter case, the engagement means is preferably configured to not engage, i.e. refrain from engaging, any target unless receiving an instruction to engage any target. The no-operation command may comprise an indication of the time interval, e.g. a number of seconds or a number of predetermined time units, or it may not contain any indication of the time interval. In the latter case, the engagement means is preferably instructed to not engage any target during a predetermined time step.
[0017] The no-operation command has the advantage that the agent component is not only configured to select a target, for each engagement means, but it can also select not to engage any target. Therefore, the agent component can develop a strategy in which the engagement system temporarily refrains from an identified approaching target until it is so close that the probability of a successful engagement is substantially increased. As far as the agent component comprises a reinforcement learning based machine learning algorithm, the probability of a successful engagement need not be specified explicitly. It may be increased, preferably maximized, by the training algorithm.
[0018] The no-operation command may be the determined as a series of no-operation commands. In other words, determination of a time interval during which no engagement is effected may cause the agent component to transmit at every time step a no-operation command to the respective engagement means. The no-operation command may comprise refraining from transmitting any command.
[0019] In an embodiment, the respective action command to an engagement means includes a command to engage, by the engagement means, one or more of the one or more targets. In this embodiment, the method preferably further comprising determining a plurality of respective action commands to engage a target of the one or more targets by each of a plurality of the engagement means simultaneously.
[0020] This means that the agent component may control the plurality of engagement means to deal with a plurality of targets. An advantage is that of increased flexibility: Targets can be engaged according to their relevance, with one or more engagement means according to their availability and suitability to the target. If multiple engagement means engage the same target simultaneously, it is preferred that all of the multiple engagement means receive an action command independently.
[0021] In another embodiment, the method further comprises determining, by the agent component, a further action command to cause the engagement means to refrain from engaging any target until receipt of a reactivation command. Thereby, the agent component can further delay the engagement by the engagement means.
[0022] In a further embodiment, the method further comprises• not engaging, by the engagement means, the target during the specified time interval; and / or• engaging, by the engagement means, a target after the end of the specified time interval; and / or• not engaging, by the engagement means, a target after the end of the specified time interval.
[0023] If the engagement means is not engaging the target, in particular during the specified time interval, the engagement means can apply a strategy determined by the agent component in which the target is only engaged at a time when an action command by the agent component instructs the engagement means to engage the target. After the end of the specified time interval, the engagement means may engage the target, in particular according to an instruction by the agent component. In embodiments, the engagement means may not engage the target after the end of the specified time interval, e.g. in response to a second nooperation command. In examples, the first no-operation command may specify a time interval during which none of a plurality of engagement means may engage the target, and a second no-operation command may specify a time interval during which only one of the engagement means may not engage a target.
[0024] In yet a further embodiment, the method may further comprise transmitting, by the agent component, the no-operation command to the engagement means instructed to not engage any target. This means that the engagement means is explicitly instructed not to engage any target. The no-operation command may be transmitted at a predetermined rate, preferably identical to the rate at which the agent component is configured to determine anaction. The rate is preferably a rate at which the agent component conducts a sequence of computing steps to determine the action and / or no-operation commands. This allows using the no-operation commands as alive status signals to the engagement means.
[0025] The method may further comprise not transmitting any action commands within the specified time interval. This economizes bandwidth on any data link between the agent component and the engagement means.
[0026] In an embodiment, the method further comprises continuously transmitting, by the agent component, a plurality of identical and / or different action commands.
[0027] In another embodiment, the no-operation command instructs all of the engagement means to not engage any target during the specified time interval.
[0028] In other words, the no-operation command may be for all engagement means. Such a no-operation command may be referred to as a global no-operation command. This allows the agent component to develop a strategy in which no engagement is done at the time interval. As far as the agent component comprises a delay head configured to determine a global nooperation command, then only the delay head will have to determine the global no-operation command to ensure no target is engaged. This reduces the class imbalance resulting from a strategy wherein all engagement means stay idle for a certain interval. In other words, the agent component may generate, by the engagement means output heads, instructions to engage a target which are subsequently discarded as long as a global no-operation command is issued. In this case no target is engaged. This reduces the class imbalance in a case where no target is engaged.
[0029] The respective action command may include a plurality of no-operation commands, wherein each no-operation command instructs a respective engagement means not to engage any target during a respective specified time interval. This may be implemented by individual engagement means output heads of the agent component issuing each of the no-operation commands. It may be implemented by a post-processing algorithm that determines the nooperation commands for each of the respective engagement means in response to receiving a global no-operation command from the delay head. This allows determination of instructions for the respective delay heads based on one global command.
[0030] In a further embodiment, the no-operation command may comprise an indication of a delay and / or the specified time interval. In particular, the no-operation command may define one or more individual delays to be applied to one or more engagement means individually. This allows determining, by the agent component, a preferable strategy that determines thatno operation is conducted during a predetermined time step or specified time interval. The agent component is then flexible enough to be trained, e.g. by reinforcement learning, to come up with a more appropriate strategy. Moreover, even in a scenario where there is a risk of a class imbalance between action and non-action commands, e.g. because most action commands are no-operation commands, determining respective no-operation commands for all engagement means reduces this class imbalance for each engagement means.
[0031] In an embodiment, the action commands are transmitted only in response to receipt of a confirmation user input within a predetermined time interval prior to determining the action command. Preferably, a no-operation command is transmitted if no confirmation user input is received during the confirmation time interval. This may be referred to as a human-on-the- loop solution. This allows human control of the device to increase safety, while at the same time benefiting from the automated handling of the complexity and recommendation of engagement decisions that the agent component was trained for.
[0032] A second aspect of the present disclosure relates to a computer-implemented method of training an agent component, in particular the agent component of the first aspect, of an engagement system comprising one or more engagement means configurable to engage one or more targets. The method comprises the following steps:• determining training data comprising observations of an environment indicative of a position, and preferably movement, of the one or more targets relative to the one or more engagement means;• determining, by the agent component and based on the training data, an action command for each of the engagement means, the action command comprising an instruction to engage one or more of the targets or a no-operation command to not engage any target during a specified time interval;• determining a reward based on the action; and• updating the agent component based on the reward.
[0033] The advantage is that this kind of training, in particular reinforcement learning, allows providing an agent component that is trained to implement a long-term strategy over a duration of a scenario, for the overall mission goal defined by the reward. A long-term strategy may comprise a policy to achieve high rewards. In embodiments, the training data may pertain to a scenario of a duration of several minutes, up to a few hours. A typical timestep in decision making, in particular a duration of a loop of determining observations, determining action commands, determining the reward, and updating the agent, may be in theorder of milliseconds. This method of learning has been shown to lead to a successful policy for millions of time steps. Training data may be generated by a simulation (which is preferred), but also using real-world data, e.g. to allow learning during deployment.
[0034] The method of training may further comprise determining an alive status, a position, and / or a movement, of one or more protected assets, in particular high-value assets. The protected assets may or may not be part of the engagement system. The alive status, position, and / or movement of the protected assets may be processed to determine a protected asset reward. The protected asset reward may comprise a negative reward if the protected asset is successfully engaged by a target.
[0035] In an embodiment, the reward comprises one or more of:• a positive intermediate reward in response to a determination that the predicted action leads to successful engagement of a target, wherein the intermediate reward is preferably based on a predetermined importance indicator related to the target;• a negative intermediate reward in response to a determination that an engagement means has expended part of a finite resource, in particular ammunition and / or energy; and / or• a negative intermediate reward in response to a determination that a predetermined asset, in particular a sensor and / or an engagement means comprised in the engagement system is no longer operational, in particular has been successfully engaged by one or more of the targets.
[0036] Applying a reward may comprise updating a control parameter of the agent component depending on the reward. The importance indicator may be predetermined, e.g. set by a user such that a fixed higher importance indicator is set for a target of higher value, i.e. a target that is considered to be more important. The importance indicator may alternatively be determined by an algorithm, e.g. a predetermined formula, that calculates the importance indicator based on type, location, and / or velocity of the target. However, the importance indicator may be a fixed value and it may be equal for all targets in some cases. Even in these cases, the training method has been found to lead to a successful policy.
[0037] The negative intermediate reward preferably takes into account any engagement of the engagement system and / or any protected asset by any of the targets. Put differently, the targets my comprise threats that are configured to engage any asset that the engagement system may be configured to protect. In particular, the targets may attack any of the engagement means and / or sensor. In this case, the training may enable the agent componentto engage the targets such that they are successfully engaged before they are able to successfully engage, e.g. counterattack, the engagement system. The engagement system may be configured to protect any asset, e.g. infrastructure, by defining, e.g. by a user input, a negative reward in case of successful engagement of the asset by the targets. In an illustrative example, the engagement system may comprise long-range engagement means, and the targets may comprise a shorter-range means to engage other targets, such as a part of the engagement means, and / or one or more protected assets. In this case, training has been found to cause the agent to develop a policy that favours engaging the targets by the long-range engagement means before the targets can engage the engagement means and / or protected assets.
[0038] The negative intermediate reward may depend on scarcity of the resource. For example, during the scenario, the reward for expending the resource may be increased, e.g. by an analytic function, as the resource becomes scarcer. Put differently, higher negative rewards may be applied if a scarce, e.g. expensive, resource, such as a missile, is expended, in particular if more and more ammunition is expended, and the limited amount of ammunition is decreasing. This allows to train the agent component to develop a policy of economizing ammunition, e.g. by using less scarce and / or cheaper ammunition types in a first engagement and only use more expensive and / or scarce ammunition types if the first engagement was unsuccessful. In an exemplary embodiment, however, the reward values are independent of the amount of ammunition that has been expended. In an embodiment, the training comprises applying a terminal reward, in particular a positive terminal reward if all targets have been successfully engaged, and / or a negative terminal reward if a target has successfully engaged a protected asset, and / or or has successfully engaged all engagement means so that the agent component can no longer defend the protected assets.
[0039] The terminal reward may be determined / applied at an end of episode, in particular at a time when all targets have been successfully engaged, and / or when all sensors and / or engagement means are no longer operational, and / or when a high-value asset has been destroyed by one or more of the targets.
[0040] In an embodiment, the method further comprises:• simulating the environment to determine the training data;• iteratively updating the simulated environment by conducting a simulated time step based on the action, and determining a further action based on the updated environment; and• Preferably terminating the iteration in response to an abort condition or after a predetermined time,
[0041] Here, updating the simulated environment preferably comprises determining a counter-engagement of one or more of the sensors and / or one or more of the engagement means by one or more of the targets.
[0042] Here, the abort condition preferably comprises a determination that all targets have been successfully engaged and / or a determination that all sensors and / or all engagement means have been successfully counter-engaged.
[0043] The specific commands that can be determined by the agent component as described above for the case of inference may be determined also during training, including in particular:• determining a further action command to cause the engagement means to refrain from engaging any target of the one or more targets until receipt of a reactivation command;• transmitting, by the agent component, the no-operation command to the engagement means instructed to not engage any target;• not transmitting any action commands within the specified time interval;• continuously transmitting, by the agent component, a plurality of identical and / or different action commands;• determining a no-operation command that instructs all of the engagement means to not engage any target during the specified time interval; and / or• determining an action command that includes a plurality of no-operation commands, wherein each no-operation command instructs a respective engagement means not to engage any target during a respective specified time interval.
[0044] The no-operation command may comprise, also during training, an indication of a delay and / or the specified time interval.
[0045] Likewise, the actions of the engagement means may be conducted by a simulated or real-world environment, including• not engaging, by the engagement means, a target of the one or more targets during the specified time interval; and / or• engaging, by the engagement means, a target of the one or more targets after the end of the specified time interval; and / or• not engaging, by the engagement means, a target of the one or more targets after the end of the specified time interval.
[0046] This allows to conduct all the steps done during inference also during training and determining the reward function based on a realistic behaviour of the agent component.
[0047] A further embodiment relates to the method of the first aspect, wherein the agent component has been trained by the method of the second aspect.
[0048] In an embodiment, the respective action command includes the or a no-operation command instructing at least one, preferably a subset of, the engagement means to not engage any target during a predetermined time interval. This allows the agent to decide to wait with its engagement, e.g. for better engagement geometries and increased probability of successful engagement.
[0049] In an embodiment, the agent component comprises a machine-learning algorithm, preferably a neural network, more preferably a neural network comprising:• a backbone layer comprising an input of the neural network configured to receive the input data and / or the observation, the backbone layer configured to generate a representation of the observation; and• a recurrent neural network configured to generate the or an action based on the representation and maintain a memory of the past observations to act optimally; and• an output layer configured to output the action command.
[0050] Since one backbone layer receives one input of the observation of multiple targets, the neural network processes this information including all its parts. It therefore takes into account the full situational awareness and therefore makes central control efficient.
[0051] In an embodiment, the backbone layer comprises a transformer-type neural network.
[0052] In an embodiment, the output layer comprises a plurality of output heads, wherein at least one output head of the plurality of the output heads is configured to generate an action command operable to control an engagement means of the engagement means.
[0053] The output of the output head can be seen as a vector in an action space. Since each individual output head is separate from the other output heads, said action spaces are distinct. The output, i.e. the action command may connect the effector to a target by indicating that the target is to be engaged. However, the output command may also cause the engagement means to stay idle for a time step (no-operation).
[0054] In an embodiment, the output heads comprise a delay output head configured to generate the or a delay command.
[0055] This makes training efficient. It may be implemented, in an illustrative example, as multiple MLP layers stacked one after the other, that take as input the embedding produced by the recurrent neural network, and output logits that parameterise a categorical distribution with 20 options. Each option corresponds to a multiple of the number of no-ops that need to be applied in the environment to prevent the engagement means to engage.
[0056] In an embodiment, the recurrent neural network comprises a long-short term memory (LSTM). The advantage of the LSTM is that it allows for training over long sequences of experience and efficiently remember past observations in order to act optimally, while avoiding the vanishing gradient problem common to other types of recurrent neural networks.
[0057] In an embodiment, the output head of an engagement means may comprise a nooperation command that instructs the engagement means to not engage any of the targets for one step, even if the delay head has not instructed the engagement means to remain idle. This has the advantage of allowing the agent component to further choose a subset of the engagement means to engage one or more of the targets at the current step of the episode, while keeping other engagement means idle and spare their ammunition.
[0058] A third aspect of the present disclosure relates to a system comprising one or more processors and one or more storage devices, wherein the system is configured to perform the computer-implemented method of any of the preceding aspects.
[0059] In an embodiment, the system further comprises one or more of the engagement means and preferably one or more sensors operable to generate the input data.
[0060] A fourth aspect of the present disclosure relates to a computer program product for loading into a memory of a computer. The computer program product comprises instructions, that, when executed by a processor of the computer, cause the computer to execute a computer-implemented method of the first and / or second aspect. In an embodiment, a non- transitory computer-readable storage medium is provided that stores instructions executable by one or more processors. The instructions comprise any of the steps of a method of the first and / or second aspect of the present disclosure.
[0061] A fifth aspect of the present disclosure relates to a computer program product comprising a trained machine learning module obtainable by the computer-implemented method of the first and / or second aspect. In an embodiment, a non-transitory computer- readable storage medium is provided that stores instructions executable by one or more processors. The instructions comprise any of the steps of a method of the first and / or second aspect of the present disclosure.Brief description of the drawings
[0062] The features, objects, and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference numerals refer to similar elements.• Fig. l is a flow chart of a method of controlling an engagement system;• Fig. 2 is a flow chart of a method of providing an agent component of an engagement system;• Fig. 3 is a block diagram showing a structure of an agent component and corresponding data types; and• Fig. 4 is a schematic drawing of a scenario.Detailed description of the preferred embodiments
[0063] Fig. 1 is a flow chart of a method 100 of controlling an engagement system, such as engagement system 410 shown in Fig. 4. The steps of method 100 may be executed by an agent component, in particular an agent component obtainable by method 200 shown in Fig. 2, comprising structures as shown in Fig. 3, and / or the agent component 420 of Fig. 4. Method 100 allows operating the engagement system to engage, preferably destroy, one or more targets.
[0064] The method is preferably conducted in a loop, so that the steps are repeated. The steps may be repeated at a predetermined frequency, e.g. once every 500 milliseconds. The steps may be repeated during varying time intervals determined by processing time. Fast processing allows quickly reacting to any change observed by the sensors.
[0065] Method 100 begins by determining, 102, one or more engagement means. The engagement means are comprised in an engagement system. Each of the engagement means is configurable to engage the targets. In embodiments, each of the engagement means may be configured to engage with one target at a time. Step 102 preferably includes determining for each of the available engagement means an alive status that indicates if the engagement means is functioning, and an ammunition status that indicates the amount of ammunition left.
[0066] At 104, input data are received. The input data may comprise sensor data. Sensor data may be received from any sensor that can observe a target. The sensor data are indicative of a position, and preferably also movement, of the targets relative to the engagement means.
[0067] Once steps 102 and 104 are completed, the agent component has been supplied with an observation of the environment on which it is to act. The action is then generated by steps 106-120.
[0068] At 106, the agent component determines simultaneously an action command for each of the one or more engagement means. This is based on the received sensor data. This preferably includes processing the sensor data by a machine learning algorithm, more preferably processing, 108 the sensor data, by an artificial neural network. The artificial neural network is more preferred to include a reinforcement learning algorithm. According to the reinforcement learning paradigm, the agent component may process both input data received at step 104 and information stored in the agent component based on past input data.
[0069] The agent component is configured to determine, 110, a delay command for all engagement means. Said command instructs all engagement means not to engage any target during a specified time interval. Preferably, the engagement means are configured to ignore any other command received during the interval.
[0070] It is preferred if the agent component is configured to determine, 112, a no-op command for one engagement means that can be applied even if no delay command has been predicted for all engagement means
[0071] At 118, the action command is transmitted to the one or more engagement means. Transmission of the action commands may include an evaluation step, wherein it is first determined if a delay command for all engagement means is comprised in the action command. In this case, no action commands are transmitted. It is further preferable to obtain, 120, a confirmation user input. This may be realized as a human-in-the-loop solution, where the action command is output to a human interface device to seek confirmation from the user that the action commands can be transmitted, and wherein the action commands are transmitted only if an affirmative user input is received. This may, however, also be realized as a human-on-the-loop solution, wherein the action commands are transmitted to the engagement means if a user input enabling transmission has been received within a predetermined time interval, which is preferably much longer than the duration of execution of steps 102 to 118. This is preferable since the user input can be received asynchronously,i.e. the user may authorize autonomous operation of the engagement system in advance for an interval of, e.g. 15 minutes.
[0072] By the method 100, the threat posed by each individual target is assessed, processed, and the engagement means are controlled. This includes increasing kill probabilities and reducing the risk of loss of engagement means and / or sensors and / or high-value assets, and reducing the requirements for ammunition.
[0073] Method 100 is preferably applicable to defend against a saturation attack where a large number of targets appear at about the same time.
[0074] Fig. 2 is a flow chart of a method 200 of providing an agent component of an engagement system.
[0075] Method 200 may be seen as a reinforcement learning method. According to the concept of reinforcement learning and subject to the specifics explained in this description, an agent component is generally configured to observe a real or simulated environment and to act in the environment by determining actions. In particular, observations, i.e. data determined based on and / or indicative of the environment, may be supplied to an input of the agent component. It is preferable to process observations by preprocessing or by other machine learning algorithms, such as feedforward neural networks, to generate a representation that is then input into the agent component. The agent component is configured to process the observations and / or representations and to determine an action as an output. The action may influence the environment, e.g. by being input into a controlled device. Method 200 is a method of training an agent component. Training may require large amounts of training data to generate a robust model that can be used for inference on testing data. During training, intermediate and / or final rewards may be applied, and the agent component may be trained to increase, preferably maximize the rewards as detailed below.
[0076] The method 200 begins by initializing, 202, the machine learning algorithm. This preferably includes configuring the agent component such that it generates a random action in response to an observation. Then, the subsequent training steps determine the agent component's learned behaviour.
[0077] The agent component is trained in this example by interacting with a simulated environment. The simulation preferably includes one or more targets, such as attacking UAVs, and the engagement means and sensors of the engagement system. The simulation may comprise a large number of targets arriving at the same time, destruction of sensors and / or engagement means due to engagements conducted by the targets, and other changesdetermined by the simulation setup. The simulation comprises also the actions taken by agent component, in particular in reaction to the outputs generated, i.e. decisions made, by the agent component. The simulation preferably runs in a headless mode during training, where no graphics are rendered as far as they are not necessary to generate observation data for input into the agent component. The interaction between the simulation and the processing of the observations by the agent component is conducted as follows: The steps 204 - 210 are repeated for each time step, which preferably has a predetermined duration. During each time step, the simulation environment processes the simulation to reflect both the changes imposed by the simulation setup, e.g. the appearance of new targets, and the changes caused by the actions determined by the agent component, such as successful engagements of threats. Moreover, during each time step, observations that can be made by the agent component are determined and processed to generate an action in response to the ongoing simulated situation, which is then fed into the simulation on the next time step.
[0078] At 204, the observations are supplied to the agent component. The observations may be determined by a simulation. It is preferred to use a simulation environment that captures a scenario on which the agent component is to be trained. In order to allow the agent component to generalize, it is further preferred to conduct a plurality of simulations and to supply the observations generated by the simulations as a batch of inputs to the agent component. The simulations progress for some time to determine a new simulation state, and observations are determined from the state by extracting, from the simulation, the threat states of targets that are observable using the sensors of the simulation, and further extracting the corresponding engagement states, effector states, and sensor alive status information. This preferably includes updating of the alive status of sensors and engagement means, i.e. the alive status is set to False if the simulation has determined the corresponding sensor or engagement means to be destroyed during the present time step. This preferably includes updating the ammunition status of the engagement means to reflect ammunition expended during the last engagement step.
[0079] At 206, the agent component determines an action to be conducted in response to the observation. This preferably includes processing the observation by a neural network as shown in Fig. 3. Preferably, a backbone layer generates a representation of the observation, and a recurrent neural network then generates outputs that include an action command for each engagement means. Each such action command may include a no-operation command that causes the corresponding engagement means not to engage the target at the present timestep. The action command may include an engage command comprising an indication of which target to engage. Since the same agent component determines the action commands for all engagement means, the behaviour of the engagement means is inherently coordinated. This has the advantage that the agent component can be trained to determine an action by the entire engagement system, which may include attacking a high-value target with a plurality of engagement means at the same time, and attacking a low-value target only with one engagement means or none at all.
[0080] At 207, the action is applied to the environment. This preferably includes outputting, by the agent component, the action commands and / or no-operation commands to simulated engagement means, so that the agent component acts on the environment at this step.
[0081] At 208, an intermediate reward is determined that depends on the result of the action command generated during the last time step and / or preceding steps. More specifically, the intermediate reward may depend on the history of the evolution of the scenario. If the simulation has determined that an engagement of a target has led to the target being destroyed, a positive reward is attributed. Determining the rewards in this way allows training the agent component to increase, preferably maximize, the kill probabilities reached by the engagement system. The magnitude of the reward preferably depends on the importance of the target. In particular, the simulation setup may comprise a user-determined indication of the importance of each target. The indication is then used to determine the reward during the time step. Moreover, it is preferred to apply a negative reward for any ammunition used. The magnitude of the negative reward depends on the type and scarcity of the ammunition. It is preferred to set predetermined parameters to determine the type of the ammunition as part of the simulation setup. The scarcity of the ammunition may be determined as a function of the amount of ammunition left.
[0082] At 210, it is determined if an abort condition has been reached. An abort condition may include the determination that there are no more targets left (successful outcome) or that all engagement means and / or all sensors and / or all protected assets have been destroyed. If no abort condition is fulfilled a time step is determined at 212. This preferably comprises incrementing a current time value by a predetermined value, which is preferably at about 500 milliseconds. This value allows training the agent component to develop a temporally finegrained coordination strategy, that comprises reacting to any changes in the situation quickly. After the time step, , the next iteration of steps 204-210 is conducted.
[0083] If an abort condition is reached, a terminal reward is applied, 214. This includes a large positive reward if there are no more targets left, and a large negative reward if a high- value asset is destroyed. The terminal reward is zero otherwise.
[0084] By method 200, the agent component has learned a policy, which is a mapping from observations to actions. Put differently, the setting of the weights in the agent component enables the trained agent component to determine an action for any test observations that could be supplied to an input of the agent component. The determination of the action is based on the training process and generally causes it to generate actions that lead to a high reward. It should be noted that the agent component itself can be seen as a model-free agent, since the model of the situation is comprised in the simulation, which can be a separate algorithm that communicates with the agent component only by determining the states and observations, and by determining the reward. The training of the agent component based on the intermediate rewards and the terminal reward can be done using Proximal Policy Optimization (PPO) as explained, e.g., in: Schulman et al.: Proximal Policy Optimization Algorithms. arXiv: 1707.06347v2 [cs.LG] (2017).
[0085] In an exemplary training process, the agent component has been trained using 80 instances of a simulation environment in a headless mode. For each iteration of training, 80 different runs of the scenario were used as a batch. Each episode was up to three minutes long, and the agent component could act in discrete steps for at most 300 steps.
[0086] Fig. 3 is a block diagram showing a structure of an agent component and corresponding data types. Data types are depicted as rectangles with rounded corners, and components of the agent component, i.e. the backbone layer 326, the recurrent neural network 330, and the heads 332, as normal rectangles.
[0087] The observation 300 is a data type that comprises the situational awareness of the agent component. It may be determined by determining a state of the simulation, e.g. by having the simulation progress by one or more time steps, and extracting the data from the simulation that is observable by the sensors that form part of the simulation. The observation 300 may be determined in this way for training and there by constitute the training data on which the agent component is trained by reinforcement learning. The observation 300 may also be determined in this way for test purposes to see how a trained agent component reacts to a given simulation. However, the observation 300 may be determined from real-world data. For example, targets may be detected and ranges (preferably position data) may bedetermined by a radar device. The effector states 302 may be determined from status updates of the devices. This allows the agent component to act upon observations in a real situation.
[0088] The effector states 302 comprise any state of the engagement means. In particular, for each engagement means, an alive status 304 is indicative of whether the engagement means can be used to engage a target or not. In a real-world application, the engagement means may send continuous status updates to the agent component, and the alive status 304 can be derived from the status updates. The alive status 304 is preferably encoded as a Boolean value set to True if the engagement means is operable and False otherwise. The ammunition indication 306 comprises information on the amount of ammunition left for the engagement means. The ammunition indication 306 may comprise an integer number of ammunition pieces (e.g. number of artillery shells or missiles) available, or a float value indicative of a charging of an energy storage, e.g. in case the engagement means comprises a radiation weapon. The ammunition indication 306 preferably comprises a vector comprising one dimension for each type of ammunition. The sensor alive status 308 indicates, for each sensor, whether the sensor is functioning or not. It may comprise a Boolean value for each sensor.
[0089] The engagement states 310 are indicative of ongoing engagements. Herein, each engagement denotes an action whereby one of the engagement means has engaged a particular target. Each engagement state 310 may comprise an identifier of a parent engagement means, that has engaged the target, 312, a start time indicator 314 indicative of the when the action command for the engagement has been generated by the agent component, and a launch time indicator 316 indicative of a time when a projectile (e.g. shell, missile, or energy pulse) has left the engagement means.
[0090] Each of the target states 318 relates to one of the targets and comprises a position 320 and a velocity 322 of the target, and an indication of engagements 324 that the engagement system has conducted against the target. The target state may also include further information on the target, such as a target type indicator (not shown). However, it may be impossible to determine target type from the input data, in particular if the input data is generated by a sensor that is insensitive to a target type. If the sensor is, for example, a radar device, only position and velocity may be obtainable. This does not prevent application of the present disclosure since the agent component does not necessarily process a target type indicator.
[0091] The observation 300 relates to a given time step. The observation 300 can be supplied into a backbone layer 326 of the agent component. The backbone layer 326 is configured toreceive the observation 300 at an input and generate a representation 328 of the observation 300. The backbone layer 326 preferably comprises a transformer-type neural network. Using a transformer-type neural network has the advantage that it can process its input data in parallel. Moreover, the representations generated thanks to the attention mechanism of a transformer are better suitable for the present application. The representation 328 may then be supplied to an input of the recurrent neural network 330, which may comprise a long-short term memory (LSTM). The recurrent neural network 330 preferably stores information on past states, in particular data based on input data received in previous time steps. The recurrent neural network 330 comprises a plurality of output heads 332. The output heads 332 comprise effector heads 336, 338, 340, 342, 344, and 346, each of which is configured to generate an action command of action commands 352, 354, 356, 358, 360, 362. Each of the action commands may be sent to one engagement means. The generation of the corresponding action commands by the effector heads means that the action is separated into a distinct action space for each engagement means. This means that the assignment of a target to an effector can be done in a tractable way, whereas the central control, i.e. the generation of a coordinated strategy is done by the agent component. In particular, the agent component can be trained to device a coordinated policy that applies to all engagement means.
[0092] The action command may comprise an instruction to engage a target by an engagement means related, in particular, communicatively coupled, to the effector head. The action command may also include a no-operation command instructing the corresponding engagement means not to engage with any target during a specified delay or a during the duration of one time step. In the present embodiment, each effector head is configured to output an action command either instructing the corresponding engagement means to engage one target, or to engage no target at all. The agent component may generate a time interval in which one or more, or a subset of the engagement means does not engage any target by determining single or repeated local no-operation commands by the corresponding effector heads. This allows development of appropriate strategies including longer phases in which no engagement is instructed.
[0093] The recurrent neural network 330 further comprises a no-operation head 334 configured to issue a general no-operation command 350 that causes all engagement means to ignore the corresponding action command and not to conduct any operation. The general nooperation command 350 may be comprise a delay command, i.e. the no-operation command may comprise a numerical value indicative of a time during which no operations are to beconducted. This has the advantage that the agent component can be trained efficiently at very small time steps, e.g. 500 milliseconds, while deciding not to act for longer time steps if no targets are in sight or no target is posing an immediate risk or no target is optimally positioned to maximise the kill probability. In realistic scenarios, action commands that cause engagement of a target occur at much lower frequency than the step size. This means that the actions that the agent component has to generate to maximize the reward are no-action commands in most cases. This leads to a class imbalance that makes training of the machine learning algorithm inefficient or even impossible. If the machine learning algorithm is, however, configured to output a time interval during which no action command is to cause an engagement, the class imbalance is much more limited.
[0094] Put differently, the delay command 350 has the effect of downweighting, by the architecture, the case where no engagement should be initiated. Therefore, efficient training is possible despite the class imbalance. More specifically, the no-operation command may be postprocessed by a post-processing algorithm (not shown) that sets all action commands Cl - C6 to no-operation commands, such that the action output by the agent component includes no-operation commands during the specified delay. Put differently, the agent component may be configured to issue and transmit no-operation commands with every time step. The nooperation command may also cause the agent component to refrain from sending any action commands during the specified time interval.
[0095] The numerical value indicated in the no-operation command 350 may be specified in time units, such as milliseconds or seconds. It is, however, preferred to indicate the nooperation command as an integer number of time steps during which no engagement is to be effected. The time steps generally define the rate at which the system can generate any output commands. In embodiments, the time step is 500 milliseconds, and the no-operation head may be configured to issue integer values between 0 and 20. This allows defining intervals up to 10 seconds into the future in which no engagement is done. However, also higher numbers for the time steps may be chosen.
[0096] Fig. 4 is a schematic drawing of a scenario. The scenario comprises a plurality of targets 402. In the context of air defence, the targets may comprise aerial vehicles, e.g. unmanned aerial vehicles (UAV), manned aerial vehicles, or missiles. The method is particularly advantageous in case the targets are fast moving targets, such as UAVs, and if there is a large number of targets. The targets 402 in this example comprise low value targets 404 and a high value target 406. The engagement system 410 is an air defence system in thisexample. The engagement system 410 is generally operable to destroy one or more of the targets 402 as long as they are airborne or to otherwise nullity or reduce the effectiveness of the attack by the targets 402.
[0097] The engagement system 410 comprises sensors 416, which is a radar device in this example. The sensor 416 comprises a radar device in this example. However, also other sensor devices are possible, including passive sensors such as cameras, or other active sensors, such as sonar or lidar devices. The sensor 416 is configured to detect at least a part of the targets 402. Although Fig. 4 shows only targets that are detected, there may be further undetected targets (not shown). For the purpose of a partial observation reinforcement learning problem, however, only the targets that are detected by the sensors 416 are part of the observation 300.
[0098] The engagement system 410 further comprises engagement means 412, 414. The engagement means 412, 414 in this example comprise anti-aircraft artillery 412 and missile launchers 414. In other examples, engagement system 410 may comprise other engagement means such as radiation weapons or other electronic attack devices. The engagement means 412, 414 are controlled by the agent component 422 and operable to engage targets 402 and thereby act on the environment. The engagement means may be ground-based, sea-based, airbased, space-based, or underwater-based. The engagement means 412, 414 are preferably configured to execute any action command by the agent component. More specifically, each engagement means is preferably communicatively coupled to a delay head of the agent component and to execute any action commands from the corresponding delay head. Each engagement means is preferably configured not to engage any target if it does not receive any action command. This allows implementing a strategy where no targets are engaged during a time interval without the need of sending no-operation commands to the engagement means regularly.
[0099] Both sensor 416 and engagement means 412, 414, are part of the environment 400 in that they can be rendered inoperable, i.e. damaged or destroyed, by other influences, in particular as a result of being engaged by targets 402. This is indicated by alive status 304 for the engagement means, and alive status 308 of the sensor.
[0100] The sensor 416 and engagement means 412, 414, are also parts of the engagement system 410, which further comprises a command centre 418. The command centre 418 is the central part of the engagement system 410 and configured to control the sensor 416 and engagement means 412, 414. It comprises an agent component 422 operable to executemethod 100 during operation. It comprises a preprocessor 420 configured to preprocess the signals detected from the sensor 416, which are radio frequency signals in this case, and to output a sensor alive status, which may be a Boolean value set to True if the sensor 416 continues to send information, and target states as described with respect to Fig. 3. The output is sent to agent component 422, which receives it as part of the observations. The agent component 422 is in bidirectional communication with the engagement means 412, 414. The agent component 422 thereby obtains information on the status of each of the engagement means 412, 414, in particular on whether it is still operating and the amount of ammunition available. The agent component 422 can, by executing method 100, determine an action and send the action commands to the engagement means 412, 414. The agent component 422 preferably comprises a machine learning algorithm, more preferably the neural network shown in Fig. 3. The environment 400 may further comprise a protected asset 426 which is, in this embodiment, not part of the engagement system 410. A successful engagement of the protected asset by the targets may, however, result in negative terminal rewards by reward function 424. This allows training the agent component to develop a policy that protects asset 426, which may be a high-value asset, e.g. civilian or military infrastructure.
[0101] The agent component 422 may also be trained while deployed in the engagement system 410. To do so, it is necessary to apply a reward function 424, which also takes the sensor data as an input and determines a success, after each step in time and at the end of the training, if the action was successful and inputs a reward into the agent component 422. The reward function may be an algorithm that contains general functions to determine the rewards. It is, however, preferred to use simulations for training. In this case, the environment 400 is part of a simulation, and the target states are preferably entered into the agent component 422 directly, thus foregoing the step of detection by the sensor 416 and preprocessing. However, the sensor 416 is part of the simulated environment, and loss of the sensor results in a negative reward determined by the reward function 424. Once trained, the agent component 422 can be deployed into a real-world system.Reference signs100 Method of controlling an engagement system102-120 Steps of method 100200 Method of generating an agent component of an engagement system202-214 Steps of method 200300 Observation304 Alive status306 Ammunition308 Sensor alive status310 Engagement states312 Parent314 Start time indicator316 Launch time indicator318 Target states320 Position322 Velocity324 Engagements326 Backbone layer328 Representation330 Recurrent neural network332 Output layer334 No-operation head336 First effector head338 Second effector head340 Third effector head342 Fourth effector head344 Fifth effector head346 Sixth effector head348 Action350 No-operation command352 First action command354 Second action commandThird action command Fourth action command Fifth action command Sixth action command EnvironmentTargetsLow-value target High-value target Engagements Engagement system , 414 Engagement meansSensorCommand centre PreprocessorAgent component Reward functionProtected asset
Claims
26Claims1. A computer-implemented method (100) of controlling an engagement system (410), the method comprising: determining (102) one or more engagement means (412, 414) comprised in the engagement system (410), wherein each of the engagement means (412, 414) is configurable to engage one or more targets (402); receiving (104) input data, in particular sensor data, indicative of a position, and preferably movement, of the one or more targets (402) relative to the one or more engagement means (412, 414); simultaneously determining (106), by an agent component (422) and based on the received input data, an action command for each of the one or more engagement means (412, 414) and a no-operation command instructing at least one, preferably a subset of, the engagement means (412, 414) to not engage any target during a specified time interval; and transmitting (118) the respective action command to the one or more engagement means (412, 414), wherein the agent component (422) preferably comprises a machine-learning algorithm, more preferably a reinforcement learning based machine-learning algorithm.
2. The method of claim 1, wherein the respective action command to an engagement means (412, 414) includes a command to engage, by the engagement means (412, 414), one or more of the one or more targets (402); the method preferably further comprising determining a plurality of respective action commands to engage a target of the one or more targets (402) by each of a plurality of the engagement means (412, 414) simultaneously.
3. The method of any of the preceding claims, further comprising determining, by the agent component (422), a further action command to cause the engagement means (412, 414) to refrain from engaging any target of the one or more targets (402) until receipt of a reactivation command.
4. The method of any of the preceding claims, further comprising: not engaging, by the engagement means ( 12, 414), a target of the one or more targets (402) during the specified time interval; and / or engaging, by the engagement means (412, 414), a target of the one or more targets (402) after the end of the specified time interval; and / or not engaging, by the engagement means (412, 414), a target of the one or more targets (402) after the end of the specified time interval.
5. The method of any of the preceding claims, further comprising: transmitting, by the agent component (422), the no-operation command to the engagement means (412, 414) instructed to not engage any target; and / or not transmitting any action commands within the specified time interval.
6. The method of any of the preceding claims, further comprising continuously transmitting, by the agent component (422), a plurality of identical and / or different action commands.
7. The method of any of the preceding claims, wherein the no-operation command instructs all of the engagement means (412, 414) to not engage any target during the specified time interval; and / or wherein the respective action command includes a plurality of no-operation commands, wherein each no-operation command instructs a respective engagement means (412, 414) not to engage any target during a respective specified time interval.
8. The method of any of the preceding claims, wherein the no-operation command comprises an indication of a delay and / or the specified time interval.
9. The method of any of the preceding claims, wherein the action commands are transmitted only in response to receipt of a confirmation user input within a predetermined confirmation time interval prior to determining the action command, and wherein a no-operation command is preferably transmitted if no confirmation user input is received during the confirmation time interval.
10. A computer-implemented method of training an agent component (422), in particular the agent component (422) of any of the preceding claims, of an engagement system (410) comprising one or more engagement means (412, 414) configurable to engage one or more targets (402), the method comprising: determining training data comprising observations (300) of an environment (400) indicative of a position, and preferably movement, of the one or more targets (402) relative to the one or more engagement means (412, 414); determining, by the agent component (422) and based on the training data, an action command for each of the engagement means (412, 414), the action command comprising an instruction to engage one or more of the targets (402) or a no-operation command to not engage any target during a specified time interval; determining a reward based on the action; and updating the agent component (422) based on the reward.
11. The method of claim 10, wherein the reward comprises one or more of: a positive intermediate reward in response to a determination that the predicted action leads to successful engagement of a target, wherein the intermediate reward is preferably based on a predetermined importance indicator related to the target; a negative intermediate reward in response to a determination that an engagement means (412, 414) has expended part of a finite resource, in particular ammunition and / or energy; and / or a negative intermediate reward in response to a determination that a predetermined asset, in particular a sensor (416) and / or an engagement means (412, 414) comprised in the engagement system (410), is no longer operational, in particular has been successfully engaged by one or more of the one or more targets (402).
12. The method of claim 10 or 11, wherein the training comprises applying a terminal reward, in particular a positive terminal reward if all targets (402) have been successfully engaged, and / or a negative terminal reward if a target has successfully engaged a protected asset (426), in particular any part of the engagement system (410).
13. The method of any of claims 10-12, further comprising: simulating the environment (400) to determine the training data;29 iteratively updating the simulated environment (400) by conducting a simulated time step based on the action, and determining a further action based on the updated environment (400); and preferably terminating the iteration in response to an abort condition or after a predetermined time, wherein updating the simulated environment (400) preferably comprises determining a counter-engagement of one or more of the sensors (416) and / or one or more of the engagement means (412, 414) by one or more of the targets (402), and / or wherein the abort condition preferably comprises a determination that all targets (402) have been successfully engaged and / or a determination that all sensors (416) and / or all engagement means (412, 414) have been successfully counter-engaged.
14. The method of any of claims 1-9, wherein the agent component (422) has been trained by the method of any of claims 10-13.
15. The method of any of the preceding claims, wherein the respective action command includes the or a no-operation command instructing at least one, preferably a subset of, the engagement means (412, 414) to not engage any target during a predetermined time interval.
16. The method of any of the preceding claims, wherein the agent component (422) comprises a machine-learning algorithm, preferably a neural network, more preferably a neural network comprising: a backbone layer (326), preferably a transformer-type neural network, comprising an input of the neural network configured to receive the input data and / or the observation (300), the backbone layer (326) configured to generate a representation (328) of the observation (300); and a recurrent neural network (330), in particular a recurrent neural network configured to maintain a memory of one or more earlier observations (300), more particularly a long-short term memory; and / or an output layer (332) configured to output the action command.
17. The method of claim 16, wherein the output layer (332) comprises a plurality of output heads (334, 336, 338, 340, 342, 344, 346), wherein at least one output head of the plurality of30 the output heads is configured to generate an action command (350, 352, 354, 356, 358, 360, 362) operable to control an engagement means (412, 414) of the engagement means (412, 414), wherein the output heads preferably comprise a no-operation output head (334) configured to generate the or a no-operation command (350).
18. A system comprising one or more processors and one or more storage devices, wherein the system is configured to perform the computer-implemented method of any one of claims 1-17, and wherein the system preferably further comprises one or more of the engagement means and / or one or more sensors (416) operable to generate the input data.
19. A computer program product for loading into a memory of a computer, comprising: instructions, that, when executed by a processor of the computer, cause the computer to execute a computer-implemented method of any of claims 1-17; and / or a trained machine learning module obtainable by the computer-implemented method of any of claims 10-17.