A fire resource allocation method and device based on reinforcement learning and a storage medium
By employing a reinforcement learning-based fire resource allocation method and utilizing the Q-learning algorithm to optimize the allocation path, the problem of rapid and effective weapon fire resource allocation is solved. This enables rapid allocation when similar targets appear, reducing time complexity and improving the level of intelligence.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-07-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies make it difficult to quickly and effectively allocate weapon firepower resources, especially when similar targets appear, requiring reallocation, which leads to large computational loads and long decision-making times.
A fire resource allocation method based on reinforcement learning is adopted. By defining fire resource allocation elements, constructing an objective function under a constraint model, and using the Q-learning algorithm to iteratively optimize the allocation path, the allocation path of typical targets is established, and the allocation of similar targets is quickly realized.
It achieves the optimal path solution for enemy targets when the user's firepower resources are sufficient, reduces the amount of computation and decision time, improves the level of intelligence, and takes into account actual constraints, thus having good applicability and scalability.
Smart Images

Figure CN116894559B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence, and specifically to a firepower resource allocation method, apparatus, and storage medium based on reinforcement learning. Background Technology
[0002] The Weapon Allocation (WTA) problem studies how to allocate weapon units to strike enemy targets to achieve optimal strike effect and optimize the fire strike system; it is also known as target allocation or fire allocation. It is one of the key aspects of command and control in modern warfare. The WTA problem is essentially a class of nonlinear combinatorial optimization decision problems, and its solution space grows exponentially with the number of weapons and targets, thus it is an NP-complete problem. By formulating a reasonable fire allocation scheme, resource allocation can be optimized to obtain the maximum battlefield benefit with the minimum cost. The fire allocation problem has always been a research hotspot in the military field. Existing literature has established optimization models with minimum losses and solved them using modern intelligent optimization algorithms; different methods have slightly different constraints and optimization algorithms. Current technologies can only allocate to a specific target; when a similar target appears, reallocation is required. Therefore, how to develop a fire resource allocation model based on reinforcement learning that not only solves the fire resource allocation for typical targets but also learns an allocation path, enabling rapid allocation of similar targets, is a problem that urgently needs to be solved. Summary of the Invention
[0003] In view of this, the present invention provides a fire resource allocation method based on reinforcement learning, comprising:
[0004] Define the elements of fire resource allocation, including: the number, type, location, and damage status of attack sub-targets; the number of platforms, types, quantities, and operational mission lists of available fire resources;
[0005] Based on the elements of firepower resource allocation, construct an objective function for the cost-effectiveness ratio of firepower resources under the firepower resource allocation constraint model;
[0006] Analyze the historical data of the firepower resource allocation scheme, and establish a solution space for the firepower resource allocation scheme under the constraints of the firepower resource allocation constraint model;
[0007] Using the allocation schemes in the solution space as states, the transitions between different states as paths, and constructing a reward function based on the objective function of the cost-effectiveness ratio and the damage requirements to the target, the optimal firepower resource allocation scheme path is obtained through continuous iteration using the Q-learning algorithm.
[0008] When the same type of target appears, starting from the path of the optimal fire resource allocation scheme, a new target allocation scheme is given in the opposite direction of the optimal search path.
[0009] Specifically, the firepower resource allocation constraint model includes the definition of multiple sets of constraints:
[0010] The constraints include the range and altitude at which firepower resources can be attacked, the matching constraints between firepower resources and sub-target types, the matching constraints between firepower resources and the battlefield environment, the constraints on the actual firepower resources deployed in each mission allocation, and the operational effectiveness constraints based on target damage requirements and key sub-target damage requirements.
[0011] Specifically, the constraints between the attack range and altitude of firepower resources are as follows: Let the minimum and maximum effective ranges of the i-th resource on the k-th resource platform be respectively... Maximum effective height is The distance between the i-th resource of the k-th resource platform and the j-th sub-target is The height of the j-th sub-target attacked by the i-th resource of the k-th resource platform is Then there is
[0012]
[0013]
[0014] The constraints for matching firepower resources with sub-target types include: selecting suitable resources based on a resource-target type matching table. Indicates resources and firepower With target t j Matching relationship, satisfying
[0015]
[0016] The constraints for matching firepower resources with the battlefield environment include: establishing a matching table between firepower resources and the battlefield environment based on the environmental influence of firepower resources; and using... Indicates firepower resources Matching the battlefield environment j The matching relationship is then:
[0017]
[0018] The constraints on the actual firepower resources deployed in each mission allocation include: the actual amount of firepower resources deployed in each mission allocation cannot exceed the planned amount of firepower resources deployed, i.e.
[0019]
[0020] This represents the planned deployment quantity of the i-th firepower resource on the k-th resource platform.
[0021] The operational effectiveness constraints based on target damage requirements and key sub-target damage requirements include: determining the overall operational effectiveness threshold and the effectiveness threshold of key sub-targets based on the target damage requirements and key sub-target damage requirements; the probability of the i-th resource detecting the j-th sub-target is... The probability of penetration is If the effectiveness threshold of a combat mission is e, then we have
[0022]
[0023] in The effectiveness of attacking the j-th sub-target for the i-th resource of the k-th resource platform.
[0024] Specifically, the objective function for constructing the cost-effectiveness ratio of fire resources under the fire resource allocation constraint model includes: based on the multiple sets of constraints, if the cost of launching one i-th type of resource on the k-th resource platform is... The cost-effectiveness ratio is then expressed as:
[0025]
[0026] Under the given set of constraints, the objective function to maximize the cost-effectiveness ratio of the firepower resources is:
[0027]
[0028] Constrained by conditions
[0029] Specifically, constructing the reward function based on the objective function of the cost-effectiveness ratio and the damage requirement to the target includes:
[0030] The reward function for transitioning from state S to state S' is:
[0031]
[0032] Where f(S) is the cost-effectiveness ratio of state S, des(S) is the degree of damage of state S, and des(require) is the damage requirement for the target; for the part that does not meet the above constraints, its effectiveness is set to zero.
[0033] Specifically, a table Q is used to record the state value of each state for the corresponding action. It is initially initialized to 0, and then updated with each step. The update method is represented by the Bellman Equation as follows:
[0034] Q(S,a)=Q(S,a)+α[r+γmax(Q(S',a'))-Q(S,a)]
[0035] in:
[0036] S represents the current state, and a represents the action taken in the current state;
[0037] S' represents the next state caused by this action, and a' is the action to be taken in this new state;
[0038] r represents the reward for taking this action, γ is the discount factor, which determines how important future rewards are, and α is the learning rate.
[0039] Specifically, for targets t1, t2, ..., t in the target list... N The allocation schemes are selected sequentially according to priority. For any target, if there is no typical target of the same type, the table Q is built using the reinforcement learning algorithm method of Q-learning mentioned above to obtain a typical target allocation path. The final stable state of the table Q is the optimal solution, that is, the optimal target allocation scheme.
[0040] Specifically, when the same type of target appears, starting from the path of the optimal fire resource allocation scheme, a new target allocation scheme is given in the reverse direction of the optimal search path, including: for targets in the target list, if there are already trained typical target allocation schemes of the same type, then starting from the optimal solution, the optimal solution is searched in the reverse direction of the optimal path. When there is no better solution in the previous two steps, the local optimal solution at this time is set as the target fire allocation scheme.
[0041] The present invention also discloses a fire resource allocation device based on reinforcement learning, comprising:
[0042] The element definition module for fire resource allocation is used to define the elements of fire resource allocation. These elements include: the number, type, location, and damage status of attack sub-targets; the number of platforms, types, and quantities of available fire resources; and the list of combat missions.
[0043] The objective function construction module is used to construct an objective function for the cost-effectiveness ratio of fire resources under the fire resource allocation constraint model based on the elements of fire resource allocation.
[0044] The solution space establishment module is used to analyze the historical data of the fire resource allocation scheme and establish the solution space of the fire resource allocation scheme under the constraints of the fire resource allocation constraint model.
[0045] The path module for the optimal firepower resource allocation scheme is used to take the allocation scheme in the solution space as the state, take the transition between different states as the path, construct a reward function with the objective function of the cost-effectiveness ratio and the damage requirement to the target, and continuously iterate through the Q-learning reinforcement learning algorithm to obtain the path of the optimal firepower resource allocation scheme.
[0046] The new target allocation scheme determination module is used to, when the same type of target appears, start from the path of the optimal fire resource allocation scheme and give a new target allocation scheme in the opposite direction of the optimal search path.
[0047] The present invention also discloses a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the fire resource allocation method based on reinforcement learning as described above.
[0048] Beneficial effects:
[0049] 1. Through this invention, reinforcement learning algorithms are used to train the allocation paths of typical targets using fire resource allocation rules and historical allocation schemes. Then, the optimal allocation scheme and allocation path of typical targets are used to quickly obtain the allocation scheme of each target. By combining the fire resource allocation schemes of each target, a strike allocation scheme for this mission is obtained.
[0050] 2. This invention enables the determination of the optimal path for striking enemy targets when friendly firepower resources are sufficient. When the quantity of a certain type of ammunition is insufficient to support the optimal solution, a typical table Q of that type of target is reconstructed to allocate firepower to enemy targets. When subsequent targets are of the same type as previous targets, path allocation is performed based on the optimal path obtained for previous targets through reinforcement learning, thereby reducing computational load, shortening decision-making time, and improving intelligence.
[0051] 3. This invention establishes a model based on the optimal cost-effectiveness ratio as the optimization principle, while also considering several important constraints in real-world fire resource allocation, including operational distance and altitude constraints, target-platform matching, environmental constraints, fire resource constraints, and effectiveness constraints, making it more realistic. The cost-effectiveness model incorporates the probability of target detection and the penetration probability of the resource, making it more practically valuable than other fire resource optimization objective functions. Furthermore, the model can be appropriately supplemented with constraints based on actual conditions, exhibiting scalability and broader applicability.
[0052] 4. Through this invention, the optimal allocation scheme for a typical target and the path from each scheme to the optimal allocation scheme are first learned. Then, the allocation of targets of the same type is searched along the reverse path of the allocation path, and the local optimal solution of the previous n steps is set as the allocation scheme. Although the local optimal solution is used instead of the global optimal solution, some allocation rules have been learned in the initial training, such as which firepower resources have a high cost-effectiveness ratio for this type of target. Therefore, this solution is the global optimal solution most of the time.
[0053] 5. Through this invention, a time complexity of M was successfully reduced. M The NP problem is reduced to n, where M is the type of firepower resource and n is the number of steps in the search. Furthermore, the amount of data that needs to be maintained is reduced. Traditional methods require maintaining the effectiveness of each type of firepower resource for each target, but now it is only necessary to search for the effectiveness of the firepower resources used for the target in the allocated schemes. Attached Figure Description
[0054] Figure 1 This is a flowchart illustrating the fire resource allocation method based on reinforcement learning proposed in this invention.
[0055] Figure 2 This is a schematic diagram of a state transition rule in the solution set space proposed in this invention;
[0056] Figure 3 This is a schematic diagram of the transfer rules for the stack-based scheme proposed in this invention;
[0057] Figure 4 This is a schematic diagram of the fire resource allocation device based on reinforcement learning proposed in this invention. Detailed Implementation
[0058] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.
[0059] This invention provides a firepower resource allocation method based on reinforcement learning, such as... Figure 1 As shown, it includes:
[0060] Step 1: Define the elements of firepower resource allocation, which include: the number, type, location, and damage status of attack sub-targets; the number of platforms, types, and quantities of available firepower resources; and the list of combat missions.
[0061] Firepower resource allocation refers to assigning corresponding firepower resources to the target list. The allocation process must consider operational distance and altitude constraints, target-platform compatibility, firepower resource constraints, and effectiveness constraints, aiming to maximize cost-effectiveness while meeting these constraints.
[0062] (1) Target System
[0063] Suppose there are N sub-targets in the fire strike target system, denoted as t1, t2, ..., tn. N Each target has the following characteristics: number, target type (affecting missile-target matching), location (longitude, latitude, altitude), damage status S (value between 0 and 1), and value coefficient (value between 0 and 1).
[0064] (2) Firepower Resources
[0065] Suppose there are M resource platforms in total, and S types of resources. This represents the resources from the k-th resource platform, and their quantities are respectively
[0066] (3) Task List
[0067] A mission list refers to the combat mission assigned by superiors and the degree of damage to be achieved. A mission list includes the following: the planned targets, the degree of damage to be achieved, the planned firepower resources deployed, the start time of the firepower strike, the end time of the firepower strike, and the battlefield environment.
[0068] (4) Multi-platform firepower resource allocation
[0069] Each target is attacked with at least multiple resources. Once a preset level of damage is reached, no further attack missions will be assigned. The optimal allocation strategy maximizes the cost-effectiveness ratio under various constraints. This means using the i-th resource of the k-th resource platform to attack the j-th sub-target. It can take the value 0 or 1.
[0070] Step 2: Construct an objective function for the cost-effectiveness ratio of fire resources under the fire resource allocation constraint model based on the elements of fire resource allocation;
[0071] (1) Distance and height constraints
[0072] Let the minimum and maximum effective distances of the i-th resource on the k-th resource platform be respectively... Maximum effective height is The distance between the i-th resource of the k-th resource platform and the j-th sub-target is The height of the j-th sub-target attacked by the i-th resource of the k-th resource platform is Then there is
[0073]
[0074]
[0075] (2) Resource matching constraints
[0076] Based on the resource-target type matching table, select the appropriate resource. Representing resources With target t j Matching relationship, satisfying
[0077]
[0078] (3) Environmental constraints
[0079] Based on the environmental impact on firepower resources, a matching table between resource reserves and the battlefield environment is established. Representing resources Matching the battlefield environment j The matching relationship is then:
[0080]
[0081] (4) Firepower resource constraints
[0082] The actual amount of firepower resources deployed in each mission allocation cannot exceed the planned amount of firepower resources deployed, that is...
[0083]
[0084] This represents the planned deployment quantity of the i-th firepower resource on the k-th resource platform.
[0085] (5) Combat effectiveness constraints
[0086] Based on the target damage requirements and the damage requirements of key sub-targets, determine the overall effectiveness threshold of the combat mission and the effectiveness threshold of key sub-targets. The probability of the i-th resource detecting the j-th sub-target is... The probability of penetration is If the effectiveness threshold of a combat mission is e, then we have
[0087]
[0088] If the cost of launching the i-th type of resource on the k-th resource platform is The cost-effectiveness ratio is then expressed as:
[0089]
[0090] in To maximize the cost-effectiveness ratio of an attack on the j-th sub-target from the i-th resource of the k-th resource platform, under constraints.
[0091] maxf (8)
[0092] The final model is
[0093]
[0094] Constrained by conditions
[0095] Step 3: Analyze the historical data of the firepower resource allocation scheme, and establish a solution space for the firepower resource allocation scheme under the constraints of the firepower resource allocation constraint model;
[0096] Traditional firepower resource allocation algorithms often require explicit rules and pre-defined strategies, as well as sufficient understanding of each target and support from historical damage plans. This leads to increased complexity and instability. This patent proposes a Q-learning algorithm based on reinforcement learning. By modeling historical allocation plans and military rules, it establishes a solution space with quantifiable cost-effectiveness. Each allocation plan is treated as a state, and transition rules, i.e., action rules, are established between allocation plans to obtain the optimal state transition path for each state. The Q-learning algorithm is used to establish search paths for typical targets based on existing data. When the same type of target appears, it quickly provides an allocation plan for the new target by starting from the optimal solution for that type of typical target and moving in the reverse direction of the optimal search path.
[0097] To find the optimal solution for a given objective, we first need to define its solution space. In the extreme case where all resources can be freely combined to form a solution, the number of allocation schemes in the solution space will increase exponentially with the number of resources, and the superiority of the allocation model over manual methods will not be apparent. This method introduces constraints by analyzing historical data to reduce the number of allocation schemes in the solution space. To ensure the universality of the solution space, the constraints must apply to at least one type of objective, and not just several. For example, if there are two resources, A and B, each with two units, and assuming there is prior knowledge that each objective can be destroyed with at most two units, then the solution space has six possible allocation schemes: [A], [B], [AA], [AB], [BA], and [BB].
[0098] Step 4: Using the allocation schemes in the solution space as states, the transitions between different states as paths, and constructing a reward function based on the objective function of the cost-effectiveness ratio and the damage requirements to the target, the optimal firepower resource allocation scheme path is obtained through continuous iteration using the Q-learning algorithm.
[0099] The state transition rules between allocation schemes are equivalent to the actions in Q-learning. The simplest state transition rule is to chain all states together in a certain way, with states only transitioning between each other. Taking the solution space mentioned above as an example, one type of state transition rule is as follows: Figure 2 As shown in the example, the firepower resource allocation scheme can be transferred, and the arrows represent the directions in which the allocation scheme can be transferred.
[0100] The above state transition rules are simple, but the transition path between states is limited. For example, it takes 4 steps to transition between relatively similar allocation schemes B and BB, which is not conducive to the transition between similar states.
[0101] Therefore, based on resource allocation principles, a stack-based scheme transition rule is proposed. That is, each state can only add one resource at the end of the allocation scheme, or delete the last resource, such as... Figure 3 As shown.
[0102] The simpler chain of rules has the advantages of fewer steps in the transfer of similar allocation schemes and clearer structure. For example, the transition between BA and BB only requires two steps. Of course, there are other more complex rules, such as the ability to transform between any states and any resource to be transformed into another resource. This patent will not introduce them here. The core of the rule is two points: (1) The number of steps in the change between similar states should be small, which is mainly beneficial to the fine-tuning of the target attack allocation scheme of the same type; (2) The number of transition actions for each state should be small and the number of edges connecting states should be small, which is beneficial to the stability of the learning state.
[0103] Once the task allocation scheme solution set space and allocation scheme transition rules are clarified, the state and actions of Q-learning are determined. That is, each allocation scheme in the solution set space is regarded as a state, and the agent's actions are the state transition rules. Here we select the stack transition rules as actions. In addition to incrementing and decrementing, we also set the stationary state. The table Q is established as shown in Table 1.
[0104]
[0105]
[0106] 1) Reward function
[0107] To find the optimal cost-effectiveness destruction plan and obtain the optimal search path for other plans, and to facilitate the rapid search for an approximate optimal plan along the inverse path from the optimal solution of a typical target when similar targets appear, the reward function for the transition from state S to S' is:
[0108]
[0109] Where f(S) is the cost-effectiveness ratio of state S, calculated using the cost-effectiveness ratio model described above. It is worth noting that the effectiveness of the part that does not meet the constraints is set to zero. des(S) is the degree of damage in state S, and des(require) is the damage requirement for the target.
[0110] 2) Status Update
[0111] Q-Learning, when learning the optimal path for each state, needs to explicitly define the reward obtained after taking a specific action to guide state updates. For this purpose, a Q-table is used to record the state value of each state for its corresponding action. It is initially initialized to 0, and then updated with each step using the Bellman Equation:
[0112] Q(S,a)=Q(S,a)+α[r+γmax(Q(S',a'))-Q(S,a)](11)
[0113] in:
[0114] S represents the current state, and a represents the action taken in the current state;
[0115] S' represents the next state caused by this action, and a' is the action to be taken in this new state;
[0116] r represents the reward for taking this action, γ is the discount factor, which determines how important the future reward is, and α is the learning rate.
[0117] As can be seen from the formula, the Q value of S,a is equal to the immediate reward plus the future reward.
[0118] By incorporating military models into the Q-Learning algorithm and using the cost-effectiveness ratio as a reward, a continuous iterative effect can be achieved.
[0119] Allocation scheme selection
[0120] For targets t1, t2, ..., t in the target list N Allocation schemes are selected sequentially according to priority. Given any target t... i For example, such as t i If there are no typical targets of the same type, a stable table Q is built using the Q-learning method described above. A typical target allocation path is obtained. The optimal solution in the final stable state of table Q, that is, the allocation scheme in which the action points to itself, is the optimal allocation.
[0121] Step 5: When the same type of target appears, start from the path of the optimal fire resource allocation scheme and quickly give the allocation scheme of the new target in the opposite direction of the optimal search path.
[0122] If t iGiven a well-trained typical target allocation scheme of the same type, start from the optimal solution of the typical scheme and search for the optimal solution in reverse along the optimal path. When there is no better solution after the first two steps, the local optimal solution is set as the target fire allocation scheme. There is a special case where the resources required by the fire allocation scheme at this time are exhausted, and the allocation path should be re-established.
[0123] The present invention also discloses a fire resource allocation device based on reinforcement learning, comprising:
[0124] The element definition module for fire resource allocation is used to define the elements of fire resource allocation. These elements include: the number, type, location, and damage status of attack sub-targets, the number of platforms with available fire resources, the types and quantities of fire resources, and the list of combat missions.
[0125] Firepower resource allocation refers to assigning corresponding firepower resources to the target list. The allocation process must consider operational distance and altitude constraints, target-platform compatibility, firepower resource constraints, and effectiveness constraints, aiming to maximize cost-effectiveness while meeting these constraints.
[0126] (1) Target System
[0127] Suppose there are N sub-targets in the fire strike target system, denoted as t1, t2, ..., tn. N Each target has the following characteristics: number, target type (affecting missile-target matching), location (longitude, latitude, altitude), damage status S (value between 0 and 1), and value coefficient (value between 0 and 1).
[0128] (2) Firepower Resources
[0129] Suppose there are M resource platforms in total, and S types of resources. This represents the resources from the k-th resource platform, and their quantities are respectively
[0130] (3) Task List
[0131] A mission list refers to the combat mission assigned by superiors and the degree of damage to be achieved. A mission list includes the following: the planned targets, the degree of damage to be achieved, the planned firepower resources deployed, the start time of the firepower strike, the end time of the firepower strike, and the battlefield environment.
[0132] (4) Multi-platform firepower resource allocation
[0133] Each target is attacked with at least multiple resources. Once a preset level of damage is reached, no further attack missions will be assigned. The optimal allocation strategy maximizes the cost-effectiveness ratio under various constraints. This means using the i-th resource of the k-th resource platform to attack the j-th sub-target. It can take the value 0 or 1.
[0134] The objective function construction module is used to construct an objective function for the cost-effectiveness ratio of fire resources under the fire resource allocation constraint model based on the elements of fire resource allocation.
[0135] (1) Distance and height constraints
[0136] Let the minimum and maximum effective distances of the i-th resource on the k-th resource platform be respectively... Maximum effective height is The distance between the i-th resource of the k-th resource platform and the j-th sub-target is The height of the j-th sub-target attacked by the i-th resource of the k-th resource platform is Then there is
[0137]
[0138]
[0139] (2) Resource matching constraints
[0140] Based on the resource-target type matching table, select the appropriate resource. Representing resources With target t j Matching relationship, satisfying
[0141]
[0142] (3) Environmental constraints
[0143] Based on the environmental impact on firepower resources, a matching table between resource reserves and the battlefield environment is established. Representing resources Matching the battlefield environment j The matching relationship is then:
[0144]
[0145] (4) Firepower resource constraints
[0146] The actual amount of firepower resources deployed in each mission allocation cannot exceed the planned amount of firepower resources deployed, that is...
[0147]
[0148] This represents the planned deployment quantity of the i-th firepower resource on the k-th resource platform.
[0149] (5) Combat effectiveness constraints
[0150] Based on the target damage requirements and the damage requirements of key sub-targets, determine the overall effectiveness threshold of the combat mission and the effectiveness threshold of key sub-targets. The probability of the i-th resource detecting the j-th sub-target is... The probability of penetration is If the effectiveness threshold of a combat mission is e, then we have
[0151]
[0152] If the cost of launching the i-th type of resource on the k-th resource platform is The cost-effectiveness ratio is then expressed as:
[0153]
[0154] Under constraints, the goal is to maximize the cost-effectiveness ratio.
[0155] maxf(8)
[0156] The final model is
[0157]
[0158] Constraints:
[0159] The solution space establishment module is used to analyze the historical data of the fire resource allocation scheme and establish the solution space of the fire resource allocation scheme under the constraints of the fire resource allocation constraint model.
[0160] Traditional firepower resource allocation algorithms often require explicit rules and pre-defined strategies, as well as sufficient understanding of each target and support from historical damage plans. This leads to increased complexity and instability. This patent proposes a Q-learning algorithm based on reinforcement learning. By modeling historical allocation plans and military rules, it establishes a solution space with quantifiable cost-effectiveness. Each allocation plan is treated as a state, and transition rules, i.e., action rules, are established between allocation plans to obtain the optimal state transition path for each state. The Q-learning algorithm is used to establish search paths for typical targets based on existing data. When the same type of target appears, it quickly provides an allocation plan for the new target by starting from the optimal solution for that type of typical target and moving in the reverse direction of the optimal search path.
[0161] To find the optimal solution for a given objective, we first need to define its solution space. In the extreme case where all resources can be freely combined to form a solution, the number of allocation schemes in the solution space will increase exponentially with the number of resources, and the superiority of the allocation model over manual methods will not be apparent. This method introduces constraints by analyzing historical data to reduce the number of allocation schemes in the solution space. To ensure the universality of the solution space, the constraints must apply to at least one type of objective, and not just several. For example, if there are two resources, A and B, each with two units, and assuming there is prior knowledge that each objective can be destroyed with at most two units, then the solution space has six possible allocation schemes: [A], [B], [AA], [AB], [BA], and [BB].
[0162] The path module for the optimal firepower resource allocation scheme is used to take the allocation scheme in the solution space as the state, take the transition between different states as the path, construct a reward function with the objective function of the cost-effectiveness ratio and the damage requirement to the target, and continuously iterate through the Q-learning algorithm to obtain the path of the optimal firepower resource allocation scheme.
[0163] The state transition rules between allocation schemes are equivalent to the actions in Q-learning. The simplest state transition rule is to chain all states together in a certain way, with states only transitioning between each other. Taking the solution space mentioned above as an example, one type of state transition rule is as follows: Figure 2 As shown in the example, the firepower resource allocation scheme can be transferred, and the arrows represent the directions in which the allocation scheme can be transferred.
[0164] The above state transition rules are simple, but the transition path between states is limited. For example, it takes 4 steps to transition between relatively similar allocation schemes B and BB, which is not conducive to the transition between similar states.
[0165] Therefore, based on resource allocation principles, a stack-based scheme transition rule is proposed. That is, each state can only add one resource at the end of the allocation scheme, or delete the last resource, such as... Figure 3 As shown.
[0166] The simpler chain of rules has the advantages of fewer steps in the transfer of similar allocation schemes and clearer structure. For example, the transition between BA and BB only requires two steps. Of course, there are other more complex rules, such as the ability to transform between any states and any resource to be transformed into another resource. This patent will not introduce them here. The core of the rule is two points: (1) The number of steps in the change between similar states should be small, which is mainly beneficial to the fine-tuning of the target attack allocation scheme of the same type; (2) The number of transition actions for each state should be small and the number of edges connecting states should be small, which is beneficial to the stability of the learning state.
[0167] Once the task allocation scheme solution set space and allocation scheme transition rules are clarified, the state and actions of Q-learning are determined. That is, each allocation scheme in the solution set space is regarded as a state, and the agent's actions are the state transition rules. Here we select the stack transition rules as actions. In addition to incrementing and decrementing, we also set the stationary state. The table Q is established as shown in Table 1.
[0168] A+ A- A= B+ B- B= [] [A] - [] [B] - [] [A] [AA] [] A [AB] - [A] [B] [AB] - [B] [BB] [] [B] [AA] - [A] [AA] - - [AA] [AB] - - [AB] - [A] [AB] [BA] - [B] [BA] - - [BA] [BB] - - [BB] - [B] [BB]
[0169] 1) Reward function
[0170] To find the optimal cost-effectiveness destruction plan and obtain the optimal search path for other plans, and to facilitate the rapid search for an approximate optimal plan along the inverse path from the optimal solution of a typical target when similar targets appear, the reward function for the transition from state S to S' is:
[0171]
[0172] Where f(S) is the cost-effectiveness ratio of state S, calculated using the cost-effectiveness ratio model described above, des(S) is the degree of damage in state S, and des(require) is the damage requirement for the target. It is worth noting that the effectiveness of the part that does not meet the constraints is set to zero.
[0173] 2) Status Update
[0174] Q-Learning, when learning the optimal path for each state, needs to explicitly define the reward obtained after taking a specific action to guide state updates. For this purpose, a Q-table is used to record the state value of each state for its corresponding action. It is initially initialized to 0, and then updated with each step using the Bellman Equation:
[0175] Q(S,a)=Q(S,a)+α[r+γmax(Q(S',a'))-Q(S,a)] (11)
[0176] in:
[0177] S represents the current state, and a represents the action taken in the current state;
[0178] S' represents the next state caused by this action, and a' is the action to be taken in this new state;
[0179] r represents the reward for taking this action, γ is the discount factor, which determines how important the future reward is, and α is the learning rate.
[0180] As can be seen from the formula, the Q value of S,a is equal to the immediate reward plus the future reward.
[0181] By incorporating military models into the Q-Learning algorithm and using the cost-effectiveness ratio as a reward, a continuous iterative effect can be achieved.
[0182] Allocation scheme selection
[0183] For targets t1, t2, ..., t in the target list N Allocation schemes are selected sequentially according to priority. Given any target t... i For example, such as t i If there are no typical targets of the same type, a stable table Q is built using the Q-learning method described above. A typical target allocation path is obtained. The optimal solution in the final stable state of table Q, that is, the allocation scheme in which the action points to itself, is the optimal allocation.
[0184] The new target allocation scheme determination module is used to quickly provide a new target allocation scheme when the same type of target appears, starting from the path of the optimal fire resource allocation scheme and moving in the opposite direction of the optimal search path.
[0185] If t i Given a well-trained typical target allocation scheme of the same type, start from the optimal solution of the typical scheme and search for the optimal solution in reverse along the optimal path. When there is no better solution after the first two steps, the local optimal solution is set as the target fire allocation scheme. There is a special case where the resources required by the fire allocation scheme at this time are exhausted, and the allocation path should be re-established.
[0186] The present invention also discloses a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, is the fire resource allocation method based on reinforcement learning described above.
[0187] In summary, the above are merely preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
[0188] It will be apparent to those skilled in the art that the embodiments of the present invention are not limited to the details of the exemplary embodiments described above, and that the embodiments of the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the embodiments of the present invention. Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of the embodiments of the present invention is defined by the appended claims rather than the foregoing description. Therefore, all variations falling within the meaning and scope of equivalents of the claims are intended to be encompassed within the embodiments of the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims. Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units, modules, or devices recited in the system, apparatus, or terminal claims may also be implemented by the same unit, module, or device through software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any particular order.
[0189] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention and are not intended to limit them. Although the embodiments of the present invention have been described in detail with reference to the above preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions to the technical solutions of the embodiments of the present invention should not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A fire resource allocation method based on reinforcement learning, characterized in that, include: Define the elements of fire resource allocation, including: the number, type, location, and damage status of attack sub-targets; the number of platforms, types, quantities, and operational mission lists of available fire resources; Based on the elements of firepower resource allocation, construct an objective function for the cost-effectiveness ratio of firepower resources under the firepower resource allocation constraint model; Analyze the historical data of the firepower resource allocation scheme, and establish a solution space for the firepower resource allocation scheme under the constraints of the firepower resource allocation constraint model; Using the allocation schemes in the solution space as states, the transitions between different states as paths, and constructing a reward function with the objective function of cost-effectiveness and the damage requirement to the target, the optimal firepower resource allocation scheme path is obtained through continuous iteration using the Q-learning reinforcement learning algorithm. When the same type of target appears, starting from the path of the optimal fire resource allocation scheme, a new target allocation scheme is given in the opposite direction of the optimal search path; The firepower resource allocation constraint model includes the definition of multiple sets of constraints: Constraints on the attack range and altitude of firepower resources, constraints on the matching between firepower resources and sub-target types, constraints on the matching between firepower resources and the battlefield environment, constraints on the actual firepower resources deployed in each mission allocation, and operational effectiveness constraints based on target damage requirements and key sub-target damage requirements. The constraint condition between the attack range and height of the firepower resources is as follows: Let the minimum and maximum effective ranges of the i-th resource on the k-th resource platform be respectively... The maximum effective height is The distance between the i-th resource and the j-th sub-target of the k-th resource platform is The height of the j-th sub-target attacked by the i-th resource of the k-th resource platform is Then there is The constraints for matching firepower resources with sub-target types include: selecting suitable resources based on a resource-target type matching table; and using... Indicates resources and firepower With the goal Matching relationship, satisfying The constraints for matching firepower resources with the battlefield environment include: establishing a matching table between firepower resources and the battlefield environment based on the environmental influence of firepower resources; and using... Indicates firepower resources Matching the battlefield environment The matching relationship is then: The constraints on the actual firepower resources deployed in each mission allocation include: the actual amount of firepower resources deployed in each mission allocation cannot exceed the planned amount of firepower resources deployed, i.e. This represents the planned deployment quantity of the i-th firepower resource on the k-th resource platform. The operational effectiveness constraints based on target damage requirements and key sub-target damage requirements include: determining the overall effectiveness threshold and key sub-target effectiveness threshold for the operational mission based on the target damage requirements and key sub-target damage requirements; The resource for the first The probability of finding each sub-target is The probability of penetration is The effectiveness threshold for combat missions is Then there is in For the first The first resource platform The first resource strike The effectiveness of individual goals.
2. The fire resource allocation method based on reinforcement learning as described in claim 1, characterized in that, The objective function for constructing the cost-effectiveness ratio of fire resources under the constraint model of fire resource allocation includes: based on the multiple sets of constraints, if a missile is launched from the k-th resource platform... The cost of the seed resources is The cost-effectiveness ratio is then expressed as: Under the given set of constraints, the objective function to maximize the cost-effectiveness ratio of the firepower resources is: , Constrained by conditions .
3. The fire resource allocation method based on reinforcement learning as described in claim 1, characterized in that, Constructing the reward function based on the objective function of the cost-effectiveness ratio and the damage requirement to the target includes: State Transition to state The reward function is: in For state Cost-effectiveness For state The extent of the damage, The requirement is to destroy the target; for any part that does not meet the constraints, its effectiveness is set to zero.
4. The fire resource allocation method based on reinforcement learning as described in claim 1, characterized in that, Table Q is used to record the state value of each state for the corresponding action. It is initially initialized to 0, and then updated with each step. The method of updating is represented by the Bellman Equation as follows: in: Represents the current state. Actions taken in response to the current state; This represents the next state resulting from this action. This refers to the actions taken in this new state; This represents the reward received for taking this action. It is a discount factor. This determines how important future rewards will be. This is the learning rate.
5. The fire resource allocation method based on reinforcement learning as described in claim 4, characterized in that, For the targets in the target list The allocation schemes are selected sequentially according to priority. For any target, if there is no typical target of the same type, the table Q is built using the reinforcement learning algorithm method of Q-learning mentioned above to obtain a typical target allocation path. The final stable state of the table Q is the optimal solution, that is, the optimal target allocation scheme.
6. The fire resource allocation method based on reinforcement learning as described in claim 5, characterized in that, When the same type of target appears, starting from the path of the optimal fire resource allocation scheme, a new target allocation scheme is given in the reverse direction of the optimal search path, including: for targets in the target list, if there are already trained typical target allocation schemes of the same type, then starting from the optimal solution, the optimal solution is searched in the reverse direction of the optimal path. When there is no better solution in the previous two steps, the local optimal solution at this time is set as the target fire allocation scheme.
7. A fire resource allocation device based on reinforcement learning, characterized in that, include: The element definition module for fire resource allocation is used to define the elements of fire resource allocation. These elements include: the number, type, location, and damage status of attack sub-targets; the number of platforms, types, and quantities of available fire resources; and the list of combat missions. The objective function construction module is used to construct an objective function for the cost-effectiveness ratio of fire resources under the fire resource allocation constraint model based on the elements of fire resource allocation. The solution space establishment module is used to analyze the historical data of the fire resource allocation scheme and establish the solution space of the fire resource allocation scheme under the constraints of the fire resource allocation constraint model. The path module for the optimal firepower resource allocation scheme is used to take the allocation scheme in the solution space as the state, take the transition between different states as the path, construct a reward function with the objective function of cost-effectiveness and the damage requirement to the target, and continuously iterate through the Q-learning reinforcement learning algorithm to obtain the path of the optimal firepower resource allocation scheme. The new target allocation scheme determination module is used to, when the same type of target appears, start from the path of the optimal fire resource allocation scheme and give a new target allocation scheme in the opposite direction of the optimal search path. The firepower resource allocation constraint model includes the definition of multiple sets of constraints: Constraints on the attack range and altitude of firepower resources, constraints on the matching between firepower resources and sub-target types, constraints on the matching between firepower resources and the battlefield environment, constraints on the actual firepower resources deployed in each mission allocation, and operational effectiveness constraints based on target damage requirements and key sub-target damage requirements. The constraint condition between the attack range and height of the firepower resources is as follows: Let the minimum and maximum effective ranges of the i-th resource on the k-th resource platform be respectively... The maximum effective height is The distance between the i-th resource and the j-th sub-target of the k-th resource platform is The height of the j-th sub-target attacked by the i-th resource of the k-th resource platform is Then there is The constraints for matching firepower resources with sub-target types include: selecting suitable resources based on a resource-target type matching table; and using... Indicates resources and firepower With the goal Matching relationship, satisfying The constraints for matching firepower resources with the battlefield environment include: establishing a matching table between firepower resources and the battlefield environment based on the environmental influence of firepower resources; and using... Indicates firepower resources Matching the battlefield environment The matching relationship is then: The constraints on the actual firepower resources deployed in each mission allocation include: the actual amount of firepower resources deployed in each mission allocation cannot exceed the planned amount of firepower resources deployed, i.e. This represents the planned deployment quantity of the i-th firepower resource on the k-th resource platform. The operational effectiveness constraints based on target damage requirements and key sub-target damage requirements include: determining the overall effectiveness threshold and key sub-target effectiveness threshold for the operational mission based on the target damage requirements and key sub-target damage requirements; The resource for the first The probability of finding each sub-target is The probability of penetration is The effectiveness threshold for combat missions is Then there is in For the first The first resource platform The first resource strike The effectiveness of individual goals.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the fire resource allocation method based on reinforcement learning as described in any one of claims 1 to 6.
Citation Information
Patent Citations
Airborne opportunistic array radar target searching algorithm based on search resource management
CN111060884A
Unmanned chariot team firepower distribution method based on deep reinforcement learning
CN112364972A