An Intelligent Decision-Making Method Based on a Large Reinforcement Learning Model
By introducing a decision certificate generation mechanism and a hierarchical QMIX network, collaborative optimization and verifiable control of complex multi-sub-decision coupling problems are achieved, solving the problems of difficult verification and constraint control of decision results in existing technologies, and improving the interpretability and adaptive security of decisions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU DINGFENG CLOUD COMPUTING CO LTD
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies lack the ability to uniformly model the dependencies, constraint propagation relationships, and assertion triggering relationships among multiple sub-decisions in complex, multi-dimensional coupled decision-making scenarios. This makes it difficult to verify and constrain the decision results, and also lacks an adaptive update mechanism, making it difficult to achieve global optimization and secure control.
By introducing a decision certificate generation mechanism, a hierarchical QMIX joint value modeling method with certificate conditions, and a counterexample-driven dynamic constraint shield update mechanism, a closed-loop optimization process of decision-verification-repair-constraint evolution is formed by generating structured decision plans and decision certificates, performing constraint-aware reconstruction and consistency verification.
It improves the interpretability and verifiability of the decision-making process, enhances the system's adaptive security control capabilities, and improves the global optimization capabilities and decision stability in complex, multi-dimensional coupled decision-making scenarios.
Smart Images

Figure CN121998025B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of reinforcement learning technology, and in particular to an intelligent decision-making method based on a large reinforcement learning model. Background Technology
[0002] With the development of artificial intelligence technology, reinforcement learning-based intelligent decision-making methods have been widely applied in scenarios such as task scheduling, resource allocation, intelligent control, automated operation and maintenance, and risk decision-making. Existing technologies typically construct state, action, and reward mechanisms to enable decision-making models to continuously optimize strategies through ongoing interaction with the environment, thereby achieving automated decision-making for complex tasks. With the development of deep learning technology, utilizing neural networks to represent high-dimensional state information and combining them with reinforcement learning algorithms to improve decision-making performance has become an important development direction in the field of intelligent decision-making. In some complex applications, technologies are also attempting to introduce large models to semantically understand environmental states or generate preliminary decision schemes to enhance the decision-making system's ability to process unstructured data and complex contexts.
[0003] However, existing technologies still have significant shortcomings when dealing with complex, multi-dimensional, coupled decision-making scenarios. On the one hand, traditional reinforcement learning methods typically model the overall decision as a single action or simply break down multiple sub-tasks, lacking a unified modeling capability for the dependencies, constraint propagation relationships, and assertion triggering relationships between multiple sub-decisions. This results in limited joint optimization effects and makes it difficult to obtain globally optimal decision results. On the other hand, the decision generation process in existing technologies is mostly a black-box output. Even when deep neural networks or large models are introduced, they usually only generate action or policy results, lacking preconditions, invariants, and post-assertions bound to sub-decision steps. This makes it difficult to verify the decision results and prevents consistency verification and safety constraint control of the decision process before execution.
[0004] Existing constraint handling methods mostly employ static rule filtering or fixed penalty term design, making it difficult to adaptively update based on violations and failure modes exposed during actual execution. When the decision result violates the constraints, existing methods typically only provide a failure result or a re-search action, lacking precise location of the violating sub-decision unit, local repair, and constraint incremental generation mechanisms based on counterexample information. This prevents the system from forming a closed-loop optimization process of decision-making, verification, repair, and constraint evolution. Existing technologies struggle to simultaneously achieve policy optimality, constraint controllability, and decision interpretability, failing to meet the application requirements for verifiable collaborative optimization and adaptive safety control in complex multi-sub-decision coupling scenarios.
[0005] Therefore, how to provide an intelligent decision-making method based on a large reinforcement learning model is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0006] One objective of this invention is to propose an intelligent decision-making method based on a large reinforcement learning model. This invention introduces a decision certificate generation mechanism, a hierarchical QMIX joint value modeling method with certificate conditionalization, and a counterexample-driven dynamic constraint shield update mechanism to achieve collaborative optimization and verifiable control of complex multi-sub-decision coupling problems. It constructs a closed-loop decision-making process in detail, from decision generation, joint evaluation, consistency verification to constraint adaptive update, and has the advantages of strong decision interpretability, high constraint control capability, good joint optimization effect, and strong system security and stability.
[0007] An intelligent decision-making method based on a large reinforcement learning model according to an embodiment of the present invention includes:
[0008] Acquire environmental state data, process the environmental state data, and obtain a unified semantic state representation;
[0009] The semantic state representation is input into a large reinforcement learning model, and a structured decision plan and corresponding decision certificate are output in parallel through a plan generation head and a certificate generation head.
[0010] Based on the mapping relationship in the structured decision plan and decision certificate, the constraint perception reconstruction of each sub-decision step is carried out to form a sub-decision unit. According to the action type, parameter range and associated constraints of each sub-decision unit, an action candidate set is generated, a dependency graph between sub-decision units is established, and a joint action space is formed.
[0011] The candidate action combinations in the joint action space are used as evaluation objects. Based on semantic state representation, structured decision plan, decision certificate and decision dependency graph, the joint value and repair cost prediction of each candidate action combination are calculated through a certificate-conditional hierarchical QMIX network, and the optimal action combination is determined.
[0012] The optimal action combination and decision certificate are input into the constraint shield for consistency verification. If the verification passes, the optimal action combination is executed. If the verification fails, the optimal action combination is minimally repaired according to the verification result to obtain the repair action and generate the corresponding counterexample information.
[0013] Execute the optimal action combination or repair action, obtain environmental feedback data, update the reinforcement learning large model and hierarchical QMIX network based on the environmental feedback data, and generate constraint increments based on counterexample information to update the constraint shield.
[0014] Optionally, the environmental status data includes structured data and unstructured data. The structured data includes system operating parameters, resource status information, task attribute information, and time series data, while the unstructured data includes text data, log data, and event description data.
[0015] Optionally, the processing of environmental state data to obtain a unified semantic state representation includes:
[0016] Structured data is numerically normalized and vectorized, while unstructured data is text-encoded. The processed structured and unstructured features are then fused and represented uniformly through an encoding network to obtain a semantic state representation.
[0017] Optionally, the step of outputting the structured decision plan and corresponding decision certificate in parallel through the plan generation header and certificate generation header includes:
[0018] The semantic state representation is input into the backbone network of the large reinforcement learning model for feature extraction to obtain the state feature representation used for decision generation. The backbone network of the large model adopts the Transformer architecture.
[0019] The state feature representation is input into the plan generation head connected to the backbone network of the large model to generate a structured decision plan. The structured decision plan includes multiple sequentially arranged sub-decision steps, the action type corresponding to each sub-decision step, the parameter range, and the dependencies between the steps.
[0020] The state feature representation is input into the certificate generation head connected to the backbone network of the large model to generate a decision certificate corresponding to the structured decision plan. The decision certificate includes preconditions, invariants, post-assertions and constraint references corresponding to each sub-decision step.
[0021] Based on the structured decision plan and the decision certificate, establish a mapping relationship between each sub-decision step and its corresponding preconditions, invariants and post-assertions, and store the mapping relationship in association with the structured decision plan and the decision certificate;
[0022] Based on environmental feedback data, the reinforcement learning model is updated with parameters through a reinforcement learning update unit connected to the backbone network of the large model. The reinforcement learning update unit uses an actor-critic approach to update the parameters, and the parameter update method is low-rank adaptation update.
[0023] Optionally, the generation of the action candidate set, the establishment of a dependency graph between sub-decision units, and the formation of a joint action space include:
[0024] Read the sub-decision steps in the structured decision plan and the mapping relationship in the decision certificate, and extract the action type, parameter range, preconditions, invariants, post-assertions and constraint references corresponding to each sub-decision step;
[0025] For each sub-decision step, a constraint description fragment is generated based on the corresponding preconditions, invariants, and constraint references. The constraint description fragment is then bound to the action type, parameter range, and execution order of the corresponding sub-decision step to form a step-level constraint semantic unit.
[0026] Constraint-aware reconstruction of constraint semantic units at each step level, including:
[0027] According to the constraint type, constraint strength and constraint object in the constraint description fragment, the action elements, parameter elements and execution conditions in the original sub-decision steps are split. Action elements and parameter elements with the same constraint object and satisfying compatibility conditions are recombined. Action elements and parameter elements with constraint conflicts are isolated. The recombined action elements, parameter elements and execution conditions are used to generate corresponding sub-decision units.
[0028] For each sub-decision unit, a set of candidate actions is generated based on the reconstructed action elements, parameter elements, and execution conditions. The candidate action set is then constrained and filtered according to the corresponding preconditions and invariants, and candidate actions that do not meet the preconditions or violate the invariants are deleted.
[0029] Based on the step sequence relationships in the structured decision-making plan, the constraint transit relationships between step-level constraint semantic units, and the constraint conflict relationships between sub-decision units, a dependency graph between sub-decision units is established, specifically as follows:
[0030] Extract the input constraints, output constraints, and assertion triggering conditions corresponding to each sub-decision unit;
[0031] Match the output constraints of one sub-decision unit with the input constraints of another sub-decision unit to establish constraint propagation edges;
[0032] Match the post-assertions of one sub-decision unit with the pre-conditions of another sub-decision unit to establish an assertion trigger edge;
[0033] By superimposing constraint transitive edges and assertion triggering edges, a dependency graph is formed between sub-decision units containing sequential dependencies, constraint dependencies, and assertion dependencies. A joint action space is generated based on the action candidate set of each sub-decision unit and the dependency graph.
[0034] Optionally, the step of calculating the joint value and predicted repair cost of each candidate action combination and determining the optimal action combination through a certificate-conditional hierarchical QMIX network includes:
[0035] The candidate action combinations in the joint action space are used as evaluation objects. The local observation information of the sub-decision unit corresponding to each candidate action combination, the semantic state representation, the structured decision plan, the decision certificate, and the adjacency relationship information in the sub-decision unit dependency graph are extracted. The hierarchical QMIX network includes a certificate-conditional local value network, an intra-group hybrid layer, a global hybrid layer, and a repair cost prediction branch.
[0036] By using the certificate-conditional local value network, the local observation information of each sub-decision unit, the semantic state representation, the preconditions and invariants bound to the corresponding sub-decision unit in the decision certificate, and the adjacency relationship information are fused and processed to obtain the conditional local value of each sub-decision unit under the corresponding candidate action combination.
[0037] Through the intra-group mixing layer, based on the dependency strength, constraint transitivity, and assertion triggering relationship in the sub-decision unit dependency graph, sub-decision units with strong dependencies are grouped, and the conditional local values of each sub-decision unit within each group are mixed in the first layer to obtain the group-level value corresponding to each group.
[0038] Through the global mixing layer, the group-level value corresponding to each group, the semantic state representation, the plan encoding result of the structured decision plan, and the certificate encoding result of the decision certificate are mixed in the second layer to obtain the joint value corresponding to each candidate action combination. Through the repair cost prediction branch, the repair cost prediction value corresponding to each candidate action combination is output according to the matching result between each candidate action combination and the preconditions, invariants, post-assertions and constraint references in the decision certificate.
[0039] The candidate action combinations are ranked according to their joint value and predicted repair cost, and the combination with the highest joint value and the lowest predicted repair cost is determined as the optimal action combination.
[0040] Optionally, the step of inputting the optimal action combination and decision certificate into the constraint shield for consistency verification includes:
[0041] A constraint shield is constructed based on the decision certificate and the current constraint set, specifically as follows:
[0042] Extract the preconditions, invariants, post-assertions, and constraint references from the decision certificate. Based on the mapping relationship between each sub-decision step and the corresponding constraints, combine the preconditions, invariants, and post-assertions associated with the same sub-decision unit into corresponding local constraint fragments.
[0043] Based on the constraint propagation relationship and assertion triggering relationship between each sub-decision unit in the sub-decision unit dependency graph, each local constraint fragment is chained and encapsulated in layers to form a local shield unit corresponding to each sub-decision unit and a global shield unit used to coordinate each local shield unit.
[0044] The optimal action combination is input into the constraint shield. Each local shield unit performs constraint matching verification on the action content, parameter values and execution order of the corresponding sub-decision unit. Each local shield unit performs pre-execution verification according to the corresponding preconditions, execution process verification according to the corresponding invariants, and execution result verification according to the corresponding post-assertions. The verification results of each local shield unit are then passed to the global shield unit.
[0045] The global shield unit performs consistency summary verification on the optimal action combination based on the verification results of each local shield unit, the constraint transmission relationship between each sub-decision unit, the assertion triggering relationship, and the constraint conflict relationship. It identifies the sub-decision units with constraint conflicts, the action content that triggers the conflict, the parameter values, the execution order, and the corresponding constraint references, and generates counterexample information.
[0046] When the optimal action combination fails the consistency summary check, the constraint shield performs minimal repair on the optimal action combination according to the counterexample information. After each repair, the corresponding local shield unit and global shield unit are called again for verification until a repair action that satisfies the current constraint set is generated.
[0047] When the optimal action combination passes the consistency summary check, the optimal action combination is output as the action to be executed; when the optimal action combination fails the consistency summary check, the repair action is output as the action to be executed.
[0048] Optionally, the step of generating constraint increments based on counterexample information to update the constraint shield includes:
[0049] Execute the optimal action combination or repair action, and collect environmental feedback data after the action is executed. The environmental feedback data includes execution result data, state change data, reward feedback data, and constraint trigger data.
[0050] An updated sample is constructed based on environmental feedback data. The updated sample includes a semantic state representation corresponding to the current decision-making process, a structured decision plan, a decision certificate, an optimal action combination or repair action, environmental feedback data, and counterexample information.
[0051] The updated samples are input into the reinforcement learning update unit to update the parameters of the reinforcement learning large model, and the backbone network, plan generation head and certificate generation head of the large model are updated collaboratively.
[0052] The updated samples are input into the hierarchical QMIX network to update the parameters of the certificate-conditional local value network, intra-group hybrid layer, global hybrid layer, and repair cost prediction branch.
[0053] Based on the counterexample information, extract the violation sub-decision unit, conflict constraint reference, triggering condition, violation action content, violation parameter value, and violation execution order, generate constraint increments, and write them into the current constraint set to update the constraint shield.
[0054] The beneficial effects of this invention are:
[0055] This invention combines a structured decision plan generated by a large reinforcement learning model with a decision certificate, transforming the decision result from a black-box action output into a structured result containing sub-decision steps, constraints, execution assertions, and corresponding mapping relationships. Consequently, the system can perform constraint-aware reconstruction of each sub-decision step before execution and conduct consistency verification based on the decision certificate, significantly improving the interpretability, verifiability, and traceability of the decision process. This solves the problems of difficulty in verifying, explaining, and constraining complex decision results in existing technologies.
[0056] This invention constructs a certificate-conditional hierarchical QMIX network, integrating semantic state representation, structured decision plans, decision certificates, and sub-decision dependency graphs into the joint value evaluation process. This eliminates the need for simple splitting or independent optimization among multiple sub-decision units, enabling collaborative modeling and joint optimization through the combined effects of dependencies, constraint transitivity, and assertion triggering relationships. This invention effectively improves global optimization capabilities in complex, multi-dimensional coupled decision-making scenarios, ensuring that the obtained optimal action combinations better align with overall task objectives and local constraints, thus addressing the shortcomings of insufficient joint optimization capabilities and poor global optimality in existing technologies.
[0057] This invention uses a constraint shield to perform pre-execution consistency verification of the optimal action combination and generates constraint increments based on counterexample information, continuously updating the constraint shield. This allows the system to continuously adjust the constraint set based on violation patterns and conflict relationships exposed during actual execution, forming a closed-loop processing mechanism of decision generation, joint evaluation, consistency verification, minimal repair, and constraint evolution. This invention not only enables local repair when decisions fail or constraints are violated, reducing the overhead of overall recalculation, but also enhances the system's adaptive safety control capabilities in dynamic environments, improving decision stability, execution reliability, and long-term operational performance. Attached Figure Description
[0058] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0059] Figure 1 This is a flowchart of an intelligent decision-making method based on a large reinforcement learning model proposed in this invention;
[0060] Figure 2 This is a schematic diagram of the structure of a hierarchical QMIX network for certificate conditionalization, which is proposed in this invention as an intelligent decision-making method based on a large reinforcement learning model. Detailed Implementation
[0061] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0062] refer to Figure 1 and Figure 2 An intelligent decision-making method based on a large reinforcement learning model includes:
[0063] Acquire environmental state data, process the environmental state data, and obtain a unified semantic state representation;
[0064] The semantic state representation is input into a large reinforcement learning model, and a structured decision plan and corresponding decision certificate are output in parallel through a plan generation head and a certificate generation head.
[0065] Based on the mapping relationship in the structured decision plan and decision certificate, the constraint perception reconstruction of each sub-decision step is carried out to form a sub-decision unit. According to the action type, parameter range and associated constraints of each sub-decision unit, an action candidate set is generated, a dependency graph between sub-decision units is established, and a joint action space is formed.
[0066] The candidate action combinations in the joint action space are used as evaluation objects. Based on semantic state representation, structured decision plan, decision certificate and decision dependency graph, the joint value and repair cost prediction of each candidate action combination are calculated through a certificate-conditional hierarchical QMIX network, and the optimal action combination is determined.
[0067] The optimal action combination and decision certificate are input into the constraint shield for consistency verification. If the verification passes, the optimal action combination is executed. If the verification fails, the optimal action combination is minimally repaired according to the verification result to obtain the repair action and generate the corresponding counterexample information.
[0068] Execute the optimal action combination or repair action, obtain environmental feedback data, update the reinforcement learning large model and hierarchical QMIX network based on the environmental feedback data, and generate constraint increments based on counterexample information to update the constraint shield.
[0069] In this embodiment, the environmental status data includes structured data and unstructured data. The structured data includes system operating parameters, resource status information, task attribute information, and time series data, while the unstructured data includes text data, log data, and event description data.
[0070] In this embodiment, the processing of environmental state data to obtain a unified semantic state representation includes:
[0071] Structured data is numerically normalized and vectorized, while unstructured data is text-encoded. The processed structured and unstructured features are then fused and represented uniformly through an encoding network to obtain a semantic state representation.
[0072] In this embodiment, the step of outputting a structured decision plan and a corresponding decision certificate in parallel through a plan generation header and a certificate generation header includes:
[0073] The semantic state representation is input into the backbone network of the large reinforcement learning model for feature extraction to obtain the state feature representation used for decision generation. The backbone network of the large model adopts the Transformer architecture.
[0074] The state feature representation is input into the plan generation head connected to the backbone network of the large model to generate a structured decision plan. The structured decision plan includes multiple sequentially arranged sub-decision steps, the action type corresponding to each sub-decision step, the parameter range, and the dependencies between the steps, wherein:
[0075] The plan generation head refers to the structured decision generation module connected to the backbone network of the large model. It is used to receive state feature representations, perform task semantic parsing on the state feature representations, extract step-level feature information related to the decision, generate multiple sub-decision steps based on the step-level feature information, identify the action type and determine the parameter range of each sub-decision step, arrange the sub-decision steps in sequence and perform correlation analysis in combination with the state features, encapsulate the sub-decision steps, corresponding action types, parameter ranges and dependencies between steps in a structured manner, and output a structured decision plan that meets the preset structure format.
[0076] The state feature representation is input into the certificate generation head connected to the backbone network of the large model to generate a decision certificate corresponding to the structured decision plan. The decision certificate includes preconditions, invariants, post-assertions and constraint references corresponding to each sub-decision step.
[0077] The certificate generation header includes:
[0078] Constraint semantics are extracted based on state feature representation and structured decision plan, and the pre-execution conditions, execution process constraints and execution result requirements of each sub-decision step in the current environment are identified, forming constraint semantic information associated with each sub-decision step;
[0079] The constraint semantic information is classified and bound to generate preconditions, invariants and post-assertions corresponding to each sub-decision step, and the corresponding constraint references are determined according to the source of the constraint. At the same time, the mapping relationship between each sub-decision step and the corresponding constraint is established.
[0080] The preconditions, invariants, post-assertions and constraint references corresponding to each sub-decision step are encapsulated in a structured manner to generate a decision certificate that corresponds one-to-one with the structured decision plan.
[0081] Based on the structured decision plan and the decision certificate, establish a mapping relationship between each sub-decision step and its corresponding preconditions, invariants, and post-assertions, and associate and store the mapping relationship with the structured decision plan and the decision certificate, wherein the mapping relationship is as follows:
[0082] Each sub-decision step is bound to its corresponding preconditions, invariants, and post-assertions to form a constraint association set indexed by the sub-decision step. The constraint association set records the preconditions that the sub-decision step must satisfy before execution, the invariants that must be maintained during execution, and the post-assertions that must be satisfied after execution.
[0083] Based on environmental feedback data, the reinforcement learning model is updated with parameters through a reinforcement learning update unit connected to the backbone network of the large model. The reinforcement learning update unit uses an actor-critic approach to update the parameters, and the parameter update method is low-rank adaptation update.
[0084] In this embodiment, generating a candidate action set, establishing a dependency graph between sub-decision units, and forming a joint action space includes:
[0085] Read the sub-decision steps in the structured decision plan and the mapping relationship in the decision certificate, and extract the action type, parameter range, preconditions, invariants, post-assertions and constraint references corresponding to each sub-decision step;
[0086] For each sub-decision step, a constraint description fragment is generated based on the corresponding preconditions, invariants, and constraint references. This constraint description fragment is then bound to the action type, parameter range, and execution order of the corresponding sub-decision step to form a step-level constraint semantic unit. Specifically, the generation of the constraint description fragment involves:
[0087] Based on the preconditions, invariants, and constraint references corresponding to each sub-decision step, extract the constraint element information related to the sub-decision step;
[0088] The constraint element information is classified and processed to form pre-execution constraint information, execution process constraint information, and execution result constraint information.
[0089] The pre-execution constraint information, execution process constraint information, and execution result constraint information are combined according to a preset semantic structure to generate constraint description fragments for the corresponding sub-decision steps;
[0090] Constraint-aware reconstruction of constraint semantic units at each step level, including:
[0091] According to the constraint type, constraint strength and constraint object in the constraint description fragment, the action elements, parameter elements and execution conditions in the original sub-decision steps are split. Action elements and parameter elements with the same constraint object and satisfying compatibility conditions are recombined. Action elements and parameter elements with constraint conflicts are isolated. The recombined action elements, parameter elements and execution conditions are used to generate corresponding sub-decision units.
[0092] For each sub-decision unit, a set of candidate actions is generated based on the reconstructed action elements, parameter elements, and execution conditions. The candidate action set is then constrained and filtered according to the corresponding preconditions and invariants, and candidate actions that do not meet the preconditions or violate the invariants are deleted.
[0093] Based on the step sequence relationships in the structured decision-making plan, the constraint transit relationships between step-level constraint semantic units, and the constraint conflict relationships between sub-decision units, a dependency graph between sub-decision units is established, specifically as follows:
[0094] Extract the input constraints, output constraints, and assertion triggering conditions corresponding to each sub-decision unit;
[0095] Match the output constraints of one sub-decision unit with the input constraints of another sub-decision unit to establish constraint propagation edges;
[0096] Match the post-assertions of one sub-decision unit with the pre-conditions of another sub-decision unit to establish an assertion trigger edge;
[0097] By superimposing constraint transitive edges and assertion triggering edges, a dependency graph is formed between sub-decision units containing sequential dependencies, constraint dependencies, and assertion dependencies. A joint action space is generated based on the action candidate set of each sub-decision unit and the dependency graph.
[0098] In this embodiment, the step of calculating the joint value and predicted repair cost of each candidate action combination and determining the optimal action combination through a certificate-conditional hierarchical QMIX network includes:
[0099] The candidate action combinations in the joint action space are used as evaluation objects. The local observation information of the sub-decision unit corresponding to each candidate action combination, the semantic state representation, the structured decision plan, the decision certificate, and the adjacency relationship information in the sub-decision unit dependency graph are extracted. The hierarchical QMIX network includes a certificate-conditional local value network, an intra-group hybrid layer, a global hybrid layer, and a repair cost prediction branch.
[0100] Through the certificate-conditional local value network, the local observation information of each sub-decision unit, the semantic state representation, the preconditions and invariants bound to the corresponding sub-decision unit in the decision certificate, and the adjacency relationship information are fused to obtain the conditional local value of each sub-decision unit under the corresponding candidate action combination. Specifically, the generation of the conditional local value is as follows:
[0101] The local observation information and semantic state representation of each sub-decision unit are fused to obtain a basic feature representation that reflects the current environmental state and local decision information.
[0102] The basic feature representation is fused with the corresponding preconditions and invariants in the decision certificate to form a constraint feature, and combined with the adjacency information of the sub-decision units in the dependency graph, the basic feature representation is constrained to obtain a conditional feature representation.
[0103] The conditional feature representation is input into the local value calculation module, and the conditional local value of each sub-decision unit under the corresponding candidate action combination is output.
[0104] Through the intra-group mixing layer, based on the dependency strength, constraint transitivity, and assertion triggering relationship in the sub-decision unit dependency graph, sub-decision units with strong dependencies are grouped, and the conditional local values of each sub-decision unit within each group are mixed in the first layer to obtain the group-level value corresponding to each group.
[0105] Through the global mixing layer, the group-level value corresponding to each group, the semantic state representation, the plan encoding result of the structured decision plan, and the certificate encoding result of the decision certificate are mixed in a second layer to obtain the joint value corresponding to each candidate action combination. Through the repair cost prediction branch, based on the matching results between each candidate action combination and the preconditions, invariants, post-assertions, and constraint references in the decision certificate, the repair cost prediction value corresponding to each candidate action combination is output. The generation of the repair cost prediction value is specifically as follows:
[0106] Based on the action content, parameter values and execution order of each sub-decision unit in each candidate action combination, the action feature information corresponding to the candidate action combination is extracted and matched with the preconditions, invariants, post-assertions and constraint references in the decision certificate to obtain the constraint matching result.
[0107] Based on the constraint matching results, identify the sub-decision units that may violate constraints in each candidate action combination and the corresponding violation type, and quantify the degree and scope of the violation to generate constraint conflict feature information;
[0108] Input the constraint conflict feature information into the repair cost prediction branch, and output the degree of adjustment required for the corresponding candidate action combination under the constraint conditions, as the repair cost prediction value of the candidate action combination.
[0109] The candidate action combinations are ranked according to their joint value and predicted repair cost, and the combination with the highest joint value and the lowest predicted repair cost is determined as the optimal action combination.
[0110] In this embodiment, the step of verifying the consistency of the optimal action combination and the decision certificate input constraint shield includes:
[0111] A constraint shield is constructed based on the decision certificate and the current constraint set, specifically as follows:
[0112] Extract the preconditions, invariants, post-assertions, and constraint references from the decision certificate. Based on the mapping relationship between each sub-decision step and the corresponding constraints, combine the preconditions, invariants, and post-assertions associated with the same sub-decision unit into corresponding local constraint fragments.
[0113] Based on the constraint propagation relationship and assertion triggering relationship between each sub-decision unit in the sub-decision unit dependency graph, each local constraint fragment is chained and encapsulated in layers to form a local shield unit corresponding to each sub-decision unit and a global shield unit used to coordinate each local shield unit.
[0114] The optimal action combination is input into the constraint shield. Each local shield unit performs constraint matching verification on the action content, parameter values, and execution order of its corresponding sub-decision unit. Specifically, each local shield unit performs pre-execution verification according to its corresponding preconditions, execution process verification according to its corresponding invariants, and execution result verification according to its corresponding post-assertions. The verification results of each local shield unit are then passed to the global shield unit.
[0115] Each local shield unit performs pre-execution verification according to the corresponding preconditions. Specifically, based on the preconditions bound to the corresponding sub-decision unit in the decision certificate, the action content, parameter values and execution environment status of the sub-decision unit are matched and judged. It is checked whether the current environment status meets the resource constraints, state constraints and triggering conditions limited by the preconditions. If it does not meet the preconditions, it is marked as a pre-execution violation.
[0116] The execution process is verified according to the corresponding invariants. Specifically, during the execution of the sub-decision unit, the state changes during the execution process are continuously monitored according to the invariants. It is determined whether each state parameter remains within the range of values defined by the invariants during the execution process. If the range is exceeded or the constraints are violated, it is marked as a process violation.
[0117] The execution result is verified according to the corresponding post-assertions. Specifically, after the sub-decision unit is executed, the consistency of the execution result is judged according to the post-assertions. The execution result is checked to see if it meets the expected target state, result conditions and constraint requirements. If it does not meet the requirements, it is marked as a result violation.
[0118] The global shield unit performs consistency summary verification on the optimal action combination based on the verification results of each local shield unit, the constraint transmission relationship between each sub-decision unit, the assertion triggering relationship, and the constraint conflict relationship. It identifies the sub-decision units with constraint conflicts, the action content that triggers the conflict, the parameter values, the execution order, and the corresponding constraint references, and generates counterexample information.
[0119] When the optimal action combination fails the consistency aggregation check, the constraint shield performs minimal repair on the optimal action combination based on the counterexample information. After each repair, the corresponding local shield unit and global shield unit are called again for verification until a repair action that satisfies the current constraint set is generated. The minimal repair includes:
[0120] Keep the actions corresponding to the sub-decision units that have passed the verification unchanged, and only perform local repairs on the sub-decision units that trigger conflicts. In accordance with the repair order of parameters first, then actions, and then sequence, the parameter values are constrained and contracted, the action content is replaced, and the execution order is locally rearranged.
[0121] When the optimal action combination passes the consistency summary check, the optimal action combination is output as the action to be executed; when the optimal action combination fails the consistency summary check, the repair action is output as the action to be executed.
[0122] In this embodiment, the step of generating constraint increments based on counterexample information to update the constraint shield includes:
[0123] Execute the optimal action combination or repair action, and collect environmental feedback data after the action is executed. The environmental feedback data includes execution result data, state change data, reward feedback data, and constraint trigger data.
[0124] An updated sample is constructed based on environmental feedback data. The updated sample includes a semantic state representation corresponding to the current decision-making process, a structured decision plan, a decision certificate, an optimal action combination or repair action, environmental feedback data, and counterexample information.
[0125] The updated samples are input into the reinforcement learning update unit to update the parameters of the reinforcement learning large model, and the backbone network, plan generation head and certificate generation head of the large model are updated collaboratively.
[0126] The updated samples are input into the hierarchical QMIX network to update the parameters of the certificate-conditional local value network, intra-group hybrid layer, global hybrid layer, and repair cost prediction branch.
[0127] Based on the counterexample information, extract the violation sub-decision unit, conflict constraint reference, triggering condition, violation action content, violation parameter value, and violation execution order, generate constraint increments, and write them into the current constraint set to update the constraint shield.
[0128] Example 1: To verify the feasibility of this invention in practice, it was applied to a cloud data center. This data center is responsible for the unified deployment and operation management of business systems across multiple departments, and includes approximately 800 computing nodes, over 200 business service modules, and multiple concurrent processing task queues. In actual operation, the system needs to make collaborative decisions among multiple sub-decisions, such as task allocation, resource scheduling, priority control, and execution order adjustment. Due to resource constraints, execution dependencies, and security policy limitations among these sub-decisions, traditional rule-based or single reinforcement learning model-based decision-making methods struggle to achieve global optimality and are prone to resource conflicts, task queuing delays, and execution failures under high load. Existing methods lack a verifiable mechanism for decision results and cannot determine whether a decision violates constraints before execution.
[0129] In this scenario, the method of this invention is applied to construct an intelligent decision-making system. The system first collects environmental state data, including CPU utilization, memory usage, network bandwidth consumption, task queue length, and system log information, and performs unified encoding processing to obtain a semantic state representation. Subsequently, this semantic state representation is input into a large-scale reinforcement learning model, where the backbone network extracts features and outputs a structured decision plan through a plan generation head and a decision certificate through a certificate generation head. The structured decision plan breaks down the overall scheduling task into multiple sub-decision steps, such as task selection, node allocation, priority setting, and execution order adjustment; the decision certificate generates corresponding preconditions, invariants, and post-assertions for each sub-decision step and establishes a mapping relationship between steps and constraints.
[0130] Based on the structured decision plan and decision certificate, the system performs constraint-aware reconstruction on each sub-decision step, reorganizing actions and parameters that satisfy the same constraint semantics to form sub-decision units, and generating a set of action candidates based on constraint filtering. Simultaneously, a dependency graph of sub-decision units is constructed based on the step sequence and constraint transitivity, forming a joint action space. Subsequently, a certificate-conditional hierarchical QMIX network is used to evaluate the candidate action combinations in the joint action space. The local value network integrates semantic state, local observation information, and certificate constraints; the intra-group hybrid layer processes strongly dependent sub-decisions; the global hybrid layer combines plan and certificate information to obtain the joint value; and the repair cost prediction branch outputs the cost of potential constraint conflicts, thereby determining the optimal action combination.
[0131] Before execution, the system verifies the consistency of the optimal action combination using a constraint shield. The constraint shield progressively verifies each sub-decision unit based on the preconditions, invariants, and post-assertions in the decision certificate. When a sub-decision unit is detected violating resource constraints, the system generates a counterexample and performs local repairs on that sub-decision unit, such as adjusting task allocation nodes or reducing resource consumption, to obtain a repair action that satisfies the constraints. During execution, the system continuously collects environmental feedback data and generates new constraint increments based on the counterexample information, updating the constraint shield to enable the system to gradually adapt to the dynamic operating environment.
[0132] Table 1 Comparison of Intelligent Resource Scheduling Effects
[0133]
[0134] As shown in Table 1, in terms of overall performance, the method of this invention outperforms traditional reinforcement learning methods in several key scheduling metrics, indicating its stronger comprehensive optimization capability in complex multi-subject decision-making collaborative scenarios. The average task completion time decreased from 9.6 seconds to 8.1 seconds, demonstrating that under the same resource conditions, the method of this invention can complete task scheduling and execution more efficiently, reflecting its advantages in joint action optimization and execution path selection. Under high load conditions, the task latency rate decreased from 13.2% to 8.7%, indicating that the method can maintain good scheduling stability even under high system pressure, effectively alleviating task congestion problems.
[0135] From the perspective of resource utilization and conflict control, the method of this invention increases resource utilization from 68.4% to 76.9%, indicating that it can make fuller use of computing resources and reduce resource idleness during resource allocation. Simultaneously, the number of scheduling conflicts decreases from 74 times / day to 39 times / day. Combined with the increase in local repair success rate from 65.8% to 89.3%, it can be seen that this invention, through constraint shields and counterexample-driven mechanisms, can not only identify potential conflicts in advance but also effectively repair local conflicts when they occur, preventing problems from escalating and reducing overall system interference.
[0136] Regarding system stability, the method of this invention reduces the number of system rollbacks from 12 times / day to 4 times / day, while increasing the decision execution success rate from 87.5% to 94.6%, indicating that the method has higher reliability in both pre-decision verification and execution. By introducing decision certificate and constraint verification mechanisms, the system can filter out unreasonable decisions before execution and continuously optimize the decision space through dynamic constraint updates, thereby improving overall operational stability and decision quality. This invention achieves a synergistic improvement in efficiency and stability while ensuring decision security.
[0137] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. An intelligent decision-making method based on a large reinforcement learning model, characterized in that, include: The system acquires environmental status data, processes the environmental status data, and obtains a unified semantic status representation. The environmental status data includes structured data and unstructured data. The structured data includes system operating parameters, resource status information, task attribute information, and time series data. The unstructured data includes text data, log data, and event description data. The semantic state representation is input into a large reinforcement learning model, and a structured decision plan and corresponding decision certificate are output in parallel through a plan generation head and a certificate generation head. Based on the mapping relationship in the structured decision plan and decision certificate, the constraint perception reconstruction of each sub-decision step is carried out to form a sub-decision unit. According to the action type, parameter range and associated constraints of each sub-decision unit, an action candidate set is generated, a dependency graph between sub-decision units is established, and a joint action space is formed. The candidate action combinations in the joint action space are used as evaluation objects. Based on semantic state representation, structured decision plan, decision certificate and decision dependency graph, the joint value and repair cost prediction of each candidate action combination are calculated through a certificate-conditional hierarchical QMIX network, and the optimal action combination is determined. The optimal action combination and decision certificate are input into the constraint shield for consistency verification. If the verification passes, the optimal action combination is executed. If the verification fails, the optimal action combination is minimally repaired according to the verification result to obtain the repair action and generate the corresponding counterexample information. Execute the optimal action combination or repair action, obtain environmental feedback data, update the reinforcement learning large model and hierarchical QMIX network based on the environmental feedback data, and generate constraint increments based on counterexample information to update the constraint shield.
2. The intelligent decision-making method based on a large reinforcement learning model according to claim 1, characterized in that, The process of processing environmental state data to obtain a unified semantic state representation includes: Structured data is numerically normalized and vectorized, while unstructured data is text-encoded. The processed structured and unstructured features are then fused and represented uniformly through an encoding network to obtain a semantic state representation.
3. The intelligent decision-making method based on a large reinforcement learning model according to claim 1, characterized in that, The method of outputting a structured decision plan and corresponding decision certificate in parallel through a plan generation header and a certificate generation header includes: The semantic state representation is input into the backbone network of the large reinforcement learning model for feature extraction to obtain the state feature representation used for decision generation. The backbone network of the large model adopts the Transformer architecture. The state feature representation is input into the plan generation head connected to the backbone network of the large model to generate a structured decision plan. The structured decision plan includes multiple sequentially arranged sub-decision steps, the action type corresponding to each sub-decision step, the parameter range, and the dependencies between the steps. The state feature representation is input into the certificate generation head connected to the backbone network of the large model to generate a decision certificate corresponding to the structured decision plan. The decision certificate includes preconditions, invariants, post-assertions and constraint references corresponding to each sub-decision step. Based on the structured decision plan and the decision certificate, establish a mapping relationship between each sub-decision step and its corresponding preconditions, invariants and post-assertions, and store the mapping relationship in association with the structured decision plan and the decision certificate; Based on environmental feedback data, the reinforcement learning model is updated with parameters through a reinforcement learning update unit connected to the backbone network of the large model. The reinforcement learning update unit uses an actor-critic approach to update the parameters, and the parameter update method is low-rank adaptation update.
4. The intelligent decision-making method based on a large reinforcement learning model according to claim 1, characterized in that, The generated action candidate set establishes a dependency graph between sub-decision units, forming a joint action space, including: Read the sub-decision steps in the structured decision plan and the mapping relationship in the decision certificate, and extract the action type, parameter range, preconditions, invariants, post-assertions and constraint references corresponding to each sub-decision step; For each sub-decision step, a constraint description fragment is generated based on the corresponding preconditions, invariants, and constraint references. The constraint description fragment is then bound to the action type, parameter range, and execution order of the corresponding sub-decision step to form a step-level constraint semantic unit. Constraint-aware reconstruction of constraint semantic units at each step level, including: According to the constraint type, constraint strength and constraint object in the constraint description fragment, the action elements, parameter elements and execution conditions in the original sub-decision steps are split. Action elements and parameter elements with the same constraint object and satisfying compatibility conditions are recombined. Action elements and parameter elements with constraint conflicts are isolated. The recombined action elements, parameter elements and execution conditions are used to generate corresponding sub-decision units. For each sub-decision unit, a set of candidate actions is generated based on the reconstructed action elements, parameter elements, and execution conditions. The candidate action set is then constrained and filtered according to the corresponding preconditions and invariants, and candidate actions that do not meet the preconditions or violate the invariants are deleted. Based on the step sequence relationships in the structured decision-making plan, the constraint transit relationships between step-level constraint semantic units, and the constraint conflict relationships between sub-decision units, a dependency graph between sub-decision units is established, specifically as follows: Extract the input constraints, output constraints, and assertion triggering conditions corresponding to each sub-decision unit; Match the output constraints of one sub-decision unit with the input constraints of another sub-decision unit to establish constraint propagation edges; Match the post-assertions of one sub-decision unit with the pre-conditions of another sub-decision unit to establish an assertion trigger edge; By superimposing constraint transitive edges and assertion triggering edges, a dependency graph is formed between sub-decision units containing sequential dependencies, constraint dependencies, and assertion dependencies. A joint action space is generated based on the action candidate set of each sub-decision unit and the dependency graph.
5. The intelligent decision-making method based on a large reinforcement learning model according to claim 1, characterized in that, The process of calculating the joint value and predicted repair cost of each candidate action combination and determining the optimal action combination using a certificate-conditional hierarchical QMIX network includes: The candidate action combinations in the joint action space are used as evaluation objects. The local observation information of the sub-decision unit corresponding to each candidate action combination, the semantic state representation, the structured decision plan, the decision certificate, and the adjacency relationship information in the sub-decision unit dependency graph are extracted. The hierarchical QMIX network includes a certificate-conditional local value network, an intra-group hybrid layer, a global hybrid layer, and a repair cost prediction branch. By using the certificate-conditional local value network, the local observation information of each sub-decision unit, the semantic state representation, the preconditions and invariants bound to the corresponding sub-decision unit in the decision certificate, and the adjacency relationship information are fused and processed to obtain the conditional local value of each sub-decision unit under the corresponding candidate action combination. Through the intra-group mixing layer, based on the dependency strength, constraint transitivity, and assertion triggering relationship in the sub-decision unit dependency graph, sub-decision units with strong dependencies are grouped, and the conditional local values of each sub-decision unit within each group are mixed in the first layer to obtain the group-level value corresponding to each group. Through the global mixing layer, the group-level value corresponding to each group, the semantic state representation, the plan encoding result of the structured decision plan, and the certificate encoding result of the decision certificate are mixed in the second layer to obtain the joint value corresponding to each candidate action combination. Through the repair cost prediction branch, the repair cost prediction value corresponding to each candidate action combination is output according to the matching result between each candidate action combination and the preconditions, invariants, post-assertions and constraint references in the decision certificate. The candidate action combinations are ranked according to their joint value and predicted repair cost, and the combination with the highest joint value and the lowest predicted repair cost is determined as the optimal action combination.
6. The intelligent decision-making method based on a large reinforcement learning model according to claim 1, characterized in that, The process of inputting the optimal action combination and decision certificate into the constraint shield for consistency verification includes: A constraint shield is constructed based on the decision certificate and the current constraint set, specifically as follows: Extract the preconditions, invariants, post-assertions, and constraint references from the decision certificate. Based on the mapping relationship between each sub-decision step and the corresponding constraints, combine the preconditions, invariants, and post-assertions associated with the same sub-decision unit into corresponding local constraint fragments. Based on the constraint propagation relationship and assertion triggering relationship between each sub-decision unit in the sub-decision unit dependency graph, each local constraint fragment is chained and encapsulated in layers to form a local shield unit corresponding to each sub-decision unit and a global shield unit used to coordinate each local shield unit. The optimal action combination is input into the constraint shield. Each local shield unit performs constraint matching verification on the action content, parameter values and execution order of the corresponding sub-decision unit. Each local shield unit performs pre-execution verification according to the corresponding preconditions, execution process verification according to the corresponding invariants, and execution result verification according to the corresponding post-assertions. The verification results of each local shield unit are then passed to the global shield unit. The global shield unit performs consistency summary verification on the optimal action combination based on the verification results of each local shield unit, the constraint transmission relationship between each sub-decision unit, the assertion triggering relationship, and the constraint conflict relationship. It identifies the sub-decision units with constraint conflicts, the action content that triggers the conflict, the parameter values, the execution order, and the corresponding constraint references, and generates counterexample information. When the optimal action combination fails the consistency summary check, the constraint shield performs minimal repair on the optimal action combination according to the counterexample information. After each repair, the corresponding local shield unit and global shield unit are called again for verification until a repair action that satisfies the current constraint set is generated. When the optimal action combination passes the consistency summary check, the optimal action combination is output as the action to be executed; when the optimal action combination fails the consistency summary check, the repair action is output as the action to be executed.
7. The intelligent decision-making method based on a large reinforcement learning model according to claim 1, characterized in that, The step of generating constraint increments based on counterexample information to update the constraint shield includes: Execute the optimal action combination or repair action, and collect environmental feedback data after the action is executed. The environmental feedback data includes execution result data, state change data, reward feedback data, and constraint trigger data. An updated sample is constructed based on environmental feedback data. The updated sample includes a semantic state representation corresponding to the current decision-making process, a structured decision plan, a decision certificate, an optimal action combination or repair action, environmental feedback data, and counterexample information. The updated samples are input into the reinforcement learning update unit to update the parameters of the reinforcement learning large model, and the backbone network, plan generation head and certificate generation head of the large model are updated collaboratively. The updated samples are input into the hierarchical QMIX network to update the parameters of the certificate-conditional local value network, intra-group hybrid layer, global hybrid layer, and repair cost prediction branch. Based on the counterexample information, extract the violation sub-decision unit, conflict constraint reference, triggering condition, violation action content, violation parameter value, and violation execution order, generate constraint increments, and write them into the current constraint set to update the constraint shield.