A reinforcement learning driven dynamic constraint optimization decision system and method thereof
By using a reinforcement learning-driven dynamic constraint optimization decision system, combined with 1D convolutional neural networks, multi-head self-attention mechanisms, and model predictive control, the system solves the problems of decision lag and insufficient security in dynamic constraint processing in complex industrial scenarios, achieving a unified optimization of real-time performance and security.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTHEAST UNIV
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from decision lag, insufficient security, and high cross-scenario adaptation costs when dealing with dynamic constraints in complex industrial scenarios, and cannot achieve a unified optimization of real-time performance and security.
The dynamic constraint optimization decision system driven by reinforcement learning includes a data acquisition layer, a dynamic constraint perception module, a multi-agent reinforcement learning decision module, a safe reinforcement learning optimization module, and a transfer learning adaptation module. It achieves accurate perception, safe verification, and cross-scene transfer of dynamic constraints through 1D convolutional neural networks, multi-head self-attention mechanism, CAC framework, and MPC module.
It achieves precise handling of dynamic constraints, improves the security and real-time performance of decision-making, reduces the cost of cross-scenario adaptation, and enhances the efficiency and security of collaborative decision-making in the system.
Smart Images

Figure CN122242633A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence, reinforcement learning, and operations research optimization, specifically to a reinforcement learning-driven dynamic constraint optimization decision-making system and method. Background Technology
[0002] With the deepening development of the Industrial Internet and intelligent manufacturing, systems face challenges in complex industrial production and energy scheduling processes, including multi-device collaboration, multi-objective optimization, and massive data processing. Traditional scheduling decision-making methods based on operations research or heuristic rules often suffer from high computational complexity and poor real-time performance when dealing with highly nonlinear and dynamically changing environments. In recent years, deep reinforcement learning, with its powerful high-dimensional state space representation capabilities and end-to-end decision-making advantages, has gradually become an important means of solving collaborative optimization problems in complex distributed systems. In practical fields such as industrial microgrid scheduling or flexible production line control, agents not only need to maximize long-term cumulative rewards but also must strictly adhere to various hard or soft constraints, such as equipment physical limits, energy consumption limits, and production cycle time.
[0003] To improve the efficiency of collaborative learning in large-scale distributed systems and protect the privacy of underlying data, some solutions integrating reinforcement learning and distributed architecture have been proposed in existing technologies. For example, Chinese invention patent application CN112465151A discloses a multi-agent federated collaboration method based on deep reinforcement learning. This existing technology mainly constructs a multi-agent deep reinforcement learning framework and introduces a federated learning mechanism, enabling each agent to train its policy locally and then only upload the gradient or parameter information of the policy network to the central server for aggregation, without transmitting the underlying raw runtime state data. This method, to some extent, solves the communication bandwidth pressure and sensitive data leakage risk problems in the multi-agent collaboration process, achieving better distributed decision-making results.
[0004] However, this existing technology still reveals significant limitations when dealing with real and extremely complex dynamic industrial scenarios. Firstly, regarding constraint handling mechanisms, existing multi-agent federated reinforcement learning frameworks typically rely on static Lagrange multiplier methods, which transform environmental constraints into fixed penalty terms directly integrated into the reward function, or use dual methods to optimize constraints separately. However, in actual manufacturing production lines, constraints are often highly dynamic and prone to abrupt changes. For example, sudden equipment failures can cause instantaneous changes in equipment load limits, or the insertion of urgent orders can lead to drastic fluctuations in production cycle requirements. Traditional static penalty mechanisms are extremely sensitive to multiplier parameters and cannot use deep neural networks such as one-dimensional convolutional neural networks or multi-head self-attention mechanisms to perceive the trend and rate of change of constraints in real time under sequential conditions. Furthermore, they lack the ability to dynamically allocate priority weights for constraints of different dimensions, resulting in severe lag in decision-making when facing sudden constraint changes, making it difficult to ensure that key hard constraints are always satisfied.
[0005] Secondly, pure deep reinforcement learning algorithms are essentially trial-and-error learning. In the process of exploring unknown states or iteratively updating federated policies, they inevitably generate exploratory actions that violate safety boundaries. The aforementioned existing technologies only focus on the federated aggregation of policy parameters, neglecting safety verification and physical trajectory correction before the execution of decision-making actions. In industrial control, once issued power adjustment or batch control commands violate the hard constraints of the physical load of heavy equipment or the total energy consumption limit of the power grid, it will directly lead to system instability, equipment damage, or even serious safety accidents. In other words, existing technologies lack a mechanism that can immediately trigger and perform forward-looking time-domain adaptive trajectory correction based on the current physical state and dynamic change frequency when there is a high risk of violation in local decisions, failing to completely isolate the risk of trial and error before the physical equipment executes the decisions.
[0006] Furthermore, modern industrial production exhibits typical characteristics of flexible manufacturing with small batches and diverse product varieties, and production scenarios change extremely frequently. When faced with entirely new order structures or equipment topology changes, the aforementioned existing technologies typically require restarting the multi-agent strategy from scratch or undergoing a lengthy fine-tuning and convergence process. Due to the lack of experience storage and transfer adaptation mechanisms based on meta-reinforcement learning, existing systems cannot quickly recall historical experience models by calculating the similarity between old and new scenarios, nor can they achieve incremental learning using a small number of interaction samples. This results in significant time and computational costs, severely limiting the rapid deployment and generalization capabilities of algorithms in industrial settings.
[0007] In summary, a complete dynamic constraint optimization decision-making system and method is needed. This system should not only inherit the advantages of multi-agent federated reinforcement learning in distributed collaboration and privacy protection, but also uniformly address core pain points such as dynamic and accurate perception of constraints, safe and forward-looking correction of execution actions, and rapid experience transfer across scenarios. This would enable dynamic collaborative decision-making with global optimization while ensuring industrial-grade real-time performance and absolute security. Summary of the Invention
[0008] The purpose of this invention is to provide a reinforcement learning-driven dynamic constraint optimization decision system and method to solve the problem mentioned in the background art of lacking a complete decision framework that can uniformly handle dynamic constraints, multi-objectives, distributed collaboration, and take into account both real-time performance and security.
[0009] To achieve the above objectives, the present invention provides the following technical solution: a reinforcement learning-driven dynamic constraint optimization decision-making system, comprising: a data acquisition layer, used to collect dynamic environment data and constraint data in the decision-making scenario in real time, wherein the constraint data includes hard constraint data and soft constraint data, wherein the hard constraint data is an inviolable constraint threshold, and the soft constraint data is a constraint parameter that can be adjusted within a preset range; The dynamic constraint perception module assigns weights to the collected constraint data based on an attention mechanism, focuses on key constraint dimensions, and updates the constraint priority matrix in real time. The multi-agent reinforcement learning decision module adopts a distributed deep reinforcement learning architecture, including a master agent and several sub-agents. The sub-agents are responsible for the local optimization of the sub-decision scenario, and the master agent performs global optimization based on the local decision results of the sub-agents and global constraints. The safety reinforcement learning optimization module integrates the constraint Actor-Critic (CAC) framework with model predictive control (MPC). It detects the constraint satisfaction status in real time during the decision-making process and triggers MPC to correct the local trajectory when the predicted constraint may be violated. The transfer learning adaptation module, based on the Meta-RL framework, stores decision-making experience models from historical scenarios. When the decision-making scenario changes, it quickly adapts to the constraint optimization requirements of the new scenario through a small number of interaction samples. The decision output layer receives optimized decision instructions, transforms them into executable control signals, and sends them to the execution terminal. Through the synergistic effect of its core modules, this system framework enables more precise and efficient dynamic constraint processing, improves the efficiency of multi-subsystem collaborative decision-making, significantly enhances decision security, and integrates a secure reinforcement learning optimization module. This addresses the reliability issues of traditional reinforcement learning relying solely on rewards and penalties to handle constraints, meeting the needs of security-sensitive scenarios. Cross-scenario adaptation costs are significantly reduced, shortening system deployment and adaptation time in new scenarios, lowering the time and data costs of cross-scenario applications, enhancing the feasibility of decision implementation, and improving the system's practical value. Preferably, the dynamic constraint perception module includes: a constraint feature extraction unit, which uses a 1D convolutional neural network (CNN) to extract features from temporal constraint data and outputs a constraint feature vector; The attention weight calculation unit, based on the multi-head self-attention mechanism, takes the constraint feature vector and the environment state vector as input and calculates the attention weight of each constraint dimension. The constraint priority update unit dynamically adjusts the constraint priority matrix based on attention weights and constraint violation history. The base weights of hard constraints are never lower than a preset threshold. A 1D Convolutional Neural Network (CNN) is used to process temporal constraint data, accurately capturing dynamic temporal patterns. The 1DCNN's convolutional kernels can slide along the time dimension, effectively extracting local temporal features of the constraint data (such as the sudden drop trend of hard constraint thresholds and the gradual change period of soft constraint parameters). The output constraint feature vector not only includes the current constraint value but also incorporates historical change patterns (such as the constraint fluctuation amplitude over the past 10 sampling periods), providing a more comprehensive feature base for subsequent weight calculations and adapting to high-dimensional constraint scenarios. Compared to fully connected networks, 1DCNN reduces model complexity through parameter sharing. Even when faced with dozens or even hundreds of constraint dimensions (such as load, energy consumption, and cycle time constraints of multiple devices in a smart factory), it can still efficiently extract the temporal features of each dimension, avoiding the decline in computational efficiency caused by the curse of dimensionality. The attention weight calculation unit achieves dynamic association and matching between constraints and environment through multi-head self-attention, dynamically associating constraints with the environment, focusing on key constraints in parallel across multiple dimensions, combining attention weights with historical records of constraint violations, driving priority optimization based on historical experience, and rigidly guaranteeing safety weights for hard constraints. The three units work together to form a collaborative closed loop. The constraint feature extraction unit provides high-quality temporal features for weight calculation, the attention weight calculation unit provides dynamic association basis for priority updates, and the results of the constraint priority update unit can in turn feed back into subsequent data collection.
[0010] Preferably, in the multi-agent reinforcement learning decision-making module, the sub-agents adopt the Deep Deterministic Policy Gradient (DDPG) algorithm, and the master agent adopts the Centralized Training Distributed Execution (CTDE) architecture. The global reward function is optimized through the Advantageous Actor-Critic (A2C) algorithm. The global reward function includes the weighted sum of the sub-agent's local reward and the global constraint satisfaction reward. Through the hierarchical design of precise local decision-making by the sub-agents combined with global collaborative optimization by the master agent, and the targeted selection of algorithms and architectures, the core problems of conflict between local and global goals, difficulty in adapting continuous actions, and low collaborative efficiency in traditional multi-subsystem decision-making are effectively solved.
[0011] Preferably, in the safety reinforcement learning optimization module, the CAC framework's Critic network includes dual Critic branches, which output the value estimate of the decision action and the constraint violation risk estimate, respectively. When the constraint violation risk estimate exceeds a preset threshold, the MPC module is triggered to generate a short-term optimization trajectory based on the current environment state and constraints, correcting the agent's decision action. Through the collaborative mechanism of dual Critic risk prediction and MPC real-time correction, the core pain points of traditional reinforcement learning constraint processing—such as delayed risk warning and weak real-time correction capabilities—are addressed. The dedicated constraint violation risk estimation branch can quantify the risk probability based on the current action, environment state, and dynamic constraints, allowing for early intervention. When the risk estimate exceeds the threshold, the MPC module is triggered to generate a short-term optimization trajectory, enabling rapid correction of the agent's decision action. Compared to relying solely on the CAC framework to avoid constraints through policy updates, which requires multiple rounds of interaction and involves latency, MPC can directly solve for the local optimal trajectory within the short-term prediction time domain (e.g., 5-10 decision steps in the future) based on the current state and constraints, instantly adjusting actions to meet constraints (e.g., urgently reducing equipment load or adjusting vehicle travel paths), significantly improving constraint satisfaction capabilities in highly dynamic and high-risk scenarios.
[0012] Preferably, the transfer learning adaptation module includes: The meta-model storage unit stores the parameters of the reinforcement learning model trained under different historical scenarios, indexed by the scene feature vector. The scene similarity calculation unit calculates the similarity between the feature vector of the new scene and the feature vector of the historical scene using the cosine similarity algorithm. The rapid adaptation unit, when the similarity exceeds a preset threshold, calls the corresponding historical model as the initial parameter. Through the MAML (Model-Agnostic Meta-Learning) algorithm of meta-reinforcement learning, the model adaptation is completed after 1-5 rounds of interactive sample training. By accurately storing historical experience, objectively quantifying scene matching, and rapidly adapting to new scenes, it solves the core pain points of traditional reinforcement learning models, such as high cost of cross-scene migration, long adaptation cycle, and difficulty in reusing historical experience. It achieves efficient storage and accurate retrieval of historical models, objective quantitative evaluation of scene similarity, and rapid adaptation and low-cost migration to new scenes.
[0013] This invention provides a reinforcement learning-driven dynamic constraint optimization decision-making method, the specific steps of which are as follows: S1. Data Acquisition: Real-time acquisition of environmental status data and constraint data of the decision-making scenario through sensors and database interfaces, followed by data cleaning and standardization. S2, Dynamic Constraint Awareness: Based on the attention mechanism, the weights of each constraint dimension are calculated to generate a constraint priority matrix, where the weights of key constraint dimensions are dynamically adjusted through historical violation data and real-time environmental features; S3. Multi-agent distributed decision-making: Sub-agents make local optimization decisions based on local environmental data and sub-constraints, and feed the decision results back to the master agent; The master agent integrates global constraints and the decision results of each sub-agent, and generates a global optimization decision through a centralized reinforcement learning algorithm. S4. Safety Constraint Verification and Optimization: The constraint satisfaction status of decision actions is predicted through the CAC framework. If there is a risk of constraint violation, the MPC module is triggered to generate a correction trajectory and adjust the decision actions to meet the constraint requirements. S5. Scene adaptation and model update: When the scene changes, the transfer learning adaptation module calls the historical experience model and combines a small number of new scene samples to quickly update the model parameters, ensuring that the decision adapts to the constraints of the new scene. S6. Decision Output: The optimized decision instructions are transformed into signals that can be recognized by the execution terminal. At the same time, the decision process data and model parameters are stored in the historical database to provide data support for subsequent transfer learning. Through the closed-loop design of the whole link of data input, dynamic perception, collaborative decision-making, security verification, scenario adaptation and output feedback, the functions of each core module are deeply coordinated.
[0014] Preferably, in step S2, the update cycle of the constraint priority matrix is synchronized with the data acquisition cycle. When a sudden change in constraint conditions is detected (the rate of change exceeds a preset threshold), the priority matrix is updated instantly without waiting for the acquisition cycle to end. The constraint priority matrix update mechanism adopts a dual-mode design of regular synchronous update and sudden instant update. The update cycle is synchronized with the data acquisition cycle, ensuring that the constraint priority matrix can be adjusted in real time with the latest acquired constraint data, enhancing the emergency response capability in sudden scenarios. Synchronous update ensures the timeliness of information, while the instant update is triggered only when the constraint changes, thus avoiding the waste of computing resources caused by high-frequency updates throughout the day.
[0015] As a preferred option, in step S3, a federated learning communication protocol is used between the sub-agent and the master agent. The sub-agent only uploads decision gradient information and does not transmit raw data, ensuring data privacy and security. By incorporating the federated learning protocol into the communication between the sub-agent and the master agent, and taking the transmission of gradients as the core instead of transmitting raw data, the sub-agent only uploads decision gradient information (such as gradient values for updating model parameters) and does not transmit sensitive raw data such as device load, user load, and production orders, fundamentally avoiding the risk of data being stolen, tampered with, or misused during transmission.
[0016] Preferably, in step S4, the prediction time domain length of the MPC module is adaptively adjusted according to the dynamic change frequency of the constraints. The higher the constraint change frequency, the shorter the prediction time domain length, so as to ensure the real-time performance of trajectory correction. The adaptive adjustment mechanism of the prediction time domain of the MPC module, through the design of dynamic matching of time domain length driven by constraint change frequency, achieves real-time performance guarantee in high dynamic constraint scenarios, improves optimization accuracy in low dynamic constraint scenarios, and enables on-demand allocation and efficient utilization of computing resources.
[0017] As a preferred option, in step S5, when the similarity between the new scene and the historical scene is lower than a preset threshold, the incremental training mode is started. Incremental learning is performed based on the historical model parameters and the new scene data to avoid decision delay caused by retraining the model from scratch, shorten the adaptation time of low similarity scenes, avoid decision delay, and when learning the features of the new scene, the general decision-making ability already mastered by the historical model can be retained, avoiding the complete forgetting of historical knowledge caused by retraining from scratch, improving the utilization efficiency of the new scene data, and reducing data dependence.
[0018] Compared with the prior art, the beneficial effects of the present invention are: Through a layered modular architecture and a closed-loop process design, the system deeply integrates core technologies such as dynamic constraint perception, multi-agent collaboration, security optimization, and cross-scenario migration. The data acquisition layer clearly distinguishes between hard and soft constraints. Through data cleaning and standardization, it provides high-quality input for subsequent modules, avoiding garbage data that leads to garbage decisions. The dynamic constraint perception module further extracts temporal constraint features through 1DCNN, associates constraints with the environment through a multi-head self-attention mechanism, and updates the priority matrix by combining historical violation data. This enables accurate capture, dynamic weighting, and intelligent priority adjustment of high-dimensional dynamic constraints, solving the problem of indiscriminate handling of constraints and delayed response in traditional systems. It ensures that decisions always focus on key constraints, providing a clear direction for subsequent optimization. The multi-agent reinforcement learning decision-making module adopts a master-sub-agent CTDE architecture. Sub-agents achieve accurate local decisions through DDPG, while the master agent optimizes local and global weighted reward functions to balance local real-time performance and global optimality. At the same time, the sub-agents and master agents adopt a federated learning protocol, which transmits only decision gradients rather than raw data. This significantly reduces communication volume and alleviates bandwidth pressure while protecting the privacy of core data such as factory production parameters and power grid load. It solves the triple pain points of high latency in traditional centralized decision-making, global imbalance in purely distributed decision-making, and privacy leakage in collaborative processes, thereby improving the collaborative decision-making efficiency of large-scale multi-subsystems. The safety reinforcement learning optimization module integrates the CAC framework and MPC. It evaluates the value of actions and the risk of constraint violation through dual Critic branches. When the risk exceeds the standard, it triggers MPC to generate a correction trajectory. Moreover, the MPC prediction time domain adaptively adjusts the frequency with the change of constraints, which greatly reduces the cost of cross-scenario adaptation. The closed-loop process of the whole link realizes continuous optimization of decision-making, takes into account both universality and practicality, and broadens the scope of industry applications. Attached Figure Description
[0019] Figure 1 This is a flowchart of the modules of the present invention; Figure 2 This is a flowchart of steps S1 and S2 of the present invention; Figure 3 This is a flowchart of step S3 of the present invention; Figure 4 This is a flowchart of step S4 of the present invention; Figure 5 This is a flowchart of step S5 of the present invention. Detailed Implementation
[0020] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
[0021] Please see Figure 1This invention provides a reinforcement learning-driven dynamic constraint optimization decision-making system. This system, through the collaborative operation of its internally configured functional modules, executes a closed-loop dynamic constraint optimization decision-making process. The specific execution steps include: S1. Data Acquisition: Real-time acquisition of environmental status and constraint data of the decision-making scenario through sensors and database interfaces, followed by data cleaning and standardization.
[0022] In this application, the learning-driven dynamic constraint optimization decision system includes: a data acquisition layer, used to collect dynamic environmental data and constraint data in the decision-making scenario in real time. The constraint data includes hard constraint data and soft constraint data. The hard constraint data is a set physical or system limit threshold that cannot be violated, and the soft constraint data is a constraint parameter that can be flexibly adjusted within a preset range.
[0023] To avoid overly broad protection that would lead to insufficient disclosure in the specification, the following embodiments in this application are all described using a smart factory production optimization scenario as the specific application scenario. The decision-making scenario can be understood as a real-time scheduling environment composed of factory equipment, production orders, energy systems, process cycle constraints, and a control network, which at least includes a stamping unit, welding unit, assembly unit, energy consumption monitoring unit, and manufacturing execution system; environmental status data can be understood as at least including equipment operation data, energy consumption data, and order data; constraint data can be understood as at least including equipment load limit data, production cycle range data, and total energy consumption limit data; the data preprocessing unit can be understood as an industrial data cleaning and standardization module set in the data acquisition layer, used to perform outlier removal, missing value filling, timestamp alignment, dimensional unification, and normalization processing; the constraint data includes hard constraint data and soft constraint data, wherein equipment load limit data and total energy consumption limit data are hard constraint data, and production cycle range data are soft constraint data.
[0024] Environmental status and constraint data for the decision-making scenario are collected in real time through sensor and database interfaces. For equipment operation data, such as data collected from industrial sensors in the decision-making scenario, including but not limited to load, temperature, and production cycle time, the sampling frequency can be selected according to actual needs. For example, to balance the accuracy of capturing equipment status changes and the consumption of edge computing resources, the sampling frequency can be set to 1 second / time. For energy consumption data, which is the instantaneous power, cumulative power consumption, and unit cycle energy consumption data of each production unit output by smart meters, power distribution monitoring terminals, or energy consumption acquisition gateways, the sampling frequency can be selected according to actual needs. For example, to balance the accuracy of energy consumption trend monitoring and the consumption of communication bandwidth, the sampling frequency can be set to 10 seconds / time. For order data, which is the order number, product model, batch size, delivery deadline, and priority data output by enterprise resource planning systems or manufacturing execution systems, real-time synchronization can be achieved using the enterprise resource planning system interface.
[0025] For the collected environmental status data and constraint data, the data preprocessing unit embedded in the data acquisition layer uses outlier detection, missing value imputation, time synchronization, dimensional normalization, and sliding window smoothing operations to clean and standardize multi-source heterogeneous data, thereby forming environmental status data and constraint data with unified dimensions and standardization.
[0026] S2. Dynamic Constraint Awareness: Based on the attention mechanism, the weights of each constraint dimension are calculated to generate a constraint priority matrix. The weights of key constraint dimensions are dynamically adjusted through historical violation data and real-time environmental features.
[0027] In this application, the reinforcement learning-driven dynamic constraint optimization decision system includes: a dynamic constraint perception module, which is used to extract constraint temporal features based on data, combine attention mechanism and historical violation records, generate and update constraint priority matrix in real time (updated immediately when constraint changes), and clarify the direction of key constraints for decision-making.
[0028] Among them, constraint temporal features can be understood as a set of features characterizing the changing trends, fluctuation amplitudes, rates of change, and periodicity of various constraints over multiple consecutive sampling periods; attention weights can be understood as normalized coefficients used to quantify the importance of each constraint dimension under the current environmental state; constraint priority matrix can be understood as a two-dimensional or multi-dimensional weight matrix constructed according to constraint category, constraint dimension, and current priority level; historical violation data can be understood as statistical data formed by recording the number of constraint violations, duration of violations, and magnitude of violations during historical operation; constraint feature extraction unit can be understood as a convolutional network module used to encode temporal constraint data into feature vectors that can be processed by neural networks; attention weight calculation unit can be understood as an attention calculation module used to establish the correlation between environmental state and constraint dimensions; and constraint priority update unit can be understood as a rule update module used to fuse real-time weights and historical risk information and output a priority matrix.
[0029] For the execution logic that calculates the weights of each constraint dimension and generates the constraint priority matrix based on the attention mechanism, for the received dynamic environment data and constraint data after preprocessing in step S1, the dynamic constraint perception module, including the constraint feature extraction unit, first uses a 1D convolutional neural network (CNN) to extract features from the temporal constraint data and outputs constraint feature vectors.
[0030] For time-series constraint data, such as receiving dimensionally unified and standardized constraint data formed by S1, where the time-series constraint data is constraint data that changes over time, in this application, the time-series constraint data specifically includes continuously collected equipment load limits, real-time equipment load, production cycle limits, total factory energy consumption thresholds, and corresponding constraint change rate sequences. The tensor dimension of the time-series constraint data input to the neural network can be selected according to actual needs. For example, considering both short-term dynamic response and local pattern extraction feature representation, the time sliding window and feature dimension of the input time-series data tensor can be set to 10×N, where 10 represents 10 consecutive sampling times, and N represents the number of constraint feature dimensions. The structure of the 1D convolutional neural network can be selected according to actual needs. For example, considering the extraction of local dynamic features, the 1D convolutional neural network can be set to two convolutional layers, with kernel sizes of 3 and 3 respectively. The 1D convolutional neural network extracts local dynamic features along the time dimension and outputs a constraint feature vector that integrates time-series patterns.
[0031] For the extracted constraint feature vector, the dynamic constraint perception module includes an attention weight calculation unit. The attention weight calculation unit is based on a multi-head self-attention mechanism and uses the constraint feature vector and the environment state vector as input to calculate the attention weight of each constraint dimension.
[0032] For the mapping rules of the multi-head self-attention mechanism between constraint feature vectors and environment state vectors, a linear projection operation can be used to map the environment state vector to a query matrix Q and the constraint feature vector to a key matrix K and a value matrix V. The multi-head self-attention mechanism can be selected according to actual needs. For example, to improve the ability to capture multi-dimensional features in parallel, four attention heads can be set. For the calculation of the correlation between the two, a scaling dot product attention operation can be used to perform tensor operations to obtain the attention weights of each constraint dimension. The higher the correlation, the greater the weight of the corresponding constraint, thereby realizing the dynamic matching between constraints and the environment.
[0033] For the attention weights of each constraint dimension calculated, the dynamic constraint perception module includes a constraint priority update unit, which dynamically adjusts the constraint priority matrix based on the attention weights and the constraint violation history.
[0034] Simultaneously, historical constraint violation records are used as the basis for weight calculation. The digitization and quantification of constraint violation history involves: statistically analyzing the number of violations, average violation magnitude, and cumulative violation duration within a preset time window for each constraint dimension, and normalizing these into a historical risk coefficient. Dynamic adjustments are made by combining the aforementioned attention weights with the quantified historical violation records, such as performing weighted summation followed by Softmax normalization for weight mathematical fusion. Furthermore, the basic weights of hard constraints are effectively guaranteed not to fall below a preset threshold. For this preset threshold, a lower bound truncation operation is used to rigidly truncate the values in the underlying code. Finally, a priority matrix focusing on key constraints is output and sent to the multi-agent decision-making module in step S3, providing key constraint directions for agent decision-making.
[0035] Regarding the constraint priority matrix update triggering mechanism, in step S2, the update cycle of the constraint priority matrix is synchronized with the data acquisition cycle. For the normal cycle synchronous update logic, the data acquisition cycle is used as the fixed update cycle. For example, if data is acquired once every second, the matrix is updated once every second, i.e., the constraint priority update cycle is set to 1 second / time. Within each cycle, the weights of each constraint dimension are calculated by combining real-time environmental features and historical violation records through an attention mechanism, generating and updating the constraint priority matrix. For the constraint mutation real-time detection logic, the rate of change of constraint conditions is monitored synchronously in real time and compared with a preset threshold. For the calculation of this rate of change, to avoid high-frequency noise interference in the industrial environment, a smoothing filter is performed using a combination of moving average filtering and first-order difference processing. For the sudden instant update triggering logic, if the constraint change rate is detected to exceed the threshold, for example, when the equipment load change rate exceeds 15% / second and a sudden failure occurs, the normal cycle waiting is immediately interrupted, triggering the instant update of the priority matrix. The weight calculation and matrix update process can be repeated without waiting for the end of the acquisition cycle to generate the instant-updated constraint priority matrix.
[0036] S3. Multi-agent distributed decision-making: Sub-agents make local optimization decisions based on local environmental data and sub-constraints, and feed the decision results back to the master agent; the master agent integrates global constraints and the decision results of each sub-agent, and generates a global optimization decision through a centralized reinforcement learning algorithm.
[0037] In this application, the reinforcement learning-driven dynamic constraint optimization decision system includes a multi-agent reinforcement learning decision module, which adopts a distributed deep reinforcement learning architecture, specifically comprising a master agent and several sub-agents. For the underlying control topology of this multi-agent system, the master agent employs a centralized training and distributed execution architecture.
[0038] In this context, a sub-agent can be understood as a local optimization decision-making unit deployed according to the physical boundaries of a subsystem, such as a production line or substation. In the smart factory production optimization scenario, it is divided into 6 sub-agents according to the production line, specifically covering 3 stamping sub-agents, 2 welding sub-agents, and 1 assembly sub-agent. The master agent can be understood as the central hub that coordinates the overall goal, and it is usually deployed on the edge computing nodes of the factory. The local state input vector can be understood as a feature vector formed by splicing the equipment load, equipment temperature, current order quantity, production cycle, sub-constraint priority weights, and local energy consumption level of the corresponding production line. The global state vector can be understood as a fused feature vector composed of the local feature summaries uploaded by each sub-agent, the total energy consumption of the factory, the global order delivery pressure, the global constraint priority matrix, and the equipment health status.
[0039] In the local decision-making phase of the sub-agent, the sub-agent completes local optimization based on local data and the sub-constraint priority matrix output from step S2. The sub-agent employs the DDPG algorithm. For the input of local decision-making, local data is collected and combined with the sub-constraint priority matrix to generate the local state input vector of the algorithm. The construction process of this input vector involves tensor concatenation using feature splicing and normalization encoding operations to construct a state input vector that meets the network dimension requirements. Its state space specifically includes local environmental features such as production line equipment load, current order quantity, and production cycle time. Based on the above state vector, the sub-agent generates and executes local actions using the DDPG algorithm. Its action space includes equipment operating power adjustment and production batch adjustment. For the boundary constraints of this continuous action output space, a tanh activation function is used to map the network output to the actual physical adjustment range of the equipment. After the action is executed, the sub-agent obtains a local reward and calculates the decision gradient of the model parameters based on the local reward feedback and the algorithm loss function. For the calculation logic of this local decision gradient, the mean square Bellman error formula is used as the loss function of the Critic network, and the policy gradient of the Actor network is calculated based on the deterministic policy gradient theorem, thereby transforming the business trial and error process into an executable gradient differentiation process.
[0040] In the sub-agent local decision-making and gradient uploading stages, to address the triple challenges of high computational latency in traditional centralized decision-making, global imbalance in purely distributed decision-making, and privacy leaks during collaborative processes, a federated learning communication protocol is adopted between the sub-agent and the master agent. Specifically, the FedProx algorithm is used for parameter aggregation. The sub-agent only uploads the calculated decision gradient information to the master agent, without transmitting sensitive raw data such as equipment load and process parameters, thus reducing communication bandwidth requirements while ensuring data privacy and security. Regarding the gradient uploading frequency, the sub-agent is set to upload the gradient once every 5 decision steps.
[0041] During the centralized training and global optimization phase of the main agent, the main agent integrates global constraints across the entire plant with information from each sub-agent to generate a global optimization decision. The main agent receives gradients from all sub-agents and integrates global constraints and global environment data to construct a global state vector. The mechanism for constructing the global state vector involves integrating multi-source heterogeneous global feature spaces through feature-level concatenation and compression via a fully connected embedding layer. Subsequently, the main agent optimizes the global strategy using the superior A2C algorithm. The optimization objective of the main agent is a global reward function, which includes a weighted sum of local rewards from sub-agents and rewards for satisfying global constraints. The specific formula is set as: Global reward function = 0.5 × sum of production efficiency rewards from each sub-agent + 0.5 × total energy consumption constraint satisfaction reward. The specific quantification logic for the total energy consumption constraint satisfaction reward involves using a piecewise penalty function to continuously map deviations exceeding or falling below the energy consumption threshold into reward scores.
[0042] Regarding the parameter update and distribution mechanism for the global policy, the main agent is configured to update the global policy parameters every 10 decision steps. The global update calculation logic for the algorithm's policy parameters involves iterating the parameters of the Actor and Critic networks by calculating the advantage function A(s,a), the policy gradient term, and the value function loss metric. After the update, the main agent distributes the corresponding local parameter subsets to the sub-agents, guiding each sub-agent to update its local policy network. Upon receiving the updated parameters, the sub-agents adjust their local policies, repeating the perception, decision-making, and gradient uploading process. Multiple iterations lead to global reward convergence, ultimately forming a collaborative closed loop of local decision-making, gradient uploading, global optimization, and parameter updates, achieving an effective balance between local real-time performance and global tuning.
[0043] S4. Safety Constraint Verification and Optimization: The constraint satisfaction status of decision actions is predicted through the CAC framework. If there is a risk of constraint violation, the MPC module is triggered to generate a correction trajectory and adjust the decision actions to meet the constraint requirements.
[0044] In this application, the reinforcement learning-driven dynamic constraint optimization decision system further includes: a safe reinforcement learning optimization module, which integrates the CAC framework and the MPC module, to detect the constraint satisfaction status in real time during the decision-making process, and to trigger model predictive control to perform local trajectory correction when the predicted constraint may be violated, thereby realizing a closed-loop process and a safety guarantee.
[0045] In this context, the CAC framework can be understood as a constraint Actor-Critic safe learning framework that simultaneously considers maximizing cumulative rewards and minimizing constraint costs during reinforcement learning training and execution; the Double Critic branch can be understood as a value assessment branch and a risk assessment branch set up in parallel within the same Critic network; constraint violation risk estimation can be understood as a numerical representation of the probability of triggering hard constraints or severe soft constraints violations within a few decision steps in the future for the current action; the MPC module can be understood as a model predictive control module that solves the control quantity correction sequence in real time based on the current state, prediction model, and rolling optimization objective; the system dynamics model can be understood as a physical transmission model that includes at least the relationship between equipment energy consumption and load; the prediction time domain can be understood as the discrete decision step length predicted by MPC to the future in rolling optimization; and the short-term optimization trajectory can be understood as the continuous control action correction sequence generated within this prediction time domain.
[0046] During safety constraint verification, the CAC framework receives the initial decision action generated by the main agent in step S3 and analyzes it in conjunction with the current environmental state and constraints, including both hard and soft constraints. Through the Double Critic branch within the Critic network of this framework, the value estimate of the decision action and the constraint violation risk estimate are calculated and output in parallel. The structural parameters of this Critic network can be set according to the state dimension of the actual industrial scenario. For example, considering feature fitting to balance nonlinear relationship representation capabilities and online inference efficiency, the number of hidden layer neurons in the Critic network is set to 128 and 64 respectively. The underlying mathematical logic for the value estimation of the decision action includes using the state-action value function Q(s,a) to calculate the expected value of the state-action pair. For the digital quantification and network training process of the constraint violation risk estimate, a risk cost function is constructed using a constraint cost function, transforming the physical constraint boundary into an algorithmically differentiable risk probability or penalty expectation value output.
[0047] Regarding the risk threshold judgment logic, the system compares and verifies the constraint violation risk estimate output by the Double Critic branch with a preset threshold in real time, such as setting the constraint violation risk threshold to 80% or 85%. If the risk does not exceed the limit, the initial decision action is determined to be within the safe physical boundary, and the decision action directly enters the subsequent S5 or S6 output stage; if the risk exceeds the limit, such as in the smart factory production optimization scenario, where it is predicted that the equipment load of the welding sub-intelligent agent will exceed the upper limit, i.e., the risk assessment reaches 88%, the system uses this as a trigger condition to immediately wake up the MPC module and synchronously collect the dynamic change frequency of the constraint conditions.
[0048] The triggered MPC module, based on the current environmental state, dynamic constraints, and pre-defined system dynamics models such as the relationship between equipment energy consumption and load, initiates rolling optimization to generate a short-term optimization trajectory. The core mathematical equation for this module, used for state derivation, is the system dynamics state-space equation x(k+1)=Ax(k)+Bu(k)+w(k), where x(k) represents the system state at time k, u(k) represents the control input, and w(k) represents the disturbance term. Its rolling optimization objective function, controlled by hard constraint boundaries, is J=Σ[α·energy consumption deviation²+β·cycle deviation²+γ·control increment²], satisfying constraints on equipment load, total energy consumption, and cycle boundary. During this correction process, to balance computation time and optimization accuracy, the prediction time domain length of this module is not a fixed constant but is adaptively adjusted according to the dynamic change frequency of the constraints. The execution rules for adaptive time-domain adjustment are as follows: the higher the frequency of constraint changes, the shorter the prediction time domain length, in order to reduce computation time and ensure the real-time performance of trajectory correction; the lower the frequency of change, the longer the time domain is to improve optimization accuracy. Specifically, if the system determines that the device is in a normal state, the prediction time domain is set to 8 steps, corresponding to an 8-second time period; when the device is in an abnormal state or the constraint changes frequently, the prediction time domain is adaptively shortened to 4 steps, corresponding to a 4-second time period.
[0049] Based on the adjusted prediction time domain and current state, the MPC module uses a system dynamics model to perform forward-looking extrapolation, generating short-term optimized trajectories, such as action adjustment schemes for the next 4 or 5-10 decision steps. The system prioritizes calling the trajectory generated by this module to rigidly correct and replace the high-risk original decision actions output by the main agent. For example, the generated corrected trajectory instruction is to reduce the welding current by 10% and adjust the production cycle to 1.8 pieces / minute. Through internal model extrapolation, this corrective scheme can drastically reduce the risk of equipment load constraint violation from 88% to 9%. Through this trajectory correction intervention based on the dynamics model, it is effectively ensured that the corrected decision actions meet the physical constraints of the industrial site to the greatest extent. Subsequently, the corrected actions are safely passed to the output stages of steps S5 and S6, or the correction error is fed back to the main agent's optimization strategy, thereby achieving a safety guarantee for the reinforcement learning trial-and-error mechanism before physical equipment execution.
[0050] S5. Scene Adaptation and Model Update: When the scene changes, the transfer learning adaptation module calls the historical experience model and combines a small number of new scene samples to quickly update the model parameters, effectively ensuring that the decision-making ability adapts to the constraints of the new scene.
[0051] In this application, the reinforcement learning-driven dynamic constraint optimization decision system includes: a transfer learning adaptation module, which is used to quickly update the model parameters by calling historical experience models and combining a small number of new scenario samples when the scenario changes, so that the decision model adapts to the new scenario constraints.
[0052] Here, a change in the scenario refers to the external operating conditions faced by the system as a prerequisite. A change in the decision-making scenario can be understood as a significant change in at least one of the following: order structure, equipment health status, production cycle requirements, or energy supply conditions. Key indicators are the underlying basis for the system to perceive changes in the environment. Key indicators can be understood as order batch size, product changeover frequency, number of faulty devices, average equipment load rate, energy consumption per unit output, and order delivery urgency. To enable the underlying algorithm to quantify the changes in the above key indicators, a scenario feature vector is introduced. The scenario feature vector can be understood as a set of key indicators representing the current new scenario operating conditions, which at least includes order batch data and faulty device data. The historical experience model can be understood as a model that is pre-stored in the meta-model storage unit and applied in different historical scenarios. The parameters of the reinforcement learning model that converge during training in the context of the scene; a small number of new scene samples can be understood as task support set data collected through interaction with the real environment or digital twin environment for model fine-tuning. Regarding the mathematical dimension definition of the specific state features, decision actions, and feedback rewards contained in this support set data, it can be understood as a set of quadruplets consisting of state vector s, action vector a, reward value r, and the next state s'; the scene similarity calculation unit can be understood as a matching module used to calculate the similarity of the angle between the feature vectors of new and old scenes; the fast adaptation unit can be understood as a meta-learning module used to call similar historical models and perform fast fine-tuning with a small number of samples; the incremental training unit can be understood as an online training module that continuously receives new data streams and updates model parameters in low similarity scenes.
[0053] The transfer learning adaptation module extracts key metrics of the current environment to generate new scene feature vectors. For scenarios where the scene changes, such as when a small-batch, multi-variety order scenario is initiated, the extracted key metrics must at least cover an order batch size of 50 pieces / variety and one faulty device. To transform these discrete business key metrics into continuous scene feature vectors, numerical normalization, one-hot encoding by category, and feature concatenation operations are performed for feature encoding and tensor construction, thereby generating scene feature vectors that conform to the underlying algorithm dimensions.
[0054] The scene similarity calculation unit embedded in the transfer learning adaptation module calls the cosine similarity algorithm to calculate the spatial similarity between the new scene feature vector and the feature vectors of various historical scenes pre-stored in the experience base. Regarding the storage and retrieval of historical experience models, the historical scene experience base pre-archives four types of experience models, including at least four types: large-volume order scenarios, small-volume multi-variety order scenarios, normal equipment scenarios, and equipment failure scenarios. For data preprocessing before cosine similarity calculation, Z-score standardization is performed to eliminate the interference of differences in the feature dimensions of key indicators on feature matching accuracy.
[0055] A rapid adaptation mechanism for highly similar scenes is triggered by a rapid adaptation unit. The trigger condition for this mechanism is to compare the calculated similarity value with a preset threshold, such as 75%. If the similarity is higher than or equal to this threshold, the rapid adaptation mechanism is triggered. For example, if the calculated similarity between the current new scene vector and historical small-batch multi-variety scenes reaches 82%, the system directly calls this historical experience model as the initial parameters of the multi-agent decision network. For the execution logic of rapidly updating model parameters using a small number of new scene samples, the MAML algorithm in meta-reinforcement learning is used for fine-tuning the underlying parameters. For the acquisition and training application of a small number of new scene samples, for example, 80 sets of interactive sample data are collected through environmental interaction as a task support set, and 1 to 5 rounds of interactive sample training are performed within a short physical time period, such as 3 minutes, to complete the rapid iteration of model parameters. Regarding the definition of the underlying constraint penalty loss function for the inner loop update of this MAML algorithm and the setting parameters of the network learning rate, gradient descent update operation with constraint penalty terms is performed to promote rapid network convergence.
[0056] The incremental training mode is triggered by the incremental training unit to steadily update the underlying network parameters. For handling low-similarity scenarios, if the computational similarity between the new scenario and all historical scenarios is below the preset threshold of 75%, the system switches to incremental training mode via the incremental training unit. The specific execution logic of this incremental learning is based on the parameters of the historical experience model with the highest relative similarity and the continuously generated data stream of the new scenario, ensuring the steady construction of decision generalization ability in the new scenario. The specific data construction mechanism for this continuously generated data stream includes a method of continuously writing state-action-reward-next state quadruple cache queues in chronological order to meet the continuous input requirements of incremental learning. Regarding the network layer weight freezing and underlying computing power scheduling mechanism in this incremental training mode, the underlying network iteration control is specifically achieved by freezing the front-end feature extraction layer and only updating the parameters of the policy output layer and value evaluation layer.
[0057] S6. Decision Output: The optimized decision instructions are converted into signals that can be recognized by the execution terminal. At the same time, the decision process data and model parameters are stored in the historical database to provide data support for subsequent transfer learning.
[0058] In this application, the reinforcement learning-driven dynamic constraint optimization decision system includes: a decision output layer, which is used to convert the optimized decision instructions into signals that can be recognized by the execution terminal, and store decision process data and model parameters in a historical database to provide data support for subsequent model evolution and transfer learning.
[0059] Among them, the decision command is the optimal control action obtained from the algorithm model. The decision command can be understood as dimensionless tensor data such as power adjustment and batch adjustment output after the aforementioned multi-agent global optimization or MPC module correction; the execution terminal is the physical work entity at the bottom layer of the smart factory. The execution terminal can be understood as various production and processing equipment such as stamping machines, welding robots, and assembly lines; the signal is the low-level communication command that drives the operation of the above hardware. The signal can be understood as electrical control signal that can be directly recognized and executed by the industrial programmable logic controller; the decision process data is a complete slice record of the system's operation within a single time step. The decision process data can be understood as containing a complete loop. The system consists of a multi-dimensional dataset of environmental state characteristics, specific execution actions, and environmental feedback reward values; model parameters are the core elements recording the current learning experience of the neural network, which can be understood as the set of node weights for each layer of the deep reinforcement learning network that has now converged during training; the historical database is the persistent storage hub of the system, which can be understood as the enterprise's manufacturing execution system and dedicated meta-model storage unit; the protocol conversion unit can be understood as an industrial protocol adaptation module used to convert algorithm action instructions to PLC or industrial bus control messages; and the data persistence module can be understood as a storage module used to serialize and write the operation logs, sample data, and model parameters into the historical database.
[0060] The optimized decision instructions are converted into signals recognizable by the execution terminals through the protocol conversion unit of the decision output layer. Regarding the mapping process of decision instructions from the algorithm calculation layer to the physical quantities of the lower-level machines, the system maps the floating-point action tensors output by reinforcement learning to the actual physical setpoints of each execution terminal. For the tensor decoding and dimensional recovery logic in this mapping process, inverse normalization and boundary pruning are performed to inversely normalize and safely truncate the action feature vectors output by the network, generating safe and compliant physical setpoints. For the driving mechanism of sending these physical setpoints to the lower-level controller, high-frequency deterministic communication connections are established using industrial communication protocols such as Modbus TCP, Profinet, or EtherCAT, thereby directly driving the production line equipment to perform corresponding physical actions such as power adjustment or batch switching.
[0061] The decision output layer's embedded data persistence module synchronously stores decision process data and model parameters to a historical database. For the synchronous disk storage operation during control signal-driven device execution, the system writes the current time step's decision process data and currently converged model parameters to the relevant enterprise systems and a dedicated historical scenario database. For the efficient disk storage mechanism for the aforementioned massive heterogeneous operational data and high-dimensional network weights, serialization, packaging, and compression according to JSON or Protocol Buffers data structure standards are employed to ensure the integrity of system operational data and high-concurrency access efficiency.
[0062] The aforementioned closed-loop storage operation provides direct data support for subsequent transfer learning. For business logic empowered by historical data, the feature data and network weights stored in the historical database directly serve as the historical experience foundation and training sample data source for the aforementioned transfer learning adaptation module, enabling scene feature extraction, scene similarity calculation, and rapid fine-tuning through meta-reinforcement learning. Regarding the periodic cleaning and high-value sample screening mechanism for the massive amount of redundant data accumulated in this historical experience database over a long period, specifically employing time-decay-based sample elimination and constraint violation sensitivity-based high-value sample retention operations, adaptive maintenance of database storage space and retention of high-value core experiences are achieved. This comprehensively connects the entire chain of evolutionary closed-loop systems at the system architecture level, encompassing dynamic perception, collaborative decision-making, security correction, execution output, and experience accumulation.
[0063] Such as smart factory production optimization scenarios The automotive parts factory comprises three stamping production lines, two welding production lines, and one assembly production line. Production scheduling decisions need to be optimized, with constraints including: equipment load constraints (hard constraints, such as stamping machine load ≤ 100t), production cycle time constraints (soft constraints, such as welding cycle time ≥ 2 pieces / minute), and energy consumption constraints (hard constraints, total factory energy consumption ≤ 500kW / h). These constraints dynamically change with order volume and equipment status, with the goal of improving production efficiency and energy utilization.
[0064] Data acquisition layer: Collects equipment operation data (load, temperature, production cycle time, sampling frequency 1 second / time) and energy consumption data (smart meters, sampling frequency 10 seconds / time) and order data (ERP system interface, real-time synchronization) through industrial sensors; constraint data is obtained from the MES system, including equipment load limit, production cycle time range, and total energy consumption limit.
[0065] Dynamic constraint perception module: 1D-CNN is set to 2 convolutional layers (convolutional kernel size 3, 3), multi-head self-attention mechanism is set to 4 attention heads, constraint priority update cycle is 1 second / time; when the device load change rate exceeds 15% / second (such as sudden failure), it triggers instant update.
[0066] Multi-agent reinforcement learning decision-making module: Sub-agents: The production line is divided into 6 sub-agents (3 stamping, 2 welding, and 1 assembly). The DDPG algorithm is used. The state space includes the production line equipment load, current order quantity, and production cycle time. The action space includes the equipment operating power adjustment amount and the production batch adjustment amount. The main agent is deployed on the edge computing node of the factory and adopts the A2C algorithm. The global reward function is 0.5 × the sum of the production efficiency rewards of each sub-agent + 0.5 × the total energy consumption constraint satisfaction reward. Communication protocol: FedProx algorithm is used for federated learning. Sub-agents upload gradients every 5 decision steps, and the master agent updates the global policy every 10 decision steps.
[0067] Safety reinforcement learning optimization module: The number of hidden layer neurons in the Critic network in the CAC framework is 128 and 64, and the constraint violation risk threshold is 80%; the MPC prediction time domain is adjusted according to the equipment status: the prediction time domain is 8 steps (8 seconds) when the equipment is normal and 4 steps (4 seconds) when the equipment is abnormal; the system dynamics model adopts the equipment energy consumption and load relationship model.
[0068] Transfer learning adaptation module: The historical scene experience library stores four types of scene models (large batch orders, small batch multi-variety orders, normal equipment, and equipment failure). The scene feature vector includes order batch and number of equipment failures. The similarity threshold is 75%, and the MAML adaptation training rounds are 5.
[0069] Implementation process and results Data Acquisition (S1): Sensors collect press load data (1 second / time), and smart meters collect energy consumption data (10 seconds / time). The data is then processed by outlier detection to remove outliers and standardization.
[0070] Dynamic Constraint Awareness (S2): 1D-CNN extracts features of equipment load fluctuation and energy consumption change rate, and attention mechanism calculates constraint weights. It is found that in the scenario of large-volume orders, the load constraint weight of the stamping machine is the highest (0.7), followed by the production cycle constraint weight (0.2), and the energy consumption constraint weight is 0.1, generating a priority matrix.
[0071] Multi-agent distributed decision-making (S3): The stamping sub-agent 1 adjusts the operating power (reduced by 5%) according to the local load data (current load 95t) through the DDPG algorithm and uploads the gradient to the master agent; the master agent integrates the gradients of the 6 sub-agents and the total energy consumption constraint (current energy consumption 480kW / h), optimizes the global strategy through the A2C algorithm, and outputs the batch adjustment instruction for the assembly line (adjusting the batch from 100 pieces to 80 pieces).
[0072] Safety Constraint Verification and Optimization (S4): The CAC framework predicts that the equipment load of welding sub-agent 2 will exceed the upper limit (risk 88%), triggering the MPC module to generate a correction trajectory in the next 4 steps (4 seconds) based on the energy consumption-load model: reduce the welding current by 10%, adjust the production cycle to 1.8 pieces / minute, and reduce the risk of load constraint violation to 9% after correction.
[0073] Scene adaptation and model update (S5): When the small batch and multiple variety order scenario is started, the scene feature vector (order batch of 50 pieces / type, 1 faulty device) has a similarity of 82% with the historical "small batch and multiple variety" scenario. The historical model is called and adapted after 5 rounds of MAML training (80 sets of samples). The adaptation time is 3 minutes.
[0074] Decision Output (S6): The decision output layer converts power adjustment and batch adjustment instructions into PLC control signals to control the production line equipment to execute; the decision data is stored in the MES system for subsequent production analysis.
[0075] Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A reinforcement learning-driven dynamic constraint optimization decision-making system, characterized in that, include: The data acquisition layer is used to collect dynamic environmental data and constraint data in the decision-making scenario in real time. The constraint data includes hard constraint data and soft constraint data. The dynamic constraint perception module assigns weights to the collected constraint data based on an attention mechanism, focuses on key constraint dimensions, and updates the constraint priority matrix in real time. The multi-agent reinforcement learning decision module adopts a distributed deep reinforcement learning architecture, including a master agent and several sub-agents. The sub-agents are responsible for local optimization of sub-decision scenarios, and the master agent performs global optimization based on the local decision results of the sub-agents and global constraints. The safety reinforcement learning optimization module integrates the constraint Actor-Critic framework with model predictive control to detect the constraint satisfaction status in real time during the decision-making process. The transfer learning adaptation module, based on the meta-reinforcement learning framework, stores decision-making experience models from historical scenarios. The decision output layer receives the optimized decision instructions, converts them into control signals, and sends them to the execution terminal.
2. The reinforcement learning-driven dynamic constraint optimization decision-making system according to claim 1, characterized in that, The dynamic constraint perception module includes: The constraint feature extraction unit uses a 1D convolutional neural network to extract features from temporal constraint data and outputs constraint feature vectors. The attention weight calculation unit, based on the multi-head self-attention mechanism, takes the constraint feature vector and the environment state vector as input and calculates the attention weight of each constraint dimension. The constraint priority update unit dynamically adjusts the constraint priority matrix based on attention weights and constraint violation history, where the basic weight of hard constraints is not lower than a preset threshold.
3. The reinforcement learning-driven dynamic constraint optimization decision-making system according to claim 1, characterized in that, In the multi-agent reinforcement learning decision module, the sub-agents adopt a deep policy gradient algorithm, the master agent adopts a centralized training and distributed execution architecture, and the global reward function is optimized through the superior Actor-Critic algorithm. The global reward function includes the local reward of the sub-agents and the global constraint satisfaction reward.
4. The reinforcement learning-driven dynamic constraint optimization decision-making system according to claim 1, characterized in that, In the security reinforcement learning optimization module, the Critic network of the CAC framework includes two Critic branches, which respectively output the value estimate of decision actions and the constraint violation risk estimate.
5. The reinforcement learning-driven dynamic constraint optimization decision-making system according to claim 1, characterized in that, The transfer learning adaptation module includes: The meta-model storage unit stores the parameters of the reinforcement learning model trained under different historical scenarios, indexed by the scene feature vector. The scene similarity calculation unit calculates the similarity between the feature vector of the new scene and the feature vector of the historical scene using the cosine similarity algorithm. The fast adaptation unit calls the corresponding historical model as the initial parameter when the similarity is higher than the preset threshold. It completes the model adaptation through 1-5 rounds of interactive sample training using the MAML algorithm of meta-reinforcement learning.
6. The reinforcement learning-driven dynamic constraint optimization decision-making method according to claim 1, the specific steps are as follows: S1. Data Acquisition: Real-time acquisition of environmental status data and constraint data of the decision-making scenario through sensors and database interfaces, followed by data cleaning and standardization. S2, Dynamic Constraint Awareness: Based on the attention mechanism, the weights of each constraint dimension are calculated to generate a constraint priority matrix, where the weights of key constraint dimensions are dynamically adjusted through historical violation data and real-time environmental features; S3. Multi-agent distributed decision-making: Sub-agents make local optimization decisions based on local environmental data and sub-constraints, and feed the decision results back to the master agent. The master agent integrates global constraints and the decision results of each sub-agent, and generates a global optimization decision through a centralized reinforcement learning algorithm. S4. Safety Constraint Verification and Optimization: Predict the constraint satisfaction status of decision actions through the CAC framework. If there is a risk of constraint violation, trigger the MPC module to generate a correction trajectory and adjust the decision actions. S5. Scene adaptation and model update: When the scene changes, the transfer learning adaptation module calls the historical experience model and combines a small number of new scene samples to quickly update the model parameters. S6. Decision Output: The optimized decision instructions are converted into signals that can be recognized by the execution terminal, and the decision process data and model parameters are stored in the historical database.
7. The reinforcement learning-driven dynamic constraint optimization decision-making method according to claim 6, characterized in that, In step S2, the update cycle of the constraint priority matrix is synchronized with the data acquisition cycle. When a sudden change in constraint conditions is detected, the priority matrix is updated immediately.
8. The reinforcement learning-driven dynamic constraint optimization decision-making method according to claim 6, characterized in that, In step S3, the sub-agent and the master agent use a federated learning communication protocol, and the sub-agent only uploads decision gradient information.
9. The reinforcement learning-driven dynamic constraint optimization decision-making method according to claim 6, characterized in that, In step S4, the prediction time domain length of the MPC module is adaptively adjusted according to the dynamic change frequency of the constraints, and the constraint change frequency and the prediction time domain length change in opposite directions.
10. The reinforcement learning-driven dynamic constraint optimization decision-making method according to claim 6, characterized in that, In step S5, when the similarity between the new scene and the historical scene is lower than a preset threshold, the incremental training mode is started, and incremental learning is performed based on the historical model parameters and the new scene data.