Multi-agent optimization control method and system based on mean field game reinforcement learning
By constructing difference and mean field data matrices and updating the feedback gain matrix using iterative equations, the stability problem of traditional methods under unknown system matrices is solved, and efficient distributed decision-making and control of multi-agent systems is realized.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-16
AI Technical Summary
Traditional mean-field game theory methods struggle to guarantee stable convergence when the system matrix is unknown and mean-field coupling exists, leading to limitations in practical engineering applications.
By adopting a data-driven approach, the feedback gain matrix is iteratively updated using the bias model-free iterative equation and the mean model-free iterative equation by constructing a difference data matrix and a mean data matrix, thereby obtaining the optimal control input, eliminating the influence of the mean field coupling term, and realizing model-free reinforcement learning.
In scenarios with unknown system dynamics and external disturbances, it achieves good convergence and engineering feasibility, and improves the coordinated control performance of large-scale multi-agent systems.
Smart Images

Figure CN122219618A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of multi-agent control technology, specifically relating to a multi-agent optimization control method and system based on mean-field game reinforcement learning. Background Technology
[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.
[0003] Mean-field models are important modeling tools for describing and analyzing large-scale multi-agent systems, and have been widely applied in fields such as UAV formation control, smart grid operation scheduling, and urban intelligent traffic management. Mean-field game theory, as a core theoretical framework for studying distributed decision-making problems in such systems, is characterized by the fact that when the number of agents approaches infinity, the influence of individuals on the overall system becomes negligible, while the group effect becomes significant. Therefore, when making decisions, each agent does not need to consider the specific interactions of all other individuals, but rather abstracts the interactions between individuals into a mean-field term. This transforms the high-dimensional and complex game problem into solving a set of coupled forward and backward partial differential equations, significantly simplifying the modeling and solution complexity of the system.
[0004] Traditional mean-field game solving methods typically rely on the premise that the system model is precisely known. However, in real-world engineering scenarios, it is often difficult to obtain a completely accurate system model. In recent years, reinforcement learning, as a policy optimization method that does not depend on an accurate model, has been gradually introduced into the mean-field game solving framework to address the challenge of model uncertainty.
[0005] However, since both the system dynamics and cost function in mean-field games contain mean-field terms of state coupling, the equivalence between traditional model-based and model-free methods no longer holds for such problems. Therefore, traditional model-free reinforcement learning methods have significant limitations when directly applied to mean-field games, especially when the system matrix is unknown and mean-field coupling exists. Existing methods struggle to guarantee stable convergence, which constitutes the main technical deficiency in current practical applications in this field. Summary of the Invention
[0006] To address the aforementioned issues, this invention proposes a multi-agent optimization control method and system based on mean-field game reinforcement learning. This invention learns and approximates the optimal distributed control strategy through a data-driven approach without relying on an accurate system model, exhibiting good convergence and engineering feasibility.
[0007] According to some embodiments, the first aspect of the present invention provides a multi-agent optimization control method based on mean-field game reinforcement learning, employing the following technical solution: Multi-agent optimization control methods based on mean-field game reinforcement learning include: Construct a difference data matrix and a mean field data matrix using the state information and control input of any two intelligent agents; Based on the difference data matrix and the current initial individual deviation feedback gain matrix, the initial individual deviation feedback gain matrix is iteratively updated using the pre-constructed deviation model-free iterative equation until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. Based on the mean field data matrix and the current initial mean field feedback gain matrix, the initial mean field feedback gain matrix is iteratively updated using a pre-constructed mean-model-free iterative equation until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean field feedback gain matrix. The optimal control input is obtained based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix.
[0008] Furthermore, the construction of the difference data matrix and mean field data matrix using the state information and control input of any two intelligent agents includes: Generate the state deviation and control deviation between any two agents based on their state information and control input, and obtain the mean state and control values of any two agents. Based on the state deviation and control deviation, as well as the state mean and control mean, construct the difference data matrix and the mean field data matrix.
[0009] Further, the step of iteratively updating the initial individual deviation feedback gain matrix based on the difference data matrix and the current initial individual deviation feedback gain matrix using a pre-constructed model-free iterative equation for deviation, until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, to obtain the optimal individual deviation feedback gain matrix, includes: Set iteration steps Based on the difference data matrix and the current initial individual deviation feedback gain matrix, the updated individual deviation feedback gain matrix is obtained by substituting it into the pre-constructed model-free iterative equation for deviation. If the difference between the updated individual bias feedback gain matrix and the original individual bias feedback gain matrix is less than the convergence criterion, then the updated individual bias feedback gain matrix is output as the optimal individual bias feedback gain matrix. If not, then let The updated individual bias feedback gain matrix is iteratively calculated until the difference between the individual bias feedback gain matrices between two iterations is less than the convergence criterion. Then, the updated individual bias feedback gain matrix is output as the optimal individual bias feedback gain matrix.
[0010] Further, the step of iteratively updating the initial mean-field feedback gain matrix based on the mean-field data matrix and the current initial mean-field feedback gain matrix using a pre-constructed mean-model-free iterative equation until the difference between the mean-field feedback gain matrices between two iterations is less than the convergence criterion, thereby obtaining the optimal mean-field feedback gain matrix, includes: Set iteration steps Based on the mean field data matrix and the current initial mean field feedback gain matrix, the updated mean field feedback gain matrix is obtained by substituting it into the pre-constructed mean-model-free iterative equation. If the difference between the updated mean field feedback gain matrix and the original mean field feedback gain matrix is less than the convergence criterion, then the updated mean field feedback gain matrix is output as the optimal mean field feedback gain matrix. If not, then let The updated mean field feedback gain matrix is iteratively calculated until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion. Then, the updated mean field feedback gain matrix is output as the optimal mean field feedback gain matrix.
[0011] Furthermore, based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix, the optimal control input is obtained as follows:
[0012] in, It is an intelligent agent The optimal control input, It is the optimal individual bias feedback gain matrix. It is the optimal mean-field feedback gain matrix. It is an intelligent agent Status information, It is the mean field term.
[0013] Furthermore, the construction process of the bias model-free iterative equation and the mean model-free iterative equation includes: Based on the interaction between each agent as the mean field coupling term, the dynamic equation of the multi-agent system is constructed, and a distributed state feedback controller is designed according to the dynamic equation of the multi-agent system. Based on the system transformation method, the mean field coupling term in the dynamic equation of the multi-agent system is eliminated, and the dynamic equations of the deviation subsystem and the mean subsystem are obtained. The dynamic equation of the deviation subsystem is optimized based on the individual deviation function, and the data-driven iterative expression of the dynamic equation of the deviation subsystem is obtained. The dynamic equations of the mean subsystem are optimized based on the mean field function, and the data-driven iterative expression of the dynamic equations of the mean subsystem is obtained. Based on the properties of vectorization and the Kronecker product, the data-driven iterative expressions of the dynamic equations of the bias subsystem and the mean subsystem are transformed into linear regression forms, resulting in the model-free iterative equations of the bias and the mean.
[0014] According to some embodiments, the second aspect of the present invention provides a multi-agent optimization control system based on mean-field game reinforcement learning, employing the following technical solution: Multi-agent optimization control systems based on mean-field game reinforcement learning include: The data acquisition module is used to construct a difference data matrix and a mean field data matrix using the state information and control input of any two intelligent agents; The individual deviation feedback gain module is used to iteratively update the initial individual deviation feedback gain matrix based on the difference data matrix and the current initial individual deviation feedback gain matrix using a pre-built deviation model-free iterative equation until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. The mean field feedback gain module is used to iteratively update the initial mean field feedback gain matrix based on the mean field data matrix and the current initial mean field feedback gain matrix using a pre-built mean model-free iterative equation until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean field feedback gain matrix. The optimal control calculation module is used to obtain the optimal control input based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix.
[0015] According to some embodiments, a third aspect of the present invention provides a computer-readable storage medium.
[0016] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the multi-agent optimization control method based on mean-field game reinforcement learning as described in the first scheme above.
[0017] According to some embodiments, a fourth aspect of the present invention provides a computer device.
[0018] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the steps in the multi-agent optimization control method based on mean-field game reinforcement learning as described in the first embodiment above.
[0019] According to some embodiments, a fifth aspect of the present invention provides a computer program product or computer program.
[0020] A computer program product or computer program includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the steps in the multi-agent optimization control method based on mean-field game reinforcement learning as described in the first embodiment above.
[0021] Compared with the prior art, the beneficial effects of the present invention are as follows: The data-driven reinforcement learning method based on mean-field game theory proposed in this invention is designed for large-scale multi-agent systems in scenarios where system dynamics are unknown and external disturbances and uncertainties exist. It characterizes the interactions between multiple agents through mean-field coupling terms, which is closer to the actual large-scale multi-agent systems. At the same time, based on system transformation and state difference structure, it eliminates the influence of mean-field coupling terms on the equivalence of model-based and model-free algorithms, and constructs model-free data-driven iterative relationships and linear regression forms. This enables iterative solution of the feedback gain matrix without the need for system matrix parameters, and further obtains the distributed control law, reducing modeling complexity.
[0022] In practical applications, this invention allows the controller to collect the states of each agent at discrete time points during the training phase and calculate the group average state. It collects state and input data samples of any two agents coupled with the mean field term that satisfy the continuous excitation condition. The distributed control law is then obtained using the mean field game data-driven reinforcement learning algorithm of this invention. Subsequently, based on the control law, control inputs are calculated for each agent and output to the actuator. Under this process, each agent only needs to utilize its own state and mean field term to achieve distributed decision-making, thereby achieving asymptotic Nash equilibrium control effects and improving the system's coordinated control performance and engineering feasibility. Attached Figure Description
[0023] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.
[0024] Figure 1 This is a flowchart of a multi-agent optimization control method based on mean-field game reinforcement learning in an embodiment of the present invention; Figure 2 These are the state data samples of the first and second agents and the mean field term collected in this embodiment of the invention; Figure 3 These are control data samples of the first and second agents and the mean field term collected in this embodiment of the invention; Figure 4 As described in the embodiments of the present invention The iterative curve graph; Figure 5 As described in the embodiments of the present invention The iterative curve graph; Figure 6 As described in the embodiments of the present invention The iterative curve graph. Detailed Implementation
[0025] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0026] It should be noted that the following detailed description is illustrative and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.
[0027] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0028] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.
[0029] Example 1 like Figure 1 As shown, this embodiment provides a multi-agent optimization control method based on mean-field game reinforcement learning. This embodiment uses the application of this method to a server as an example for illustration. It is understood that this method can also be applied to terminals, and can also be applied to systems including terminals, servers, and other components, and implemented through interaction between the terminal and the server. The server can be an independent physical server, a server cluster composed of multiple physical servers, or a distributed system. It can also be a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network servers, cloud communication, middleware services, domain name services, CDN security services, and big data and artificial intelligence platforms. The terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, etc., but is not limited to these. The terminal and server can be directly or indirectly connected via wired or wireless communication, which is not limited herein. In this embodiment, the method includes the following steps: Construct a difference data matrix and a mean field data matrix using the state information and control input of any two intelligent agents; Based on the difference data matrix and the current initial individual deviation feedback gain matrix, the initial individual deviation feedback gain matrix is iteratively updated using the pre-constructed deviation model-free iterative equation until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. Based on the mean field data matrix and the current initial mean field feedback gain matrix, the initial mean field feedback gain matrix is iteratively updated using a pre-constructed mean-model-free iterative equation until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean field feedback gain matrix. The optimal control input is obtained based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix.
[0030] Specifically, the method described in this embodiment includes the following process: Step S1: Construct a difference data matrix and a mean field data matrix using the state information and control input of any two agents; It should be noted that in the training phase, this step requires the construction of the control input for the training phase. However, in practical applications, the difference data matrix and mean field data matrix can be constructed directly using the current state information of the agent and the control input, without the need to construct the control input again. During the training phase, firstly, using the agent's state information, an exploration stimulus signal is superimposed on the initial individual bias feedback gain matrix to generate the control input corresponding to the training phase, including:
[0031] in, It is the first The agent in the th... Control input for the next iteration It is the first The agent in the th... The state information of the next iteration. It is an exploratory stimulus signal. It is the initial individual deviation feedback gain matrix that makes the system closed-loop stable. ;make The number of training iterations is used; and during the training iteration process, the state information of the agent is iteratively calculated using discrete-time stochastic difference equations at each iteration.
[0032] Step S1.1: Generate the state deviation and control deviation between any two agents based on their state information and control input, and obtain the mean state value and mean control value of any two agents; Similarly, during training, based on the control inputs and state information generated during the training phase, the state deviation and control deviation between each pair of agents are constructed, and the mean state and control values of any two agents are obtained, including:
[0033]
[0034] in, Representing, respectively, the intelligent agent External intelligent agents Status and control inputs, Represents intelligent agents and intelligent agents State deviation and control deviation express The state mean and control mean of each agent.
[0035] Step S1.2: Construct the difference data matrix and the mean field data matrix based on the state deviation, control deviation, state mean, and control mean; Taking the training phase as an example, The samples are combined to obtain the following difference data matrix:
[0036] At the same time, the following mean-field data matrix was also obtained:
[0037] in, , , , , , , , .
[0038] Step S2: Based on the difference data matrix and the current initial individual deviation feedback gain matrix, iteratively update the initial individual deviation feedback gain matrix using the pre-constructed deviation model-free iterative equation until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. Step S2.1: Set the iteration step Based on the difference data matrix and the current initial individual bias feedback gain matrix Substituting the pre-constructed bias-free model-free iterative equation, we obtain The updated individual bias feedback gain matrix is then obtained. ; Specifically, the deviation has no model iterative equation, and the formula is as follows:
[0039] in,
[0040]
[0041]
[0042]
[0043] Step S2.2: Determine Whether it is valid, If it is a convergence criterion, then output the currently updated individual bias feedback gain matrix as the optimal individual bias feedback gain matrix. Proceed to step S4, where... It is the individual bias feedback gain matrix before the current update; If not, then let And iteratively calculate the updated individual bias feedback gain matrix until... If true, then output the current individual bias feedback gain matrix as the optimal individual bias feedback gain matrix. Proceed to step S4.
[0044] Step S3: Based on the mean field data matrix and the current initial mean field feedback gain matrix, iteratively update the initial mean field feedback gain matrix using the pre-constructed mean-model-free iterative equation until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean field feedback gain matrix. Step S3.1: Set the iteration step Based on the mean field data matrix and the current initial mean field feedback gain matrix Substituting into the pre-constructed mean-model-free iterative equation, we obtain The updated mean field feedback gain matrix is then obtained. ; Specifically, the mean has no model iterative equation, and the formula is as follows:
[0045] in,
[0046]
[0047]
[0048]
[0049]
[0050] Step S3.2: Determine Whether it is valid, If it is a convergence criterion, then output the currently updated mean field feedback gain matrix as the optimal mean field feedback gain matrix. Proceed to step S4, where... It is the mean field feedback gain matrix before the current update; If not, then let And iteratively calculate the updated mean field feedback gain matrix until... If true, then output the updated mean field feedback gain matrix as the optimal mean field feedback gain matrix. Proceed to step S4.
[0051] Step S4: Based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix, the optimal control input is obtained as follows:
[0052] in, The intelligent agent at the current moment The optimal control input, It is the optimal individual bias feedback gain matrix. It is the optimal mean-field feedback gain matrix. The mean-field term satisfies the following equation:
[0053] Wherein, the coefficient matrix For a constant matrix of appropriate but unknown dimensions, It is in the The average field term during the next iteration of the policy update. It is in the The mean field term during the next iteration of the policy update.
[0054] Specifically, the construction process for the bias model-free iterative equation and the mean model-free iterative equation is as follows: First, based on the interactions between the agents as the mean field coupling term, the dynamic equations of the multi-agent system are constructed. Then, a distributed state feedback controller is designed based on these equations, including: (1) Based on the interaction between each agent as the mean field coupling term, the dynamic equations of the multi-agent system are constructed as follows: Consider a A large-scale multi-agent system consisting of 10 agents, the set of agents is denoted as . Any intelligent agent It satisfies the following discrete-time random difference equation:
[0055] in, and They are intelligent agents In the The state and control inputs during the next iteration of the policy update. It is the dimension of the state. It controls the dimension. For mean-field coupling, denoted as the average state of the population; It is an intelligent agent No. The state and coefficient matrix during the next iteration of the policy update. For a constant matrix of appropriate but unknown dimensions, It is an intelligent agent In the The Gaussian white noise used in the next iteration update strategy has a mean of 0 and a variance of . Independent and identically distributed Gaussian white noise.
[0056] initial state They are independent and satisfy the same mathematical expectation. In addition, the initial state set With Gaussian white noise set They are independent of each other.
[0057] (2) Design a distributed state feedback controller based on the dynamic equations of a multi-agent system, as follows: Specifically, let Indicates except the first Besides the first agent, the control input from all other agents is then... Cost function of an agent Defined as:
[0058] in, It is a known constant matrix of appropriate dimensions. and It is symmetrical; This represents the matrix transpose operation.
[0059] Definition of the first Distributed permissible control set of individual agents for:
[0060] in, , It is the first A control sequence for an agent. .
[0061] Given the control inputs of the other agents Under the condition of distributed permissible control set Seeking the cost function The smallest optimal response control sequence; Based on the optimality condition of the optimal response control sequence in the dynamic equations of a multi-agent system, the corresponding forward and backward stochastic difference equations are established, resulting in the corresponding coupled algebraic Riccati equation system: To ensure the following The equivalent transformation holds and remains Given the symmetry, suppose there exists a symmetric matrix. Make Then we obtain the coupled algebraic Riccati equations, as follows:
[0062] in, It is a matrix-form solution to the coupled algebraic Riccati equation system.
[0063] The solution to the Lyapunov equation is obtained by transforming the matrix form of the coupled algebraic Riccati equation system, and the distributed state feedback controller is obtained based on the solution to the Lyapunov equation. Specifically, a model-based policy iteration algorithm is given, which is used for policy updates at the . In the next iteration, let The solutions to the following Lyapunov equations are:
[0064] Iterative update of individual bias feedback gain matrix and average field feedback gain matrix The calculation is as follows:
[0065] The individual deviation feedback gain matrix and the mean field feedback gain matrix are used as the distributed state feedback controller.
[0066] Secondly, based on the system transformation method, the mean field coupling term in the dynamic equation of the multi-agent system is eliminated, and the dynamic equation of the deviation subsystem is obtained. Dynamic equations of the mean subsystem ,as follows:
[0067]
[0068] For the deviation subsystem, the following quadratic function is defined as the individual deviation function:
[0069] Based on the optimization of the individual deviation function, the dynamic equation of the deviation subsystem can be further written as:
[0070] make , , By replacing the displayed system matrix terms, we obtain the data-driven iterative expression for the dynamic equations of the deviation subsystem without the displayed system matrix terms, as shown in the following formula:
[0071]
[0072] For the mean-valued subsystem, the following quadratic function is defined as the mean field function:
[0073] Based on the optimization of the mean subsystem dynamic equation using the mean field function, it can be further written as:
[0074] make , , , By replacing the displayed system matrix terms, we obtain the data-driven iterative expression for the dynamic equation of the mean subsystem without the displayed system matrix terms, as shown in the following formula:
[0075] By combining the data-driven iterative expressions of the dynamic equations of the deviation subsystem and the mean subsystem, the data-driven iterative expressions of the dynamic equations of the multi-agent system are obtained.
[0076] Finally, based on the properties of vectorization and Kronecker product, the data-driven iterative expression of the dynamic equation of the multi-agent system is transformed into a linear regression form, resulting in the bias model-free iterative equation and the mean model-free iterative equation. Based on vectorization and the Kronecker product identity, the data-driven iterative expression of the dynamic equation of the biased subsystem is equivalently rewritten into a linear regression form, resulting in the biased model-free iterative equation, as follows:
[0077] in, ; Based on vectorization and the Kronecker product identity, the data-driven iterative expression of the dynamic equation of the mean subsystem is rewritten into a linear regression form, resulting in the model-free iterative equation for the mean, as follows:
[0078] in, , , , .
[0079] In the theoretical derivation of the above model-free iterative equations for deviation and mean, the iterative step is consistently adopted. This will be described. At the algorithm implementation level (steps S2 and S3), to clearly distinguish the independent iterative processes of the deviation subsystem and the mean subsystem, they will be described separately using... and Mark its iteration step. Both are completely consistent in convergence and update logic. , and This is merely a symbolic distinction and does not affect the substantive content of the algorithm.
[0080] Consider one The correlation matrix coefficients of a multi-agent system and a dynamic system are as follows: , , ,
[0081] in ,and initial state exist The upper part follows a uniform distribution, and The coefficients of the cost function are:
[0082] In this experiment, to implement the algorithm, The control input is designed as follows:
[0083] Among them, frequency From the interval Randomly selected from the middle, convergence criterion is taken Control input Apply to After iterating 50 times while satisfying the rank condition, the mean field term and the datasets of agents 1 and 2 are collected. Figure 2 and Figure 3The control input is given Under the influence of the average field term, the state and control trajectory of agents 1 and 2.
[0084] Converging sequence like Figure 4 As shown, the convergent sequence , like Figure 5 The shown and convergent sequences , like Figure 6 As shown. Simulation results show that, in the convergence criterion The algorithm converges after three iterations.
[0085] Example 2 This embodiment provides a multi-agent optimization control system based on mean-field game reinforcement learning, including: The data acquisition module is used to construct a difference data matrix and a mean field data matrix using the state information and control input of any two intelligent agents; The individual deviation feedback gain module is used to iteratively update the initial individual deviation feedback gain matrix based on the difference data matrix and the current initial individual deviation feedback gain matrix using a pre-built deviation model-free iterative equation until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. The mean field feedback gain module is used to iteratively update the initial mean field feedback gain matrix based on the mean field data matrix and the current initial mean field feedback gain matrix using a pre-built mean model-free iterative equation until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean field feedback gain matrix. The optimal control calculation module is used to obtain the optimal control input based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix.
[0086] The examples and application scenarios implemented by the above modules and corresponding steps are the same, but are not limited to the content disclosed in Embodiment 1 above. It should be noted that the above modules, as part of the system, can be executed in a computer system such as a set of computer-executable instructions.
[0087] The descriptions of each embodiment in the above embodiments have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0088] The proposed system can be implemented in other ways. For example, the system embodiments described above are merely illustrative, and the division of modules described above is only a logical functional division. In actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed.
[0089] Example 3 This embodiment provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps in the multi-agent optimization control method based on mean-field game reinforcement learning as described in Embodiment 1 above.
[0090] Example 4 This embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the steps in the multi-agent optimization control method based on mean-field game reinforcement learning as described in Embodiment 1 above.
[0091] Example 5 This embodiment provides a computer program product or computer program, including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the steps in the multi-agent optimization control method based on mean-field game reinforcement learning described in Embodiment 1 above.
[0092] Those skilled in the art will understand that embodiments of the present invention can provide methods, systems, or computer program products. Therefore, the present invention can take the form of hardware embodiments, software embodiments, or embodiments combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage and optical storage) containing computer-usable program code.
[0093] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, as well as combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart. Figure 1 One or more processes and / or boxes Figure 1A device that provides the functions specified in one or more boxes.
[0094] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0095] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0096] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0097] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.
Claims
1. A multi-agent optimization control method based on mean-field game reinforcement learning, characterized in that, include: Construct a difference data matrix and a mean field data matrix using the state information and control input of any two intelligent agents; Based on the difference data matrix and the current initial individual deviation feedback gain matrix, the initial individual deviation feedback gain matrix is iteratively updated using the pre-constructed deviation model-free iterative equation until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. Based on the mean field data matrix and the current initial mean field feedback gain matrix, the initial mean field feedback gain matrix is iteratively updated using a pre-constructed mean-model-free iterative equation until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean field feedback gain matrix. The optimal control input is obtained based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix.
2. The multi-agent optimization control method based on mean-field game reinforcement learning as described in claim 1, characterized in that, The construction of the difference data matrix and mean field data matrix using the state information and control input of any two agents includes: Generate the state deviation and control deviation between any two agents based on their state information and control input, and obtain the mean state and control values of any two agents. Based on the state deviation and control deviation, as well as the state mean and control mean, construct the difference data matrix and the mean field data matrix.
3. The multi-agent optimization control method based on mean-field game reinforcement learning as described in claim 1, characterized in that, The process involves iteratively updating the initial individual deviation feedback gain matrix based on the difference data matrix and the current initial individual deviation feedback gain matrix using a pre-constructed model-free iterative equation for deviation, until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. This includes: Set iteration steps Based on the difference data matrix and the current initial individual deviation feedback gain matrix, the updated individual deviation feedback gain matrix is obtained by substituting it into the pre-constructed model-free iterative equation for deviation. If the difference between the updated individual bias feedback gain matrix and the original individual bias feedback gain matrix is less than the convergence criterion, then the updated individual bias feedback gain matrix is output as the optimal individual bias feedback gain matrix. If not, then let The updated individual bias feedback gain matrix is iteratively calculated until the difference between the individual bias feedback gain matrices between two iterations is less than the convergence criterion. Then, the updated individual bias feedback gain matrix is output as the optimal individual bias feedback gain matrix.
4. The multi-agent optimization control method based on mean-field game reinforcement learning as described in claim 1, characterized in that, The process involves iteratively updating the initial mean-field feedback gain matrix based on the mean-field data matrix and the current initial mean-field feedback gain matrix using a pre-constructed mean-model-free iterative equation until the difference between the mean-field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean-field feedback gain matrix. This includes: Set iteration steps Based on the mean field data matrix and the current initial mean field feedback gain matrix, the updated mean field feedback gain matrix is obtained by substituting it into the pre-constructed mean-model-free iterative equation. If the difference between the updated mean field feedback gain matrix and the original mean field feedback gain matrix is less than the convergence criterion, then the updated mean field feedback gain matrix is output as the optimal mean field feedback gain matrix. If not, then let The updated mean field feedback gain matrix is iteratively calculated until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion. Then, the updated mean field feedback gain matrix is output as the optimal mean field feedback gain matrix.
5. The multi-agent optimization control method based on mean-field game reinforcement learning as described in claim 1, characterized in that, The optimal control input is obtained based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix, as follows: in, It is an intelligent agent The optimal control input, It is the optimal individual bias feedback gain matrix. It is the optimal mean-field feedback gain matrix. It is an intelligent agent Status information, It is the mean field term.
6. The multi-agent optimization control method based on mean-field game reinforcement learning as described in claim 1, characterized in that, The construction process of the bias model-free iterative equation and the mean model-free iterative equation includes: Based on the interaction between each agent as the mean field coupling term, the dynamic equation of the multi-agent system is constructed, and a distributed state feedback controller is designed according to the dynamic equation of the multi-agent system. Based on the system transformation method, the mean field coupling term in the dynamic equation of the multi-agent system is eliminated, and the dynamic equations of the deviation subsystem and the mean subsystem are obtained. The dynamic equation of the deviation subsystem is optimized based on the individual deviation function, and the data-driven iterative expression of the dynamic equation of the deviation subsystem is obtained. The dynamic equations of the mean subsystem are optimized based on the mean field function, and the data-driven iterative expression of the dynamic equations of the mean subsystem is obtained. Based on the properties of vectorization and the Kronecker product, the data-driven iterative expressions of the dynamic equations of the bias subsystem and the mean subsystem are transformed into linear regression forms, resulting in the model-free iterative equations of the bias and the mean.
7. A multi-agent optimization control system based on mean-field game reinforcement learning, characterized in that, include: The data acquisition module is used to construct a difference data matrix and a mean field data matrix using the state information and control input of any two intelligent agents; The individual deviation feedback gain module is used to iteratively update the initial individual deviation feedback gain matrix based on the difference data matrix and the current initial individual deviation feedback gain matrix using a pre-built deviation model-free iterative equation until the difference between the individual deviation feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal individual deviation feedback gain matrix. The mean field feedback gain module is used to iteratively update the initial mean field feedback gain matrix based on the mean field data matrix and the current initial mean field feedback gain matrix using a pre-built mean model-free iterative equation until the difference between the mean field feedback gain matrices between two iterations is less than the convergence criterion, thus obtaining the optimal mean field feedback gain matrix. The optimal control calculation module is used to obtain the optimal control input based on the agent's state information, the optimal individual deviation feedback gain matrix, and the optimal mean field feedback gain matrix.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps in the multi-agent optimization control method based on mean-field game reinforcement learning as described in any one of claims 1-6.
9. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps in the multi-agent optimization control method based on mean field game reinforcement learning as described in any one of claims 1-6.
10. A computer program product, characterized in that, The computer program product includes a computer program that, when executed by a processor, implements the steps in the multi-agent optimization control method based on mean-field game reinforcement learning as described in any one of claims 1-6.