A multimodal language model closed-loop simulation method and device
By constructing a closed-loop simulation method based on a multimodal language model, the problems of policy iteration and autonomous coordination in existing intelligent decision-making systems in multi-agent environments are solved. This enables dynamic feedback of agent behavior policies and a comprehensive understanding of environmental states, thereby improving the system's policy optimization capabilities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN UNIV OF TECH
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-16
AI Technical Summary
Existing intelligent decision-making systems based on large language models struggle to achieve strategy iteration, autonomous collaboration, and behavioral evolution modeling in dynamic environments involving strategy conflicts, resource competition, and games among multiple agents, resulting in insufficient adaptability in highly complex and multivariate decision-making scenarios.
A closed-loop simulation method for multimodal language models is constructed. Through a closed-loop simulation system with multi-agent auto-evolution, environmental change data and behavioral trajectory data are collected. Natural language information is combined to generate an updated environmental semantic representation. Gaussian perturbations are superimposed in a closed sandbox environment to calculate the environmental state increment, thereby realizing dynamic feedback and policy iteration of agent behavior strategies.
It realizes dynamic feedback in the interactive game process between multiple agents, improves the system's game coordination and strategy optimization capabilities, enhances the language model's comprehensive understanding of environmental state changes, and ensures the adaptability and robustness of strategy generation.
Smart Images

Figure CN121809703B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the technical field of language models, and specifically to a closed-loop simulation method and apparatus for multimodal language models. Background Technology
[0002] With the in-depth application of large language models in the field of natural language processing, they have demonstrated powerful cross-modal expression and context awareness capabilities in complex tasks such as semantic understanding, logical reasoning, and policy modeling. They can perform high-dimensional semantic abstraction, causal structure construction, and dynamic decision generation in multi-source heterogeneous information environments, providing core support for building intelligent agent systems with adaptive reasoning and complex behavior modeling capabilities. In particular, they have demonstrated significant generalization ability and robustness in highly complex tasks such as multi-agent collaboration, game optimization, and scene perception.
[0003] However, existing intelligent decision-making systems based on large language models still mainly focus on static data analysis and language question-answering tasks, lacking the ability to model time evolution characteristics and interactive feedback mechanisms. Especially in complex environments involving policy conflicts, resource competition and game dynamics among multiple agents, it is difficult to achieve policy iteration, autonomous collaboration and behavioral evolution modeling. As a result, they exhibit insufficient adaptability and limited reasoning depth when facing multivariable and highly uncertain decision-making scenarios, which seriously restricts their application expansion in real complex systems.
[0004] Therefore, a method is needed to simulate the interactive behavior between multiple agents, thereby reflecting the dynamic feedback process of behavioral decisions on the evolution of the overall environmental state. Summary of the Invention
[0005] This invention provides a closed-loop simulation method and apparatus for multimodal language models, which can simulate the interactive behavior between multiple agents, thereby reflecting the dynamic feedback process of behavioral decisions on the evolution of the overall environmental state.
[0006] A first aspect of the present invention provides a closed-loop simulation method for a multimodal language model, the method comprising:
[0007] Construct a closed-loop simulation system for multi-agent autonomous evolution, wherein the closed-loop simulation system collects environmental change data caused by the behavior of the agents through strategic interactions and game conflicts between the agents, and continuously records behavioral trajectory data during the evolution process;
[0008] The intelligent agent is deployed in the closed-loop simulation system. The intelligent agent performs feature extraction based on the environmental change data and natural language information, generates an updated environmental semantic representation by combining the behavioral trajectory data and interaction information, and outputs a behavioral strategy within a predefined action space.
[0009] Based on the aforementioned behavioral strategy, the virtual environment undergoes dynamic evolution. Gaussian perturbations are superimposed on the closed sandbox environment to calculate the environmental state increment, and the global environmental state variables are updated to drive the evolution platform into the next iteration.
[0010] Based on the above technical solutions, preferably, the intelligent agent extracts features based on the environmental change data and natural language information, combines the behavioral trajectory data and interaction information to generate an updated environmental semantic representation, and outputs a behavioral strategy within a predefined action space, specifically further including:
[0011] A multi-agent semantic interaction module is constructed, wherein the multi-agent semantic interaction module supports each agent to interact with its own perception results, behavioral decisions and external analysis information of other agents at the natural language level;
[0012] During the interaction of multiple intelligent agents, the semantic interaction module combines the knowledge graph and the behavior trajectory data to perform information identification and consistency verification on the interaction content, and feeds back the semantically fused behavior strategy information to each intelligent agent to achieve collaborative optimization of group strategies.
[0013] Based on the above technical solutions, preferably, before deploying the intelligent agent in the closed-loop simulation system, the method includes:
[0014] The environmental change data, the behavioral trajectory data, and the agent's historical decision parameters are converted into natural language descriptions and visual graphical representations in parallel to generate unstructured information.
[0015] Based on the above technical solutions, preferably, the intelligent agent includes an environment analysis module, a behavior generation module, and a reflection and optimization module. Deploying the intelligent agent in the closed-loop simulation system specifically includes:
[0016] The agent is fed structured and unstructured information. The environment analysis module performs semantic compression and feature extraction on the structured and unstructured information, and generates an updated environmental semantic representation based on the interaction information. The behavior generation module generates the behavior strategy based on the environmental semantic representation, the current environmental state, historical behavior trajectories, and the reflection mechanism. The structured information includes the environmental change data, the behavior trajectory data, and the historical decision parameters.
[0017] After the behavior generation module generates the behavior strategy based on the environmental semantic representation, the reflection and optimization module performs strategy review and self-optimization, scores the behavior trajectory data and behavior strategy, selects the behavior strategy with the best score through comparative analysis, and updates the model by combining the strategy distillation method.
[0018] Based on the above technical solutions, preferably, the step of executing the dynamic evolution of the virtual environment based on the behavioral strategy, and calculating the environment state increment by superimposing Gaussian perturbations in the closed sandbox environment, specifically includes:
[0019] The behavioral strategy is used as the execution input for the current simulation cycle and input into the closed sandbox environment. In the closed sandbox environment, an interaction intensity-driven state evolution function model is constructed to quantify the numerical perturbation effect of each agent on the environmental state increment.
[0020] Define the intensity weight of the action of each agent on a specific environmental state variable in the current cycle;
[0021] The behavioral intensity of multiple agents is weighted and superimposed on the same state dimension according to the behavioral intensity weight to form the total intensity of the behavioral effect of the specific environmental state variable.
[0022] A Gaussian perturbation term is constructed to simulate background noise perturbation caused by non-intelligent agent behavior, wherein the Gaussian perturbation term satisfies the conditions of zero mean and a set variance.
[0023] Using the total intensity of the behavioral effect and the Gaussian perturbation term as input variables, the environmental state variables for the current period are solved by substituting them into the state evolution function;
[0024] The environmental state variables are superimposed on the global environmental state variables of the previous simulation cycle to form the updated environmental state increment.
[0025] Based on the above technical solutions, preferably, the step of updating the global state variables of the environment to drive the evolution platform into the next iteration specifically includes:
[0026] The environmental state increment is fed back to the agent as one of the structured information inputs for the next cycle, forming the basic environmental variable data perceived by the agent, which together with the unstructured information constitutes the input of the environmental semantic representation.
[0027] A new set of behavioral strategies is generated in the updated environment and passed to the closed sandbox environment to execute the next round of state evolution.
[0028] Based on the above technical solutions, preferably, the step of collecting environmental change data caused by the behavior of the agents through strategic interactions and game conflicts between agents in the closed-loop simulation system, and continuously recording behavioral trajectory data during the evolution process, further includes:
[0029] The initial resource state of the closed sandbox environment is set, as well as the number of agents, agent identification, and initial policy parameters of the agents, to establish a closed-loop simulation system with resource competition constraints and zero-sum game rules;
[0030] At the beginning of each simulation cycle, the updated environmental state increment from the previous cycle is broadcast to all agents as structured information input to the agents, while historical behavior records and external communication semantic information are input as unstructured information.
[0031] In a second aspect of the invention, a multimodal language model closed-loop simulation apparatus is provided. The apparatus is used to execute a multimodal language model closed-loop simulation method as described in any of the above embodiments. The apparatus includes an acquisition module, a processing module, and an output module, wherein:
[0032] Construct a closed-loop simulation system for multi-agent autonomous evolution, wherein the closed-loop simulation system collects environmental change data caused by the behavior of the agents through strategic interactions and game conflicts between the agents, and continuously records behavioral trajectory data during the evolution process;
[0033] The intelligent agent is deployed in the closed-loop simulation system. The intelligent agent performs feature extraction based on the environmental change data and natural language information, generates an updated environmental semantic representation by combining the behavioral trajectory data and interaction information, and outputs a behavioral strategy within a predefined action space.
[0034] Based on the aforementioned behavioral strategy, the virtual environment undergoes dynamic evolution. Gaussian perturbations are superimposed on the closed sandbox environment to calculate the environmental state increment, and the global environmental state variables are updated to drive the evolution platform into the next iteration.
[0035] In a third aspect of the invention, an electronic device is provided, including a processor, a memory, a user interface, and a network interface, wherein the memory is used to store instructions, the user interface and the network interface are both used to communicate with other devices, and the processor is used to execute the instructions stored in the memory to cause the electronic device to perform the method as described in any of the preceding embodiments.
[0036] In a fourth aspect, the present invention provides a computer-readable storage medium storing instructions that, when executed, perform the method as described in any of the preceding claims.
[0037] In summary, one or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:
[0038] 1. This invention constructs a multi-agent closed-loop simulation system based on a zero-sum game mechanism. By utilizing a structured behavioral strategy-driven state evolution function model, it realizes the continuous influence of agent behavior on environmental variables. After deploying agents with semantic understanding and strategy generation capabilities, the system supports agents in forming environmental semantic representations and outputting behavioral strategies based on environmental change data, behavioral trajectory data, and natural language interaction information. These behavioral strategies, as inputs, induce state changes with Gaussian perturbations in the sandbox environment. The updated environmental state serves as the basis for agent decision-making in the next cycle, thereby completing the closed-loop coupling of perception, reasoning, execution, and feedback in each cycle. This enables highly dynamic linkage between agent behavior selection and environmental state evolution, thus fully simulating the interactive game process between multiple agents and their substantial feedback on the system's evolution path.
[0039] 2. By constructing a multi-agent semantic interaction module, policy sharing and cognitive collaboration among agents at the natural language level are realized, enabling each agent to dynamically adjust its own strategy based on the perception and behavioral information of other agents, thereby improving the overall system's game coordination and the speed of group strategy convergence, and enhancing the strategy optimization capability in complex environments.
[0040] 3. By converting environmental change data, behavioral trajectory data, and historical decision parameters into natural language descriptions and visual graphical representations in parallel, unstructured information is generated, enabling the language model to simultaneously perceive multimodal inputs at both the linguistic and visual levels. This effectively compensates for the limitations of structured numerical information in semantic modeling and significantly enhances the language model's comprehensive understanding of environmental state change trends and historical behavioral logic.
[0041] 4. By refining the agent into an environment analysis module, a behavior generation module, and a reflection and optimization module, a closed-loop flow of structured and unstructured information is achieved in each stage of perception, decision-making, and self-optimization. This ensures that the agent has contextual association, historical dependence, and behavioral feedback capabilities during the policy generation process, thereby improving the adaptability, robustness, and evolution of policy generation.
[0042] 5. By establishing an interaction intensity-driven state evolution function model and introducing Gaussian perturbations to simulate external non-behavioral factors, we can realize the direct numerical driving of behavioral policies on environmental state variables, form a high-fidelity behavior-state mapping mechanism, ensure that the simulation environment has realistic feedback and dynamic evolution characteristics, and provide a measurable platform for continuous policy testing and feedback learning.
[0043] 6. By feeding back the updated environmental state increment to the agent and using it to generate the next round of behavioral policies, a dynamic feedback loop between policy output and state perception is established, ensuring that the evolution platform has the ability to continuously update itself and link behaviors, thereby enabling the agent's decision-making and environmental evolution to maintain temporal consistency and logical traceability. Attached Figure Description
[0044] Figure 1 This is a flowchart illustrating a closed-loop simulation method for a multimodal language model disclosed in an embodiment of the present invention;
[0045] Figure 2 This is a schematic diagram of a module of a multimodal language model closed-loop simulation device disclosed in an embodiment of the present invention;
[0046] Figure 3 This is a schematic diagram of the structure of an electronic device disclosed in an embodiment of the present invention.
[0047] Explanation of reference numerals in the attached drawings: 201, acquisition module; 202, processing module; 203, output module; 301, processor; 302, communication bus; 303, user interface; 304, network interface; 305, memory. Detailed Implementation
[0048] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.
[0049] In the description of the embodiments of the present invention, words such as "for example" or "for instance" are used to indicate examples, illustrations, or explanations. Any embodiment or design described as "for example" or "for instance" in the embodiments of the present invention should not be construed as being more preferred or advantageous than other embodiments or designs. Rather, the use of words such as "for example" or "for instance" is intended to present the relevant concepts in a specific manner.
[0050] In the description of the embodiments of the present invention, the term "multiple" means two or more. For example, multiple systems means two or more systems, and multiple screen terminals means two or more screen terminals. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the indicated technical features. Thus, a feature defined with "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and variations thereof all mean "including but not limited to," unless otherwise specifically emphasized.
[0051] With the continuous enhancement of the capabilities of large language models in the field of natural language processing, they have demonstrated strong generalization and robustness in multimodal semantic modeling and contextual reasoning, and have become the core support for building intelligent agent systems with complex behavior generation and adaptive reasoning capabilities. However, existing intelligent decision-making systems based on large language models are still limited to static task scenarios and lack the ability to model temporal evolution and multi-agent game interaction. They are unable to effectively characterize the dynamic coupling relationship between policy conflict and environmental feedback, which restricts their application in real systems with high complexity and strong uncertainty. Therefore, it is urgent to introduce intelligent agent interaction modeling methods oriented towards behavior evolution and policy game process in order to realize the dynamic feedback expression of the impact of multi-agent behavior on environmental state evolution.
[0052] This embodiment discloses a closed-loop simulation method for multimodal language models, referring to... Figure 1 This includes the following steps S110-S130:
[0053] S110, constructing a closed-loop simulation system with multi-agent autonomous evolution.
[0054] This invention discloses a multimodal language model closed-loop simulation method applied to a server. The server includes, but is not limited to, electronic devices such as mobile phones, tablets, wearable devices, and PCs (Personal Computers), and can also be a backend server running a multimodal language model closed-loop simulation method. The server can be implemented using a standalone server or a server cluster composed of multiple servers.
[0055] In one possible implementation, the closed-loop simulation system collects environmental change data caused by agent behavior through strategic interactions and game conflicts between agents, and continuously records behavioral trajectory data during the evolution process. Specifically, it also includes: setting the initial resource state of the closed sandbox environment, as well as the number of agents, agent identities, and initial policy parameters, to establish a closed-loop simulation system with resource competition constraints and zero-sum game rules; at the beginning of each simulation cycle, broadcasting the updated environmental state increment of the previous cycle to all agents as structured information input to the agents, and inputting historical behavior records and external communication semantic information as unstructured information input.
[0056] Specifically, setting the initial resource state, number of agents, identity identifiers, and initial policy parameters for the closed sandbox environment is the core task of the simulation system initialization phase. The purpose is to determine the agent behavior space, policy foundation, and resource distribution structure before evolution begins. The initial resource state refers to the system variables available for allocation and competition in the simulation environment, such as constrained resource units, task allocation weights, or operation permissions; its structure must possess measurability and dynamic response characteristics. The number of agents is the number of individuals participating in the game evolution in parallel within the simulation system, directly affecting the dimension of the policy space. Identity identifiers assign unique symbolic labels to each agent for policy affiliation and behavioral trajectory differentiation. Initial policy parameters define the behavioral model of each agent before entering the evolution process, including action priorities, reward function biases, and initial behavior selection probability distributions. During implementation, these variables can be input into the system initialization module in tensor form through configuration files or parameter interfaces, and a structured initial policy vector can be generated for each agent using policy templates. After this setup, each agent possesses the basic ability to execute behaviors and evolve policies in the game environment.
[0057] At the start of each simulation cycle, the system scheduling module synchronously broadcasts the updated environmental state increment from the previous cycle to all agents. This environmental state increment is calculated using a state evolution function after the behavioral policy of the previous cycle is applied to the simulation environment. Its data structure is a multidimensional tensor, containing the numerical representations of various state variables at the current moment, and serves as the basic input for the agent's environment analysis module to perform perception modeling. The broadcasting method can employ a unified memory read / write or message queue mechanism to ensure that each agent receives a consistent state view at the same time step, thereby guaranteeing the temporal synchronization and logical consistency of the decision-making process. After the environmental state increment is input as structured information to the agents, it is used together with unstructured information as the input source for the semantic fusion module for inference processing.
[0058] Simultaneously, the simulation platform organizes historical behavior records and external communication semantic information into natural language descriptions, which are then input as unstructured information into the semantic perception module of each agent. Historical behavior records include the agent's own structured behavioral policy sequences and their execution results over multiple past periods, which can be generated into semantic descriptions through a structured-to-natural language conversion module. External communication semantic information is provided by the semantic interaction module, reflecting the strategic intentions, game tendencies, or warning feedback issued by other agents in the previous period. This information, input in text form, constitutes the contextual semantic environment received by the current agent. After the structured and unstructured information are fused at the semantic layer, they jointly drive the agent to generate a semantic representation of the current simulation environment, supporting subsequent behavioral policy reasoning, cooperative game modeling, and state prediction analysis.
[0059] S120 deploys intelligent agents in a closed-loop simulation system.
[0060] In one possible implementation, before deploying the agent in the closed-loop simulation system, the method includes: converting environmental change data, behavioral trajectory data, and the agent's historical decision parameters into natural language descriptions and visual graphical representations in parallel to generate unstructured information.
[0061] Specifically, before deploying agents in the closed-loop simulation system, environmental change data, behavioral trajectory data, and historical decision parameters are extracted from the environmental variable data preloaded during the evolution initialization phase, the initial behavioral trajectory samples of the agents, and the records of past policies. Environmental change data refers to the numerical increments of system state variables after being affected by behavioral policies during the simulation evolution, such as the change vectors of resource distribution, constraint strength, or state feedback indicators; behavioral trajectory data consists of the structured behavioral policy sequences of each agent in the historical simulation cycle and their corresponding environmental responses; historical decision parameters are the key control variables in the agent's policy model, including action selection weights, preference function outputs, and policy update rates.
[0062] The three types of structured information mentioned above are fed into the dual-modal input processing module for parallel transformation. The natural language description channel first performs a semantic template mapping operation on the structured tensor input. This mapping, based on a semantic rule dictionary or language model generation framework, parses the key state variables and change trends in the numerical vector into linguistic units, and reassembles them into grammatically complete and context-consistent text fragments. Taking environmental change data as an example, the variable name, direction of change, and magnitude are combined into an expression such as "Resource unit A decreased by 12% in the last evolution." The mapped statement will serve as semantic input for the agent's language understanding module to process.
[0063] The visualization channel receives the same structured data in parallel and selects an appropriate visualization template for graphical encoding. Environmental change data can be displayed using line charts, heatmaps, or bar charts to show the dynamic changes of variables with each evolutionary cycle; behavioral trajectory data can be represented by decision path diagrams or state transition diagrams to show the mapping between action sequences and execution feedback; historical decision parameters can be expressed using radar charts or matrix heatmaps to show strategy bias and parameter evolution. The visualization results are stored in the form of graphical tensors and serve as the visual interface for the language model, providing graphical information consistent with the semantics of language fragments.
[0064] Natural language descriptions and graphical representations are uniformly encapsulated into an unstructured information input package using a dual-channel structure. During the agent deployment phase, this package serves as the primary input source for the environment analysis module, forming multimodal perception content together with the structured information broadcast at the start of the simulation cycle. This approach enables a bidirectional mapping from structured information to unstructured semantics and graphical representations, thereby endowing the language model with contextual understanding and visual perception capabilities. This ensures the agent possesses a comprehensive understanding of trend structures, policy history, and environmental changes during closed-loop reasoning. The fusion of this unstructured and structured information drives the agent to generate semantic representations, further supporting subsequent behavioral policy generation and reflective optimization processes, forming a continuous information processing chain within the evolutionary system.
[0065] In one possible implementation, the intelligent agent includes an environment analysis module, a behavior generation module, and a reflection and optimization module. Deploying the intelligent agent in a closed-loop simulation system specifically includes: inputting structured and unstructured information into the intelligent agent; wherein the environment analysis module performs semantic compression and feature extraction by combining structured and unstructured information, and generates an updated environmental semantic representation based on interaction information; the behavior generation module generates behavioral strategies based on the environmental semantic representation, combined with the current environmental state, historical behavioral trajectories, and a reflection mechanism; wherein the structured information includes environmental change data, behavioral trajectory data, and historical decision parameters; after the behavior generation module generates behavioral strategies based on the environmental semantic representation, the reflection and optimization module performs strategy review and self-optimization, scores the behavioral trajectory data and behavioral strategies, selects the behavioral strategy with the best score through comparative analysis, and updates the model using a strategy distillation method.
[0066] Specifically, the implementation method for inputting structured and unstructured information into the agent includes two parallel information injection paths. The structured information channel inputs three types of core data in tensor form: environmental change data, behavioral trajectory data, and historical decision parameters. Environmental change data refers to the increments of state variables in the closed sandbox environment resulting from the execution of structured behavioral strategies in the previous simulation cycle, represented by a multidimensional numerical tensor Ins; behavioral trajectory data is the agent's historical behavioral strategy sequence and its environmental response data set, denoted as... The historical decision parameters are the outputs of the previous round's strategy model stored in the reflection and optimization module, denoted as... The unstructured information channel consists of external information in natural language and interactive semantic information. It originates from the policy intent text G_date shared by the multi-agent semantic interaction module and the environment description language information, which constitutes supplementary contextual content. It is encoded into semantic vectors by a pre-trained language embedding model and input in parallel with structured data.
[0067] The environment analysis module takes the aforementioned structured and unstructured information as input. First, it uniformly encodes the input data, mapping the numerical tensors to the semantic space via a multi-head attention structure. Simultaneously, it transforms the textual information into context vector representations. This module then calls the inference function Φ of a large language model to perform the semantic fusion process, generating an environmental semantic representation. The formula is defined as follows:
[0068]
[0069] in, This is a fusion function for large-scale language models, with inputs including the current environment state (Ins) and historical behavior semantic vectors. Historical Action Record date Semantic interaction information With reflection strategies Output results It is the semantic representation of the environment after the current cycle update, serving as the reasoning basis for the behavior generation module.
[0070] The behavior generation module starts from the semantic representation of the environment. Extract key causal structures and combine them with the current environmental state (Ins) and historical action sequences. Reference Strategy Execute the behavioral decision function to generate a structured behavioral policy. The calculation formula is as follows:
[0071]
[0072] Behavioral strategies In a structured format, it represents the optimal action instruction of the agent in the current semantic state. This instruction conforms to the system action space constraint specification and will serve as input to drive the evolution of state variables in the sandbox environment, constituting the source of environmental change data.
[0073] The reflection and optimization module scores and evaluates the behavioral strategies and behavioral trajectory data after the simulation cycle ends. (Behavioral trajectory data) Compared with the current strategy model Both are input into the policy update function Φ, which generates a bidirectional learning signal by comparing the best and worst policy samples. The policy model update process employs a distillation mechanism, fusing archived empirical data with the current policy performance to generate a new reference policy. The update formula is as follows:
[0074]
[0075] in, For complete behavior record, Given the current policy model, the updated policy Feedback is then sent to the behavior generation module in the next cycle, completing the policy evolution loop. This mechanism supports continuous policy optimization and knowledge transfer, ensuring that the agent possesses stable adaptability and reflective capabilities in long-term evolution, and constructing a structured policy optimization path and an internal information circulation channel within the simulation system.
[0076] In one possible implementation, the agent extracts features based on environmental change data and natural language information, combines behavioral trajectory data and interaction information to generate an updated environmental semantic representation, and outputs behavioral strategies within a predefined action space. Specifically, it further includes: constructing a multi-agent semantic interaction module, wherein the multi-agent semantic interaction module supports each agent in interacting with its own perception results, behavioral decisions, and external analysis information from other agents at the natural language level; during the interaction of multiple agents, the semantic interaction module combines knowledge graphs and behavioral trajectory data to perform information identification and consistency verification of the interaction content, and feeds back the semantically fused behavioral strategy information to each agent to achieve collaborative optimization of the group strategy.
[0077] Specifically, the implementation of the semantic interaction module for multi-agent systems includes three parts: interaction content generation, language expression construction, and semantic transmission. First, within each simulation cycle, after generating behavior, all agents structurally summarize their perceived environmental state information, executed behavioral strategies, and reflected policy representations. Using the semantic content construction function as an interface, they then transmit the perceived results (Ins), behavioral trajectories, and other relevant information. Strategy parameters The natural language representation is generated through fusion. This representation process calls the natural language generation function Φ, whose structure is as follows:
[0078]
[0079] in For the first The policy expression language paragraph generated by each agent in the current cycle, where Ins represents environmental information. For behavioral trajectory data, To reflect on policy parameters, the generated natural language segment contains the current state judgment, target intent, policy basis, and expected feedback. It is an explicit expression of the agent's own behavior and policy logic.
[0080] This natural language segment is broadcast to all other agents via the interaction module and is provided as a text stream for semantic parsing by the other agents. Each agent receives the text from the other agents. Subsequently, semantic vector encoding is performed on the input to generate structured semantic input. At this point, the system introduces a knowledge graph as an auxiliary semantic retrieval structure to enhance the agent's ability to understand concepts and fill in contextual semantics within the language content. The knowledge graph is constructed in the form of entity-relation-entity triples, embedded with a structured semantic relation network, supporting entity alignment, context disambiguation, and relational reasoning in natural language fragments.
[0081] After multiple agents complete semantic exchange, the interaction module enters the information discrimination and consistency verification phase. This phase, based on the semantic consistency verification function Ψ, matches and analyzes all semantic input sequences against the local policy history to determine their consistency with their own behavioral trajectories. Specific evaluation metrics include context similarity, policy goal alignment, and language structure confidence score, and a multi-dimensional consistency score matrix is constructed using the semantic dependency graph. Finally, the verified policy language segments are reintegrated into the local policy context and fused into a semantic input vector. .
[0082] The semantic fusion function is defined as follows:
[0083]
[0084] in For the first The confidence weights of each semantic input are automatically learned by the information discrimination mechanism; Enc is the language model encoder, which outputs semantic embedding vectors. Finally, the fused... Enter into the first An environment analysis module for each agent is used to update its environmental semantic representation for the next cycle:
[0085]
[0086] The aforementioned mechanism achieves semantic coordination of policy intentions, behavioral choices, and environmental understanding among swarm agents, promoting policy convergence and local behavioral consistency. It is a key subsystem supporting multi-agent semantic game and policy linkage. The information screening and knowledge graph mechanisms employed ensure the quality of interactive content, avoid inefficient information contamination of the semantic modeling path, and enhance the synergy and robustness of policy generation.
[0087] S130 executes the dynamic evolution of the virtual environment based on behavioral strategies. It superimposes Gaussian perturbations in a closed sandbox environment to calculate the environmental state increment and updates the global environmental state variables to drive the evolution platform into the next iteration.
[0088] In one possible implementation, the virtual environment dynamically evolves based on behavioral policies. Gaussian perturbations are superimposed on the environment state increment within a closed sandbox environment. Specifically, this includes: using the behavioral policy as the execution input for the current simulation cycle and inputting it into the closed sandbox environment; constructing an interaction intensity-driven state evolution function model within the closed sandbox environment to quantify the numerical perturbation impact of each agent on the environment state increment; defining the behavioral intensity weights of each agent on specific environment state variables within the current cycle; weighting and superimposing the behavioral intensities of multiple agents according to their behavioral intensity weights on the same state dimension to form the total intensity of behavioral effects on specific environment state variables; constructing a Gaussian perturbation term to simulate background noise perturbations caused by non-agent behaviors, wherein the Gaussian perturbation term satisfies zero mean and a set variance condition; using the total intensity of behavioral effects and the Gaussian perturbation term as input variables, substituting them into the state evolution function to solve for the environment state variables of the current cycle; and superimposing the environment state variables onto the global environment state variables of the previous simulation cycle to form the updated environment state increment.
[0089] Specifically, the behavioral strategies are used as the execution input for the current simulation cycle and input into the closed sandbox environment. The structured behavioral strategies output by the behavior generation module are then passed to the environment evolution module. Each structured behavioral strategy is a symbolic multi-dimensional vector containing action type, action target, and action intensity parameters for multiple environmental state variables. This vector is decoded through the execution interface and injected into the sandbox environment as a control signal. After receiving the structured behavioral strategies from all agents, the sandbox environment aligns them with the agent's identity according to the timestamp and organizes them into a unified policy action input matrix, preparing data for the input of the subsequent evolution function.
[0090] The specific method for constructing the interaction intensity-driven state evolution function model is to establish the following update formula for each state variable dimension:
[0091]
[0092] in For the first The state variables at time 1 The value, Its corresponding response coefficient, This represents the total intensity of behavioral influence on this variable during the current period. This is the Gaussian perturbation term on this variable. The evolutionary function model is a causal response modeling mechanism for each state dimension, ensuring that the agent's policy output can be quantified and fed back into the system state.
[0093] To define the intensity weight of each agent's behavior towards a specific state variable in the current cycle, each action instruction in the behavior policy vector needs to be categorized according to its dimension using a variable mapping function, and the behavior intensity needs to be calculated based on its action parameters. For example, for the resource allocation variable, if the agent... If the strategy of "requesting resource x, with a magnitude of r" is executed in the current cycle, then its behavioral intensity can be expressed as: ,in This is the sensitivity coefficient preset in the policy template. All strength values will retain the agent index within the variable dimension.
[0094] The total intensity of the actions of multiple agents is formed by weighted summation of their respective weights along the same state dimension, using the following formula:
[0095]
[0096] in For the first An agent on variables The intensity weight of the behavior is determined by its historical impact or strategy level. The corresponding behavioral intensity; The total intensity of the aggregated behavior is the core driving force that determines the direction and magnitude of state evolution.
[0097] To construct a Gaussian perturbation term to simulate background noise perturbations caused by non-agent behavior, a random perturbation source needs to be introduced into the evolution function. It satisfies:
[0098]
[0099] in The set disturbance variance is used to control the stability and sensitivity of the system. The introduction of a noise term ensures that the system is robust to uncontrollable external disturbances, while retaining a certain degree of evolutionary uncertainty to simulate sporadic changes in the real environment.
[0100] Using the total intensity of the behavioral effects and the Gaussian perturbation term as input variables, the environmental state variables are solved by substituting them into the state evolution function, as follows:
[0101]
[0102] The formula solves independently for each variable dimension, generating the changes in all state variables in the current period, and then aggregates them into a complete state update result through a matrix structure.
[0103] The increments of the solved environmental state variables are added to the environmental global state variable tensor of the previous simulation cycle, and the update method is as follows:
[0104]
[0105] in For all dimensions The resulting state increment vector, This is the increment of the global environment state from the previous cycle, and the updated state. The structured information is then input to all agents as the next cycle's input. This process ensures that the environment's response to the agent's policy forms a closed-loop feedback loop and maintains continuous evolutionary characteristics in each cycle. This mechanism simultaneously possesses behavioral drive, quantifiable controllability, and perturbation adaptability, and is the core model logic for constructing a dynamic feedback simulation platform.
[0106] In another possible implementation, the virtual environment dynamically evolves based on the behavioral strategy, and Gaussian perturbations are superimposed in a closed sandbox environment to calculate the environmental state increment. Specifically, this includes: inputting the behavioral strategy as the execution input for the current simulation cycle into the closed sandbox environment, wherein the behavioral strategy includes variable types, strategy directions, strategy mode types, and strategy confidence for multiple environmental state variables; in the sandbox environment, finding and binding a set of mechanism response mapping functions for each environmental state variable, wherein the mechanism response mapping function is used to generate a nonlinear behavioral response factor corresponding to the environmental state variable based on the input strategy mode type and strategy confidence; based on the nonlinear behavioral response factor, combined with the weight parameters of the strategy mode, calculating the behavioral effect increment value of the environmental state variable through a combination of dual nonlinear functions; simultaneously selecting a perturbation function matching the variable type from the perturbation function family to construct a perturbation term under the variable dimension; substituting the behavioral effect increment value and the perturbation term as joint inputs into the state update formula to obtain the updated value of the environmental state variable for the current cycle; and superimposing the updated value onto the environmental state tensor of the previous cycle to generate the environmental state increment for the current cycle.
[0107] Specifically, the implementation of the dynamic evolution of the virtual environment is based on the nonlinear response modeling mechanism of structured behavior strategy and the perturbation function combination mechanism. By constructing the state update path through mapping function and perturbation generation, the dynamic feedback and temporal evolution of environmental variables are realized.
[0108] First, the structured behavior strategy output by the behavior generation module within the current simulation cycle is input into the closed sandbox environment. The behavior strategy is encoded in tensor format and includes a four-element structure of variable type, policy direction, policy mode type, and policy confidence for multiple environmental state variables. For example, variable type is used to identify the target object, such as resource capacity or connection stability; policy direction defines the trend of action, such as increase or decrease; policy mode type describes the behavior logic, such as single-point action or collaborative intervention; and policy confidence represents the degree of certainty of the current behavior in the agent's internal inference chain, with a value range of [0,1].
[0109] Secondly, the sandbox environment performs a function retrieval operation on each target state variable in the structured behavior strategy to find a set of mechanism response mapping functions bound to its type. This set is preset by the system during the deployment phase, and each state variable is bound to at least one type of mapping function, used to generate a nonlinear response factor based on the strategy mode type and strategy confidence. This nonlinear response factor is used to simulate the actual perturbation intensity of the strategy behavior on the state variable, and its generation form can be defined as follows:
[0110]
[0111] in This represents the incremental value of the behavioral effect of the kth environmental state variable. This is the global response adjustment factor for the k-th state variable, controlling the upper limit of the increment; The response weight of the i-th agent to variable k depends on its policy pattern type; Let be the policy confidence of the i-th agent with respect to variable k; the sigmoid and tanh functions are used to construct a dual nonlinear mapping to simulate the inhibition-saturation effect of the behavioral response.
[0112] Furthermore, the system selects a matching perturbation function from the perturbation function family for each variable type, generating perturbation terms for each variable dimension. The predefined perturbation function family includes Gaussian perturbations, Poisson perturbations, and periodic perturbations. If a Gaussian perturbation is selected, its definition is:
[0113]
[0114] in Let be the disturbance term of the k-th state variable during period t; The perturbation variance represents the sensitivity of a variable to external non-strategy factors.
[0115] Then, the system increments the value of the action. With disturbance term Substituting these values into the state update formula as joint inputs, we calculate the new values of the state variables for the current period:
[0116]
[0117] in This represents the update value of the k-th state variable in period t+1; This is the value of the variable in the previous period; For nonlinear increments triggered by the strategy; This represents the system fluctuation value output by the disturbance function.
[0118] Finally, all variable dimensions Aggregate and superimpose the results onto the environmental state tensor from the previous period. The environmental state tensor that constitutes the current period And calculate its increment:
[0119]
[0120] The above process completes the feedback path mapping from structured behavioral policies to global environmental state variables, constructing a dynamic evolution mechanism based on nonlinear behavioral modeling and perturbation simulation. The policy tensor output by the behavior generation module directly controls the state update amplitude. The sandbox environment extends its feedback mechanism through mechanism mapping and perturbation functions. The tensor after the environmental state update is used as structured information input to the agent's environment analysis module in the next cycle.
[0121] In one possible implementation, updating the global state variables of the environment to drive the evolution platform into the next iteration specifically includes: feeding back the incremental environmental state as one of the structured information inputs for the next cycle to the agent, which constitutes the basic environmental variable data perceived by the agent, and together with the unstructured information, constitutes the input of the environmental semantic representation; generating a new round of behavioral policies in the updated environmental state and transmitting them to the closed sandbox environment to execute the next round of state evolution.
[0122] Specifically, the environmental state increment is fed back to the agent as one of the structured information inputs for the next cycle. This requires the environmental state management module to update the global state tensor after the state variables are updated in the sandbox environment. Broadcast to all agents in a unified format. This tensor structure is a multi-dimensional numerical matrix, where each dimension represents a state variable in the simulation environment, such as resource distribution, channel load, signal strength, and area security factor. The feedback mechanism achieves state synchronization through shared memory mapping or a distributed message queue, ensuring that each agent receives a consistent view of the environment input. This structured information is input to the agent's environment analysis module, where it combines with natural language content from the semantic interaction module to form unstructured information. This unstructured information is then uniformly encoded within the model to form the current period's environmental semantic representation, denoted as... Environmental semantic representation is a high-dimensional semantic abstraction of the current environmental state, including structured state perception and contextual language background, used to support the policy generation module in semantic reasoning.
[0123] To generate a new set of behavioral strategies in the updated environment, it is necessary to... The policy inference function Φ is invoked for the input basis, while historical behavior trajectories are also incorporated. Reference strategy of the previous cycle The decision generation logic in the execution behavior generation module outputs structured behavior strategies. The strategy is a multi-dimensional behavior vector conforming to a predefined action space specification. Each dimension represents the behavioral decision made in a specific state variable dimension, including action category, execution intensity, and target object. The behavior strategy, after being structured and encoded, is transmitted to the execution module in the closed sandbox environment. In the next simulation cycle, it re-drives the evolutionary function model, completing the interaction with the new environmental state variables and triggering a new numerical response. Through this process, a continuous iterative relationship is formed between the behavior strategy and the environmental state, realizing a closed-loop system evolution mechanism of perception-reasoning-execution-feedback, supporting the agent to complete continuous strategy evolution and state adaptation during dynamic game processes.
[0124] This embodiment also discloses a multimodal language model closed-loop simulation device, referring to... Figure 2 The device includes an acquisition module 201, a processing module 202, and an output module 203. It is used to execute any of the multimodal language model closed-loop simulation methods described above, wherein:
[0125] The acquisition module 201 is used to construct a closed-loop simulation system for multi-agent autonomous evolution. In the closed-loop simulation system, environmental change data caused by agent behavior is collected through policy interaction and game conflict between agents, and behavioral trajectory data is continuously recorded during the evolution process.
[0126] The processing module 202 is used to deploy an intelligent agent in the closed-loop simulation system. The intelligent agent extracts features based on environmental change data and natural language information, combines behavioral trajectory data and interaction information to generate an updated environmental semantic representation, and outputs behavioral strategies within a predefined action space.
[0127] Output module 203 is used to perform dynamic evolution of the virtual environment based on behavior strategy. It superimposes Gaussian perturbations in the closed sandbox environment to calculate the environment state increment and updates the global state variables of the environment to drive the evolution platform into the next iteration.
[0128] In one possible implementation, the acquisition module 201 is used to construct a multi-agent semantic interaction module, wherein the multi-agent semantic interaction module supports each agent to interact with its own perception results, behavioral decisions and external analysis information of other agents at the natural language level.
[0129] The processing module 202 is used to perform information identification and consistency verification of the interaction content by combining knowledge graph and behavior trajectory data during the interaction of multiple intelligent agents, and to feed back the behavior strategy information after semantic fusion to each intelligent agent to achieve collaborative optimization of group strategy.
[0130] In one possible implementation, the processing module 202 is used to convert environmental change data, behavioral trajectory data, and historical decision parameters of the agent into natural language descriptions and visual graphical representations in parallel, generating unstructured information.
[0131] In one possible implementation, the output module 203 is used to input structured information and unstructured information to the agent. The environment analysis module combines the structured and unstructured information to perform semantic compression and feature extraction, and generates an updated environmental semantic representation based on the interaction information. The behavior generation module generates a behavior strategy based on the environmental semantic representation, combined with the current environmental state, historical behavior trajectory, and reflection mechanism. The structured information includes environmental change data, behavior trajectory data, and historical decision parameters.
[0132] The processing module 202 is used to perform strategy review and self-optimization by the reflection and optimization module after the behavior generation module generates behavior strategies based on the semantic representation of the environment. It scores the behavior trajectory data and behavior strategies, selects the behavior strategy with the best score through comparative analysis, and updates the model by combining the strategy distillation method.
[0133] In one possible implementation, the processing module 202 is used to input the behavioral strategy as the execution input of the current simulation cycle into a closed sandbox environment, and in the closed sandbox environment, to construct an interaction intensity-driven state evolution function model to quantify the numerical perturbation effect of each agent on the environmental state increment.
[0134] Processing module 202 is used to define the intensity weight of the behavior of each agent on a specific environmental state variable in the current period.
[0135] The processing module 202 is used to weight and superimpose the behavioral intensity of multiple agents on the same state dimension according to the behavioral intensity weight to form the total intensity of the behavioral effect of a specific environmental state variable.
[0136] The processing module 202 is used to construct a Gaussian perturbation term to simulate the background noise perturbation caused by the behavior of a non-intelligent agent, wherein the Gaussian perturbation term satisfies the conditions of zero mean and set variance.
[0137] Processing module 202 is used to solve for the environmental state variables of the current period by substituting the total intensity of the behavioral action and the Gaussian perturbation term into the state evolution function.
[0138] The environmental state variables are superimposed onto the global environmental state variables of the previous simulation cycle to form the updated environmental state increment.
[0139] In one possible implementation, the processing module 202 is used to send back the environmental state increment as one of the structured information inputs for the next cycle to the agent, which constitutes the basic environmental variable data perceived by the agent, and together with the unstructured information, constitutes the environmental semantic representation input.
[0140] The processing module 202 is used to generate a new round of behavioral policies in the updated environment state and pass them to the closed sandbox environment to execute the next round of state evolution.
[0141] In one possible implementation, the processing module 202 is used to set the initial resource state of the closed sandbox environment, as well as the number of agents, agent identification, and initial policy parameters, to establish a closed-loop simulation system with resource competition constraints and zero-sum game rules.
[0142] The output module 203 is used to broadcast the updated environmental state increment of the previous cycle to all agents at the beginning of each simulation cycle, as structured information input to the agents, and to input historical behavior records and external communication semantic information as unstructured information input.
[0143] It should be noted that the above embodiments of the apparatus are only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.
[0144] This embodiment also discloses an electronic device, as shown in the reference. Figure 3 The electronic device may include: at least one processor 301, at least one communication bus 302, user interface 303, network interface 304, and at least one memory 305.
[0145] The communication bus 302 is used to enable communication between these components.
[0146] The user interface 303 may include a display screen and a camera. Optionally, the user interface 303 may also include a standard wired interface and a wireless interface.
[0147] The network interface 304 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).
[0148] The processor 301 may include one or more processing cores. The processor 301 connects to various parts of the server using various interfaces and lines, and performs various server functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in memory 305, and by calling data stored in memory 305. Optionally, the processor 301 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 301 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications. The GPU is responsible for rendering and drawing the content required for display. The modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 301 and may be implemented as a separate chip.
[0149] The memory 305 may include random access memory (RAM) or read-only memory. Optionally, the memory may include a non-transitory computer-readable storage medium. The memory 305 may be used to store instructions, programs, code, code sets, or instruction sets. The memory 305 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch function, sound playback function, image playback function, etc.), instructions for implementing the various method embodiments described above, etc. The data storage area may store data involved in the various method embodiments described above. Optionally, the memory 305 may also be at least one storage device located remotely from the aforementioned processor 301. As a computer storage medium, the memory 305 may include an operating system, a network communication module, a user interface 303 module, and an application program for a multimodal language model closed-loop simulation method.
[0150] exist Figure 3 In the illustrated electronic device, the user interface 303 is primarily used to provide an input interface for the user and to acquire user input data. The processor 301 can be used to call an application program stored in the memory 305 that represents a multimodal language model closed-loop simulation method. When executed by one or more processors 301, the electronic device performs one or more methods as described in the above embodiments.
[0151] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, as some steps can be performed in other orders or simultaneously according to the present invention. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to the present invention.
[0152] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0153] In the several embodiments provided by this invention, it should be understood that the disclosed apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some service interface; the indirect coupling or communication connection between apparatuses or units may be electrical or other forms.
[0154] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0155] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0156] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage device (CMD). Based on this understanding, the technical solution of this invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory 305 and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned memory 305 includes various media capable of storing program code, such as a USB flash drive, external hard drive, magnetic disk, or optical disk.
[0157] The present invention also discloses a computer-readable storage medium storing instructions. When executed by one or more processors 301, these instructions cause an electronic device to perform one or more methods as described in the above embodiments.
[0158] The above are merely exemplary embodiments of this disclosure and should not be construed as limiting the scope of this disclosure. Any equivalent changes and modifications made in accordance with the teachings of this disclosure shall still fall within the scope of this disclosure. Those skilled in the art will readily conceive of other embodiments of this disclosure upon considering the specification and the disclosure of practical truths. This invention is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not described in this disclosure. The specification and embodiments are to be considered exemplary only, and the scope and spirit of this disclosure are defined by the claims.
Claims
1. A closed-loop simulation method for multimodal language models, characterized in that, The method includes: A closed-loop simulation system for multi-agent autonomous evolution is constructed. In this closed-loop simulation system, environmental change data caused by the behavior of the agents is collected through policy interactions and game conflicts between the agents, and behavioral trajectory data is continuously recorded during the evolution process. The environmental change data is the numerical increment of the system state variables after being affected by the behavioral policies during the simulation evolution. The behavioral trajectory data is the set of historical behavioral policy sequences of the agents and the corresponding environmental response data. The system state variables include the change vectors of resource distribution, constraint strength, or state feedback indicators. The intelligent agent is deployed in the closed-loop simulation system. The intelligent agent performs feature extraction based on the environmental change data and natural language information, generates an updated environmental semantic representation by combining the behavioral trajectory data and interaction information, and outputs behavioral strategies in a predefined action space. The interaction information is the interaction content supported by the multi-agent semantic interaction module for each intelligent agent to interact with its own perception results, behavioral decisions and external analysis information of other intelligent agents at the natural language level. Based on the aforementioned behavioral strategy, the virtual environment undergoes dynamic evolution. Gaussian perturbations are superimposed on the closed sandbox environment to calculate the environmental state increment, and the global environmental state variables are updated to drive the evolution platform into the next iteration. Among these, Gaussian perturbation terms are constructed to simulate background noise perturbations caused by non-agent behavior. Historical behavior records and external communication semantic information are organized into natural language descriptions and input as unstructured information into the semantic perception module of each agent. The historical behavior records include the agent's own structured behavior strategy sequences and execution results in the past multiple periods. Semantic descriptions are generated by a structured-to-natural language conversion module. The external communication semantic information is provided by the semantic interaction module, reflecting the strategic intentions, game tendencies, or warning feedback issued by other agents in the previous period. After the structured and unstructured information are fused at the semantic layer, they jointly drive the agent to generate a semantic representation of the current simulation environment. The structured information includes the environmental change data, the behavior trajectory data, and historical decision parameters.
2. The multimodal language model closed-loop simulation method according to claim 1, characterized in that, The intelligent agent extracts features based on the environmental change data and natural language information, combines the behavioral trajectory data and interaction information to generate an updated environmental semantic representation, and outputs a behavioral strategy within a predefined action space. Specifically, it also includes: A multi-agent semantic interaction module is constructed, wherein the multi-agent semantic interaction module supports each agent to interact with its own perception results, behavioral decisions and external analysis information of other agents at the natural language level; During the interaction of multiple intelligent agents, the semantic interaction module combines the knowledge graph and the behavior trajectory data to perform information identification and consistency verification on the interaction content, and feeds back the semantically fused behavior strategy information to each intelligent agent to achieve collaborative optimization of group strategies. The knowledge graph is composed of entity-relation-entity triples and embedded in a structured semantic relation network, supporting entity alignment, context disambiguation and relation reasoning in natural language fragments.
3. The multimodal language model closed-loop simulation method according to claim 1, characterized in that, Before deploying the agent in the closed-loop simulation system, the method includes: The environmental change data, the behavioral trajectory data, and the agent's historical decision parameters are converted into natural language descriptions and visual graphical representations in parallel to generate unstructured information.
4. The multimodal language model closed-loop simulation method according to claim 3, characterized in that, The intelligent agent includes an environment analysis module, a behavior generation module, and a reflection and optimization module. Deploying the intelligent agent in the closed-loop simulation system specifically includes: The agent is fed structured and unstructured information. The environment analysis module performs semantic compression and feature extraction by combining the structured and unstructured information, and generates an updated environmental semantic representation by combining the interaction information. The behavior generation module generates the behavior strategy based on the environmental semantic representation, combined with the current environmental state, historical behavior trajectory and reflection mechanism. After the behavior generation module generates the behavior strategy based on the environmental semantic representation, the reflection and optimization module performs strategy review and self-optimization, scores the behavior trajectory data and behavior strategy, selects the behavior strategy with the best score through comparative analysis, and updates the model by combining the strategy distillation method.
5. The multimodal language model closed-loop simulation method according to claim 1, characterized in that, The process of dynamically evolving the virtual environment based on the aforementioned behavioral strategy, and calculating the environment state increment by superimposing Gaussian perturbations in a closed sandbox environment, specifically includes: The behavioral strategy is used as the execution input for the current simulation cycle and input into the closed sandbox environment. In the closed sandbox environment, an interaction intensity-driven state evolution function model is constructed to quantify the numerical perturbation effect of each agent on the environmental state increment. Define the intensity weight of the action of each agent on a specific environmental state variable in the current cycle; The behavioral intensity of multiple agents is weighted and superimposed on the same state dimension according to the behavioral intensity weight to form the total intensity of the behavioral effect of the specific environmental state variable. A Gaussian perturbation term is constructed to simulate background noise perturbation caused by non-intelligent agent behavior, wherein the Gaussian perturbation term satisfies the conditions of zero mean and a set variance. Using the total intensity of the behavioral effect and the Gaussian perturbation term as input variables, the environmental state variables for the current period are solved by substituting them into the state evolution function; The environmental state variables are superimposed on the global environmental state variables of the previous simulation cycle to form the updated environmental state increment.
6. The multimodal language model closed-loop simulation method according to claim 5, characterized in that, The process of updating the global state variables of the environment to drive the evolution platform into the next iteration specifically includes: The environmental state increment is fed back to the agent as one of the structured information inputs for the next cycle, forming the basic environmental variable data perceived by the agent, which together with the unstructured information constitutes the input of the environmental semantic representation. A new set of behavioral strategies is generated in the updated environment and passed to the closed sandbox environment to execute the next round of state evolution.
7. The multimodal language model closed-loop simulation method according to claim 1, characterized in that, The closed-loop simulation system collects environmental change data caused by the actions of the agents through policy interactions and game conflicts between agents, and continuously records behavioral trajectory data during the evolution process. Specifically, it also includes: The initial resource state of the closed sandbox environment is set, as well as the number of agents, agent identification, and initial policy parameters of the agents, to establish a closed-loop simulation system with resource competition constraints and zero-sum game rules; At the beginning of each simulation cycle, the updated environmental state increment from the previous cycle is broadcast to all agents as structured information input to the agents, while historical behavior records and external communication semantic information are input as unstructured information.
8. A closed-loop simulation device for multimodal language models, characterized in that, The device is used to execute a multimodal language model closed-loop simulation method as described in any one of claims 1-7, the device comprising an acquisition module, a processing module, and an output module, wherein: The acquisition module is used to construct a closed-loop simulation system for multi-agent autonomous evolution. In the closed-loop simulation system, environmental change data caused by the behavior of the agents are collected through strategic interactions and game conflicts between the agents, and behavioral trajectory data are continuously recorded during the evolution process. The processing module is used to deploy the intelligent agent in the closed-loop simulation system, wherein the intelligent agent performs feature extraction based on the environmental change data and natural language information, combines the behavior trajectory data and interaction information to generate an updated environmental semantic representation, and outputs a behavior strategy within a predefined action space; The output module is used to perform dynamic evolution of the virtual environment based on the behavior strategy, calculate the environment state increment by superimposing Gaussian perturbation in the closed sandbox environment, and update the global state variables of the environment to drive the evolution platform into the next iteration.
9. An electronic device, characterized in that, The device includes a processor, a communication bus, a user interface, a network interface, and a memory. The memory is used to store instructions. The user interface and the network interface are both used to communicate with other devices. The communication bus is used to enable communication between the components within the electronic device. The processor is used to execute the instructions stored in the memory to cause the electronic device to perform the method as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores instructions that, when executed, perform the method as described in any one of claims 1-7.