Ecological environment evolution simulation and planning scheme evaluation system based on reinforcement learning

By using a reinforcement learning-based system for simulating ecological environment evolution and evaluating planning schemes, the contradiction between simulation accuracy and decision-making agility in complex ecological environment systems using traditional methods has been resolved. This system enables dynamic, closed-loop evaluation and adaptive management of ecological environment systems, improving the applicability and interpretability of the evaluation.

CN122287286APending Publication Date: 2026-06-26NANJING ACAD OF ENVIRONMENTAL PROTECTION SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING ACAD OF ENVIRONMENTAL PROTECTION SCI
Filing Date
2026-02-04
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies face a tension between simulation accuracy and decision-making agility in ecological environment simulation and planning scheme evaluation. Traditional methods are difficult to adapt to complex and dynamic ecological environment systems, resulting in evaluation results that fail to reflect the complex process by which decision-makers continuously adjust their actions based on environmental feedback.

Method used

An ecological environment evolution simulation and planning scheme evaluation system based on reinforcement learning is adopted. A multi-dimensional state-action space is constructed, and a reinforcement learning agent is embedded. The closed-loop interaction process of ecological environment evolution simulation and planning scheme is realized through simulation coupling module, state inference module and policy update module. Combined with sub-models of multiple process mechanisms and priority experience playback mechanism, dynamic modeling and policy optimization are supported.

Benefits of technology

It realizes a dynamic, closed-loop process for ecological environment evolution simulation and planning scheme evaluation, enhances the system's robust operation and adaptive management under complex disturbances, and improves its applicability and interpretability in multi-scale planning scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122287286A_ABST
    Figure CN122287286A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of intelligent decision-making technology for ecological environment, specifically disclosing a reinforcement learning-based system for simulating ecological environment evolution and evaluating planning schemes. This invention constructs a closed-loop technical architecture integrating mechanism-driven simulation, data-driven perception, and intelligent decision optimization. This architecture models planning schemes as intelligent agents capable of interacting with the environment and continuously learning, enabling the system to perform sequential decision optimization and dynamic performance evaluation of complex adaptive management strategies while maintaining the physical consistency of ecological processes and environmental quality evolution. The system tightly couples ecological dynamics models, environmental pollution diffusion models, and reinforcement learning intelligent agents, driving strategy optimization through a multi-objective reward function that includes both ecosystem service enhancement and environmental quality improvement, achieving dynamic intervention and adaptive evaluation within a unified spatiotemporal grid. This application supports interpretable and robust decision-making for complex ecological environment planning schemes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent decision-making technology for ecological environment, and relates to an ecological environment evolution simulation and planning scheme evaluation system based on reinforcement learning. Background Technology

[0002] Ecological and environmental evolution simulation and planning scheme evaluation, as core technologies supporting sustainable development strategies, have demonstrated increasingly important application value in recent years in fields such as territorial spatial planning, ecological restoration projects, climate change response, and biodiversity conservation. With the rapid development of remote sensing, geographic information systems, and big data technologies, the ability to acquire ecological and environmental data has significantly improved, providing a foundation for high-precision simulations. However, how to effectively integrate multi-source heterogeneous data, characterize complex nonlinear ecological processes, and achieve robust extrapolation of highly uncertain future scenarios remains a key bottleneck facing the current technological system.

[0003] Traditional ecological environment simulation methods mainly rely on process models based on physical mechanisms or statistical regression models, such as SWAT and InVEST. The former aims to reconstruct the operational logic of the ecological environment system from a mechanistic perspective by establishing a set of differential equations for sub-processes such as hydrological cycle, nutrient migration, and vegetation growth; the latter uses historical observation data to train regression relationships to predict ecological responses under specific input variables. These methods have achieved good results at specific scales and under steady-state assumptions, especially in closed or semi-closed systems with complete data, clear boundaries, and controllable disturbances, demonstrating strong explanatory power. Correspondingly, the evaluation of planning schemes usually adopts static or quasi-static methods such as multi-index weighted scoring, cost-benefit analysis, or scenario comparison, using simulation results as input and supplemented by expert experience for decision support. Within this framework, the model structure is relatively fixed, and the parameters largely depend on prior knowledge or historical fitting; its adaptability is mainly reflected in its ability to reproduce known patterns.

[0004] However, with the intensification of global change and the increase in the intensity of human activities, the ecological environment system is exhibiting unprecedented dynamic, non-stationary, and multi-scale coupling characteristics. Against this backdrop, on the one hand, the aforementioned traditional technical approaches reveal deep-seated structural limitations at the principle level. While mechanism-based models possess physical interpretability, their characterization of complex feedback loops often relies on numerous simplifying assumptions, and once the model structure is solidified, it struggles to adaptively respond to emerging system behavior patterns. On the other hand, statistical models are constrained by the linear extrapolation logic of history determining the future, making them highly susceptible to failure in the face of abrupt disturbances. Existing assessment systems generally separate simulation and assessment into two independent stages, lacking modeling of the dynamic adjustment capabilities of planning strategies themselves. This static assessment paradigm fails to capture the adaptive management essence of real-time learning and processing, resulting in assessment results that fail to reflect the complex process by which decision-makers continuously adjust their actions based on environmental feedback in the real world.

[0005] The existing technological system has created an irreconcilable tension between simulation accuracy and decision-making agility, turning high-precision models into tools for post-hoc analysis, while rapid assessment methods lose their ecological authenticity due to oversimplification. This contradiction is particularly prominent in cutting-edge scenarios such as climate change adaptation planning, dynamic adjustment of ecological red lines, and collaborative watershed governance. In these cases, the system state is constantly evolving, the intervention effect has a significant time lag, and multiple stakeholders have conflicting interests. Traditional static, open-loop assessment paradigms are no longer sufficient to provide operational scientific support.

[0006] Therefore, how to construct a technical framework that can deeply integrate ecological process mechanisms and intelligent decision-making mechanisms, so that the simulation system can not only reproduce the dynamic evolution of the ecological environment, but also autonomously explore, test and optimize planning strategies in the form of embedded intelligent agents, thereby achieving dynamic and closed-loop evaluation of complex adaptive management schemes while maintaining ecological authenticity, has become a key challenge and an urgent technical problem to be solved by those skilled in the art. Summary of the Invention

[0007] In view of this, in order to solve the problems mentioned in the background technology, a reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system is proposed, which aims to resolve the structural contradiction between ecological process simulation and dynamic evaluation of planning strategies in the existing technology.

[0008] The objective of this invention can be achieved through the following technical solution: This invention provides a system for simulating and evaluating ecological environment evolution and planning schemes based on reinforcement learning, comprising:

[0009] The spatial construction module constructs a multi-dimensional state-action space that includes ecological environment state variables, human intervention action space, and environmental disturbance factors. The ecological environment state variables include vegetation coverage, biodiversity index, water environment quality parameters, air environment quality index, and soil pollution load.

[0010] The parameter setting module initializes the initial state parameters of the ecological environment evolution simulation engine based on historical remote sensing images, ground observation data, and socio-economic statistics, and sets multiple preset planning schemes as candidate action sequences.

[0011] The simulation coupling module selects a planned intervention action based on the current ecological environment status in each decision cycle and inputs it into the ecological environment evolution simulation engine.

[0012] The state projection module, based on the planned intervention actions and external environmental disturbance factors, projects the state of the ecological environment system at the next time step, generating ecological service value indicators, environmental quality improvement indicators, and risk indicators.

[0013] The strategy update module calculates real-time reward signals based on the aforementioned ecosystem service value indicators, environmental quality improvement indicators, and risk indicators, and updates the strategy network parameters using an experience replay mechanism.

[0014] The results output module outputs the cumulative reward value of each planning scheme and the corresponding ecological environment system state evolution trajectory throughout the entire cycle when the preset simulation termination conditions are met.

[0015] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. The present invention integrates ecological environment evolution simulation and planning scheme evaluation into a closed-loop interactive process by constructing a multi-dimensional state-action space and embedding a reinforcement learning agent. This enables the system to dynamically select intervention actions based on the current ecological environment state in each decision cycle and update the strategy in real time based on the deduction results. This breaks through the limitations of the traditional static and open-loop evaluation paradigm and realizes dynamic modeling of the adaptive management process.

[0016] This invention integrates multiple process-mechanism-based sub-models into the state extrapolation module. Through a unified grid, consistent time steps, and standardized data interfaces, it achieves ordered coupling and anomaly tolerance among the sub-models, preserving the ecological authenticity of the mechanistic model while enhancing the system's robust operation under complex disturbances. Simultaneously, the extrapolation results are mapped into multi-dimensional reward signals, providing clear guidance for strategy optimization.

[0017] This invention introduces a priority experience replay mechanism and a proximal policy optimization algorithm into the policy update module, combined with a multi-objective reward function design, enabling the agent to explore efficient planning paths while avoiding high-risk ecological degradation scenarios. Furthermore, a hierarchical policy architecture supports collaborative decision-making between macro-level objectives and micro-level actions, enhancing the system's applicability and interpretability in multi-scale planning scenarios. Attached Figure Description

[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a schematic diagram of the system module connections of the present invention.

[0020] Figure 2 This is a flowchart simulating the workflow of a coupled module.

[0021] Figure 3 This is a flowchart of the sub-model integration and index calculation process for the state deduction module. Detailed Implementation

[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0023] Please see Figure 1 As shown, this invention provides an ecological environment evolution simulation and planning scheme evaluation system based on reinforcement learning, including a spatial construction module, a parameter setting module, a simulation coupling module, a state inference module, a policy update module, and a result output module. The spatial construction module is connected to the parameter setting module, the parameter setting module is connected to the simulation coupling module, the simulation coupling module is connected to both the state inference module and the result output module, the state inference module is connected to the policy update module, and the policy update module is connected to the simulation coupling module.

[0024] The spatial construction module constructs a multi-dimensional state-action space that includes ecological environment state variables, human intervention action space, and environmental disturbance factors. The ecological environment state variables include vegetation coverage, biodiversity index, water environment quality parameters, air environment quality index, and soil pollution load.

[0025] It should be explained that, in order to comprehensively reflect the overall characteristics of the ecological environment, the ecological environment state variables are divided into two dimensions: ecosystem health and environmental media quality. The ecosystem health dimension includes vegetation cover and biodiversity index. Vegetation cover reflects the vegetation growth status and substrate stability of the target area, while the biodiversity index quantifies species richness and evenness within the target area. The environmental media quality dimension includes water environmental quality parameters, air environmental quality index, and soil pollution load. The water environmental quality parameters cover COD, ammonia nitrogen, dissolved oxygen, etc., used to characterize the degree of water pollution. The air environmental quality index includes PM2.5 and NOx concentrations, used to characterize air cleanliness. The soil pollution load includes heavy metal content and organic pollutant residues, used to characterize soil environmental safety. Through this construction method, the system not only focuses on the succession of biological communities but also on the evolution of the physicochemical properties of the water, air, and soil environments upon which humans depend for survival.

[0026] Furthermore, when setting water environmental quality parameters, air environmental quality index, and soil pollution load, calibration is required based on actual application scenarios.

[0027] It should be explained that the aforementioned human intervention action space refers to the collection of various planning measures taken by humans to influence the evolution of the ecological environment, while environmental disturbance factors refer to external factors that are not directly controlled by human planning but will significantly affect the evolution of the ecological environment system, which are equivalent to natural and social disturbances in the operation of the system.

[0028] The parameter setting module initializes the initial state parameters of the ecological environment evolution simulation engine based on historical remote sensing images, ground observation data, and socio-economic statistics, and sets multiple preset planning schemes as candidate action sequences.

[0029] It should be explained that the purpose of setting multiple preset planning schemes as candidate action sequences is to build an executable planning measure library for the reinforcement learning agent, transforming various planning strategies for the ecological environment in reality into a list of action combinations that the system can recognize and select, so that the agent can select the optimal intervention method from the list according to the ecological environment status in subsequent simulations.

[0030] It should be noted that the construction method of the ecological environment evolution simulation engine is as follows: First, historical remote sensing images, ground observation data and socio-economic statistics are integrated to initialize the initial state parameters of the engine, ensuring that the parameters fit the actual ecological environment and human activity background of the target area; typical preset planning schemes such as mine ecological restoration and wetland ecosystem restoration are set and transformed into a multi-step candidate action sequence ordered logically, and a library of planning measures that can be executed by the intelligent agent is built to provide core support for the engine's subsequent dynamic simulation and decision optimization.

[0031] It should be explained that the ecological environment evolution simulation engine receives structured planning intervention instructions from the simulation coupling module, combines multi-source basic data with external environmental disturbance factors, and accurately predicts the state of the ecological environment system at the next time step through a set of built-in sub-models such as hydrological and water quality coupling and pollutant migration and transformation. At the same time, it generates three core indicators: ecological service value, environmental quality improvement, and comprehensive risk, providing feedback for strategy updates to the reinforcement learning agent. Ultimately, it supports the dynamic simulation, effectiveness evaluation, and robustness verification of planning schemes, and assists in the scientific decision-making of complex ecological environment planning.

[0032] It should be noted that each pre-planned scheme is not a single isolated action, but a logical and chronological sequence of actions. This is because ecological and environmental planning often requires multi-step coordinated implementation, rather than a single action yielding results. For example, a mine ecosystem restoration scheme might include: ① Years 1-2: Delineating mine pollution control boundaries and soil sampling monitoring points; ② Years 3-5: Implementing soil heavy metal solidification and stabilization treatment and vegetation reconstruction; ③ Years 6-8: Conducting post-remediation soil quality tracking monitoring and vegetation community optimization. These three actions constitute a complete sequence in chronological order.

[0033] The simulation coupling module selects a planned intervention action based on the current ecological environment status in each decision cycle and inputs it into the ecological environment evolution simulation engine.

[0034] For a preferred embodiment of the present invention, please refer to Figure 2 As shown, the simulation coupling module is configured to perform the following operations: normalize the current ecological environment state variables to generate a standardized state vector.

[0035] Preferably, the normalization method is Min-Max normalization. Furthermore, if the original value of a certain indicator exceeds a preset minimum or maximum value, truncation will be performed, directly mapping the excess portion to 0 or 1 to avoid interference from outliers on the overall standardization result.

[0036] The standardized state vector is input into the policy network of the reinforcement learning agent to obtain the probability distribution for each possible planned intervention action.

[0037] It's important to note that each selectable action is assigned a probability value between 0 and 1, and the sum of the probabilities of all actions equals 1. This distribution intuitively reflects the suitability of each action in the current state; the higher the probability, the more likely the policy network believes that the action is likely to bring about a good ecological effect.

[0038] Based on the probability distribution, a random sampling strategy is used to select a planned intervention action.

[0039] It's important to note that the core of using a random sampling strategy to select a planned intervention action is to select the action based on the action probability distribution output by the policy network through weighted random sampling. It doesn't directly select the action with the highest probability, but rather makes high-probability actions more likely to be selected while retaining opportunities to explore low-probability actions, thus balancing the use of known optimal actions with the exploration of potentially better ones. For example, if the probabilities of three actions are 0.7, 0.15, and 0.15, sampling will likely select the action with a 0.7 probability and try the other actions with a smaller probability, avoiding policy rigidity.

[0040] The feasibility of the selected planning intervention actions is verified, including resource constraint verification, compliance verification, and spatial implementation boundary verification.

[0041] It should be explained that the resource constraint verification checks whether the necessary resources, such as funds, land, and human resources, are sufficient for the execution of the action; the compliance verification confirms whether the action complies with ecological and environmental policies and national spatial planning requirements; and the spatial implementation boundary verification determines whether the area where the action is implemented is within the geographically permitted area for development and restoration. All three types of verifications must be passed simultaneously for the action to be considered ready for implementation. If any verification fails, a new action must be selected from the remaining available options.

[0042] If the verification passes, the planned intervention action will be encoded into a structured action instruction and input into the ecological environment evolution simulation engine.

[0043] If the validation fails, select another action from the remaining options until an action is available.

[0044] If all optional actions fail the validation, the preset empty action instruction will be output or the current simulation cycle will be terminated.

[0045] Preferably, the current ecological environment state variables are normalized to generate standardized state vectors. These standardized state vectors are then input into a policy network, which is a deep neural network structure. The output of this network is a probability distribution for each available planning intervention action. A random sampling strategy is used to select a candidate action from the probability distribution. The selected candidate action is then sent to a feasibility verification unit, which performs three verifications: resource constraint verification, compliance verification, and spatial implementation boundary verification.

[0046] It should be noted that the resource constraint check is used to determine whether the currently available resources are sufficient to support the implementation of the action; the compliance check is used to determine whether the action complies with current laws, regulations, and policies; and the spatial implementation boundary check is used to determine whether the action is located within the permitted geographical area. If all three checks pass, the candidate action is encoded as a structured action instruction. If any check fails, the remaining selectable actions are reselected in descending order of probability until an action that satisfies all check conditions is obtained, and this action is encoded as a structured action instruction. The structured action instruction is then transmitted to the state deduction module.

[0047] In a preferred embodiment of the present invention, the reinforcement learning agent adopts a hierarchical policy architecture to support multi-scale planning and decision-making, specifically including: in the high-level policy module, defining macro-intervention target categories for a preset time, including ecological restoration, development control, and infrastructure layout adjustment.

[0048] In the underlying strategy module, each macro-intervention target category is refined into several executable micro-actions, such as the coordinates of soil improvement areas for mine ecological restoration, the list of pollution source shutdowns, and the coordinates of ecological corridor construction.

[0049] The high-level strategy network selects the category of macroeconomic intervention targets for the current decision-making cycle based on long-term cumulative reward signals.

[0050] The underlying policy network receives the category constraints selected by the higher layer and selects specific actions to implement from the subset of micro-actions corresponding to that category.

[0051] The two-level policy networks share some neural network layer parameters and are trained end-to-end using a joint loss function, where the gradient signal of the higher-level policy is backpropagated through the action selection probability of the lower-level policy.

[0052] It should be noted that the two-layer network shares the parameters of the underlying feature extraction neural network layer, avoiding redundant learning and ensuring consistent understanding of the ecological environment. An end-to-end training model is adopted, integrating the decision-making and execution errors of the two layers through a joint loss function, and adjusting parameters synchronously. The gradient signal of the high-level policy is backpropagated using the probability of action selection at the lower level, enabling the higher level to optimize the macro-objective based on the implementation effect at the lower level. This design avoids macro-objectives becoming unrealistic and micro-actions deviating from their intended direction, while also reducing training costs and ensuring the consistency and efficient collaboration of multi-scale decision-making.

[0053] It should be noted that the planned interventions include ecological restoration actions, such as mine ecological restoration and wetland ecosystem restoration, as well as environmental governance actions, such as upgrading and transforming sewage treatment plants, shutting down industrial pollution sources, controlling agricultural non-point source pollution, and restricting air emissions.

[0054] The state projection module, based on the planned intervention actions and external environmental disturbance factors, projects the state of the ecological environment system at the next time step, generating ecological service value indicators, environmental quality improvement indicators, and risk indicators.

[0055] It should be noted that the external environmental disturbance factors include climate change scenario data and socio-economic driving factors.

[0056] It's important to explain that climate change scenario data focuses on uncertainties at the natural environment level, such as future trends in temperature rise and fall, changes in total precipitation and its distribution, and the frequency of extreme weather events. These changes directly affect core ecological and environmental conditions such as the hydrological cycle, vegetation growth, and soil moisture content. Socioeconomic drivers, on the other hand, focus on the indirect impacts of human activities, such as regional population growth, industrial restructuring, urbanization, and resource development intensity. These factors indirectly affect the evolution of the ecological and environmental system by altering land use patterns and pollution emission levels. Together, they constitute uncontrollable but essential external variables in ecological simulations, making the simulation results closer to the complexities of the real world.

[0057] For a preferred embodiment of the present invention, please refer to Figure 3 As shown, the state deduction module is configured to perform the following operations: parse the received structured action instructions into specific ecological intervention operation parameters.

[0058] It should be noted that the ecological intervention operation parameters include the soil improvement area, pollution control intensity, and land use adjustment scope for mine ecological restoration.

[0059] The ecological intervention operation parameters and the current ecological environment state variables are input into a set of sub-models based on process mechanisms. The set of sub-models includes a hydrological and water quality coupling model, a pollutant migration and transformation model, a vegetation dynamics model, and a species distribution model.

[0060] It should be noted that the hydrological and water quality coupled model is used to simulate water circulation and to simulate the migration and attenuation processes of pollutants such as nitrogen and phosphorus in water bodies using convection-diffusion equations. The pollutant migration and transformation model is used to simulate the diffusion and deposition of atmospheric pollutants and the adsorption and desorption processes of pollutants in soil, predicting changes in the quality of environmental media. The vegetation dynamics model is used to simulate vegetation growth processes. The species distribution model is used to simulate changes in biological habitats.

[0061] The sub-model set is invoked to perform ordered coupling operations to generate predicted values ​​of ecological and environmental state variables for the next time step.

[0062] Carbon sequestration capacity, habitat connectivity, water and soil environmental capacity, and composite pollution index are calculated based on the predicted values.

[0063] It should be noted that carbon sequestration capacity measures the ability of an ecosystem to absorb and store carbon dioxide, primarily calculated based on predicted values ​​such as vegetation cover and vegetation type. A higher value indicates stronger ecological carbon sequestration and climate change mitigation benefits. Habitat connectivity measures biodiversity conservation potential. Soil and water environmental capacity reflects the regional environmental media's ability to absorb and self-purify pollutants; a higher value indicates stronger environmental carrying capacity. The composite pollution index comprehensively weights the pollution concentrations of water, air, and soil; a lower value indicates better environmental quality.

[0064] The four indicators are mapped to preset ecosystem service value functions, environmental quality evaluation functions, and ecological environment risk early warning functions, respectively, and the corresponding ecosystem service value indicators, environmental quality evaluation indicators, and risk indicators are output.

[0065] It should be noted that the ecosystem service value function is used to calculate the comprehensive benefits on the ecological side, mainly by weighted summation of carbon sink capacity and habitat connectivity after normalization; the environmental quality evaluation function is used to assess the quality of environmental media, by coupling the composite pollution index with water and soil environmental capacity, and outputting a quantitative value characterizing the degree of environmental cleanliness. The weight factors corresponding to each parameter can be determined by the analytic hierarchy process or expert scoring method.

[0066] The aforementioned ecosystem service value indicators, environmental quality evaluation indicators, and risk indicators are encapsulated into reward signal tuples for use by reinforcement learning agents.

[0067] In a preferred embodiment of the present invention, the specific construction method and operation logic of the sub-model set are as follows: the target area is divided into regular grid units with uniform spatial resolution, and each grid unit has a unique geographic coordinate identifier.

[0068] Set a consistent time step for all sub-models, and call each sub-model sequentially according to a preset execution order within each time step.

[0069] The portion of the output variables of the current sub-model that involves ecological and environmental state variables, after unit conversion and dimension calibration, is used as the input boundary conditions for the next sub-model.

[0070] It should be noted that each sub-model will output multiple types of data, among which only content directly related to ecological environment state variables is selected, such as vegetation coverage and biomass output by the vegetation dynamics model, and soil moisture content and runoff output by the hydrological cycle model. Non-ecological environment state data, such as intermediate process values ​​and log information, are not included in the transmission scope to ensure the targeted nature of data transmission.

[0071] It should be noted that the ecological and environmental state variable data, after unit conversion and dimensional calibration, will become the initial input premise for the next sub-model. This connection ensures that the data logic is consistent and the physical meaning is consistent when each sub-model simulates ecological processes, making the overall ecological and environmental evolution projection more in line with the real situation.

[0072] During data transfer between sub-models, resampling is performed on variables with mismatched spatial resolutions.

[0073] The strategy update module calculates real-time reward signals based on the aforementioned ecosystem service value indicators, environmental quality improvement indicators, and risk indicators, and updates the strategy network parameters using an experience replay mechanism.

[0074] In a preferred embodiment of the present invention, the specific operation steps of the strategy update module are as follows: Define a multi-objective reward function, which is composed of a weighted combination of the ecoservice value weight coefficient, the environmental quality improvement weight coefficient, and the ecological environment risk penalty coefficient.

[0075] For example, the multi-objective reward function is: ,in Indicators of ecological service value. For the degree of improvement in environmental quality, As a risk indicator, This is the immediate reward value for the current decision-making cycle. This represents the pre-defined weighting factors. The weighting factors are set based on the core constraint that the weight of ecosystem service value, the weight of environmental quality improvement, and the risk penalty coefficient should sum to 1, combined with the core objectives and actual needs of the regional ecological and environmental planning. If the current plan emphasizes pollution control, the proportion of the environmental quality improvement weight will be appropriately increased; if it emphasizes ecological conservation, the ecosystem service value weight will be increased. The degree of environmental quality improvement is calculated based on water and soil environmental capacity and the composite pollution index, reflecting the degree of water clarity, air cleanliness, and soil cleanliness.

[0076] Substitute the ecological service value index, environmental quality improvement index, and risk index into the multi-objective reward function to calculate the immediate reward value for the current decision-making cycle.

[0077] It should be noted that the immediate reward value is a comprehensive reward and penalty value calculated in real time by the multi-objective reward function for a single planned intervention action. It uses the ecological service value and environmental quality improvement derived after the action is implemented as positive gains and the comprehensive risk as negative penalties. After weighted calculation, it directly reflects the immediate comprehensive ecological and environmental benefits of a single planned action and is the core data for the agent to determine the merits of the action and store it in the experience replay buffer.

[0078] Store the current state, selected action, immediate reward value, and next state in the experience replay buffer.

[0079] It should be noted that the experience replay buffer is the core data storage module for reinforcement learning. It is used to sequentially store the complete interaction experience of each decision made by the agent, including core data such as the ecological environment state, planned actions, reward values, and subsequent ecological environment states. The stored experience data can be randomly sampled for training the policy network, breaking data correlations, avoiding training oscillations, and improving the model's convergence efficiency and stability. This is crucial for ensuring that the agent efficiently learns and optimizes planning strategies.

[0080] When the number of samples in the experience replay buffer reaches the preset batch threshold, a batch of experience samples is randomly selected from it.

[0081] The policy gradient and value loss are calculated using the empirical samples, and the policy network parameters are updated based on the near-end policy optimization algorithm.

[0082] In a preferred embodiment of the present invention, the specific analysis process of the experience replay buffer is as follows: an initial priority is assigned to each experience sample stored in the experience replay buffer, and the priority is determined based on a preset maximum priority value.

[0083] When extracting experience samples from the experience replay buffer to update the policy network, non-uniform sampling is performed according to the proportion of each sample priority to the sum of the total priorities.

[0084] After each network update using a certain empirical sample, its priority is recalculated based on the temporal difference error calculated for that sample.

[0085] The updated priority is written back to the experience replay buffer for subsequent sampling probability adjustment.

[0086] When the experience replay buffer reaches its maximum capacity, high-value samples are retained based on priority, while the lowest-priority samples are discarded.

[0087] Preferably, the strategy update module uses the PPO algorithm, with an experience replay buffer capacity of 10,000 records and a batch threshold of 256. Sample priority is adjusted after each update round. The results output module terminates after 30 years of simulation, generating evaluation reports for three schemes. These reports are written to the InfluxDB distributed time-series database, and the Elasticsearch multidimensional query interface supports planning departments in retrieving the optimal scheme based on criteria such as whether the area of ​​improved mine soil exceeds 50% and whether the evaluation score is greater than 0.8.

[0088] The results output module outputs the cumulative reward value of each planning scheme and the corresponding ecological environment system state evolution trajectory throughout the entire cycle when the preset simulation termination conditions are met.

[0089] In a preferred embodiment of the present invention, the result output module is configured to perform the following operations: after the simulation terminates, extract all action sequences executed by the reinforcement learning agent throughout the entire simulation period to form a complete planned intervention path.

[0090] Based on the time step correspondence, the planned intervention path is associated with the corresponding state evolution trajectory to construct a state-action-reward triplet sequence.

[0091] The cumulative reward values ​​in the triplet sequence are normalized to generate a standardized evaluation score.

[0092] Preferably, the cumulative reward value in the triplet sequence is subjected to Min-Max normalization to generate a standardized evaluation score.

[0093] Based on the changing trends of ecological and environmental state variables in the state evolution trajectory, identify whether there are nodes of irreversible ecological degradation.

[0094] Preferably, the slope of change of each ecological environment state variable within the monitoring time window is compared with a preset threshold to identify whether there are irreversible ecological degradation nodes.

[0095] It should be noted that if the slope of any ecological environment state variable exceeds a preset threshold within the monitoring time window, an irreversible ecological degradation node is identified. The preset threshold is determined by combining the ecological mechanisms and actual evolutionary patterns of the ecological environment state variables, referencing long-term monitoring data and degradation cases of similar ecological environment systems in the region to clarify the critical value for maintaining their ecological functions, and simultaneously defining the warning threshold based on the abrupt change characteristics of the variable's slope. This threshold combines ecological scientific validity with practicality, accurately matching the identification needs of irreversible ecological degradation nodes.

[0096] The standardized evaluation scores are integrated with the information on irreversible degradation nodes to generate a structured evaluation report.

[0097] The structured evaluation report is linked and stored with the original planning scheme to form a traceable planning scheme evaluation database.

[0098] In a preferred embodiment of the present invention, the formation of a traceable planning scheme evaluation database includes: assigning a unique identifier to each original planning scheme and establishing an index structure based on the unique identifier.

[0099] The standardized assessment scores, irreversible ecological degradation node information, state-action-reward triplet sequences, and simulated termination condition parameters in the structured assessment report are encapsulated into structured data objects.

[0100] The structured data object is bound to a unique identifier corresponding to the original planning scheme and written into a distributed time-series database.

[0101] During the writing process, the version number of the ecological environment evolution simulation engine, the version identification information of the reinforcement learning agent model, and the scenario configuration parameters of the external environmental disturbance factors on which the simulation execution depends are recorded synchronously.

[0102] Establish a multi-dimensional query interface to support joint retrieval of the evaluation database based on the target area, planning time span, type of intervention measure, or evaluation score threshold.

[0103] In a preferred embodiment of the present invention, the system further includes a robustness assessment of the planning scheme under multiple scenario perturbations, which is performed as follows: loading a variety of preset future scenario combinations, each combination containing a set of climate model output data and socio-economic development path parameters.

[0104] It should be noted that the construction of future scenario combinations is centered on covering multi-dimensional uncertainties. First, the two key disturbance dimensions of climate and socio-economic factors are identified. Climate model output data are selected from different scenarios such as temperature, precipitation, and extreme weather. Socio-economic development path parameters cover variables such as population growth, industrial structure, and urbanization. Then, the two types of data are cross-combined according to reasonable logic to form multiple sets of differentiated scenario combinations. Each set corresponds to a specific combination of climate model output data and socio-economic development path parameters.

[0105] For each scenario combination, the pre-trained reinforcement learning agent policy network is reused to drive the ecological environment evolution simulation engine to run a complete simulation cycle under fixed policy conditions.

[0106] Record the ecological environment system state evolution trajectory and cumulative reward value corresponding to the same planned intervention path under each scenario.

[0107] Calculate the mean, standard deviation, and minimum of the cumulative reward value under each scenario, which serve as indicators of the expected benefits, volatility, and worst-case performance of the planning scheme under uncertain environments.

[0108] Based on the expected benefits, volatility, and worst-case performance indicators, a robust scoring function is constructed to rank all candidate planning schemes and output the set of preferred schemes that rank in the top preset proportion in the scoring function ranking.

[0109] All content not described in detail in the specification belongs to the prior art known to those skilled in the art, and the parameter configuration of each model and algorithm is not specifically limited and can be set by conventional means. Software dependency libraries and hardware platforms not mentioned in this technical solution are not shown in the figure because they are general technologies, and will not be described here.

[0110] The above content is merely an example and illustration of the concept of the present invention. Those skilled in the art can make various modifications or additions to the specific embodiments described, or use similar methods to replace them, as long as they do not deviate from the concept of the invention or exceed the scope defined by the present invention, and all such modifications and additions should fall within the protection scope of the present invention.

Claims

1. A system for simulating ecological environment evolution and evaluating planning schemes based on reinforcement learning, characterized in that: include: The spatial construction module constructs a multi-dimensional state-action space that includes ecological environment state variables, human intervention action space, and environmental disturbance factors. The ecological environment state variables include vegetation coverage, biodiversity index, water environment quality parameters, air environment quality index, and soil pollution load. The parameter setting module initializes the initial state parameters of the ecological environment evolution simulation engine based on historical remote sensing images, ground observation data, and socio-economic statistics, and sets multiple preset planning schemes as candidate action sequences. The simulation coupling module selects a planned intervention action based on the current ecological environment status in each decision cycle and inputs it into the ecological environment evolution simulation engine; The state projection module, based on the planned intervention actions and external environmental disturbance factors, projects the state of the ecological environment system at the next time step, and generates ecological service value indicators, environmental quality improvement indicators, and risk indicators. The strategy update module calculates real-time reward signals based on the aforementioned ecosystem service value indicators, environmental quality improvement indicators, and risk indicators, and updates the strategy network parameters using an experience replay mechanism. The results output module outputs the cumulative reward value of each planning scheme and the corresponding ecological environment system state evolution trajectory throughout the entire cycle when the preset simulation termination conditions are met.

2. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 1, characterized in that: The simulated coupling module is configured to perform the following operations: The current ecological environment state variables are normalized to generate standardized state vectors; The standardized state vector is input into the policy network of the reinforcement learning agent to obtain the probability distribution for each possible planned intervention action; Based on the probability distribution, a random sampling strategy is used to select a planned intervention action; Feasibility verification is performed on the selected planning intervention actions, including resource constraint verification, compliance verification, and spatial implementation boundary verification. If the verification passes, the planned intervention action will be encoded into a structured action instruction and input into the ecological environment evolution simulation engine; If the verification fails, select another action from the remaining available actions until an action is obtained; If all optional actions fail the validation, the preset empty action instruction will be output or the current simulation cycle will be terminated.

3. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 1, characterized in that: The state deduction module is configured to perform the following operations: The received structured action instructions are parsed into specific ecological intervention operation parameters; The ecological intervention operation parameters and the current ecological environment state variables are input together into a set of sub-models based on process mechanisms. The set of sub-models includes a hydrological and water quality coupling model, a pollutant migration and transformation model, a vegetation dynamics model, and a species distribution model. The sub-model set is invoked to perform ordered coupling operations to generate predicted values ​​of ecological and environmental state variables for the next time step; Based on the predicted values, carbon sequestration capacity, habitat connectivity, water and soil environmental capacity, and composite pollution index are calculated. The four indicators are mapped to the preset ecosystem service value function, environmental quality evaluation function and ecological environment risk early warning function respectively, and the corresponding ecosystem service value indicators, environmental quality evaluation indicators and risk indicators are output. The aforementioned ecosystem service value indicators, environmental quality evaluation indicators, and risk indicators are encapsulated into reward signal tuples for use by reinforcement learning agents.

4. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 3, characterized in that: The specific construction method and operation logic of the sub-model set are as follows: The target area is divided into regular grid cells with uniform spatial resolution, and each grid cell has a unique geographic coordinate identifier; Set a consistent time step for all sub-models, and call each sub-model sequentially according to a preset execution order within each time step; The portion of the output variables of the current sub-model that involves ecological and environmental state variables, after unit conversion and dimension calibration, is used as the input boundary condition for the next sub-model. During data transfer between sub-models, resampling is performed on variables with mismatched spatial resolutions.

5. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 1, characterized in that: The specific operation steps of the policy update module are as follows: A multi-objective reward function is defined, which is composed of a weighted combination of the ecosystem service value weight coefficient, the environmental quality improvement weight coefficient, and the ecological environment risk penalty coefficient. Substitute the ecosystem service value index, environmental quality improvement index, and risk index into the multi-objective reward function to calculate the immediate reward value for the current decision-making cycle; Store the current state, selected action, immediate reward value, and next state in the experience replay buffer; When the number of samples in the experience replay buffer reaches the preset batch threshold, a batch of experience samples is randomly selected from it. The policy gradient and value loss are calculated using the empirical samples, and the policy network parameters are updated based on the near-end policy optimization algorithm.

6. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 5, characterized in that: The specific analysis process of the experience playback buffer is as follows: An initial priority is assigned to each experience sample stored in the experience playback buffer, and this priority is determined based on a preset maximum priority value. When extracting experience samples from the experience replay buffer to update the policy network, non-uniform sampling is performed according to the proportion of each sample priority to the sum of the total priorities. After each network update using a certain empirical sample, its priority is recalculated based on the temporal difference error calculated for that sample. The updated priority is written back to the experience replay buffer for subsequent sampling probability adjustment. When the experience replay buffer reaches its maximum capacity, samples are sorted from high to low based on the absolute value of the calculated temporal difference error. Samples with error values ​​in the top preset percentile are retained as high-value samples, while samples with the lowest priority are discarded.

7. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 1, characterized in that: The result output module is configured to perform the following operations: After the simulation ends, extract all action sequences executed by the reinforcement learning agent throughout the entire simulation cycle to form a complete planned intervention path. Based on the time step correspondence, the planned intervention path is associated with the corresponding state evolution trajectory to construct a state-action-reward triplet sequence; The cumulative reward values ​​in the triplet sequence are normalized to generate a standardized evaluation score. Based on the changing trends of ecological and environmental state variables in the state evolution trajectory, identify whether there are irreversible ecological degradation nodes; The standardized evaluation scores are integrated with the information on irreversible degradation nodes to generate a structured evaluation report; The structured evaluation report is linked and stored with the original planning scheme to form a traceable planning scheme evaluation database.

8. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 7, characterized in that: The formation of a traceable planning scheme evaluation database includes: Assign a unique identifier to each original planning scheme and build an index structure based on the unique identifier; The standardized assessment scores, irreversible ecological degradation node information, state-action-reward triplet sequences, and simulation termination condition parameters in the structured assessment report are encapsulated into structured data objects. The structured data object is bound to a unique identifier corresponding to the original planning scheme and written into a distributed time-series database; During the writing process, the version number of the ecological environment evolution simulation engine, the version identification information of the reinforcement learning agent model, and the scenario configuration parameters of the external environmental disturbance factors on which the simulation execution depends are recorded synchronously. Establish a multi-dimensional query interface to support joint retrieval of the evaluation database based on the target area, planning time span, type of intervention measure, or evaluation score threshold.

9. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 1, characterized in that: The system also includes a robustness assessment of planning schemes under multiple perturbations, as follows: Load multiple preset combinations of future scenarios, each combination containing a set of climate model output data and socio-economic development path parameters; For each scenario combination, the pre-trained reinforcement learning agent policy network is reused to drive the ecological environment evolution simulation engine to run a complete simulation cycle under fixed policy conditions. Record the ecological environment system state evolution trajectory and cumulative reward value corresponding to the same planned intervention path under each scenario; Calculate the mean, standard deviation, and minimum of the cumulative reward value under each scenario, and use them as indicators of the expected benefits, volatility, and worst-case performance of the planning scheme under uncertain environments; Based on the expected benefits, volatility, and worst-case performance indicators, a robust scoring function is constructed to rank all candidate planning schemes and output the set of preferred schemes that rank in the top preset proportion in the scoring function ranking.

10. The reinforcement learning-based ecological environment evolution simulation and planning scheme evaluation system according to claim 1, characterized in that: The reinforcement learning agent employs a hierarchical policy architecture to support multi-scale planning and decision-making, specifically including: In the high-level strategy module, define the categories of macro-intervention targets with preset timeframes, including ecological restoration, development control, and infrastructure layout adjustment; In the underlying strategy module, each macro-level intervention target category is broken down into several executable micro-level actions; The high-level strategy network selects the macro-intervention target category for the current decision-making cycle based on long-term cumulative reward signals; The underlying policy network receives the category constraints selected by the higher layer and selects specific actions to implement from the subset of micro-actions corresponding to that category; The two-level policy networks share some neural network layer parameters and are trained end-to-end using a joint loss function, where the gradient signal of the higher-level policy is backpropagated through the action selection probability of the lower-level policy.