An environmental pollution-health effect whole-chain dynamic correlation analysis method and system
By employing isotope tracing, multi-media coupling transmission, and deep learning-driven causal inference methods, a complete causal chain is constructed, solving the problem of incomplete causal chains in environmental health research. This enables accurate pollution source analysis, individualized exposure assessment, and reliable identification of causal relationships for health effects. It also supports efficient fusion and real-time analysis of multi-source heterogeneous data and provides regionally specific and multi-scenario environmental health risk management strategies.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUBEI PROVINCIAL ACADEMY OF ECO-ENVIRONMENTAL SCIENCES(PROVINCIAL ECOLOGICAL ENVIRONMENT ENGINEERING ASSESSMENT CENTER)
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies in environmental health research suffer from problems such as difficulty in identifying pollution sources, insufficient simulation of environmental media transport, simplification of human exposure assessment, complexity of health effect correlation analysis, difficulty in fusion of multi-source heterogeneous data, and insufficient model adaptability. These issues result in incomplete causal chains and make it difficult to provide accurate basis for pollution source control and health risk prevention.
By combining isotope tracing with time-stratified source contribution calculation and a multi-receptor model, a multi-media coupled transport model is established. Human exposure is assessed based on individual spatiotemporal activity trajectories. A deep learning-driven causal inference method is used to construct a full-chain causal chain through a heterogeneous graph neural network, enabling efficient fusion and correlation analysis of multi-source heterogeneous data.
It enables accurate identification and quantification of the complete causal chain from pollution source to health effect, improves the accuracy of pollution source analysis, the precision of environmental media migration simulation, the accuracy of individualized exposure assessment, and the reliability of the causal relationship of health effect. It supports the efficient fusion and real-time analysis of multi-source heterogeneous data and provides regionally specific and multi-scenario environmental health risk management strategies.
Smart Images

Figure CN122245813A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of environmental health technology, specifically to a method and system for dynamic correlation analysis of the entire chain of environmental pollution and health effects. Background Technology
[0002] In the field of environmental health research, constructing causal chains between pollution sources, environmental media, human exposure, and health effects is a core element of environmental pollution control and health risk prevention. However, current research still faces significant technical challenges: (i) Limited Research Scope. Traditional research often focuses on a single or partial link in the chain, such as pollution source apportionment, exposure assessment, or health effect analysis, lacking a systematic analysis of the entire chain of pollutants from source emission to final health impact. For example, studies on the association between air pollution and respiratory diseases usually only focus on the statistical association between air pollutant concentration and disease incidence, ignoring key links such as pollutant sources, environmental migration processes, and internal human doses, resulting in an incomplete causal chain and making it difficult to provide a scientific basis for precise pollution source control; (II) Challenges in Pollution Source Appraisal. Existing pollution source appraisal methods struggle to effectively distinguish the relative contributions of existing pollution sources and historically accumulated pollution. In real-world environments, pollutants continuously emitted from existing sources and persistent pollutants left over from historical sources often coexist, exhibiting complex superposition, antagonistic, or synergistic effects. Traditional chemical composition analysis methods (such as PMF and CMB) can only identify the source type of pollutants but cannot distinguish the contributions of the same type of pollution source at different times; radionuclide dating methods (such as 210Pb and 137Cs) can establish time scales but lack effective correlation with pollution source characteristics; existing isotope tracing techniques are mainly used for single pollution source identification and have not yet been effectively integrated with time-stratified models. The difficulty in distinguishing substances with similar chemical properties but originating from pollution sources at different times leads to insufficient targeting of pollution control measures. (III) Deficiencies in Environmental Media Transport Simulation. Environmental media transport simulation suffers from insufficient adaptability to spatiotemporal scales. Traditional models are typically designed for a single environmental medium (such as the atmosphere or water body), with fixed parameters, making it difficult to simulate the dynamic migration of pollutants between multiple media, especially under complex terrain, extreme weather, or seasonal variations, where accuracy drops significantly. Furthermore, the migration and transformation behaviors of different pollutants in the environment vary significantly, making it difficult for a unified simulation method to account for the characteristics of multiple pollutants. (iv) Inadequate human exposure assessment. Human exposure assessment generally uses simplified fixed parameter models, lacking consideration of individual differences and spatiotemporal dynamics. Existing exposure assessments are mostly based on the average level of the population, ignoring individual differences in activity patterns, physiological characteristics and susceptibility, and mostly use static assessment methods, which are difficult to reflect the temporal variability and spatial heterogeneity of exposure parameters, resulting in significant deviations between exposure assessment results and actual conditions; (V) Challenges in Health Effect Association Analysis. Health effect association analysis faces interference from complex confounding factors. Various confounding factors exist between environmental exposure and health effects (such as lifestyle, genetic factors, and meteorological conditions). Existing methods often rely on traditional statistical models for adjustment, which struggle to effectively handle high-dimensional, nonlinear, and time-varying confounding relationships, leading to biases in causal relationship identification. Furthermore, effective quantitative models are lacking for the time lag and dose-cumulative effects of health effects. (vi) Difficulty in fusing multi-source heterogeneous data. Environmental monitoring data, pollution source data, population exposure data, and health effect data typically come from different systems and have different spatiotemporal scales, accuracies, and formats, making it difficult for traditional methods to achieve efficient integration and correlation analysis. Furthermore, data sparsity (especially in health effect data) is also a major obstacle to building reliable correlation models. (vii) Insufficient model adaptability. Most existing models are general designs, lacking regional adaptability and targeted optimization for specific pollution scenarios. Different regions have significant differences in geographical and climatic characteristics, pollutant emission characteristics, and human activity patterns. General models cannot accurately reflect the environmental health correlation characteristics of specific regions, limiting the accuracy and practical value of the models in practical applications; (viii) Key technical challenges are difficult to overcome. Existing technologies have significant shortcomings in the following key technical challenges: First, the mathematical modeling integration problem of isotope tracing technology and time-stratified models lacks an effective mathematical framework to combine isotope fingerprint information with the time decay law of pollutants; Second, the numerical stability problem of multi-media coupled transmission, as the physical and chemical properties of different environmental media vary greatly, and numerical oscillations and non-convergence are easy to occur during coupled solutions; Third, the causal identification problem under high-dimensional mixed factors, as environmental health data contains many mixed factors and exhibits nonlinear relationships, which are difficult to effectively control using traditional causal inference methods; Fourth, the spatiotemporal alignment problem of heterogeneous data fusion, as environmental monitoring data, exposure data, and health data from different sources have significant differences in spatiotemporal scales, making effective integration difficult.
[0003] In summary, there is an urgent need to develop a system capable of dynamic correlation analysis across the entire chain of pollution sources, environmental media, human exposure, and health effects. This system should integrate multi-source heterogeneous data and employ advanced artificial intelligence technology to construct accurate, dynamic, and interpretable correlation models, providing a scientific basis for environmental health risk management. Summary of the Invention
[0004] The purpose of this invention is to provide a method and system for dynamic correlation analysis of the entire chain of environmental pollution and health effects, so as to solve the problems mentioned in the background art.
[0005] To achieve the above objectives, the present invention provides the following technical solution: A method for dynamic correlation analysis of the entire chain of environmental pollution and health effects, including: Step S1: Multidimensional analysis of pollution sources. The method of fusing isotope tracing, time-stratified source contribution calculation and multivariate receptor model is adopted. The relative contribution of existing pollution sources and historical accumulated pollution is distinguished by multivariate isotope mixing model and time decay function. Step S2: Dynamic transport of pollutants in environmental media. Establish a multi-media coupled transport model including atmosphere, water, soil and organisms, and use a multi-scale spatiotemporal adaptive transport algorithm to simulate the migration and transformation process of pollutants in multiple environmental media. Step S3: Accurate assessment of human exposure, based on individual spatiotemporal activity trajectory and microenvironment exposure model, combined with individualized physiological parameters to calculate individualized precise exposure dose; Step S4: Causal analysis of health effects. A deep learning-driven causal inference method is used to establish a causal relationship between exposure and health effects through a representation learning framework and a multi-task objective function. Step S5: Full-chain correlation modeling, using heterogeneous graph neural networks to construct a complete causal chain network from pollution source to health effect, and calculating the contribution of each causal path through path effect quantification algorithm.
[0006] Furthermore, in step S1, the isotope tracing employs a multi-isotope mixture model:
[0007] in, This is the vector of isotope ratios in the sample. The contribution ratio of pollution source i satisfies the constraints. Let be the isotopic eigenvector of pollution source i. This is the error term, which includes measurement error and model error, where n is the total number of pollution sources; The time-stratified source contribution calculation model is as follows:
[0008] in, The total contribution at time t, For source strength time function, The attenuation constant is For time intervals; The fusion of multiple receptor models includes two steps: calculation of fusion weights and fusion of model results. After obtaining the analytical results of each model by running multiple receptor models in parallel, a weighted fusion is performed. The calculation method is as follows: Fusion weight calculation:
[0009] in, Score the performance of model m. Assess the consistency score for model m; Model fusion formula:
[0010] in, The source contribution analysis results for model m, Here are the corresponding weighting coefficients, and M represents the total number of receptor models.
[0011] Furthermore, in step S2, the multi-media coupled transport model includes internal transport equations for four environmental media: atmosphere, water, soil, and organisms, and calculates the exchange flux between the media using these formulas.
[0012] in, The transmission rate constant is and For fugacity correction factor, and This represents the concentration of pollutants in the corresponding medium. The multi-medium coupling solution employs a split-operation algorithm:
[0013] in, , , These are transmission, switching, and reaction operators, respectively. The multi-scale spatiotemporal adaptive transmission algorithm employs the following methods: Step (1): Mesh refinement criteria: ,in, Where h is the pollutant concentration and h is the terrain elevation. Pollution source density; Step (2): Refine the control conditions:
[0014] Step (3): Time step adaptation:
[0015] in, That is, the CFL conditional time step. The time step of the diffusion process is determined by the ratio of the mesh size to the velocity magnitude. Related to the square of the mesh size and the diffusion coefficient; chemical reaction time step It is determined by the reciprocal of the reaction rate constant.
[0016] Furthermore, the microenvironment exposure model in step S3 calculates the total individual exposure using the following formula:
[0017] in, The pollutant concentration at the k-th spatiotemporal point Duration of stay For activity intensity The corresponding respiratory rate, For microenvironment The absorption correction factor is n, where n is the total number of spatiotemporal trajectory points.
[0018] Furthermore, the individualized respiratory rate (IR) is calculated based on allometric growth theory, using the following formula:
[0019] in, For reference respiratory rate, BW is individual body weight. Using body weight as a reference, AF is the activity adjustment factor. Further, in step S4, the deep learning-driven causal inference employs a representation learning framework, whose multi-task objective function is:
[0020] in, To predict losses; Y represents the true value of the health effect. For input-based The predicted value is obtained by combining the feature mapping Φ with the processing variable T.
[0021] To balance the loss; MMD is the maximum mean difference, and Φ(X_1) and Φ(X_0) are the feature distributions of the treatment group and the control group after mapping Φ, respectively.
[0022] Towards loss; To handle indicator variables, The representative accepted the exposure treatment. =0 indicates that no exposure treatment was received. This is the predicted value for the propensity score.
[0023] Furthermore, in step S5, the heterogeneous graph neural network uses an attention mechanism for node updates, and the representation value of node v in layer l+1 is calculated using the following formula: , in, This represents an operation on a set of K distinct components. This represents the concatenation operation of the outputs of K attention heads. It is an activation function. This represents the summation of all nodes u within the r-order neighborhood of node v. Let r represent the set of neighbors connected to node v through relation r. These are attention weight parameters. It is the weight matrix of relation r at the k-th attention head. Let K represent the feature vector of node u at layer l, and K be the number of attention heads.
[0024] And attention weight
[0025] in, Let r be the attention parameter vector of relation r in the k-th attention head. This indicates a vector concatenation operation.
[0026] Furthermore, in step S5, the causal effect PE of a certain path p from pollution source s to health endpoint d is quantified as the product of the weights of all edges on that path:
[0027] in, For the edge The causal weights. p represents a directed path from node s to node d; The total causal effect of multiple paths is obtained by summing the path effects:
[0028] Where P(s,d) is the set of all valid causal paths from pollution source s to health endpoint d.
[0029] Furthermore, the method also includes a deep transfer learning step under data sparsity conditions: Domain adaptation techniques are used to achieve cross-regional model transfer, and the domain adversarial loss function is as follows:
[0030] in, For source domain data distribution, For the distribution of data in the target domain, For domain discriminator, For feature extractors.
[0031] An environmental pollution-health effect full-chain dynamic correlation analysis system is provided to implement the environmental pollution-health effect full-chain dynamic correlation analysis method. Furthermore, the system adopts a three-layer architecture design, including a data layer, an algorithm layer, and an application layer, which supports the fusion and real-time analysis of multi-source heterogeneous data.
[0032] include: A multidimensional analysis module for pollution sources is used to identify and quantify the contribution of pollution sources through isotope tracing and multivariate receptor model fusion methods. The environmental media dynamic transport module is used to simulate the migration and transformation of pollutants in multiple media based on a multi-media coupled transport model and a multi-scale spatiotemporal adaptive algorithm. The human exposure precision assessment module is used to calculate individualized exposure doses based on individual spatiotemporal activity trajectories and microenvironment exposure models; The health effects causal analysis module is used to identify the causal relationship between exposure and health effects through deep learning-driven causal inference methods; The whole-chain correlation model module is used to construct and visualize causal networks from pollution sources to health effects using heterogeneous graph neural networks.
[0033] The data layer includes a pollution source database, an environmental media database, a human exposure database, and a health effect database; the algorithm layer includes a multi-source analysis algorithm library, an environmental transmission algorithm library, an exposure assessment algorithm library, a causal inference algorithm library, and a graph neural network engine; the application layer includes a full-chain correlation analysis module, a dynamic risk prediction module, and a decision support module.
[0034] Compared with the prior art, the beneficial effects of the present invention are: (1) Full-chain causal identification and quantification. Through an innovative framework combining heterogeneous graph neural networks and causal inference, the complete causal chain from pollution source emissions to health impacts was accurately identified and quantified. This overcomes the problem of incomplete causal chains caused by traditional methods that only focus on a single or partial link, and provides a full-chain scientific basis for precise environmental governance.
[0035] (2) Precise analysis of existing and historical pollution sources. By innovatively combining isotope tracing technology and time stratification model, the existing pollution sources and historical accumulated pollution are effectively distinguished and their contributions are quantified through multi-isotope mixing model and time decay function. This solves the technical problem that traditional methods cannot distinguish pollutants from different periods and provides a scientific basis for targeted pollution source control.
[0036] (3) Accurate simulation of dynamic transport in multiple media. A multi-scale spatiotemporal adaptive transport algorithm and a multi-media coupled transport model were developed. Based on the grid refinement criterion and the time step adaptive mechanism, the dynamic and accurate simulation of the migration and transformation process of pollutants in various environmental media was realized. In particular, the simulation accuracy was significantly improved under complex terrain and extreme weather conditions, and the spatial and temporal accuracy was significantly improved.
[0037] (4) Precise individualized exposure assessment. A microenvironment exposure model based on spatiotemporal activity trajectory and an individualized physiological parameter model based on allometric growth theory were constructed, realizing precise exposure assessment that takes into account individual specificity and spatiotemporal dynamics. The accuracy of exposure assessment was greatly improved, providing a scientific basis for the graded management of health risks for different populations.
[0038] (5) Precise control of complex confounding factors. Through a deep learning-driven causal inference engine, using a representation learning framework and a multi-task objective function, the system effectively handles complex relationships involving high dimensions, nonlinearity, and time-varying factors, and achieves reliable inference of the causal relationship between pollution exposure and health effects.
[0039] (6) Efficient fusion of multi-source heterogeneous data. Based on a three-layer architecture design, through the collaboration of the data layer, algorithm layer and application layer, efficient fusion and correlation analysis of multi-source heterogeneous data such as environmental monitoring, pollution sources, human exposure and health effects are realized, supporting real-time analysis of large-scale data.
[0040] (7) Model building in a data-sparse environment. Through deep transfer learning technology, the domain adversarial training method was adopted to realize cross-regional and cross-pollutant knowledge transfer and model adaptation. Even with only a small amount of target area or pollutant data, a reliable model can still be built, which effectively solves the problem of sparsity of environmental health data.
[0041] (8) Adaptive optimization of region-specific models. Based on regional feature analysis and parameter library, the system can automatically adjust model parameters according to the geographical and climatic characteristics, pollutant emission characteristics and human activity patterns of different regions, so as to achieve "one system applicable to multiple places" and significantly improve the practical value of the system in diverse regions.
[0042] (9) Quantification of contribution of the whole chain path. Through the path effect quantification algorithm, the system can quantify the contribution of each causal path from a specific pollution source to a specific health endpoint, providing a precise intervention target for pollution control that maximizes environmental health benefits.
[0043] (10) Multi-scenario dynamic simulation and optimization. Based on the full-chain dynamic response simulation and scenario analysis function, the system can evaluate the full-chain effect of different pollution control measures, support multi-objective optimization decision analysis, and provide decision-makers with the most cost-effective combination of environmental management strategies.
[0044] (11) Regarding the accuracy of pollution source analysis, the accuracy of traditional single receptor models is limited by model assumptions and data quality. However, this invention adopts the method of fusing isotope tracing and multi-receptor models. Through the complementarity of multi-source information and optimization of fusion weights, the accuracy and reliability of analysis are significantly improved. Regarding the accuracy of exposure assessment, the consistency between traditional fixed-parameter models and actual biomonitoring data is limited by the neglect of spatial heterogeneity and individual differences. In contrast, this invention adopts an individualized exposure assessment model based on spatiotemporal activity trajectories, which significantly improves the consistency between the assessment results and biomonitoring data. In terms of causal identification bias control, traditional statistical adjustment methods are difficult to effectively handle high-dimensional nonlinear hybrid relationships. However, this invention adopts a deep learning-driven causal inference method combined with representation learning balance constraints, which significantly enhances the hybrid bias control capability. In terms of full-chain path identification, traditional methods can only identify the relationship of a single link, while the present invention can simultaneously identify and quantify multiple complete causal paths from pollution source to health endpoint through heterogeneous graph neural network, significantly improving the integrity of causal chain and path quantification capabilities. Attached Figure Description
[0045] Figure 1 This is a diagram showing the overall architecture of the system of the present invention; Figure 2 A structural diagram of a precise human exposure assessment module; Figure 3 The algorithm flowchart for the causal analysis module of health effects; Figure 4 To construct a graph for a full-chain association model based on graph neural networks. Detailed Implementation
[0046] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0047] According to one embodiment of the present invention, Figures 1 to 4 As shown.
[0048] 1. System Overall Architecture
[0049] The system comprises five core modules (S1-S5), each capable of operating independently yet also collaborating to form a complete environmental health correlation analysis workflow. For example... Figure 1 As shown, the overall workflow of the system is as follows: First, the multi-dimensional analysis module for pollution sources (S1) accurately identifies and quantifies existing and historical pollution sources in the environment; then, the dynamic transmission module for environmental media (S2) simulates the migration and transformation process of pollutants among multiple environmental media; next, the precise assessment module for human exposure (S3) calculates individualized exposure doses for different exposure pathways; then, the causal analysis module for health effects (S4) assesses the causal relationship between exposure and health effects; finally, the full-chain correlation model module (S5) integrates the aforementioned analysis results to construct a complete causal chain network of pollution source-environmental media-human exposure-health effects. Figures 2 to 4 The internal structure and workflow of each module are shown.
[0050] The following section provides further explanation of the system setup process in conjunction with the above methods and steps: The system architecture adopts a three-tier design, including a data layer, an algorithm layer, and an application layer. These layers interact through standardized interfaces, supporting flexible component replacement and horizontal scaling. The system employs a distributed microservice architecture, with each functional module decomposed into independent microservices that communicate via RESTful APIs and message queues. System deployment supports containerization and orchestration management, dynamically scaling up and down based on business load. A typical deployment size is an 8-16 node cluster, supporting the daily processing of terabytes of environmental monitoring and health data.
[0051] 1.1 Data Layer Implementation: The data layer provides data storage and management services for the system, including the following components: Pollution source database: Implemented using a hybrid distributed time-series database and relational database, configured as a multi-node cluster to provide data redundancy and high availability. Stored content includes information such as pollution source location, emission intensity, emission composition, and time series data.
[0052] Environmental Media Database: This database integrates spatial and time-series databases and is deployed as a multi-node cluster. The spatial database, implemented using PostgreSQL / PostGIS, stores environmental monitoring point, grid, and regional data, with the spatial index employing the R-tree algorithm. The time-series database stores time-series data on pollutant concentrations.
[0053] The human exposure database comprises two parts: a population activity database and a biomonitoring database. The population activity database, implemented using a graph database, stores data on population activity patterns, time allocation, and behavioral characteristics. The biomonitoring database stores information such as pollutant concentrations and biomarker levels in biological samples.
[0054] Health Effects Database: Implemented using a professional medical data warehouse architecture, including both structured and unstructured data storage. The structured data is designed using a star schema, including fact tables (disease incidence, mortality rate, biomarker levels) and dimension tables (time, location, population characteristics, disease classification).
[0055] Auxiliary databases include meteorological databases, geographic information databases, demographic databases, and socioeconomic databases. The meteorological database uses a professional meteorological data format (NetCDF) and supports multi-scale meteorological element storage and spatiotemporal interpolation.
[0056] 1.2 Algorithm Layer Implementation: The algorithm layer provides core algorithm support for the system and includes the following components: Multi-source apportionment algorithm library: Implements various pollution source apportionment algorithms, including isotope tracing, multivariate receptor models, and source fingerprinting. The isotope tracing algorithm includes a multivariate isotope mixing model (supporting isotope ratio analysis of carbon, nitrogen, sulfur, lead, etc.), with configuration parameters including: 10,000 MCMC iterations, 1,000 aging period, 5 chains, and a convergence criterion of Gelman-Rubin statistic < 1.1. The multivariate receptor model includes parallel implementations of PMF, CMB, and PCA-MLR algorithms. The PMF algorithm is configured with: a maximum of 300 iterations, a convergence threshold of 1e-6, automatic factor optimization (range 3-15), and uncertainty assessment using the bootstrap method (200 repetitions). The source fingerprinting algorithm employs a multi-feature fusion method, supporting comprehensive matching of chemical composition, isotope features, and spatiotemporal patterns, with similarity calculation using weighted cosine distance.
[0057] The environmental transport algorithm library implements multi-scale spatiotemporal adaptive transport algorithms and multi-media coupled transport models. The multi-scale spatiotemporal algorithm is based on adaptive grid technology, with core parameters including: a basic grid resolution of 1 km, a minimum grid resolution of 10 m, a grid adaptation threshold of 0.2 (based on concentration gradient), and an adaptive time step range of 1 second to 1 hour (based on CFL conditions). Numerical solutions employ a high-order non-oscillatory scheme. The multi-media coupled models include: an atmospheric module (based on the CMAQ framework for optimized computational efficiency), a water module (based on the WASP model to enhance sediment-water exchange processes), a soil module (a multi-layer vertical migration model), and a biological module (a bioaccumulation model based on the FGETS framework). The coupling between models employs an explicit relaxation coupling strategy, with the coupling time step adaptively set according to media characteristics (atmosphere-water: 1 hour, water-sediment: 1 day).
[0058] Exposure assessment algorithm library: This library integrates spatiotemporal activity trajectories with microenvironmental exposure models and multi-pathway cumulative exposure models. The microenvironmental exposure model includes a pollutant distribution characteristic library for 30 typical microenvironments, with each microenvironment containing 10-20 key parameters (such as indoor / outdoor ratio, ventilation rate, and pollution source intensity), supporting exposure calculations under different population activity patterns. The individualized physiological parameter model is based on allometric growth theory and includes a basic physiological parameter library for 25 population types (stratified by age, sex, and weight), supporting parameter adjustments for various health conditions. The multi-pathway cumulative exposure model integrates three major exposure pathways: respiratory, dietary, and dermal contact, and includes absorption rate and bioavailability parameters for various pollutants, supporting internal dose calculations using a physiologically based toxicokinetics (PBTK) model.
[0059] Causal Inference Algorithm Library: This library implements a deep learning-driven causal inference engine and an intelligent identification and control algorithm for confounding factors. The causal inference engine is based on a hybrid architecture of structural causal models and deep learning. Its core components include: a representation learning network (3-layer MLP, hidden layer size [512, 256, 128]), a bias network (2-layer MLP, hidden layer size [128, 64]), and an outcome prediction network (3-layer MLP, hidden layer size [128, 64, 32]). Training parameters are set as follows: batch size 128, learning rate 1e-4, Adam optimizer, 200 training epochs, and early stopping strategy (validation loss showed no improvement after 10 epochs). The confounding factor control algorithm supports automatic identification of the minimum adjustment set that satisfies the backdoor criterion. It is particularly optimized for handling common confounding factors in the environmental health field, including: meteorological conditions (temperature, humidity, air pressure, etc.), temporal trends (seasonal, long-term trends), spatial correlations (distance decay, regional clustering), and demographic characteristics (age structure, socioeconomic status).
[0060] Graph Neural Network Engine: Enables full-chain association modeling based on heterogeneous graph neural networks. The core architecture is a multi-layer heterogeneous graph attention network with configuration parameters including: hidden layer dimension [256, 128, 64], number of attention heads [8, 8, 8], skip connections and layer normalization, and the activation function is LeakyReLU. The graph representation includes four types of nodes (pollution sources, environmental locations, populations, and health endpoints) and multiple types of edges (emissions, migration, exposure, and effects), with a node feature dimension of 128 and an edge feature dimension of 64. Training employs a multi-task learning strategy, with tasks including: link prediction (predicting potential associations between nodes), node classification (predicting node attributes), and path importance evaluation (evaluating the strength of causal paths). The optimization algorithm is Adam, with a learning rate of 1e-3, weight decay of 1e-5, batch size of 256, and 500 training epochs.
[0061] Deep transfer learning engine: Enables knowledge transfer and model adaptation across regions and pollutants. Core technologies include: domain adversarial training (to reduce the distributional differences between the source and target domains), feature alignment (achieved through minimizing the maximum average difference), and meta-learning (achieving rapid adaptation through a model-independent feature extractor). The model architecture is a two-stream network, including a feature extractor (shared parameters) and a domain classifier (adversarial training). Configuration parameters include: domain adversarial loss weight 0.5, feature alignment loss weight 0.3, and task loss weight 1.0. Adaptation strategies include: full fine-tuning (when target data is abundant), hierarchical freezing (when target data is limited), and feature augmentation (when target data is extremely scarce).
[0062] 1.3. Application Layer Implementation
[0063] The application layer provides functional interfaces and analytics services for different users and scenarios, including the following components: The full-chain correlation analysis module enables the modeling and analysis of the complete chain of correlations from pollution sources to health effects. This module integrates the results of pollution source apportionment, environmental transport, exposure assessment, and health effect analysis to construct a complete causal network and provides visualization capabilities.
[0064] Dynamic Risk Prediction Module: Predicts future risk changes based on historical data and current conditions. This module supports short-term (1-7 days), medium-term (1-3 months), and long-term (1-5 years) risk predictions, and outputs include risk level, scope of impact, and affected population.
[0065] Multi-scenario Simulation Module: Evaluates the environmental and health benefits of different pollution control measures. This module supports the design and simulation of various intervention scenarios, including pollution source control, environmental remediation, and exposure interventions.
[0066] Decision Support Module: Provides optimized pollution control and health protection strategy recommendations. This module is based on a multi-objective optimization algorithm, comprehensively considering environmental benefits, health benefits, and economic costs to generate the optimal strategy combination.
[0067] The system deployment architecture adopts a hybrid cloud solution, with core computing resources deployed on a high-performance computing cluster (multi-core CPU, large-capacity memory, multiple GPUs, and large-capacity storage), and data storage using a distributed storage array (supporting petabyte-level data storage and high-throughput IO).
[0068] 2. Multidimensional Analysis Module for Pollution Sources
[0069] The multidimensional pollution source analysis module is the core component of the system for accurately identifying and quantifying existing and historical pollution sources. This module comprises four key sub-modules: an isotope tracing source analysis sub-module, a multivariate receptor model fusion sub-module, a source fingerprint feature library management sub-module, and a time-stratified source contribution calculation sub-module. These sub-modules work collaboratively to form a complete processing chain from raw environmental monitoring data to precise pollution source analysis results.
[0070] 2.1 Isotope Tracing Source Analysis Submodule
[0071] The isotope tracing source apportionment submodule aims to solve the technical challenge of traditional chemical composition analysis failing to distinguish pollution sources with similar chemical properties but originating from different periods. This submodule includes a multivariate isotope ratio determination unit, an isotope mixing model construction unit, a Bayesian parameter estimation unit, and a time-stamp identification unit.
[0072] 2.1.1 Multivariate Isotope Ratio Determination Unit: This unit employs high-precision mass spectrometry (MS / MS) for isotope ratio determination. The system is equipped with a dual analytical platform including MC-ICP-MS (multi-receiver inductively coupled plasma mass spectrometry) and IRMS (isotope ratio mass spectrometry). The system supports the determination of ratios of various key isotopes, including carbon isotopes (δ13C), nitrogen isotopes (δ15N), sulfur isotopes (δ34S), lead isotopes (206Pb / 207Pb, 208Pb / 206Pb), and strontium isotopes (87Sr / 86Sr), with measurement accuracy reaching the thousandths level. Quality control utilizes international standard reference materials, including NIST SRM 981 (Pb), NIST SRM 987 (Sr), V-CDT (C), AIR (N), and V-SMOW (S), with appropriate proportions of quality control samples and parallel samples prepared for each batch of samples.
[0073] 2.1.2 Isotope Mixing Model Construction Unit: This unit establishes a mathematical model for multivariate isotope tracing to achieve quantitative analysis of pollution sources in complex environments. The model is built based on a system of multivariate linear mixture equations, with the core equation being: ,in, This is the vector of isotope ratios of the sample (with dimension k, where k is the number of isotope species). The contribution ratio of pollution source i. Let be the isotopic eigenvector of pollution source i, and ε be the measurement error and model error. The system supports simultaneous processing of multiple isotopic indicators and multiple pollution sources. Model constraints include: (Conservation of mass), f i ≥ 0 (non-negativity constraint). Model uncertainty quantification uses Monte Carlo simulation, considering three sources: isotope measurement uncertainty, source member uncertainty, and model structure uncertainty. Model diagnostic indicators include goodness of fit R², residual distribution test, and model selection criterion AIC.
[0074] 2.1.3. Bayesian Parameter Estimation Unit: This unit uses Bayesian inference to estimate pollution source contribution parameters, effectively handling parameter uncertainty and prior information fusion. The formula for calculating the posterior distribution of the Bayesian estimation is:
[0075] Where P(f|D) is the posterior distribution of parameter f, P(D|f) is the likelihood function, and P(f) is the prior distribution. The Markov Chain Monte Carlo (MCMC) method is used for posterior sampling to obtain point estimates and confidence intervals for the parameters.
[0076] 2.1.4 Time-Marked Identification Unit: This unit combines sediment / soil profile chronology analysis to achieve time-dimensional marking of pollution sources. Dating was performed using a combined 210Pb and 137Cs dating method. The dating model combined the CRS (Constant Rate of Supply) model and the CIC (Constant Initial Concentration) model, with the model selection based on the 210Pb activity-depth distribution characteristics. The sedimentation rate calculation formula is: , where λ is the decay constant of 210Pb (0.03114 years). -1 I represents the 210Pb input flux, ρ represents the sediment density, and A0 represents the surface 210Pb activity. The 137Cs time markers include characteristic years such as 1963 (nuclear test peak) and 1986 (Chernobyl accident), which are used to verify and correct the 210Pb dating results.
[0077] 2.2 Multi-receptor model fusion submodule
[0078] The multi-receptor model fusion submodule aims to address the uncertainties and applicability limitations of single receptor models, improving the accuracy and reliability of source resolution through multi-model fusion. This submodule includes a parallel model computation unit, a model performance evaluation unit, a consistency weight calculation unit, and a fusion result optimization unit.
[0079] 2.2.1 Parallel Model Computation Unit: This unit runs multiple receptor models simultaneously, fully utilizing the advantages of different models. The system integrates several mainstream receptor models: PMF (Positive Matrix Factorization), CMB (Chemical Mass Balance), PCA-MLR (Principal Component Analysis-Multiple Linear Regression), NMF (Non-negative Matrix Factorization), UNMIX, and ME-2 (MultilinearEngine 2). Each model runs using a parallel computing framework to improve computational efficiency.
[0080] 2.2.2 Model Performance Evaluation Unit: This unit establishes a multi-dimensional model performance evaluation system to provide a basis for subsequent fusion weight calculation. The evaluation index system includes four dimensions: goodness-of-fit evaluation uses three indicators: coefficient of determination R², root mean square error (RMSE), and mean absolute error (MAE); physical rationality evaluation checks the physical meaning of source spectral features and the rationality of source contributions; statistical significance evaluation is based on bootstrap resampling and confidence interval analysis; and cross-validation evaluation uses K-fold cross-validation to evaluate the model's generalization ability.
[0081] 2.2 3. Consistency Weight Calculation Unit: This unit calculates the optimal fusion weight based on inter-model consistency and individual performance. Consistency evaluation employs multiple methods: source quantity consistency involves statistically analyzing the distribution of the number of sources identified by each model and calculating the consistency index. ,in, Here, std(source_counts) is the source count consistency index, std(source_counts) is the standard deviation of the number of sources identified by each model, and mean(source_counts) is the average number of sources identified by each model.
[0082] Source spectrum similarity was calculated using cosine similarity and Pearson correlation coefficient. Weight calculation employed a combination of the Analytic Hierarchy Process (AHP) and information entropy, resulting in the final weights. Where W is the final fusion weight, WAHP is the weight calculated by the analytic hierarchy process, Wentropy is the weight calculated by the information entropy method, and α is the weight coefficient.
[0083] 2.2.4. Fusion Result Optimization Unit: This unit optimizes and fuses the results of multiple models based on calculated weights to generate the final source resolution result. Three fusion strategies are employed: weighted average, Bayesian model averaging, and voting ensemble. Weighted average fusion utilizes source spectrum fusion. ,in, The fused source spectrum matrix, These are the weight coefficients of model m. This represents the source spectrum analysis results for model m, where M is the total number of models. Contribution fusion. ;in, The source contribution matrix after fusion. The source contribution analysis results for model m are presented. Bayesian model averaging treats each model as a hypothesis and performs model averaging based on a Bayesian framework. Voting ensemble, given a fixed number of sources, employs a voting mechanism to select the most consistent result. Uncertainty quantification utilizes Bootstrap resampling and the Monte Carlo method.
[0084] 2.3 Source Fingerprint Feature Database Management Submodule
[0085] The source fingerprint feature database management submodule aims to build and maintain a comprehensive pollution source feature database, providing accurate source member information for source analysis. This submodule includes a multi-dimensional feature extraction unit, a feature database construction unit, a similarity matching unit, and a dynamic update unit.
[0086] 2.3.1 Multidimensional Feature Extraction Unit: This unit extracts multidimensional features from pollution source samples, including chemical components, isotope ratios, molecular markers, and morphological characteristics. The system supports feature extraction from various pollution source types, including coal-fired power plants, industrial boilers, vehicle exhaust, biomass combustion, dust, sea salt, sulfate, and nitrate. Chemical component feature extraction uses ICP-MS and IC coupled analysis to detect multiple elements and ions. Isotope feature extraction covers multiple isotope systems such as carbon, nitrogen, sulfur, lead, and strontium. Molecular marker analysis uses GC-MS and LC-MS coupled techniques. Morphological feature analysis uses scanning electron microscopy (SEM) and transmission electron microscopy (TEM) to analyze the particle size distribution, morphological characteristics, and microstructure of particulate matter.
[0087] 2.3.2 Feature Database Construction Unit: This unit constructs a structured pollution source feature database to achieve efficient storage and retrieval of feature data. The database adopts a hybrid architecture of relational and graph databases, supporting efficient indexing and querying of multi-dimensional features.
[0088] 2.3.3 Similarity Matching Unit: This unit intelligently matches environmental samples with the source feature library to identify possible pollution source types. The matching algorithm employs multi-level similarity calculation, comprehensively considering multi-dimensional features such as chemical, isotopic, molecular, and morphological characteristics. The similarity calculation uses a hierarchical weighted strategy: the first layer is the feature type weight, the second layer is the component weight, and the third layer is the distance metric. The similarity score uses an integration of multiple algorithms: cosine similarity, Pearson correlation coefficient, reciprocal Euclidean distance, and Jaccard similarity. Uncertainty handling employs fuzzy matching technology, and the TOPSIS method is used for ranking multiple candidate sources.
[0089] 2.3.4 Dynamic Update Unit: This unit enables dynamic maintenance and continuous optimization of the feature library, ensuring its timeliness and accuracy. Update strategies include periodic updates and triggered updates, supporting the automatic identification and addition of new pollution source types to the database.
[0090] 2.4 Time-stratified source contribution calculation submodule
[0091] The time-stratified source contribution calculation submodule aims to achieve accurate time-stratified analysis of existing pollution sources and historically accumulated pollution, solving the technical challenge of traditional methods failing to distinguish the relative contributions of pollution sources from different periods. This submodule includes a chronological analysis unit, a time-stratification strategy unit, a contribution weight calculation unit, and a historical reconstruction unit.
[0092] 2.4.1 Chronological Analysis Unit: This unit employs various dating techniques to establish a high-precision chronological scale. The main dating methods include 210Pb dating (suitable for periods within 150 years), 137Cs dating (marking specific time points), 14C dating (suitable for scales spanning thousands of years), and optically stimulated luminescence dating (suitable for special environments). 210Pb dating uses the CRS (constant supply rate) model, and the calculation formula is: [Formula omitted for brevity]. , where λ is the 210Pb decay constant, A∞ is the supporting 210Pb activity, and A is the excess 210Pb accumulation.
[0093] 2.4.2 Time Stratification Strategy Unit: This unit establishes a scientific time stratification scheme based on the characteristics of pollution history and chronological accuracy. The stratification strategy considers three dimensions: historical periods (pre-industrialization, early industrialization, rapid industrialization, and environmental governance period), pollution events (major pollution accidents, policy turning points, and technological change points), and data accuracy (chronological uncertainty, sample interval, and analytical precision). The standard stratification scheme is as follows: the first layer (0-30 years) represents the period of influence of existing pollution sources, with a resolution of 2-5 years; the second layer (30-80 years) represents the period of rapid industrial development, with a resolution of 5-10 years; the third layer (80-150 years) represents the early industrialization period, with a resolution of 10-20 years; and the fourth layer (>150 years) represents the background period, with a resolution of 20-50 years. A special event layer marks important time nodes, such as major pollution accidents or policy implementations in specific years. The uncertainty of the stratification boundaries is represented by a probability distribution, and the boundary ambiguity considers chronological errors and the continuity of sedimentary processes.
[0094] 2.4.3 Contribution Weight Calculation Unit: This unit calculates the relative contribution weight of pollution sources at each time layer, considering factors such as source intensity variations, environmental processes, and conservation efficiency. The weight calculation is based on a modified mass balance model. ,in, The weight of pollution source i at time t, Contribution to concentration, To preserve the efficiency correction factor, The priority weight is used. The retention efficiency correction considers the degradation, migration, and transformation processes of pollutants in the environment. The correction factor is calculated using the following formula: ,in Let be the overall attenuation coefficient of pollutant i, and Δt be the time interval. Priority weights are determined based on the reliability of isotope tracing and chemical fingerprinting, with higher reliability indicators assigned higher weights. Weight normalization ensures... Uncertainty propagation is quantified using the Monte Carlo method.
[0095] 2.4.4 Historical Reconstruction Unit: This unit reconstructs the complete historical evolution of pollution based on time-stratified analysis results, providing a scientific understanding of the spatiotemporal evolution of pollution sources. Historical reconstruction employs time series analysis and trend decomposition methods to analyze the long-term trends, periodic changes, and abrupt changes in the contributions of each pollution source. Trend analysis uses STL (Seasonal and Trend Decomposition using Loess) decomposition to decompose the time series into trend, periodic, and random components. Abrupt change detection uses the PELT (Pruned Exact Linear Time) algorithm to identify significant changes in pollution source contributions. Key turning points are identified through rate of change analysis; the rate of change is calculated using the following formula: ,in The rate of change of weight. and These are the weight values for times t+1 and t-1, respectively, and Δt is the time interval.
[0096] Historical scenario reconstruction combines quantitative analysis results with qualitative historical data to construct a multi-dimensional historical picture of pollution. The uncertainty interval is estimated using bootstrap resampling with a confidence level set at 95%.
[0097] 3. Environmental Media Dynamic Transmission Module
[0098] The environmental media dynamic transport module is the core component of the system for achieving dynamic and accurate simulation of the migration and transformation process of pollutants among multiple environmental media. This module comprises four key sub-modules: a multi-scale spatiotemporal grid management sub-module, a multi-media coupled transport sub-module, an extreme condition adaptability sub-module, and a model accuracy optimization sub-module. These sub-modules work together to form a complete simulation chain from pollutant emission to the distribution of pollutants across multiple environmental media.
[0099] 3.1 Multi-scale Spatiotemporal Grid Management Submodule
[0100] The multi-scale spatiotemporal grid management submodule aims to address the issues of insufficient accuracy and low computational efficiency of traditional fixed grid models in complex environments. This submodule includes adaptive grid generation units, spatiotemporal scale matching units, grid quality control units, and dynamic grid adjustment units.
[0101] 3.1.1 Adaptive Grid Generation Unit: This unit automatically generates multi-scale computational grids based on factors such as terrain complexity, pollution source density, and concentration gradient. The grid generation uses a quadtree structure, supporting up to 8 levels of grid refinement. The basic grid resolution is set to 1km × 1km, with a maximum refinement resolution of 10m × 10m and a refinement ratio of 1:100. Grid refinement criteria include three aspects: terrain complexity refinement, pollution source density refinement, and concentration gradient refinement. Terrain complexity is calculated based on the terrain roughness index TRI using a digital elevation model; pollution source density is calculated based on the number and intensity of sources within the grid; and the concentration gradient is determined based on pre-simulation results. The comprehensive criterion formula is as follows:
[0102] in, The refinement criterion value for grid i. For pollutant concentration, Where h is the maximum concentration and h is the terrain elevation. For the maximum elevation, For pollution source density, This represents the maximum pollution source density.
[0103] The mesh smoothing algorithm uses Laplacian smoothing technology to ensure that the size ratio of adjacent meshes is controlled within 1:2.
[0104] 3.1.2 Spatiotemporal Scale Matching Unit: This unit determines the optimal spatiotemporal discrete scale based on the physicochemical properties of the pollution and environmental conditions, balancing computational accuracy and efficiency. Time scale matching is based on the CFL stability condition and diffusion number constraint; the time step calculation employs an adaptive algorithm. Where Δt is the adaptive time step, To limit the time step of convection, To limit the time step for diffusion, Time steps are constrained for chemical reactions. This includes three constraints: convection, diffusion, and chemical reaction. Spatial scale adaptation considers the relationship between pollutant characteristic scales and grid resolution, requiring… To ensure sufficient resolution, where Δx is the mesh size and Lc is the pollutant characteristic scale, a multi-time-step technique is employed for spatiotemporal coupling optimization, using different time steps for different physical processes.
[0105] 3.1.3 Mesh Quality Control Unit: This unit monitors mesh quality and performs automatic optimization to ensure the stability and accuracy of numerical computation. Quality control indicators include three aspects: mesh geometric quality, numerical accuracy, and computational efficiency. Geometric quality control is based on indicators such as mesh aspect ratio, interior corner quality, mesh distortion, and the size ratio of adjacent meshes. Numerical accuracy control is achieved through error estimation and mesh convergence analysis; local truncation error estimation uses the Richardson extrapolation method. Computational efficiency is evaluated by monitoring CPU time and memory usage; the mesh optimization algorithm includes three operations: local refinement, coarsening, and reconstruction.
[0106] 3.1.4 Dynamic Mesh Adjustment Unit: This unit adjusts the mesh structure in real time according to changes in pollutant distribution during the simulation, achieving dynamic optimization of computational resource allocation. The adjustment strategy is based on pollutant concentration distribution, gradient changes, and computational load balancing. The adjustment criteria are consistent with the mesh refinement criteria; when... When the value exceeds the threshold, refinement is applied; when the value is below one-quarter of the threshold, coarsening is applied.
[0107] The adjustment trigger mechanism is based on indicators such as the concentration gradient change rate, the grid load imbalance index, and changes in accuracy requirements. The dynamic refinement algorithm uses a hybrid criterion based on gradient and physical quantity thresholds, while the dynamic coarsening algorithm identifies grid regions that can be merged. The load balancing algorithm employs dynamic partitioning technology, and the data migration strategy ensures data consistency during grid adjustment.
[0108] 3.2 Multi-media Coupled Transmission Submodule
[0109] The multi-media coupled transport submodule aims to simulate the coupled transport of pollutants among multiple environmental media, including the atmosphere, water, soil, and organisms. This submodule includes an intra-media transport unit, an inter-media exchange unit, a phase equilibrium calculation unit, and a coupled solution unit.
[0110] 3.2.1 Intra-medium Transport Unit: This unit simulates the transport process of pollutants within various environmental media, employing targeted numerical methods and physical models. Atmospheric transport simulation uses a three-dimensional Eulerian grid model, with the convection-diffusion reaction equations as the fundamental governing equations, and a splitting algorithm for numerical solution. Water transport simulation is based on a three-dimensional hydrodynamic-water quality coupled model, with the hydrodynamic module using the finite volume method to solve the shallow water equations. Soil transport simulation is based on the Richards equations and the convection-diffusion equations, considering water flow and solute transport in the unsaturated zone. Biophase transport employs a dynamic bioaccumulation model, considering the absorption, distribution, and metabolism of pollutants within plants and animals.
[0111] 3.2.2 Intermediate Exchange Unit: This unit calculates the pollutant exchange flux between media such as atmosphere-water, atmosphere-soil, and water-sediment, achieving a key connection for multi-media coupling. The formula for calculating the intermediate exchange flux is: ,in, The transmission rate constant is and For fugacity correction factor, and The values represent the pollutant concentrations in the corresponding media. Atmospheric-water exchange employs a two-film theoretical model, with the overall mass transfer coefficient considering both gas and liquid phase impedances. Atmospheric-soil exchange considers volatilization and wet / dry deposition processes, with volatilization flux modeled using an impedance model. Water-sediment exchange includes three processes: diffusion, bioturbation, and resuspension. A graded time-step coupling strategy is used, with synchronization points set at hourly flux exchange between media.
[0112] 3.2.3 Phase Equilibrium Calculation Unit: This unit calculates the equilibrium distribution of pollutants in a multiphase system, considering processes such as dissolution, adsorption, and volatilization. Gas-liquid equilibrium is calculated using a modified Raoult's law, and activity coefficients are calculated using the UNIFAC method. Solid-liquid equilibrium considers both linear and nonlinear adsorption; linear adsorption uses the partition coefficient, while nonlinear adsorption uses the Freundlich or Langmuir equations. Multi-component competitive adsorption is performed using the extended Langmuir equation or the ideal adsorption solution theory. Temperature effects are corrected for the equilibrium constant using the van't Hoff equation, and pH effects are considered using the species distribution coefficient.
[0113] 3.2.4 Coupled Solution Unit: This unit implements the coupled solution of the multi-medium transport equations, handling strong interactions and feedback effects between media. The solution strategy employs a split-operation algorithm and an iterative coupling method. The governing equations are: Where C represents the pollutant concentration. , , These are the transport operator, the exchange operator, and the reaction operator, respectively.
[0114] The splitting algorithm decomposes complex coupled systems into relatively simple subproblems, and the operator splitting scheme employs strangsplitting. An iterative coupling method handles strongly coupled problems, with external iteration controlling the consistency of flux between media. Numerical stability is guaranteed through techniques such as time step constraints, artificial viscosity addition, positive definiteness preservation, and mass conservation constraints. Computational efficiency is optimized using parallel computing and preconditioning techniques.
[0115] 3.3 Extreme Condition Adaptability Submodule
[0116] The extreme conditions adaptability submodule is designed to handle pollutant transport simulations under complex meteorological and topographical conditions, improving the model's prediction accuracy under special conditions. This submodule includes an extreme weather identification unit, a complex terrain processing unit, a model parameter adjustment unit, and a special process simulation unit.
[0117] 3.3.1 Extreme Weather Identification Unit: This unit automatically identifies extreme weather conditions that may affect pollutant transport, including temperature reversal, strong winds, heavy rain, haze, and dust storms. Temperature reversal identification is based on vertical temperature gradient analysis. Strong wind event identification is based on wind speed thresholds and duration. Heavy rain event identification is based on rainfall intensity. Haze condition identification is based on visibility and humidity. Dust storm identification is based on sudden increases in PM10 concentration and wind speed conditions.
[0118] 3.3.2 Complex Terrain Processing Unit: This unit addresses the impact of complex terrain such as mountains, valleys, and coastlines on pollutant transport. Mountain terrain processing employs a terrain-following coordinate system, with vertical coordinate transformation... ,in, Ground height, The model top height is shown. The valley effect simulation considers cold air convergence and inversion layer formation, employing a modified boundary layer parameterization scheme. Coastal effect handling includes sea-land breeze simulation and turbulence intensity adjustment; local circulation driven by sea-land temperature differences utilizes nested mesoscale meteorological models. The urban heat island effect is considered through a modified surface energy balance equation, and the impact of urban buildings on the wind field is handled using an urban canopy model.
[0119] 3.3.3 Model Parameter Adjustment Unit: This unit dynamically adjusts the key parameters of the transmission model based on identified extreme conditions and terrain features. The parameter adjustment strategy combines physical mechanism analysis and empirical correction. Parameter adjustment rules under different extreme conditions include: Under temperature inversion conditions, the vertical diffusion coefficient decreases, and the mixing layer height decreases. Under strong wind conditions, horizontal diffusion is enhanced, and the dry deposition velocity increases. Under heavy rain conditions, the wet deposition coefficient increases, and surface runoff erosion is enhanced. Under haze conditions, the photochemical reaction rate decreases, and the particulate matter aggregation process is enhanced. Under dust storm conditions, the particulate matter resuspension rate increases, and the deposition mechanism shifts to gravity deposition as the dominant force.
[0120] 3.3.4 Special Process Simulation Unit: This unit simulates special physicochemical processes under extreme conditions. The inversion layer simulation employs a multi-layer atmospheric model, considering the inhibitory effect of the inversion layer on vertical mixing. The strong convection simulation uses a convection parameterization scheme to simulate the rapid vertical transport of pollutants by strong convection such as thunderstorms. The enhanced wet deposition simulation employs detailed cloud microphysical processes, considering the removal efficiency of different types of precipitation. The photochemical reaction inhibition simulation is achieved by adjusting the photolysis rate constant, based on solar radiation intensity and cloud cover. The particulate matter agglomeration simulation uses the discrete section method, considering particle size growth under high humidity conditions. The wind-blown dust simulation calculates wind erosion flux and vertical distribution based on frictional wind speed and surface roughness.
[0121] 3.4 Model Accuracy Optimization Submodule
[0122] The model accuracy optimization submodule aims to improve the prediction accuracy of the transfer model through techniques such as observation data assimilation, machine learning correction, and parameter optimization. This submodule includes an observation data assimilation unit, a machine learning bias correction unit, a parameter sensitivity analysis unit, and a model performance evaluation unit.
[0123] 3.4.1 Observation Data Assimilation Unit: This unit employs advanced data assimilation techniques to fuse multi-source observation data into the transmission model, improving prediction accuracy. Assimilation methods include optimal interpolation, Kalman filtering, ensemble Kalman filtering, and four-dimensional variational assimilation. The analysis update formula for ensemble Kalman filtering (EnKF) is: where is the analysis field, is the forecast field, K is the Kalman gain matrix, y is the observed value, H is the observation operator, and the observation error covariance matrix is determined based on instrument accuracy and representative error. The background error covariance is obtained through historical forecast error statistics. The cost function for four-dimensional variational assimilation (4D-Var) is J = J b + J o J b For the background item, J o This refers to the observation items. Observational data includes multiple sources such as ground monitoring stations, satellite remote sensing, radar detection, and mobile monitoring. Data quality control employs statistical tests and physical consistency tests.
[0124] 3.4.2 Machine Learning Bias Correction Unit: This unit uses machine learning techniques to identify and correct systematic biases in the model. Bias correction employs various machine learning algorithms: random forest, support vector machine, neural network, and gradient boosting tree. Feature engineering includes multiple feature variables such as meteorological elements (temperature, humidity, wind speed, boundary layer height), geographical features (latitude and longitude, altitude, land use), temporal features (season, day / night cycle, weekend effect), and pollution source features (distance from source, source intensity, source type). Model training uses cross-validation, and performance evaluation metrics include RMSE, MAE, correlation coefficient, and bias. The deep learning model uses an LSTM network to process time series features.
[0125] 3.4.3 Parameter Sensitivity Analysis Unit: This unit systematically analyzes the sensitivity of key parameters of the transmission model to the prediction results, providing a basis for parameter optimization. Sensitivity analysis employs the Morris screening method, the Sobol method, and the extended FAST method. The Morris method is used for preliminary screening of important parameters, employing radial design. The Sobol method calculates the first-order and full-order sensitivity indices of the parameters, with a sampling number of 10,000. Sensitivity indices include local sensitivity and global sensitivity. ,in, For parameters The local sensitivity index, Y is the model output. This is to output the partial derivatives with respect to the parameters.
[0126] ,in, For parameters The global sensitivity index, where V represents variance and E represents expectation. Indicates except All parameters other than those listed above. Sensitivity ranking is based on the magnitude of the sensitivity index, identifying the key parameters that have the greatest impact on the model output.
[0127] 4. Precise Human Exposure Assessment Module
[0128] like Figure 2 As shown, the human exposure precision assessment module is the core component of the system for achieving precise exposure assessment that considers individual differences and spatiotemporal dynamics. This module comprises three key sub-modules: the spatiotemporal activity trajectory and microenvironment exposure sub-module, the individual-differentiated physiological parameter sub-module, and the multi-pathway cumulative exposure integration sub-module. These sub-modules work collaboratively to form a complete processing chain from individual activity data to precise exposure assessment results.
[0129] 4.1 Spatiotemporal Activity Trajectory and Microenvironment Exposure Submodule
[0130] The Spatiotemporal Activity Trajectory and Microenvironment Exposure submodule aims to construct a microenvironment exposure assessment model based on individual spatiotemporal activity trajectories, addressing the problem that traditional exposure assessments neglect individual activity patterns and spatial heterogeneity. This submodule includes an activity trajectory data acquisition unit, a microenvironment identification and classification unit, an exposure concentration interpolation unit, and a trajectory exposure calculation unit.
[0131] 4.1.1 Activity Trajectory Data Acquisition Unit: This unit employs multi-source data fusion technology to obtain detailed spatiotemporal activity information for individuals. Data acquisition methods include GPS positioning, accelerometer monitoring, activity log recording, and questionnaire surveys. GPS data preprocessing uses a Kalman filter algorithm to remove positioning noise, and the data is saved in the format (timestamp, longitude, latitude, accuracy, activity intensity). Accelerometer data is used to identify activity type and intensity; a machine learning classifier is used to classify the raw signals into various activity types such as sitting, walking, running, and cycling. Activity logs are recorded in real-time using a mobile app, including information such as microenvironment type (indoor / outdoor), activity content, and duration. Questionnaire surveys supplement personal characteristic information, including key fields such as age, gender, occupation, and health status.
[0132] 4.1.2 Microenvironment Identification and Classification Unit: This unit automatically identifies and finely classifies the microenvironment types of individual activities. The microenvironment classification system includes various typical microenvironments: residential environments (bedrooms, living rooms, kitchens, bathrooms), work environments (offices, factory workshops, laboratories, shops), transportation environments (buses, subways, private cars, walking / cycling), and leisure environments (parks, shopping malls, restaurants, gyms), etc. The identification algorithm employs a multi-feature fusion method: geofencing technology (based on a predefined Point of Interest (POI) database), WiFi fingerprint recognition (based on WiFi signal strength patterns), activity pattern recognition (based on dwell time and movement speed characteristics), and user annotation verification (allowing users to confirm or correct the environment classification through an app interface). Microenvironment attributes include key attributes such as ventilation conditions, personnel density, and pollution source characteristics.
[0133] 4.1.3 Exposure Concentration Interpolation Unit: This unit estimates pollutant concentrations at individual activity locations based on environmental monitoring network data using advanced spatiotemporal interpolation methods. The interpolation method employs a fusion of multiple algorithms: Kriging interpolation (for pollutants with good spatial continuity, such as PM2.5 and PM10) and inverse distance weighting (for pollutants with significant point source influence, such as SO2 and NO). x Machine learning interpolation (random forests and neural networks for complex nonlinear relationships) and land use regression models (LUR, combined with GIS data for refined estimation) were used. Time-dimensional interpolation employed spline functions and ARIMA models to process the time series of monitoring data. The indoor / outdoor concentration relationship was determined using an I / O ratio model. Where α is the permeability coefficient and β is the indoor source contribution coefficient. Indoor source strength is represented. Model parameters are adjusted based on building type, ventilation conditions, and season. Cross-validation is used to assess interpolation accuracy.
[0134] 4.1.4 Trajectory Exposure Calculation Unit: This unit calculates a refined exposure dose based on individual activity trajectories and exposure concentration distribution. The exposure calculation uses a time-weighted averaging method. ,in, Let K be the pollutant concentration at the k-th spatiotemporal point. For the duration of stay, The respiratory rate corresponding to the activity intensity. This is an absorption correction factor for the microenvironment. The respiration rate is dynamically adjusted based on activity intensity. The correction factor considers various scenarios, including wearing a mask, using an indoor air purifier, and whether windows are open. Uncertainty analysis employs the Monte Carlo method, considering concentration estimation errors, time recording errors, and parameter uncertainties.
[0135] 4.2 Individually Differentiated Physiological Parameter Submodule
[0136] The individualized physiological parameters submodule aims to determine individualized physiological parameters based on an individual's demographic characteristics, physiological status, and health condition, thereby improving the relevance and accuracy of exposure assessment. This submodule includes a demographic analysis unit, a physiological parameter modeling unit, a health status assessment unit, and a parameter dynamic adjustment unit.
[0137] 4.2.1 Demographic Characteristics Analysis Unit: This unit systematically analyzes and classifies the basic demographic characteristics of individuals. Characteristic parameters include age (grouped by 5-year intervals, covering ages 0-85+), gender (male / female), weight (kg), height (cm), BMI, ethnicity (Asian / European / African / Other), pregnancy status (pregnant / lactating / non-pregnant), and lifestyle habits (smoking / drinking / exercise frequency), among other key indicators. Data sources include questionnaires, physical examination records, and wearable device monitoring. Data standardization uses the Z-score method, and outlier detection uses the 3σ criterion and box plot method. Population classification employs cluster analysis (K-means algorithm) to divide the population into multiple typical subgroups, each with similar combinations of demographic characteristics. Classification accuracy is evaluated using the silhouette coefficient; a coefficient > 0.6 indicates good classification performance. Sensitive population identification covers special populations such as children (0-12 years), the elderly (>65 years), pregnant women, and patients with chronic diseases.
[0138] 4.2.2 Physiological Parameter Modeling Unit: This unit establishes individualized physiological parameter prediction models based on allometric growth theory and physiological principles. Core parameters include key indicators such as basal metabolic rate (BMR), minute ventilation (Ve), alveolar ventilation (VA), body surface area (BSA), blood flow (Q), and tissue volume (V). BMR is calculated using the Harris-Benedict modified formula: For males, BMR = 88.362 + 13.397 × weight + 4.799 × height - 5.677 × age; For females, BMR = 447.593 + 9.247 × weight + 3.098 × height - 4.330 × age, where W is weight (kg), H is height (cm), and A is age (years). Respiratory rate is calculated based on the allometric growth relationship of energy expenditure. ,in, The respiratory rate is used as a reference, BW is body weight, and AF is the activity adjustment factor. Body surface area is calculated using the DuBois formula: The model was calibrated using 1000 real-world test cases.
[0139] 4.2.3 Health Status Assessment Unit: This unit assesses the impact of an individual's health status on physiological parameters, enabling health-related adjustments to these parameters. Health status categories include cardiovascular diseases, respiratory diseases, metabolic diseases, abnormal liver and kidney function, and immune system diseases. Each category is further divided into mild, moderate, and severe levels based on severity. Parameter adjustments are based on clinical research data and an expert knowledge base: pulmonary function parameters for patients with respiratory diseases are reduced by 10-50%; blood flow parameters for patients with cardiovascular diseases are adjusted by ±20%; and metabolic parameters for individuals with abnormal liver function are reduced by 20-60%. The adjustment formula uses a multiplicative model: P adjusted = P baseline ×CF health , including CF health The health adjustment factor was 0.4-1.5. The drug effect assessment considered the impact of common drugs on physiological parameters: bronchodilators increase lung ventilation by 10-25%, diuretics affect blood flow distribution, and hormonal drugs affect metabolic rate. Clinical validation of the adjusted parameters was completed by comparing them with patient-measured data.
[0140] 4.2.4 Dynamic Parameter Adjustment Unit: This unit enables dynamic adjustment and real-time optimization of physiological parameters, considering changes in parameters over time, environment, and physiological state. Dynamic adjustment factors include circadian rhythm (24-hour cyclical changes in metabolic rate and respiratory pattern), seasonal variations (approximately 10% difference in basal metabolic rate between winter and summer), pregnancy variations (gradual adjustments of various physiological parameters during pregnancy), age (long-term trend changes in physiological parameters), and training adaptation (the impact of exercise training on cardiopulmonary function). Circadian rhythm adjustment uses a cosine function model. Where A is the amplitude coefficient (0.1-0.3) and Φ is the phase shift (hours). Seasonal adjustments are based on temperature and daylight hours: BMR increases by 5-15% in winter and heat dissipation-related parameters are enhanced in summer. Real-time monitoring uses wearable device data (heart rate, activity level, sleep quality) for parameter fine-tuning, with adjustments made weekly and within ±20% of the adjustment range.
[0141] 4.3 Multi-path cumulative exposure integration submodule
[0142] The multi-pathway cumulative exposure integration submodule aims to integrate multiple exposure pathways, including respiratory, dietary, and skin contact, to calculate the cumulative internal dose of pollutants and achieve comprehensive exposure assessment. This submodule includes a multi-pathway exposure calculation unit, a bioavailability assessment unit, an internal dose conversion unit, and a cumulative effect analysis unit.
[0143] 4.3.1 Multi-route Exposure Calculation Unit: This unit systematically calculates an individual's exposure to pollutants through different routes. Exposure routes are categorized into four main types: respiratory exposure (inhalation of air pollutants), dietary exposure (ingestion of contaminated food and water), skin exposure (contact with contaminated soil, water, and surfaces), and accidental ingestion (hand-to-mouth contact with children). Respiratory exposure calculation formula: E inh = C air ×IR×t×f abs C air Here, IR represents air concentration, IR represents respiration rate, and t represents exposure time. Lung absorption rate. Dietary exposure is based on dietary structure and food contamination levels: ,in, The concentration of contaminants in food i Let i be the consumption rate of food i. Skin exposure is modeled using skin penetration. Where SA is the contact area, K p This represents the skin permeability coefficient. The database contains exposure parameters for various pollutants in multiple media, covering major categories such as heavy metals, organic pollutants, and particulate matter. The exposure scenario database includes various typical exposure scenarios, such as occupational exposure, environmental exposure, and indoor exposure.
[0144] 4.3.2 Bioavailability Assessment Unit: This unit assesses the bioavailability of pollutants after they enter the human body via different routes, providing key parameters for internal dose conversion. Bioavailability is defined as the proportion of pollutants entering the systemic circulation, influenced by the physicochemical properties of the pollutant, individual physiological state, and exposure conditions. Bioavailability via the respiratory route primarily depends on particle size distribution and solubility. Bioavailability via the digestive route is affected by gastrointestinal pH, food matrix, and individual differences. Bioavailability via the skin route depends on skin integrity and the lipid solubility of the pollutant. The assessment model integrates QSAR predictions, in vitro experimental data, and literature reports.
[0145] 4.3.3 Internal Dose Conversion Unit: This unit converts the external exposure dose into the bioeffective dose in vivo, based on physiological toxicokinetics (PBPK) principles. The conversion calculation uses a simplified PBPK model: ,in, The total amount input through each channel, To eliminate the rate constant, For blood volume. A multi-compartment model considers the distribution of contaminants across different tissues: major compartments include liver, kidney, fat, muscle, and bone. Partition coefficients are determined based on the tissue / blood ratio. Elimination kinetics include metabolic elimination (primarily via hepatic enzyme systems) and excretory elimination (via kidneys, lungs, and skin). Steady-state analysis calculates the steady-state concentration in vivo under long-term exposure. ,in, For steady-state concentration, Daily exposure input, To eliminate the rate constant, This refers to the distributed volume.
[0146] 4.3.4 Cumulative Effect Analysis Unit: This unit analyzes the cumulative effects and interactions of multi-pollutant, multi-pathway exposures. Cumulative assessment methods include dose-addition (DA), reaction-addition (RA), and mixed methods. Dose-addition is applicable to pollutants with the same mechanism of action. , where HI is the hazard index, Di is the exposure dose, and RfDi is the reference dose of contaminant i.
[0147] Reactive addition is applicable to pollutants with different mechanisms of action: ,in, Let Pi be the combined reaction probability, and Pi be the reaction probability of pollutant i.
[0148] Interaction analysis considered synergistic, antagonistic, and independent modes. Time-cumulative analysis used the time integral of the internal dose. ,in, For cumulative dose, Let be the concentration in the body at time t, where T is the cumulative time. Spatial accumulation considers the cumulative effect at different exposure locations. Sensitivity analysis identifies the pollutants and pathways that contribute most to the cumulative effect, and variance decomposition is used to quantify the relative importance of each factor.
[0149] 5. Health Effect Causal Analysis Module
[0150] like Figure 3 As shown, the health effect causal analysis module is the core component of the system for accurately identifying the causal relationship between environmental exposure and health effects. This module comprises three key sub-modules: a deep learning-driven causal inference sub-module, an intelligent identification and control sub-module for confounding factors, and a time lag effect and cumulative effect analysis sub-module. These sub-modules work collaboratively to form a complete analytical chain from exposure and health data to reliable causal conclusions.
[0151] 5.1 Deep Learning-Driven Causal Inference Submodule
[0152] The deep learning-driven causal inference submodule aims to integrate deep learning technology with causal inference theory to accurately identify causal relationships in complex environmental health data. This submodule includes a causal graph construction unit, a counterfactual inference unit, a causal effect estimation unit, and a model validation unit.
[0153] 5.1.1 Causal Graph Construction Unit: This unit combines domain expert knowledge and data-driven algorithms to construct an environmental health causal graph. The causal graph contains four types of nodes: exposure variables (environmental pollutant concentration, individual exposure dose), outcome variables (disease incidence, biomarker levels, mortality rate), confounding variables (demographic characteristics, lifestyle, genetic factors), and mediating variables (physiological responses, molecular mechanisms). Directed edges between nodes represent causal relationships, and edge weights represent causal strength. Expert knowledge integration is evaluated using the Delphi method. The data-driven approach employs three causal discovery algorithms: PC, GES, and FCI, identifying causal relationships through independence tests (chi-square test, G-test) and conditional independence tests. Causal graph evaluation metrics include graph density (number of edges / number of nodes), path length distribution, and the number of strongly connected components.
[0154] 5.1.2 Counterfactual Reasoning Unit: This unit implements counterfactual reasoning based on deep learning technology to estimate potential outcomes under different intervention conditions. Counterfactual reasoning employs a representation learning framework, mapping observed individuals to a balanced representation space to reduce treatment assignment bias. The network architecture consists of three components: a representation network Φ(x), a treatment bias network π(x), and an outcome prediction network μ(Φ,t). The representation network uses a multi-layer fully connected network, employing the ReLU activation function and Dropout to prevent overfitting. The balance constraint is achieved by minimizing the maximum mean difference (MMD) between the representation distributions of the treatment group and the control group. ,in To balance the loss, MMD is the maximum mean variance, Φ represents the network, X1 is the treatment group sample set, and X0 is the control group sample set. Propensity score estimation uses logistic regression. ,in, For tendency scores, This is the weight matrix. The feature representation of input x, This is the bias term. Training objective function: ,in , and These are the weighting coefficients for the balance loss and the bias loss, respectively.
[0155] 5.1.3 Causal Effect Estimation Unit: This unit estimates the causal effects at the individual and population levels based on counterfactual reasoning. Individual Treatment Effect (ITE) Calculation: ,in, and These are the result prediction functions for the treatment group and the control group, respectively. The characteristics of individual i are represented. The mean treatment effect (ATE) is calculated as follows: Where ATE represents the average treatment effect, n is the total number of samples, and E[ITE] is the expected value of the individual treatment effect. Heterogeneity analysis of treatment effects is achieved through conditional average treatment effect (CATE): Where CATE(x) is the conditionally averaged treatment effect for a given feature x, Y(1) and Y(0) are the latent outcomes of treatment and untreatment, respectively, and X is a covariate. Gradient boosting tree algorithm is used to identify covariates that significantly moderate the treatment effect. A generalized additive model is used to model the dose-response relationship. , where g is the connection function, E[Y] is the expected value of the outcome, s1 and s2 are smoothing functions, dose is the dose variable, and confounders are confounding factors. Statistical inference uses Bootstrap resampling to estimate confidence intervals.
[0156] 5.1.4 Model Validation Unit: This unit validates the reliability and robustness of causal inference results using multiple methods. Validation methods include sensitivity analysis, robustness testing, external validation, and biological rationale assessment. Sensitivity analysis assesses the potential impact of unobserved confounding factors using the E-value method. Where RR represents the observed relative risk. Robustness testing assesses the stability of results by changing model assumptions, sample composition, and analytical methods: Bootstrap testing, cross-validation, and comparison of different algorithms. External validation uses independent datasets to verify the reproducibility of causal effects. Biological plausibility assessment includes dose-response consistency, temporal plausibility, biological mechanism support, and animal experimental evidence. The comprehensive assessment employs multiple aspects of the Bradford Hill causality assessment criteria.
[0157] 5.2 Intelligent Identification and Control Submodule for Confounding Factors
[0158] The intelligent identification and control submodule for confounding factors aims to automatically identify potential confounding factors affecting the relationship between environmental exposure and health effects, and implement effective control strategies. This submodule includes a confounding factor identification unit, a backdoor path analysis unit, an adjustment set selection unit, and a control effect evaluation unit.
[0159] 5.2.1 Confounding Factor Identification Unit: This unit employs multiple methods to automatically identify potential confounding factors in environmental health research. Confounding factors are categorized into four main types: traditional confounding factors (age, sex, socioeconomic status, lifestyle), environmental confounding factors (meteorological conditions, geographical location, coexisting pollutants), temporal confounding factors (seasonal trends, long-term trends, periodic variations), and spatial confounding factors (spatial autocorrelation, neighborhood effects, geographic clustering). The identification methods combine statistical standards and machine learning techniques: statistical standards are based on three elements of confounding factors (related to exposure, related to outcome, not on a causal path), with a correlation threshold set to |r|>0.1 and p<0.05. Machine learning methods employ feature importance assessment: random forest variable importance ranking, LASSO regression coefficient path analysis, and gradient boosting importance scores. High-dimensional confounding control uses sparse regression methods, supporting the processing of high-dimensional data with p>>n. The confounding factor library contains multiple common confounding variables, categorized by study type (cross-sectional, cohort, case-control) and health outcomes (cardiovascular, respiratory, neurological, reproductive).
[0160] 5.2.2 Backdoor Path Analysis Unit: This unit analyzes backdoor paths formed by confounding factors based on causal graph theory, providing a theoretical basis for adjustment set selection. A backdoor path is defined as a non-causal path from a processing variable to an outcome variable. These paths connect exposures and outcomes through confounding factors, forming spurious associations. The path search algorithm combines depth-first search (DFS) and breadth-first search (BFS) to traverse all possible paths in the causal graph. Backdoor criterion judgment: For an adjustment set Z to satisfy the backdoor criterion, all backdoor paths from X to Y must be blocked, and no descendant nodes of X should be included. Path analysis includes direct backdoor paths (X←Z→Y), indirect backdoor paths (X←Z1←Z2→Y), and complex backdoor paths (multi-node paths). Collision child node identification: Collisions of the form Z→W←Y require special handling; adjusting the collision child or its descendants will open the blocked path. Path weight calculation is based on the strength of the association between nodes, using partial correlation coefficients to measure path importance.
[0161] 5.2.3 Adjustment Set Selection Unit: This unit selects the optimal confounding factor adjustment set based on the backdoor criterion and optimization objectives. Adjustment set selection is a combinatorial optimization problem, and the goal is to find the smallest sufficient adjustment set that satisfies the backdoor criterion. The selection criteria include sufficiency (able to block all backdoor paths), minimality (the smallest number of variables), effectiveness (improving the accuracy of causal effect estimation), and feasibility (the variables are observable and the measurements are reliable). The search algorithms adopt three methods: greedy search, genetic algorithm, and dynamic programming. Greedy search selects the variable that can block the most backdoor paths each time until the backdoor criterion is satisfied. The genetic algorithm uses binary encoding to represent the adjustment set, and the fitness function considers the size of the adjustment set and the confounding control effect. Dynamic programming decomposes the problem into subproblems to find the global optimal solution. Adjustment set evaluation metrics: residual confounding (the non-causal components still contained in the adjusted exposure-outcome association), precision loss (the statistical noise introduced by the adjustment variables), sample size requirement (the minimum sample size required for the adjusted analysis).
[0162] 5.2.4 Control Effect Evaluation Unit: This unit evaluates the effectiveness of the confounding factor control strategy to ensure the reliability of the causal inference results. The evaluation methods include four aspects: balance test, residual confounding detection, sensitivity analysis, and robustness verification. The balance test evaluates the balance degree of the treatment group and the control group on confounding factors after adjustment: the standardized deviation |d| < 0.1, the variance ratio 0.8 < VR < 1.25, and the p-value of the K-S test p > 0.05. Residual confounding detection uses the negative control exposure method: select a false exposure that is unrelated to the outcome but related to the confounding factor, and test whether there is still an association after adjustment. E-value analysis quantifies the minimum association strength that the unobserved confounding factors need to reach to fully explain the observed causal effect. Robustness verification tests the stability of the results by changing the adjustment strategy (different adjustment sets, different control methods). The comparison of control methods includes multiple regression adjustment, stratification analysis, matching method, weighting method, and double-robust method. Effect evaluation metrics: bias reduction rate, precision improvement degree, confidence interval change, p-value stability.
[0163] 5.3 Time Lag Effect and Cumulative Effect Analysis Sub-module
[0164] The time lag effect and cumulative effect analysis sub-module aims to analyze the complex time relationship between environmental exposure and health effects, and identify the lag effect and cumulative effect patterns. This sub-module includes a lag structure identification unit, a cumulative effect modeling unit, a key exposure window identification unit, and a time pattern verification unit.
[0165] 5.3.1 Lag Structure Identification Unit: This unit identifies the optimal time lag pattern of the health effects of environmental exposure. Lag effect analysis covers multiple time scales: acute effects (0-7 days), subacute effects (1-4 weeks), and chronic effects (months to years). Lag models include three types: single-lag models, multi-lag models, and distributed lag models.
[0166] Single-lag model for evaluating the effect of a fixed lag period: ,in, Let α be the health outcome at time t, α be the intercept term, and β be the exposure effect coefficient. The exposure level is represented by the lag period, and γ is the covariate coefficient. As covariates, This is the random error term.
[0167] Multi-lag models consider multiple lag periods simultaneously: Where L is the maximum lag period. Let be the effect coefficient for the i-th lag period.
[0168] Distributed hysteresis nonlinear model (DLNM) allows for a nonlinear relationship between exposure, hysteresis, and response: Where g(·) is the join function, Let be the expected value of the response variable, and s(·) be a two-dimensional smoothing function describing the joint effect of exposure and lag. The lag period is selected based on the AIC / BIC criterion, cross-validation, and biological prior knowledge. Constraints include monotonicity of lag coefficients, smoothness, and endpoint constraints.
[0169] 5.3.2 Cumulative Effect Modeling Unit: This unit establishes a mathematical model of the cumulative effect of environmental exposure, quantifying the long-term health impacts of exposure. The cumulative effect is defined as the overall impact of exposure on health over a period of time. Modeling methods include cumulative exposure models, moving average models, and weighted cumulative models. Cumulative Exposure Models: ,in, The cumulative exposure is given, and L is the maximum lag period. Let be the exposure level at time ti, assuming all exposures contribute equally. Moving average model: ,in, Let w be the moving average exposure, and w be the width of the moving window. Weighted cumulative model: ,in, To calculate the weighted cumulative exposure, Let be the weighting coefficient for the i-th lag period. Weighting functions include linear decay, exponential decay, polynomial, and B-spline functions. Weighting function optimization uses grid search, Bayesian optimization, and genetic algorithms. The biological half-life model considers the elimination kinetics of pollutants in vivo: Among them, among them, Let be the concentration in the body at time t. Let be the exposure dose at time i, k be the elimination rate constant, and (ti) be the time interval since the i-th exposure. Effect window analysis identifies the time period that contributes the most to the health effect.
[0170] 5.3.3 Key Exposure Window Identification Unit: This unit identifies the key exposure time windows that have the greatest impact on health effects. Key window identification is based on three criteria: effect strength, statistical significance, and biological plausibility. Effect strength is measured by effect size (Cohen's d) or relative risk (RR). Sliding window analysis uses time windows of different lengths to scan the entire study period to identify the window with the strongest effect. Window length optimization is based on the AIC criterion and biological priors. Multi-window analysis considers multiple non-overlapping key windows. Sensitivity period analysis identifies specific sensitive time windows for specific populations (pregnant women, children, the elderly). Bootstrap resampling assesses the stability of key window identification. FDR correction controls the false positive rate of multiple comparisons.
[0171] 5.3.4 Time Pattern Validation Unit: This unit validates the robustness and reproducibility of the identified time lag and cumulative effect patterns. Validation methods include four levels: internal validation, external validation, sensitivity analysis, and biological validation. Internal validation uses Bootstrap resampling, cross-validation, and the Jackknife method to assess model stability. External validation verifies the reproducibility of the time patterns in independent datasets or different populations. Sensitivity analysis assesses the robustness of the results by changing model assumptions, analysis methods, and data processing: changing the lag range, weight function type, and constraint settings. Heterogeneity analysis assesses the consistency of the time patterns across different subgroups (age, sex, health status). Monte Carlo simulation assesses the model's performance under different data generation mechanisms. The biological validity of the time patterns is evaluated based on literature evidence, animal experimental results, and mechanistic studies. The comprehensive evaluation uses the weighted evidence method to integrate multiple validation results.
[0172] 6. Full-chain correlation model module
[0173] like Figure 4 As shown, the full-chain correlation model module is the core component of the system, integrating the analysis results of the aforementioned modules to construct a complete causal chain from pollution source to environmental medium to human exposure to health effects. This module comprises four key sub-modules: a graph neural network-driven full-chain modeling sub-module, a causal path identification and quantification sub-module, a dynamic response simulation and scenario analysis sub-module, and a model validation and optimization sub-module. These sub-modules work collaboratively to form an intelligent analysis chain from multi-source data to a complete causal network.
[0174] 6.1 Graph Neural Network-Driven Full-Chain Modeling Submodule
[0175] This submodule innovatively employs heterogeneous graph neural network technology to construct a full-chain correlation model of environmental pollution and health effects, solving the technical challenge of traditional methods failing to effectively represent and analyze complex correlations. This submodule includes a heterogeneous graph construction unit, a node feature learning unit, an edge relationship modeling unit, and a graph embedding optimization unit.
[0176] 6.1.1 Heterogeneous Graph Construction Unit: This unit establishes the graph representation structure of the entire chain system. Graph nodes are divided into four categories: pollution source nodes (PS, including 25 types such as industrial sources, agricultural sources, mobile sources, residential sources, and natural sources), environment nodes (EN, including air monitoring points, water monitoring points, soil monitoring points, and biological monitoring points), exposure nodes (EX, including populations, microenvironments, and activity sites), and health nodes (HE, including disease endpoints, biomarkers, and physiological indicators). Graph edges are divided into five categories: emission edges (EM, from pollution sources to environment nodes), transport edges (TR, material transport between environment nodes), exposure edges (EX, from environment nodes to exposure nodes), effect edges (EF, from exposure nodes to health nodes), and regulation edges (MD, representing the influence of regulation variables). The adjacency matrix of the heterogeneous graph is defined as... Each submatrix represents the connection relationship between corresponding node types. Corresponding to the emission edge EM, Corresponding transmission edge TR, Corresponding to exposed edge EX, Corresponding effect edge EF, The corresponding adjustment edge MD is used. The graph construction algorithm adopts an incremental construction strategy, which supports the dynamic addition of new nodes and edges, and the graph size can reach millions of nodes and tens of millions of edges.
[0177] 6.1.2 Node Feature Learning Unit: This unit learns high-quality feature representations for various types of nodes in the heterogeneous graph. Pollution source node features include feature vectors such as source intensity, emission components, spatiotemporal distribution, and technological level; environmental node features include feature vectors such as pollutant concentration, meteorological conditions, geographical location, and environmental parameters; exposure node features include feature vectors such as population characteristics, activity patterns, exposure levels, and susceptibility; health node features include feature vectors such as disease type, incidence rate, severity, and influencing factors. Feature learning employs a heterogeneous graph convolutional network, using different transformation matrices for different types of nodes. The node representation learning formula is: ,in, Let N(v) be the representation of node v at level l, and let N(v) be the set of neighbors of node v. For attention weights, Let be the transformation matrix of relation r, and σ be the activation function. The feature dimensions are ultimately unified to 128 dimensions after multiple transformations, which facilitates subsequent calculations and analysis.
[0178] 6.1.3 Edge Relationship Modeling Unit: This unit performs accurate modeling and weight learning of edge relationships in heterogeneous graphs. Edge weights represent the strength of the relationship and are calculated using multiple methods: emission edge weights are based on pollutant emission flux and transport efficiency; transport edge weights are based on transport coefficients from the physical transport model; exposure edge weights are based on exposure intensity from the exposure assessment model; and effect edge weights are based on the magnitude of causal effects from the causal inference model. Edge relationship learning employs a graph attention mechanism, and the attention weight calculation formula is as follows: ,in, For attention parameter vectors, This is the weight matrix. Vector concatenation is used. The dynamic edge weight update strategy considers the impact of time variations, seasonality, and unforeseen events. Relation type embedding employs the TransE model to learn the semantic representation of relations.
[0179] 6.1.4 Graph Embedding Optimization Unit: This unit optimizes the overall representation of heterogeneous graphs using a multi-task learning framework. The optimization objectives include four tasks: node classification, link prediction, graph structure reconstruction, and causal relationship identification. The multi-task loss function is: ,in, For node classification loss, Predict loss for the link. To reconstruct the loss, The loss is denoted by λ, representing the weight coefficients. The optimization algorithm uses the Adam optimizer with a learning rate of 0.001, a batch size of 256, and 500 training epochs. The graph embedding dimension is set to 128. Model regularization uses Dropout (p=0.3) and weight decay (…). To prevent overfitting.
[0180] 6.2 Causal Path Identification and Quantification Submodule
[0181] This submodule, based on a pre-trained heterogeneous graph neural network model, identifies the complete causal path from pollution source to health effect and quantifies the contribution of each path. This submodule includes a path search unit, a causal verification unit, a contribution quantification unit, and a path ranking unit.
[0182] 6.2.1 Path Search Unit: This unit systematically searches for all possible paths from pollution source nodes to health effect nodes in the heterogeneous graph. The search algorithm employs a combined strategy of improved depth-first search (DFS) and breadth-first search (BFS), supporting the setting of maximum path length and minimum path weight thresholds. Path validity is judged based on three criteria: connectivity (the path must be connected), directionality (the path direction must conform to causal time sequence), and rationality (the path must conform to physical and biological mechanisms). The search algorithm optimization adopts a pruning strategy, terminating the search of low-weight paths early. The algorithm complexity is O(log n). Where V is the number of nodes and k is the maximum path length. The path is represented using a combination of a node sequence and an edge weight sequence: Where (v1,v2,…,vn) is the sequence of nodes traversed by the path, and (w1,w2,…,wn-1) is the sequence of edge weights between adjacent nodes. The path is stored in a compressed format, supporting efficient storage and retrieval.
[0183] 6.2.2 Causal Validation Unit: This unit validates the causal validity of the identified pathway, distinguishing between genuine causal relationships and spurious associations. Causal validation is based on four levels: statistical level (statistical significance of each side of the pathway), temporal level (correctness of the causal time sequence), mechanistic level (the pathway conforms to known biological mechanisms), and intervention level (the pathway's performance in intervention experiments). Statistical validation employs multiple hypothesis testing correction to control for family error rate. Temporal validation checks the temporal order of each node in the pathway to ensure that the cause occurs before the effect. Mechanistic validation is based on an expert knowledge base and literature database, containing multiple validated environmental health causal mechanisms. Intervention validation uses historical intervention data and natural experimental data to evaluate the pathway's performance in actual interventions. Causal strength scoring employs a multi-evidence fusion method. ,in, The overall score is based on the strength of causality. For statistical significance scoring, Score based on time constraints. Score the rationality of the mechanism. For the intervention validation score, w1 to w4 are the weight coefficients of each dimension, and the weight coefficients are determined through expert evaluation.
[0184] 6.2.3 Contribution Quantization Unit: This unit quantifies and verifies the relative contribution of valid paths. Contribution quantification uses the path integral method, where the total contribution of a path is the product of the weights of all edges along the path. ,in, For the causal effect of the p-th path from pollution source s to health endpoint d, For the edge The causal weights. The relative contribution is calculated using a normalization method: ,in, As a relative contribution rate, Contribute to the current path, The sum of contributions from all paths. The total causal effect across multiple paths is obtained by summing the path effects: ,in, Let P(s,d) represent the total causal effect, and be the set of all valid causal paths from pollution source s to health endpoint d. Contribution decomposition analysis identifies key nodes and edges in the paths, with the criticality index based on the magnitude of change in path contribution after removing the node or edge. Uncertainty quantification employs the Bootstrap resampling method. Contribution stability analysis assesses the robustness of contribution estimates by changing model parameters and datasets. Time-varying analysis tracks the evolution of path contributions over time, identifying seasonal and long-term trends in contributions.
[0185] 6.2.4 Path Ranking Unit: This unit ranks paths based on a comprehensive assessment of contribution size and causal reliability. The ranking criteria consider four dimensions: path contribution, causal reliability, path stability, and mechanism clarity. The comprehensive scoring formula is as follows: ,in, For the overall score of the path, Score the contribution level. To score the credibility of causation, For stability rating, For mechanism clarity scoring, w1 to w4 are the weighting coefficients for each dimension. The main path identification adopts the Pareto principle, selecting paths with higher cumulative contributions as the main paths. Path clustering analysis clusters similar paths to identify typical path patterns. Path visualization uses both Sankey diagrams and network diagrams to intuitively display the path structure and contribution distribution. Dynamic ranking supports adjusting ranking weights based on user concerns (such as specific pollution sources or specific health effects).
[0186] 6.3 Dynamic Response Simulation and Scenario Analysis Submodule
[0187] This submodule uses a full-chain correlation model to perform dynamic response simulation and multi-scenario analysis to evaluate the full-chain effects of different intervention strategies. This submodule includes a scenario design unit, a dynamic simulation unit, an effect evaluation unit, and a strategy optimization unit.
[0188] 6.3.1 Scenario Design Unit: This unit designs and evaluates various intervention scenarios and strategy combinations. Scenario types include four main categories: pollution source control scenarios (e.g., enterprise closure, technology upgrade, emission restrictions), environmental governance scenarios (e.g., pollution remediation, ecological restoration, environmental supervision), exposure intervention scenarios (e.g., population migration, protective measures, behavioral changes), and health protection scenarios (e.g., medical intervention, health screening, risk warning). Scenario parameterization uses multi-dimensional parameter vector representation. ,in Type As a scenario type, Intensity For the intensity of intervention, Duration For duration, Coverage For coverage area, CostTo account for implementation costs, combined scenarios are generated through linear or nonlinear combinations of basic scenarios, supporting the combination of multiple basic scenarios. Scenario constraints include three aspects: technical feasibility, economic affordability, and social acceptability.
[0189] 6.3.2 Dynamic Simulation Unit: This unit simulates the dynamic response of intervention scenarios based on a full-chain correlation model. The simulation framework employs a multi-timescale coupling method: short-term response (day-week), medium-term response (month-quarter), and long-term response (year-decade). The dynamic equations are as follows: Where X is the system state vector (including pollution source intensity, environmental concentration, exposure level, and health indicators), U is the intervention control vector, θ is the model parameter, and t is time. The numerical solution employs the fourth-order Runge-Kutta method with adaptive time step settings. The simulation considers random disturbances and uncertainties, using the Monte Carlo method for stochastic simulation. Simulation results include time series data, spatial distribution data, and probability distribution data. Simulation validation uses backtesting with historical data, with a prediction accuracy requirement of R² > 0.8.
[0190] 6.3.3 Effectiveness Evaluation Unit: This unit evaluates the multidimensional effects of the intervention strategy. Evaluation dimensions include four aspects: environmental effects (pollutant emission reduction, environmental quality improvement), health effects (number of diseases avoided, life years extended), economic effects (health economic benefits, implementation costs), and social effects (social equity, public satisfaction). Environmental effect indicators include pollutant emission reduction rate, compliance rate improvement, and environmental quality index improvement. Health effect indicators include the number of deaths avoided, the number of morbidities avoided, life years saved, and quality of life improvement. Economic effects are assessed using cost-benefit analysis, calculating indicators such as net present value (NPV), benefit-cost ratio (BCR), and payback period. Social effects are evaluated through questionnaires and social surveys. The overall effectiveness score uses the Multi-Criterion Decision Analysis (MCDA) method, with weights determined using the Analytic Hierarchy Process (AHP).
[0191] 6.3.4 Strategy Optimization Unit: This unit optimizes the combination of intervention strategies based on the effect evaluation results. The optimization objective adopts a multi-objective optimization framework: The constraints include technical constraints, resource constraints, and time constraints. The optimization algorithm employs a non-dominated sorting genetic algorithm (NSGA-II). Pareto front analysis identifies the set of non-dominated solutions, providing decision-makers with multiple optimization options. Sensitivity analysis assesses the sensitivity of the optimization results to parameter changes. Robustness analysis evaluates the strategy's performance under uncertainty. Strategy recommendations are personalized based on decision-maker preferences, supporting interactive decision support.
[0192] 6.4 Model Validation and Optimization Submodule
[0193] This submodule ensures the reliability and robustness of the results of the full-chain correlation analysis. It includes an internal validation unit, an external validation unit, a sensitivity analysis unit, and a model update unit.
[0194] 6.4.1. Internal Validation Unit: This unit uses the training data to perform internal validation of the model. Validation methods include K-fold cross-validation, leave-one-out cross-validation, and time-series cross-validation. Validation metrics include prediction accuracy (R², RMSE, MAE) and classification accuracy (Accuracy, Precision, Recall, F1 score). - The model score is used to evaluate causal identification accuracy (TruePositive Rate, FalsePositive Rate). Bootstrap resampling validation assesses model stability. Model convergence analysis checks the convergence and stability of the training process. Overfitting detection is performed by comparing the errors on the training and validation sets. Model complexity analysis evaluates the number of parameters and computational complexity of the model.
[0195] 6.4.2 External Validation Unit: This unit uses independent datasets to validate the model's generalization ability. External validation data includes data from other regions, different time periods, and different pollution types. Validation strategies employ three methods: geographic extrapolation, temporal extrapolation, and categorical extrapolation. Geographic extrapolation evaluates the model's applicability in different geographical regions, requiring validation in at least three regions with different climates and geographical conditions. Temporal extrapolation evaluates the model's temporal stability, using data from the next 1-2 years for prospective validation. Categorical extrapolation evaluates the model's predictive ability for new pollutants or new health endpoints. External validation requirements: geographic extrapolation R² > 0.75, temporal extrapolation R² > 0.70. If the validation results do not meet the requirements, model retraining or parameter tuning will be triggered.
[0196] 6.4.3 Sensitivity Analysis Unit: This unit analyzes the model's sensitivity to parameter changes and assumptions. Sensitivity analysis methods include local sensitivity analysis, global sensitivity analysis, and Monte Carlo sensitivity analysis. Local sensitivity analysis assesses the sensitivity of parameters near a baseline value, using a sensitivity index. ,in, For parameters The local sensitivity index, Y is the model output. To output the partial derivatives with respect to the parameters. Global sensitivity analysis uses the Sobol method to calculate the first-order and total-order sensitivity indices of the parameters. Monte Carlo sensitivity analysis assesses the impact of parameter uncertainty through random sampling (10,000 times). Sensitivity thresholds are set as follows: high sensitivity |S|>0.1, medium sensitivity 0.05<|S|≤0.1, low sensitivity |S|≤0.05. Key sensitive parameters are identified for model simplification and uncertainty reduction. Assumption sensitivity analysis evaluates the impact of changes in core assumptions on the results.
[0197] 6.4.4 Model Update Unit: This unit continuously updates and optimizes the model based on validation results and new data. The update strategy includes three levels: parameter update, structure update, and data update. Parameter updates employ online learning algorithms, supporting incremental parameter adjustments. Structure updates adjust the model architecture based on validation results, such as adding new node types or modifying the number of network layers. Data updates integrate new monitoring data, research data, and literature data. Update trigger conditions include validation performance degradation exceeding a threshold (>10%), accumulation of new data reaching a certain scale (>1000 records), and the occurrence of major environmental events. The update process employs an A / B testing mechanism, with the new model and the original model running in parallel, and performance comparison determining whether a formal update is necessary. Version control records all model changes and supports version rollback. The update frequency is a regular quarterly update, with emergency updates possible in special circumstances.
[0198] Application Examples
[0199] The following examples demonstrate the application of the environmental pollution-health effect dynamic correlation analysis method and system in a real-world environmental health management scenario, validating the system's practical value and technological advantages.
[0200] Full-chain traceability and health risk assessment of heavy metal pollution
[0201] A steel-producing city faces complex heavy metal pollution problems, involving historical accumulation and current emissions of various heavy metals such as lead, cadmium, and mercury. The city has a population of 2 million and has historically housed numerous heavy industrial enterprises. In recent years, the abnormally high incidence of lead poisoning in children and kidney disease in adults has attracted widespread attention. Traditional investigation methods face four major challenges: first, they cannot accurately distinguish the relative contribution of existing pollution sources to historical pollution; second, they lack the ability to trace the entire process of pollutant emission to human exposure; third, it is difficult to accurately establish the causal relationship between health effects and pollution sources; and fourth, there is a lack of targeted pollution control strategies. The municipal government urgently needs a scientific and accurate pollution source tracing and health risk assessment plan.
[0202] System Application Methodology: The environmental pollution-health effect full-chain dynamic correlation analysis system was applied to the city's comprehensive heavy metal pollution control project. The application scheme comprises five core components: precise pollution source analysis, multi-media transport simulation, population exposure assessment, health effect analysis, and full-chain correlation modeling. The system integrates nearly 30 years of environmental monitoring data, industrial enterprise emission data, population health monitoring data, and geological survey data from the city, constructing a complete foundation for the full-chain analysis of heavy metal pollution. In particular, it collected 15 sediment profiles and 200 soil samples for time-stratified pollution source analysis, providing crucial data support for historical pollution reconstruction.
[0203] Specific implementation process: (a) Multidimensional Source Analysis Stage: The system employs isotope tracing technology combined with a multivariate receptor model for precise source apportionment of heavy metal pollution. First, lead isotope ratios (206Pb / 207Pb, 208Pb / 206Pb), strontium isotope ratios (87Sr / 86Sr), and cadmium isotope ratios are determined from collected environmental samples to establish a multivariate isotope fingerprint database. A high-precision age scale for sediment profiles is established through combined dating of 210Pb and 137Cs. The system identifies seven major pollution sources: iron and steel smelting, non-ferrous metal smelting, coal-fired power plants, traffic emissions, agricultural fertilization, natural background pollution, and historical emissions. Temporal stratification analysis revealed the historical evolution of pollution: from 1970 to 1990, iron and steel smelting was the dominant source (contributing 65%); from 1990 to 2010, multiple sources coexisted (iron and steel 35%, non-ferrous metals 25%, coal combustion 20%); and after 2010, the composition of pollution sources changed significantly with the implementation of environmental protection measures. The uncertainty of the isotopic mixing model was controlled within ±8%, and the source apportionment accuracy exceeded 90%.
[0204] (b) Dynamic Transport Stage in Environmental Media: A coupled transport model of heavy metals among multiple media—atmosphere, water, soil, and organisms—was established. Based on topographic and meteorological data, an adaptive grid covering the entire city was constructed (basic resolution 1 km, refined to 100 m for key areas). Atmospheric transport simulation considered the gaseous-particle distribution, dry and wet deposition, and long-distance transport processes of heavy metals; water transport simulation included river transport, deposition-resuspension, and riverbank exchange; soil transport simulation covered vertical migration, surface runoff, and biosorption. The multi-media coupled model successfully reproduced the spatiotemporal distribution evolution of heavy metals, with a correlation coefficient of 0.89 between model predictions and observed data. In particular, it simulated the accumulation process of heavy metals in the environment during historical high-emission periods (1980-2000), explaining the current spatial distribution pattern of heavy metals in the environment.
[0205] (c) Precise Human Exposure Assessment Phase: Based on individual activity trajectory data and microenvironmental pollution levels, the system achieves precise assessment of heavy metal exposure through multiple pathways. Exposure assessment covers four major pathways: inhalation, diet, skin contact, and accidental ingestion. The system collected spatiotemporal activity data from 1200 residents of different age groups and, combined with high-resolution environmental concentration distribution, calculated individualized exposure doses. Specifically for children, the system considered unique exposure characteristics such as hand-to-mouth contact behavior, differences in respiration rate, and soil ingestion. Based on a physiological toxicokinetic model, the system converted external exposure into internal doses in blood and urine, achieving a correlation of 0.84 between the exposure assessment results and biomonitoring data. Sensitive population identification showed that the lead exposure risk for children aged 0-6 years and pregnant women was 2.3 times and 1.7 times that of the general adult population, respectively.
[0206] (d) Causal Analysis of Health Effects: The system employs a deep learning-driven causal inference method to analyze the causal relationship between heavy metal exposure and health effects. Health endpoints include childhood intellectual development, renal impairment, cardiovascular disease, and neurological disorders. The system integrates health examination data from 150,000 individuals over eight years, controlling for 32 confounding factors such as age, sex, socioeconomic status, and lifestyle. Through counterfactual reasoning and propensity score matching, the system identified a significant causal relationship between lead exposure and decreased IQ in children (IQ decreased by 0.67 points per 1 μg / dL increase in blood lead, 95% CI: 0.42–0.92), and an association between cadmium exposure and the risk of chronic kidney disease (OR=1.23, 95% CI: 1.08–1.41). Time lag analysis revealed a significant cumulative effect of heavy metal health effects, with childhood exposure's impact on adult health lasting for 20–30 years.
[0207] (e) Full-chain correlation modeling stage: The system constructed a complete causal chain network from pollution sources to health effects. Based on graph neural network technology, the system established a heterogeneous graph model including pollution sources, environmental locations, exposed populations, and health endpoints. Full-chain path analysis identified 23 major causal paths, among which the path of steel smelting → atmospheric particulate matter → children's respiratory exposure → elevated blood lead → intellectual development impairment had the highest contribution rate (38.5%). The system quantified the relative contributions of different pollution sources to various health effects: steel smelting contributed 42% to children's blood lead levels, non-ferrous metal smelting contributed 35% to kidney function impairment, and historical pollution still contributed 28% to soil heavy metals.
[0208] Application Effects and Value: The system has achieved remarkable results in the whole-chain treatment of heavy metal pollution: the accuracy of pollution source identification has been significantly improved, successfully distinguishing the relative contribution of existing pollution sources and historical accumulated pollution; the accuracy of exposure assessment is high, with an overlap rate of 89.7% between the high-exposure population identified by the system and the high-risk population in actual health monitoring; causal relationship identification has eliminated 67% of confounding bias, and the confidence interval of causal effect estimation has been narrowed by 38%; precise treatment based on whole-chain analysis has reduced the rate of blood lead poisoning in children from 15.7% to 3.2%; the pollution source contribution list generated by the system has been incorporated into the urban heavy metal pollution prevention and control plan, and the scientific and economic efficiency of the treatment plan has been significantly enhanced.
[0209] Application cases fully demonstrate that the dynamic correlation analysis method and system for the entire chain of environmental pollution and health effects can effectively solve environmental health problems under different types and environmental conditions, realize the complete causal chain analysis from pollution source to health effect, provide comprehensive scientific and technological support for precise environmental health management, and have important practical value and broad application prospects.
[0210] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A method for dynamic correlation analysis of the entire chain of environmental pollution and health effects, characterized in that, include: Step S1: Multidimensional analysis of pollution sources. The method of fusing isotope tracing, time-stratified source contribution calculation and multivariate receptor model is adopted. The relative contribution of existing pollution sources and historical accumulated pollution is distinguished by multivariate isotope mixing model and time decay function. Step S2: Dynamic transport of pollutants in environmental media. A multi-media coupled transport model is established, which includes four media: atmosphere, water, soil and organisms. A multi-scale spatiotemporal adaptive transport algorithm is used to simulate the migration and transformation process of pollutants among multiple environmental media. Step S3: Precise assessment of human exposure, based on individual spatiotemporal activity trajectories and microenvironment exposure models, combined with individualized physiological parameters to calculate individualized precise exposure doses through multiple pathways such as breathing, diet, and skin contact; Step S4: Causal analysis of health effects. A deep learning-driven causal inference method is used to establish the causal relationship between exposure and health effects through intelligent identification and control of confounding factors, and to analyze time lag effects and cumulative effects. Step S5: Full-chain correlation modeling. A heterogeneous graph neural network is used to construct a complete causal chain network from pollution source to health effect, identify and quantify the contribution of each causal path, and conduct dynamic response simulation and scenario analysis. The data flow relationships between the modules are as follows: the pollution source contribution matrix output in step S1 serves as the source term input for the multi-media coupled transport model in step S2; the environmental concentration field output in step S2 serves as the exposure concentration input for the microenvironment exposure model in step S3; the individual exposure dose output in step S3 serves as the processing variable for the causal inference model in step S4; and the causal effect weights output in step S4 serve as the edge weights for the heterogeneous graph neural network in step S5. Data fusion between the steps is achieved through a spatiotemporal alignment algorithm, forming a complete analytical loop.
2. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 1, characterized in that, In step S1, isotope tracing employs a multi-isotope mixing model: in, This is the vector of isotope ratios in the sample. The contribution ratio of pollution source i satisfies the constraints. =1, Let be the isotopic eigenvector of pollution source i. The error term includes measurement error and model error, where n is the total number of pollution sources; The time-stratified source contribution calculation model is as follows: Where i = 1 to n, and n is the total number of pollution sources. For the total contribution at time t, For source strength time function, The attenuation constant is For time intervals; The fusion of the multi-receptor models includes two steps: fusion weight calculation and model result fusion. After obtaining the analytical results of each model by running multiple receptor models in parallel, a weighted fusion is performed. The calculation method is as follows: Fusion weight calculation: in, Score the performance of model m. Let α and β be the consistency scores for model m, and let α and β be the weight coefficients that satisfy α + β = 1, α ∈ [0, 1], β ∈ [0, 1]; Model fusion formula: in, The source contribution analysis results for model m, , where represents the corresponding weighting coefficient, and M represents the total number of receptor models.
3. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 1, characterized in that, In step S2, the multi-media coupled transport model includes internal transport equations for four environmental media: atmosphere, water, soil, and organisms, and calculates the exchange flux between media using these formulas. in, The transmission rate constant is and For fugacity correction factor, and This represents the concentration of pollutants in the corresponding medium. The multi-medium coupling solution employs a split-operation algorithm: in, , , These are transmission, switching, and reaction operators, respectively. The multi-scale spatiotemporal adaptive transmission algorithm employs the following method: Step (1): Mesh refinement criteria: ,in, Where h is the pollutant concentration and h is the terrain elevation. Pollution source density; Step (2): Refine control conditions: when When, refine grid i; when When, mesh i is coarsened; where, The preset refinement threshold; Step (3): Time step adaptation: in, The time step is constrained by CFL conditions; , which is the time step that limits the diffusion process, and is related to the square of the grid size and the diffusion coefficient; , represents the time step that limits the chemical reaction, determined by the reciprocal of the reaction rate constant. A coefficient of 0.8 represents the safety factor.
4. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 1, characterized in that, The microenvironment exposure model in step S3 uses the following formula to calculate the total individual exposure: in, Let K be the pollutant concentration at the k-th spatiotemporal point. For the duration of stay, For activity intensity The corresponding respiratory rate, For microenvironment The absorption correction factor is n, where n is the total number of spatiotemporal trajectory points.
5. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 4, characterized in that, In step S3, the individualized respiratory rate (IR) is calculated based on allometric growth theory, and the formula is as follows: in, For reference respiratory rate, BW is individual body weight. The reference weight is AF, which is the activity adjustment factor.
6. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 1, characterized in that, In step S4, the deep learning-driven causal inference employs a representation learning framework, whose multi-task objective function is: Where λbalance and λpropensity are the loss weight coefficients; To predict losses; Y represents the true value of the health effect. For input-based The predicted value is obtained by combining the feature mapping Φ with the processed variable T; To balance the loss; MMD is the maximum average difference. and These are the feature distributions of the processing group and the control group after mapping Φ, respectively; Towards loss; To handle indicator variables, The representative accepted the exposure treatment. =0 indicates that no exposure treatment was received. This is the predicted value for the propensity score.
7. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 1, characterized in that, In step S5, the heterogeneous graph neural network uses an attention mechanism to update nodes, and the representation value of node v in the (l+1)th layer is calculated using the following formula: , in, This represents an operation on a set of K distinct components. This represents the concatenation operation of the outputs of K attention heads. This represents summing over all nodes u in the r-order neighborhood of node v. Let r represent the set of neighbors connected to node v through relation r. For attention weights, Let r be the weight matrix of the relation at the k-th attention head. This represents the feature vector of node u at layer l, where K is the number of attention heads; Attention weight in, Let r be the attention parameter vector of relation r in the k-th attention head. This indicates a vector concatenation operation.
8. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 7, characterized in that, In step S5: the causal effect PE of the p-th path from pollution source s to health endpoint d is quantified as the product of the weights of all edges on that path: in, For the edge The causal weights, where p represents a directed path from node s to node d; The total causal effect of multiple paths is obtained by summing the path effects: Where P(s,d) is the set of all valid causal paths from pollution source s to health endpoint d.
9. The method for dynamic correlation analysis of the entire chain of environmental pollution and health effects according to claim 1, characterized in that, The method also includes a deep transfer learning step under sparse data conditions: Domain adaptation techniques are used to achieve cross-regional model transfer, and the domain adversarial loss function is as follows: in, For source domain data distribution, For the distribution of data in the target domain, For domain discriminator, It is a feature extractor.
10. A dynamic correlation analysis system for the entire chain of environmental pollution and health effects, used to implement the method of any one of claims 1 to 9, characterized in that, The system adopts a three-layer architecture design, including a data layer, an algorithm layer, and an application layer, including: A multidimensional analysis module for pollution sources is used to identify and quantify the contribution of pollution sources through isotope tracing and multivariate receptor model fusion methods. The environmental media dynamic transport module is used to simulate the migration and transformation of pollutants in multiple media based on a multi-media coupled transport model and a multi-scale spatiotemporal adaptive algorithm. The human exposure precision assessment module is used to calculate individualized exposure doses based on individual spatiotemporal activity trajectories and microenvironment exposure models; The health effects causal analysis module is used to identify the causal relationship between exposure and health effects through deep learning-driven causal inference methods; The whole-chain correlation model module is used to construct and visualize causal networks from pollution sources to health effects using heterogeneous graph neural networks. The data layer comprises a pollution source database, an environmental media database, a human exposure database, and a health effect database. These databases interact and achieve spatiotemporal alignment through standardized interfaces. The algorithm layer includes a multi-source analysis algorithm library, an environmental transmission algorithm library, an exposure assessment algorithm library, a causal inference algorithm library, and a graph neural network engine. Each algorithm library adopts a modular design, supporting flexible algorithm invocation and combination. The application layer includes a full-chain correlation analysis module, a dynamic risk prediction module, and a decision support module, providing users with analysis results and decision suggestions through a visual interface. These three layers communicate via RESTful APIs and message queues, supporting distributed deployment and horizontal scaling. The system is implemented using containerization technology, supporting dynamic scaling based on business load.