Multi-source heterogeneous data intelligent acquisition and spatio-temporal fusion processing system and method

By using a multi-source data automatic acquisition module, a unified data processing module, and a spatiotemporal graph neural network (ST-GNN) model, the system addresses the problems in environmental health data processing, such as low efficiency in acquiring multi-source data, difficulties in cross-protocol integration, insufficient precision in data quality control, difficulty in balancing medical data privacy protection and availability, low spatiotemporal matching accuracy, and lack of data processing capabilities for emerging pollutants. This enables efficient and real-time data acquisition and fusion processing.

CN122240709APending Publication Date: 2026-06-19HUBEI PROVINCIAL ACADEMY OF ECO-ENVIRONMENTAL SCIENCES(PROVINCIAL ECOLOGICAL ENVIRONMENT ENGINEERING ASSESSMENT CENTER)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUBEI PROVINCIAL ACADEMY OF ECO-ENVIRONMENTAL SCIENCES(PROVINCIAL ECOLOGICAL ENVIRONMENT ENGINEERING ASSESSMENT CENTER)
Filing Date
2026-03-17
Publication Date
2026-06-19

Smart Images

  • Figure CN122240709A_ABST
    Figure CN122240709A_ABST
Patent Text Reader

Abstract

This invention provides a system and method for intelligent acquisition and spatiotemporal fusion processing of multi-source heterogeneous data. It achieves fully automated processing through four core modules: an automatic multi-source data acquisition module, a unified data processing module, a spatiotemporal matching and interpolation module, and an application interface and service module. The system employs a parameter mapping matrix for adaptive API interface integration and pattern similarity-based intelligent crawling technology for data acquisition. Quality control is achieved through a multi-level data cleaning strategy that integrates statistical methods, machine learning methods, and domain rules. A hierarchical privacy protection mechanism is constructed, implementing K-anonymization and differential privacy protection for medical and health data. A spatiotemporal graph neural network model is applied, using a time-aware adjacency matrix and attention mechanism to achieve accurate spatiotemporal data matching and fusion. Detection-limited hierarchical processing and multi-media migration modeling are implemented. This invention solves the technical problems of difficult multi-source data acquisition, low spatiotemporal matching accuracy, and lack of capacity for treating emerging pollutants in the field of environmental health, providing reliable data support for environmental health risk assessment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology in the field of environmental health monitoring and risk assessment, specifically to a system and method for intelligent acquisition and spatiotemporal fusion processing of multi-source heterogeneous data. Background Technology

[0003] Despite the growing importance of environmental health research, existing technologies face numerous challenges and limitations in the acquisition, preprocessing, and fusion of multi-source heterogeneous data: Data acquisition is challenging: Environmental health research requires the integration of multi-source, heterogeneous data from different departments, formats, and temporal and spatial scales, including environmental monitoring data, pollution source data, medical and health data, emerging pollutant data, and socioeconomic data. However, existing technologies face significant obstacles in data acquisition: environmental and health data are scattered across different departments such as environmental protection, health, meteorology, and statistics, lacking a unified data sharing platform and standard interfaces. Traditional data acquisition methods rely primarily on manual requests and downloads, which are inefficient and struggle to guarantee real-time performance. While some systems provide API interfaces, the interface specifications vary significantly across systems, lacking cross-protocol adaptive integration mechanisms, making automated data collection difficult. Some historical data exists only in report or document form, lacking effective automatic extraction technologies.

[0004] Data quality and standardization issues: Data from different sources exhibits quality problems such as outliers, missing values, and inconsistent formats. For example, environmental monitoring data may contain outliers due to equipment malfunctions, while medical records may contain data biases due to data entry errors. Existing data cleaning methods are mostly designed for single data sources, employing single statistical or machine learning methods, which are insufficient to handle the complex quality issues of multi-source heterogeneous data. Furthermore, different systems use different coding standards, unit systems, and spatiotemporal reference frames, making data integration difficult. For instance, China's environmental monitoring uses the HJ / T 212 standard, while medical data may use HL7 or other standards, lacking an effective standard conversion mechanism.

[0005] Insufficient Privacy Protection in Healthcare Data: With increasingly stringent data security and privacy regulations, ensuring the value of data analysis while meeting privacy requirements has become a key challenge. Existing technologies often employ simple anonymization techniques, such as removing direct identifiers, but these cannot prevent identity re-identification attacks through data association. Furthermore, the lack of tiered protection mechanisms for data with different levels of sensitivity makes it difficult to balance data usability and privacy protection. While differential privacy theory provides a rigorous mathematical framework for privacy protection, existing environmental health data systems have not effectively combined differential privacy techniques with tiered protection mechanisms, and also lack utility loss control methods tailored to the characteristics of environmental health data.

[0006] Low accuracy in spatiotemporal data matching and interpolation: Environmental monitoring data and health data often exhibit mismatches in time and space. For example, air quality monitoring stations are sparsely distributed, while health data may be recorded according to administrative divisions or the location of medical institutions. Traditional interpolation methods, such as inverse distance weighted (IDW) and Kriging, have limited accuracy when processing nonlinear and non-stationary environmental data. Although some studies have attempted to apply machine learning to spatiotemporal interpolation, these efforts are mostly focused on single data sources and do not fully consider the time decay effect and the heterogeneity of multi-source data. Existing spatiotemporal interpolation methods based on graph neural networks have improved accuracy, but they are mainly designed for single data sources and still use traditional time-series models for processing the time dimension. They lack time-aware graph structure design and a general framework for processing multi-source heterogeneous data.

[0007] Insufficient processing efficiency and real-time performance: Environmental health monitoring data is characterized by high frequency, large volume, and continuous updates, posing a severe challenge to the efficiency and real-time performance of data processing systems. Existing systems mostly adopt batch processing mode, resulting in significant processing delays and difficulty in supporting real-time monitoring and early warning. Furthermore, data processing workflows are mostly semi-automated, requiring manual intervention and adjustments, which is inefficient and prone to introducing human error.

[0008] Emerging pollutant data processing faces unique challenges: microplastics, perfluorinated compounds (PFAS), endocrine disruptors, and other emerging pollutants are increasingly becoming hot topics in environmental health research, but their data processing faces particular difficulties. On the one hand, detection technologies and standards are not yet fully mature, resulting in inconsistent data quality, large differences in detection limits, and a large amount of data below the limit of detection (LOD). Traditional LOD / 2 substitution methods or direct deletion methods lead to information loss or bias. On the other hand, the migration and transformation of these pollutants in the environment are complex, involving multi-media interactions, and existing data processing methods struggle to characterize their environmental behavior and exposure pathways. Furthermore, the types of emerging pollutants are numerous and constantly increasing, and existing technologies lack flexible and scalable architectures to adapt to the continuous access needs of new pollutant data. Existing environmental health data processing systems generally lack detection limit stratification strategies and multi-media migration modeling capabilities for emerging pollutants.

[0009] Application interfaces and services lack standardization and flexibility: Existing environmental health data systems typically provide limited data access interfaces and lack standardized API services, making it difficult to support diverse upper-layer application needs. Simultaneously, data security management mechanisms are simplistic, hindering refined data access control and increasing the risk of data misuse. Furthermore, insufficient real-time data push capabilities fail to meet the timeliness requirements of environmental health monitoring and early warning.

[0010] The limitations of existing technologies severely restrict the in-depth development of environmental health research, and there is an urgent need for a technical solution that can efficiently acquire, intelligently process, and accurately integrate multi-source heterogeneous data. Summary of the Invention

[0011] The purpose of this invention is to provide a system for intelligent acquisition and spatiotemporal fusion processing of multi-source heterogeneous data, so as to solve the problems mentioned in the background art.

[0012] The technical problems to be solved by this invention include the following: existing environmental health data processing technologies suffer from low efficiency in acquiring multi-source data, difficulty in cross-protocol interoperability, insufficient precision in data quality control, difficulty in balancing medical data privacy protection and usability, low spatiotemporal matching accuracy, and lack of data processing capabilities for emerging pollutants.

[0013] To achieve the above objectives, the present invention provides the following technical solution: a multi-source heterogeneous data intelligent acquisition and spatiotemporal fusion processing system, comprising: The multi-source data automatic acquisition module, through API interface adaptive connection system, semi-supervised learning enhanced intelligent crawler engine and historical data parsing system, acquires pollution source, environmental media, health data and emerging pollutant data from environmental protection, health, meteorological departments and emerging pollutant monitoring data sources; The unified data processing module includes a general data quality control system, a pollution source data specific processing module, an environmental monitoring data specific processing module, a medical and health data processing module, and an emerging pollutant data processing module, which performs data cleaning, standardization, and type-specific processing. The spatiotemporal matching and interpolation module uses the spatiotemporal graph neural network ST-GNN model to realize the spatiotemporal alignment and fusion of multi-source data. The ST-GNN model constructs a time-aware adjacency matrix by combining the spatial adjacency matrix with the time decay function, and uses an attention mechanism to handle the heterogeneity of multi-source data. The application interface and service module provides standardized data access through RESTful API services, real-time data push services, and data visualization interfaces.

[0014] Furthermore, in the multi-source data automatic acquisition module: the API interface adaptively connects to the system and performs transformation through a parameter mapping matrix.

[0015] in, Let be the target system parameter vector. Let M be the source system parameter vector, and M be the parameter mapping matrix; the parameter mapping matrix M is obtained by learning through minimizing the weighted sum of mapping error and regularization term; it realizes cross-system parameter transformation and supports RESTful, SOAP and GraphQL protocols; The intelligent web crawler engine calculates patterns based on the following formula:

[0016] Collect unstructured data; among which, For pattern similarity, For structural similarity, calculation is based on DOM tree edit distance. Content similarity is calculated based on semantic vector cosine similarity, with α being a balance parameter. The historical data analysis system employs a deep learning model:

[0017] Extract tabular data from the document, where R represents the recognition result. This is a table structure recognition network, where I represents the document image. Let T be the content recognition network, where T represents the text content and ⊕ is the feature fusion operator.

[0018] Furthermore, within the unified data processing module: the general data quality control system employs a multi-level outlier scoring model.

[0019] Perform outlier detection; among which, For abnormal scoring, Scoring of statistical methods, Scoring machine learning Scoring domain rules , , These are weighting coefficients, which are adaptively adjusted based on the data type. The medical and health data processing module implements tiered privacy protection, medical coding standardization, and health indicator calculation. Basic de-identification is implemented for low-sensitivity data; K-anonymization and selective differential privacy protection are implemented for medium-sensitivity data, where K≥5, to ensure the indistinguishability of records and apply differential privacy protection to sensitive statistics; full differential privacy protection and data aggregation are implemented for high-sensitivity data, with a privacy budget ε of 0.1. Differential privacy satisfies the (ε,δ)-differential privacy definition, and noise is added using methods such as Laplace and Gaussian mechanisms. The total privacy loss from multiple queries is controlled through a budget management mechanism. The medical coding standardization supports the recognition and conversion of disease codes such as ICD-10 and ICD-11, surgical procedure codes, and biomarker codes; The health indicator calculations include biomarker corrections (urinary creatinine correction, blood lipid correction) and exposure dose estimation; The emerging pollutant data processing module develops detection limit processing strategies for microplastics and perfluorinated compounds, including LOD / 2 substitution, maximum likelihood estimation, and multi-media migration models. The detection limit processing strategy is adaptively selected based on the detection rate.

[0020] Furthermore, in the spatiotemporal matching and interpolation module: The ST-GNN model uses spatiotemporal graph convolution operations. Capturing spatiotemporal dependencies; among which, For the first Layer node feature matrix The feature matrix of the (l+1)th layer nodes. For time-aware adjacency matrices This is the weight matrix. The activation function is used; the time-aware adjacency matrix is ​​obtained by combining the spatial adjacency matrix A with the time decay function. Element-wise product is obtained as follows: Where A is the spatial adjacency matrix, Let be the time decay function. and For the timestamps of nodes i and j; Let A(i,j) represent the element in the i-th row and j-th column of the time-aware adjacency matrix, and let A(i,j) represent the element in the i-th row and j-th column of the spatial adjacency matrix. Represents the absolute value of the difference between timestamps; The multi-scale spatiotemporal registration engine achieves cross-scale data transformation based on the Variable Resolution Spatial Index (VRSI) and the Hierarchical Temporal Model (HTM). The VRSI is implemented using a quadtree or geohash encoding, and the HTM includes a hierarchical structure of hour-day-week-month-quarter-year. The data fusion quality assessment system calculates the total uncertainty using the following formula:

[0021] in, The total uncertainty is given by position x and time t. To understand uncertainty, It is random and uncertain. Due to the uncertainty of the source of the detection limit, This is due to the uncertainty of multi-media conversion.

[0022] Furthermore, the attention coefficient is calculated using the following formula:

[0023] Among them, h i and h jdenoted as node features, a and W are learnable parameters, || represents feature concatenation, and Δt is the time difference. The model training adopts the occlusion reconstruction target, and self-supervised learning is performed by randomly occluding 20% ​​of the node values. The loss function includes reconstruction loss and smoothing regularization term.

[0024] Furthermore, in the application interface and service module: The RESTful API service provides dedicated endpoints for emerging pollutants to support multi-dimensional queries by pollutant type, medium, and region. The real-time data push service is based on the WebSocket protocol, adopts the publish-subscribe pattern and message batch processing technology, and supports high-concurrency connections; The data access control system combines role-based access control (RBAC) and attribute-based access control (ABAC) permission models, defining four roles: administrator, researcher, analyst, and browser, as well as data sensitivity and geographic scope attributes.

[0025] Furthermore, the system adopts a microservice architecture and containerized deployment, supporting elastic scaling; and achieves seamless access to new pollutant data through a modular plug-in architecture, without the need to refactor the system.

[0026] A method for processing multi-source heterogeneous data in the field of environmental health includes: Step S1: Automatically collect multi-source heterogeneous data through adaptive API interface integration and intelligent web crawling technology; Step S2: Perform unified data processing on the collected multi-source heterogeneous data, including multi-level outlier detection and processing, adaptive missing value imputation, hierarchical privacy protection of medical and health data, and detection limit processing and multi-media migration simulation of emerging pollutant data; Step S3: Perform spatiotemporal fusion using the ST-GNN model with time-aware adjacency matrix and attention mechanism, construct spatiotemporal graph structure, and generate gridded prediction results and uncertainty assessment; Step S4: Integrate the results with the standardized API service and real-time push output.

[0027] Furthermore, the processing of medical and health data employs a differential privacy mechanism to meet [the requirements of privacy]. D and For any two datasets that differ by one record, the privacy requirement is given by the utility loss U. loss To keep the utility loss within a preset threshold, the following formula is used to calculate the utility loss: , where D stat For statistical distribution distance, D ml For machine learning tasks, performance degrades; O represents the original data, and P represents the protected data. and These are the weighting coefficients.

[0028] Emerging pollutant data processing includes detection limit replacement strategies and multi-media migration simulation, supporting a complete processing chain for various emerging pollutants.

[0029] Furthermore, the ST-GNN model's robustness is evaluated through cross-validation: the cross-validation RMSE is calculated as follows:

[0030] in, These are actual observed values. To avoid using the predicted value of data source i, N is the number of observation points; thus, an uncertainty map is generated to guide the data reliability assessment.

[0031] Compared with the prior art, the beneficial effects of the present invention are: 1. An API adaptive docking method based on parameter mapping matrix is ​​proposed. By learning the parameter mapping relationship between the source system and the target system, it realizes automatic conversion of multiple protocols such as RESTful, SOAP, and GraphQL, and solves the technical bottleneck of data acquisition from cross-departmental heterogeneous systems. 2. A multi-level outlier scoring model integrating statistical methods, machine learning methods, and domain rules was constructed. Through adaptive weight adjustment, the accuracy and adaptability of anomaly detection were improved. 3. A tiered privacy protection framework was designed, implementing de-identification, k-anonymization, and differential privacy protection according to the sensitivity of medical and health data, maximizing data availability while meeting privacy protection requirements; 4. This invention innovatively introduces a time-aware adjacency matrix and a time decay function into a spatiotemporal graph neural network (ST-GNN). The time-aware adjacency matrix is ​​constructed by element-wise multiplication of the spatial adjacency matrix and the time decay function. Combined with an attention mechanism, it handles the heterogeneity of multi-source data. Compared with existing spatiotemporal interpolation methods based on graph neural networks, this invention dynamically adjusts the connection weights between nodes through the time decay function, which is more suitable for handling the characteristics of irregular sampling time and large differences in time granularity of multi-source data in the field of environmental health, thus improving the spatiotemporal fusion accuracy. 5. A detection limit stratification strategy and a multi-media migration model were developed for emerging pollutants. The method of LOD / 2 substitution, maximum likelihood estimation or regression imputation was adaptively selected according to the detection rate. The migration and transformation of pollutants between different environmental media were simulated through the multi-media migration model, which filled the gap of traditional monitoring systems. 6. A total uncertainty quantification method was designed that integrates cognitive uncertainty, stochastic uncertainty, detection limit source uncertainty, and multi-media transformation uncertainty to provide reliability assessment for decision-making. Attached Figure Description

[0032] Figure 1This is a diagram showing the overall architecture of the system of the present invention; Figure 2 To unify the functional structure diagram of the data processing module; Figure 3 Flowchart for the multi-source data automatic acquisition module; Figure 4 A framework diagram for tiered privacy protection of healthcare data; Figure 5 This is a diagram of the architecture of the Spatiotemporal Graph Neural Network (ST-GNN) model. Figure 6 This is a quality assessment diagram for multi-scale spatiotemporal registration and fusion. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0034] Please see Figure 1 This invention provides a technical solution: a multi-source heterogeneous data intelligent acquisition and spatiotemporal fusion processing system. The system adopts a modular design and includes four core components: a multi-source data automatic acquisition module, a unified data processing module, a spatiotemporal matching and interpolation module, and an application interface and service module.

[0035] The multi-source heterogeneous data intelligent acquisition and spatiotemporal fusion processing system of the present invention adopts a four-layer architecture design, including a data acquisition layer, a data processing layer, a data fusion layer, and an application interface layer: The data acquisition layer is responsible for automatically collecting environmental and health data from multiple heterogeneous data sources. This layer implements a unified data access framework, supporting various data acquisition methods such as API interface integration, intelligent web crawling, and local file import. The API interface management component supports multiple interface standards such as RESTful, SOAP, and GraphQL, enabling standardized integration with monitoring systems of environmental protection, health, and meteorological departments. The intelligent web crawling engine adopts a distributed architecture, supporting scheduled triggering, incremental collection, and adaptive adjustment, and possesses self-learning capabilities for webpage structure, adapting to changes in the target website's structure. The data acquisition layer also includes a data source registry, a collection task scheduler, and a data quality initial inspection module to ensure the reliability and traceability of data acquisition. In addition, the system is designed with a dedicated adapter to support automated data acquisition from emerging pollutant monitoring data sources such as scientific research monitoring projects, special survey databases, and third-party testing institutions.

[0036] The data processing layer is responsible for cleaning, standardizing, and quality control of the acquired raw data. This layer comprises five sub-modules: a general data quality control system, a pollution source data-specific processing module, an environmental monitoring data-specific processing module, a medical and health data processing module, and an emerging pollutant data processing module. The general data quality control system integrates a multi-level detection strategy based on statistical methods (Z-score, IQR), machine learning methods (isolation forest, autoencoder), and domain rules. Through adaptive weight adjustment, it accurately identifies different types of outliers. The missing value processing engine automatically selects the optimal imputation method based on data characteristics and missing value patterns, supporting time-series-specific imputation algorithms and multivariate correlation imputation algorithms. The standardization and transformation engine maintains a coding mapping table and unit conversion relation library in the environmental health domain, achieving seamless conversion between different standards. Specifically, the emerging pollutant data processing module has developed specialized processing procedures for novel environmental risk substances such as microplastics and perfluorinated compounds, solving special challenges such as detection limit processing, multi-media migration simulation, and exposure pathway analysis. The medical and health data processing module implements a tiered privacy protection mechanism, balancing data availability and privacy security.

[0037] The data fusion layer is the core innovation of the system, responsible for the spatiotemporal alignment and deep fusion of multi-source data. This layer introduces a Spatiotemporal Graph Neural Network (ST-GNN) as its core algorithm framework, enabling spatiotemporal representation learning and joint modeling of heterogeneous multi-source data. The ST-GNN model captures the spatiotemporal dependencies of data through time-aware graph convolution operations and combines an attention mechanism to handle the heterogeneity of multi-source data, improving the accuracy of spatiotemporal interpolation. This layer also includes a spatiotemporal registration engine, a multi-scale representation module, and an uncertainty quantification component, supporting lossless transformation and fusion quality assessment of data at different spatiotemporal granularities. The model design is specifically enhanced to handle sparse, irregularly sampled, and highly variable data, providing support for the spatiotemporal analysis of emerging pollutants.

[0038] The application interface layer provides standardized data service interfaces, supporting upper-layer applications in accessing and utilizing processed data. This layer adopts a microservice architecture, offering various interface formats such as RESTful API, WebSocket real-time push, and batch data export. To meet the needs of different application scenarios, this layer implements a multi-level caching mechanism and on-demand computing strategy to optimize response performance. The application interface layer also includes a fine-grained data access control system, a role- and attribute-based composite permission model, controlling the scope of data access according to data sensitivity levels and user permissions to ensure data security. The system has also specifically developed a real-time data push service to support the timeliness requirements of environmental health monitoring and early warning.

[0039] The following sections will provide a detailed explanation of each module: (I) Multi-source data automatic acquisition module Explanation of data sources for monitoring emerging pollutants: The emerging pollutant monitoring data sources described in this invention include, but are not limited to, the following types: (1) Data from scientific research monitoring projects: data from special research projects on emerging pollutants carried out by universities and research institutes; (2) Special survey data: Data from special surveys on emerging pollutants organized by the Ministry of Ecology and Environment and local environmental protection departments; (3) Data from third-party testing institutions: monitoring data provided by third-party laboratories with the capability to detect emerging pollutants; (4) Enterprise self-monitoring data: monitoring data of emerging pollutants independently conducted by some enterprises; (5) Data from international cooperation projects: Monitoring data on emerging pollutants obtained through international cooperation; (6) Literature database: monitoring data of emerging pollutants extracted from scientific literature.

[0040] The system uses API interfaces, intelligent crawlers, and historical data parsing to automatically acquire and integrate the aforementioned data sources.

[0041] like Figure 3 As shown, the multi-source data automatic acquisition module is the foundational component of the system, corresponding to the data acquisition layer. It is responsible for automatically collecting environmental and health monitoring data from various heterogeneous data sources. This module comprises three key sub-modules: an API interface adaptive interface system, a semi-supervised learning-enhanced intelligent crawler engine, and a historical data and report automatic parsing system. These sub-modules work collaboratively to form a complete acquisition chain from different data sources to standardized raw data. The system operation flow is as follows: First, structured data is acquired from monitoring systems of environmental protection, health, and meteorological departments through the API interface adaptive interface system; simultaneously, the semi-supervised learning-enhanced intelligent crawler engine automatically collects information from unstructured data sources such as websites and portals; furthermore, the historical data and report automatic parsing system processes historical documents in formats such as PDF and Word; finally, after preliminary quality checks and format unification, the data from the three sub-modules outputs standardized raw data for use by subsequent processing modules. The main core functions include: 1.1 The multi-source API interface adaptive docking system realizes adaptive docking with different data interfaces from departments such as environmental protection, health, and meteorology. The core of the API interface adaptive calling is parameter mapping and transformation.

[0042] in, Let be the target system parameter vector. Let M be the source system parameter vector, and M be the parameter mapping matrix; The methods for constructing the parameter mapping matrix M include: (1) Parameter semantic analysis By analyzing the API documentation of the source and target systems using natural language processing techniques, parameter names, data types, value ranges, and semantic descriptions are extracted to construct parameter semantic vectors.

[0043] (2) Parameter alignment

[0044] Cosine similarity is used to calculate the similarity between parameter semantic vectors, and a mapping relationship is established for parameter pairs whose similarity exceeds a preset threshold.

[0045] (3) Generation of mapping matrix

[0046] For a source system with n parameters and a target system with m parameters, construct an m×n dimensional mapping matrix M, with matrix elements... , represents the mapping weight from source parameter j to target parameter i, with a value range of [0,1].

[0047] (4) Mapping learning

[0048] M is learned by minimizing the weighted sum of the mapping error and the regularization term:

[0049] Where λ is the regularization coefficient, and ||·||² represents the L2 norm. Let Frobenius norm be denoted as F, and E be the expected value.

[0050] (5) Optimize the solution method

[0051] The optimization solution for the mapping learning can be obtained using one of the following methods or a combination thereof: when there is sufficient training data, a closed-form solution (ridge regression solution) can be used.

[0052] ,in It is the identity matrix. The regularization parameter is used; as training data continues to arrive, the mapping matrix is ​​updated online using stochastic gradient descent, and the learning rate adopts a decay strategy.

[0053] (6) Cold start treatment

[0054] For new data sources that are being connected for the first time, the system addresses the cold start problem using the following strategies: First, it establishes an initial mapping by using semantic matching of parameter names; then, it calibrates the mapping matrix using a small number of validation samples (usually 5-10 request-response pairs); and finally, it automatically optimizes the mapping accuracy through a feedback mechanism during continuous operation.

[0055] It enables cross-system parameter conversion and supports multiple protocols such as RESTful, SOAP, and GraphQL; it can automatically connect to key data sources such as the National Pollution Source Monitoring Data Platform, Environmental Quality Monitoring Platform, Disease Prevention and Control Information System, and emerging pollutant monitoring data sources.

[0056] For example, a certain environmental protection API uses the parameter " "(Contaminant Coding), Sanitation API Usage" "(Chemical substance ID)" establishes a correspondence between the two through a mapping matrix, enabling automatic conversion.

[0057] 1.2 Semi-supervised learning-enhanced intelligent web crawler engine system: An intelligent web crawler for unstructured and semi-structured data sources was developed. The key to the web crawler engine is pattern recognition and verification, which uses structure and content similarity calculation:

[0058] in, For pattern similarity, For structural similarity, α represents content similarity, and α is a balancing parameter. The system can automatically collect unstructured pollution source and environmental data, such as environmental impact assessment reports, pollutant discharge permit information, annual environmental statistics reports, and research reports on emerging pollutants.

[0059] Structural similarity Calculation:

[0060] Where EditDistance is the edit distance of the DOM tree (the minimum number of insertion, deletion, and replacement operations), and |DOM_A| and |DOM_B| are the number of nodes in the DOM tree.

[0061] Content similarity The calculation: Sentence-BERT is used to generate text semantic vectors (768 dimensions), and cosine similarity is used for calculation.

[0062] , This is a 768-dimensional semantic vector of the webpage text. For vector dot product, , The magnitude of the vector is [-1, 1], and the closer it is to 1, the more similar the vectors are.

[0063] Adaptive adjustment of the balancing parameter α: For table-intensive web pages, α is set to 0.6 to 0.7 (emphasizing structure); for text-intensive web pages, α is set to 0.3 to 0.4 (emphasizing content); the optimal value is determined through validation set search or Bayesian optimization.

[0064] Semi-supervised learning strategies: (1) Train an initial classifier using labeled samples (accounting for 3%-5%); (2) Use the initial classifier to predict unlabeled samples; (3) Select prediction results with a confidence level higher than 0.9 as pseudo-labels; (4) Add pseudo-labeled samples to the training set for iterative training until convergence or the maximum number of iterations is reached (typically 10-20 times).

[0065] The system can automatically collect unstructured pollution sources and environmental data, such as environmental impact assessment reports, pollutant discharge permit information, annual environmental statistics reports, and research reports on emerging pollutants.

[0066] 1.3 The historical data and report automatic parsing system enables automatic parsing of historical reports, extracting key data from documents such as PDF and Word. Table structure recognition uses a deep learning model:

[0067] Where R represents the recognition result. This is a table structure recognition network, where I represents the document image. This is a content recognition network, where T represents the text content and ⊕ represents the feature fusion operator. The feature fusion employs at least one of feature concatenation, weighted summation, or attention-based weighted fusion. Detailed process of table recognition: (1) Table structure recognition It uses a deep convolutional neural network to extract document image features, and outputs table bounding box coordinates and cell segmentation results through an object detection architecture (such as R-CNN series, YOLO series or other object detection networks) to generate the table's row and column index structure.

[0068] (2) Content recognition The text region within a cell is located using a text detection algorithm (such as EAST, CTPN, or other text detection methods), and then a sequence recognition network (such as CRNN, Attention-based OCR, or Transformer-based OCR) is used to recognize the text content, outputting the text string for each cell.

[0069] (3) Feature fusion strategy: The fusion operator ⊕ maps table structure information and text content according to cell position to generate structured table data, in the format of key-value pairs of row and column indices and cell content: The fusion methods can include feature concatenation (connecting the structural feature vector and the content feature vector end to end), weighted summation (adding the two types of feature vectors according to the learnable weights), or attention mechanism weighted fusion (calculating the correlation weight between the two types of features through cross attention and then weighted fusion).

[0070] (4) Domain adaptation optimization: In view of the characteristics of documents in the field of environmental health, the system has built-in pollutant names, units of measurement and professional terminology dictionaries, and corrects OCR errors through the post-processing module. For example, it automatically corrects the incorrectly identified chemical formulas to the standard format.

[0071] The system is optimized for professional documents such as pollution discharge survey reports, environmental quality reports, health survey reports, and research reports on emerging pollutants.

[0072] (ii) Unified Data Processing Module

[0073] like Figure 2 As shown, the unified data processing module corresponds to the data processing layer and uses a unified framework to process all types of data. It comprises five sub-modules: a general data quality control system, a pollution source data-specific processing module, an environmental monitoring data-specific processing module, a medical and health data processing module, and an emerging pollutant data processing module. The unified data processing module is the core processing component of the system, responsible for cleaning, standardizing, and specifically processing multi-source heterogeneous data. The sub-modules collaborate to form a complete processing chain from raw data to high-quality processed data.

[0074] The system operation flow is as follows: First, a general data quality control system performs unified outlier detection, missing value handling, and format standardization on all types of data. Then, based on data type, specialized processing modules are used for pollution source data, environmental monitoring data, medical and health data, and emerging pollutant data. Finally, the processed data are organized according to a unified data model, outputting standardized, high-quality data for subsequent fusion analysis. The main core components include: 2.1 General Data Quality Control System: The system implements quality control functions applicable to all data types, and outlier detection employs a multi-level scoring model.

[0075] For abnormal scoring, Scoring of statistical methods, Scoring machine learning Scoring domain rules , , These are weighting coefficients; the system automatically selects the optimal algorithm combination based on the data type.

[0076] Detailed methods for outlier detection: (1) Statistical methods Classical statistical methods include the Z-score method (suitable for normally distributed data) and the IQR method (suitable for non-normally distributed data). The Z-score method calculates outlier scores based on standard deviation, while the IQR method identifies outliers based on interquartile range.

[0077] (2) Machine learning methods

[0078] This includes unsupervised anomaly detection algorithms such as Isolation Forest and Autoencoder. Isolation Forest constructs a decision tree ensemble by randomly partitioning the feature space, while Autoencoder identifies anomalous patterns through reconstruction errors.

[0079] (3) Domain rule method

[0080] This is a professional knowledge base based on environmental health indicators, containing rules governing the normal ranges of pollutants and health indicators. Examples include expert knowledge in areas such as normal pH ranges, reasonable temperature ranges, and pollutant concentration thresholds.

[0081] (4) Adaptive weight adjustment

[0082] Automatically select the optimal weight combination based on data type: For indicators with clearly defined normal ranges (such as pH value and temperature), increase the weight of domain rules; for indicators without clearly defined ranges but with obvious distribution characteristics (such as pollutant concentration), increase the weight of statistical methods; for complex correlation indicators (such as multi-indicator comprehensive evaluation), increase the weight of machine learning methods.

[0083] Anomaly handling strategies include marking (preserving the original value but adding an anomaly marker), replacement (replacing the anomaly value with a reasonable value), and deletion (for severe anomalies), which are automatically selected based on the severity of the anomaly and the importance of the data.

[0084] Missing value handling strategy: The system implements adaptive missing value imputation based on data characteristics.

[0085] (1) Missing pattern analysis

[0086] Three missing patterns are distinguished: completely random missing (MCAR), random missing (MAR), and non-random missing (MNAR), and a missing map visualization is used for diagnosis.

[0087] (2) Filling the method library

[0088] Including the following methods: Statistical imputation: Mean / median / mode imputation, suitable for simple scenarios; KNN imputation: Based on the K-nearest neighbor algorithm, using Euclidean distance and other metrics, suitable for multivariate correlation scenarios; Multiple imputation: Iterative imputation using the chain equation method to generate multiple imputation datasets, suitable for complex missing data; Time series-specific algorithms: Including Kalman filtering, ARIMA model, linear interpolation, etc., suitable for time series data.

[0089] (3) Adaptive method selection

[0090] For continuous variables, regression methods are used; for categorical variables, voting methods or conditional probability prediction are used. The quality of imputation is evaluated using cross-validation, and the optimal imputation method is selected using indicators such as RMSE, MAE, and R².

[0091] Standardization conversion: A complete coding mapping table and unit conversion relation library for the system maintenance environment health domain.

[0092] (1) Coding standardization

[0093] It covers a variety of common coding systems, including: pollutant coding: CAS number, national standard coding, etc.; disease coding: international disease classification standards such as ICD-10 and ICD-11; biomarker coding: biomedical database coding such as HMDB ID and ChEBI ID.

[0094] (2) Unit conversion

[0095] It supports multiple measurement unit systems and their conversion relationships, including: concentration units: mg / L, μg / L, ppm, ppb, etc.; count units: cells / mL, CFU / 100mL, etc.; dosage units: mg / kg·day, μg / m³, etc.

[0096] (3) Format standardization

[0097] Standardize key data formats, including date and time (standardized to ISO 8601), geographic coordinates (standardized to WGS84 coordinate system), and text encoding (standardized to UTF-8).

[0098] 2.2 The pollution source data-specific processing system implements specific processing functions for pollution source data, providing reliable emission data for environmental health risk assessment. This submodule includes four functional units: pollution source classification and identification, emission parameter standardization, emission inventory construction, and spatiotemporal feature processing.

[0099] Pollution source classification and identification: This unit enables automatic identification and classification of pollution source types.

[0100] (1) Classification system

[0101] It adopts a multi-level structure, with the first-level classification including industrial sources, agricultural sources, mobile sources, residential sources, and natural sources, and the second-level classification further subdivided into specific subcategories.

[0102] (2) Classification model

[0103] An ensemble learning approach, including algorithms such as Random Forest and XGBoost, is employed to fuse prediction results through a soft voting mechanism. Model parameters are adaptively determined through cross-validation.

[0104] Emission parameter standardization unit: This unit standardizes the parameter representation of pollution source data from different sources.

[0105] (1) Parameter identification and mapping

[0106] Identify key emission parameters (such as flue gas volume, fuel consumption, product output, etc.) in the source data and map them to a standard parameter system.

[0107] (2) Unit conversion

[0108] Convert emission parameters from different unit systems to a unified standard, such as converting tons / year and kilograms / day into a unified mass flow unit.

[0109] Emissions inventory construction: This unit calculates pollutant emissions based on activity levels and emission factors. The emission calculation uses a general formula:

[0110] in, Let denot i be the emission amount of pollutant i, AL be the activity level (such as fuel consumption, product output, etc.), EF be the emission factor, and η be the emission reduction efficiency.

[0111] (1) Emission factor library

[0112] The system maintains a database of emission factors covering various pollution source types and pollutants, and supports queries by industry, process, fuel type, and other dimensions.

[0113] (2) Estimation method

[0114] A bottom-up (based on detailed activity level data) and top-down (based on macro statistical data) emission estimation method is implemented, and the optimal solution is selected based on data availability.

[0115] Spatiotemporal Feature Processor: This unit processes the temporal and spatial features of pollution sources.

[0116] (1) Time feature processing

[0117] Extract the temporal patterns of emissions (such as hourly variation coefficients, seasonal variation coefficients, etc.) to support the temporal allocation of continuous and intermittent sources.

[0118] (2) Spatial feature processing

[0119] It processes the spatial representation of point sources, area sources, and line sources, and performs geocoding and spatial index construction.

[0120] The system particularly supports the processing of historical pollution source information, enabling the systematic integration of existing and historical pollution source information.

[0121] 2.3 Environmental Monitoring Data Specific Processing Module: This submodule aims to address the specialized processing needs of monitoring data for various environmental media, providing reliable exposure data for environmental health risk assessment. This submodule includes four functional units: environmental media identification, monitoring equipment calibration, environmental standard adaptation, and environmental quality assessment. Environmental Media Identification: This unit automatically identifies the type of environmental media to which the monitoring data belongs.

[0122] (1) Media classification system

[0123] It includes major environmental media such as air, water bodies (surface water, groundwater, drinking water), soil, sediments, and organisms.

[0124] (2) Identification method

[0125] A multi-feature fusion strategy is adopted, comprehensively considering the characteristics of the monitored project, the monitoring method, and the metadata. The classification model uses random forest or other machine learning algorithms.

[0126] Monitoring equipment calibration unit: This unit handles data differences generated by different monitoring equipment and methods.

[0127] (1) Device type identification

[0128] Differentiate between different types of monitoring equipment, such as national control stations (high-precision standard equipment) and micro stations (low-cost sensors).

[0129] (2) Correction method

[0130] For low-cost sensor data, linear regression and machine learning methods are employed, with data from national monitoring stations used as a reference for calibration to improve data consistency. Environmental factor calibration considers the impact of environmental parameters such as temperature, humidity, and air pressure on monitored values, based on physical models and empirical formulas. The calibration model adopts a general form: ,in This is the corrected true concentration. Here, E represents the measured value, P represents the environmental parameter vector, and E represents the instrument parameter vector.

[0131] Environmental Standard Adapter: This unit interfaces the monitoring data with the corresponding environmental quality standards.

[0132] (1) Standard Library Management

[0133] Maintain national and local environmental quality standards databases, including air quality standards, water environmental quality standards, soil environmental quality standards, etc.

[0134] (2) Standard matching

[0135] The system automatically matches applicable environmental quality standards based on the monitoring medium, indicator type, and functional zoning.

[0136] Environmental quality assessment: This unit assesses the environmental quality status based on environmental quality standards.

[0137] (1) Evaluation method

[0138] It includes various evaluation methods such as single-factor evaluation and comprehensive index method.

[0139] (2) Air quality assessment

[0140] It supports AQI (Air Quality Index) calculation, primary pollutant identification, and statistics on days meeting air quality standards, covering conventional pollutants such as SO2, NO2, PM10, PM2.5, O3, and CO.

[0141] (3) Water quality assessment

[0142] It supports single-factor evaluation and water quality category determination, covering multiple water quality indicators such as pH, dissolved oxygen, permanganate index, ammonia nitrogen, and total phosphorus.

[0143] 2.4 The Medical and Health Data Processing Module aims to address the needs for professional processing and privacy protection of medical and health data, providing reliable health effect data for environmental health correlation analysis. This sub-module includes functions such as data sensitivity assessment, tiered privacy protection processing, utility loss assessment, medical coding standardization, and health indicator calculation.

[0144] Data sensitivity assessment: Automatically assess the sensitivity of health data to provide a basis for selecting privacy protection strategies.

[0145] (1) Sensitivity grading

[0146] A three-tier system is adopted: Level 1 is low sensitivity (summary statistical data, desensitized survey data), Level 2 is medium sensitivity (de-identified diagnostic data, clinical indicators), and Level 3 is high sensitivity (personal health data containing direct or indirect identifiers).

[0147] (2) Evaluation factors

[0148] This includes identification information (direct identifiers such as name and ID number, and indirect identifiers such as postal code and date of birth), health information sensitivity (general health status, special disease information, mental health, and genetic information), and data granularity (individual level, group level, and population level).

[0149] (3) Evaluation methods

[0150] Decision trees, random forests, or other machine learning algorithms are used to automatically assess data sensitivity levels by considering multiple factors. The assessment results are then combined with the intended use of the data to determine an appropriate level of privacy protection.

[0151] Tiered privacy protection: Implementing different levels of privacy protection measures based on the sensitivity of the data. For example... Figure 4 As shown, the system implements a multi-layered privacy protection framework.

[0152] (1) Level 1 protection for low-sensitivity data (basic de-identification)

[0153] Remove direct identifiers (name, ID number, contact information, etc.) and perform data generalization processing (e.g., change specific age to age range, specific date to year and month, etc.). Suppress rare attribute values ​​that may lead to identity verification.

[0154] (2) Level 2 protection for moderately sensitive data (K-anonymization and selective differential privacy)

[0155] Ensure that any record is indistinguishable from at least K-1 other records (K≥5), achieving K-anonymity through generalization and suppression techniques. Apply differential privacy protection to sensitive statistics, with a privacy budget ε set to a moderate value (e.g., ε=1.0).

[0156] (3) Level 3 protection for highly sensitive data (fully differential privacy and data aggregation)

[0157] Strict differential privacy protection is applied to all queries, with a privacy budget ε of 0.1. Only highly aggregated statistical results are provided, and no individual or group-level data is output.

[0158] Differential privacy implementation methods: The system implements multiple differential privacy mechanisms, satisfying the (ε,δ)-differential privacy definition:

[0159] Where ε is the privacy budget, δ is the relaxation parameter, M is the privacy protection mechanism, and D and D' are any two datasets that differ by one record, satisfying the (ε,δ)-differential privacy protection requirement.

[0160] (1) Laplace mechanism

[0161] Adds Laplace-distributed noise to numerical query results, with a noise scaling parameter of Δf / ε, where Δf is the global sensitivity of the query function. Suitable for statistical queries such as counting, summing, and averaging.

[0162] (2) Gaussian mechanism

[0163] Noise following a Gaussian distribution is added to the query results, with the noise variance σ² related to the privacy budget ε and the relaxation parameter δ. Better utility than the Laplace mechanism can be achieved when a smaller δ is allowed.

[0164] (3) Index mechanism

[0165] It is suitable for non-numerical queries (such as selecting the optimal model, returning the top K results, etc.), and selects the output with exponential probability based on the utility function, ensuring that the most efficient output is selected more often.

[0166] (4) Budget Management

[0167] A privacy budget combination theorem is used for cross-query budget allocation. For k queries, the total privacy loss is calculated using the basic combination theorem, the advanced combination theorem, or the moment method to ensure that the overall privacy guarantee is not compromised after multiple queries. The system maintains a budget ledger and rejects queries when the remaining budget is insufficient.

[0168] Utility loss assessment: Assessing the impact of privacy protection measures on data availability.

[0169] Formula for calculating utility loss:

[0170] in, For the overall utility loss, D stat To calculate the statistical distribution distance (KL divergence, JS divergence, Wasserstein distance, etc. can be used), D ml For performance degradation in machine learning tasks (such as decreased classification accuracy, decreased regression R², etc.), O represents the original data, and P represents the protected data. and Weighting coefficients (satisfying) + = 1). The system ensures that the loss of utility is kept within an acceptable range, achieving a balance between privacy protection and data availability.

[0171] Medical coding standardization: addressing heterogeneous coding systems used by different medical institutions and data sources to achieve unified standardized coding.

[0172] (1) Supported encoding systems

[0173] This includes disease codes (ICD-10, ICD-11, ICD-9-CM, etc.), surgical procedure codes (ICD-9-CM surgical procedure code, ICD-10-PCS, etc.), drug codes (ATC code, NDC code, etc.), and biomarker codes (HMDB ID, ChEBIID, LOINC code, etc.).

[0174] (2) Encoding recognition and conversion

[0175] The system automatically identifies the encoding scheme in the source data using pattern matching, rule engines, and machine learning methods. It maintains an encoding mapping table and a transformation rule base to enable conversion between different encoding schemes, including one-to-one mapping, one-to-many mapping, many-to-one mapping, and approximate mapping.

[0176] (3) Coding quality control

[0177] Verify the validity of the codes and detect code conflicts. For records with missing codes but text descriptions, use natural language processing technology to automatically extract the disease name and map it to the standard code.

[0178] Health indicator calculation: Calculate standardized health indicators commonly used in environmental health research to ensure the comparability of data from different sources.

[0179] (1) Biomarker calibration

[0180] Urine biomarkers were corrected for urinary creatinine ( ) ),in For the corrected concentration, To measure concentration, This refers to urinary creatinine concentration; lipid-soluble contaminants in the blood are corrected for lipid levels. ),in The blood lipid concentration is determined by eliminating the influence of urine dilution and blood lipid levels.

[0181] (2) Exposure dose estimation

[0182] Environmental exposure doses are inferred from biomonitoring data. Internal exposure doses are estimated by using biomarker concentrations and human physiological parameters, and then external environmental exposure doses are inferred from internal exposure doses using pharmacokinetic models, integrating multiple exposure routes such as inhalation, dietary intake, and skin contact.

[0183] (3) Health effect score

[0184] Calculate comprehensive health indicators such as symptom scores, functional scores (e.g., lung function FEV1%), and disease risk scores to quantify the health impact of environmental pollution.

[0185] All health indicators were calculated in accordance with relevant international and domestic standards to ensure that data from different studies were comparable.

[0186] 2.5 Emerging Pollutant Data Processing Module: This module aims to address the specific treatment needs of emerging environmental risk substances such as microplastics and perfluorinated compounds. This submodule comprises four functional units: emerging pollutant identification, detection limit treatment, multi-media migration modeling, and exposure pathway analysis.

[0187] Emerging pollutant identification: This unit enables the identification, classification, and standardization of emerging pollutant data.

[0188] (1) Pollutant knowledge base

[0189] It contains detailed information on a variety of emerging pollutants, categorized into major groups such as microplastics (classified by polymer type such as PE, PP, PET and size), perfluorinated compounds (PFAS) (classified by carbon chain length and functional groups), endocrine disruptors (classified by mechanism of action), novel flame retardants, antibiotics and drug residues, etc.

[0190] (2) Data identification and matching

[0191] Emerging pollutants are identified using multiple identifiers, including compound name, CAS number, molecular formula, and chemical structure. Fuzzy matching and synonym recognition are supported, handling different naming conventions from various data sources. Chemical fingerprints (such as molecular fingerprints and SMILES strings) are used for structural similarity matching to identify novel pollutants not yet in the knowledge base. The specific method for chemical fingerprint matching is as follows: the compound to be identified is converted into a standardized SMILES string or molecular fingerprint vector (such as Morgan fingerprint or MACCS bond fingerprint), and its Tanimoto coefficient with known compounds in the knowledge base is calculated. When the Tanimoto coefficient is greater than a preset similarity threshold (default 0.85, adjustable according to application scenarios), it is determined to be of the same or homologous class, and inherits the physicochemical and toxicological parameters of the known compounds.

[0192] (3) Parameterization of material properties

[0193] It supports the automatic extraction and standardization of physicochemical properties (such as solubility, octanol-water partition coefficient, vapor pressure), environmental behavior parameters (such as half-life, bioaccumulation coefficient), and toxicological characteristics (such as toxicity threshold, mechanism of action), providing parameter support for subsequent multi-media migration simulation and risk assessment.

[0194] Detection Limit Processor: This unit is specifically designed to handle data below the detection limit (LOD), implementing a multi-strategy detection limit processing framework. Detection Limit Processing Methods: (1) Detection rate analysis Calculate the detection rate:

[0195] in, For detection rate, To detect the number of samples, This represents the total number of samples.

[0196] (2) Adaptive strategy selection

[0197] The processing strategy is selected based on the detection rate: For low detection rates (below 50%), the LOD / 2 substitution method is used, setting the concentration of undetected samples at half the detection limit. For medium detection rates (50%-80%), the maximum likelihood estimation method is used. Assuming the data follows a log-normal distribution, a likelihood function with censored data is constructed, and the distribution parameters are iteratively solved using the EM algorithm or other optimization methods to estimate the conditional expectation of undetected samples. For high detection rates (above 80%), regression imputation is used. Auxiliary variables (such as time, spatial location, concentration of related pollutants, environmental conditions, etc.) are used as independent variables to establish multiple linear regression, random forest regression, or other predictive models to estimate the concentration of undetected samples.

[0198] (3) Quantification of uncertainty

[0199] The confidence interval of the processed results is estimated using the Bootstrap resampling method, and the uncertainty of the detection limit source, U_LOD, is calculated as the half-width or standard deviation of the confidence interval. This uncertainty is used for subsequent fusion quality assessment.

[0200] Multi-media migration model: This unit simulates the migration and transformation processes of emerging pollutants in the environment.

[0201] (1) Classification of environmental media

[0202] The environment is divided into major environmental media such as air, water (surface water and groundwater), soil (top and deep layers), sediments, and organisms.

[0203] (2) Migration process

[0204] This includes interphase transport (adsorption / desorption, volatilization / sedimentation), degradation and transformation (biodegradation, photochemical degradation, hydrolysis), and spatial migration (diffusion, advection).

[0205] (3) Mathematical model

[0206] The concentration changes of pollutants in various environmental media are described using a set of differential equations based on mass conservation:

[0207] in, The concentration of pollutants in environmental medium i (unit: mg / L or μg / m³). The transfer rate constant from environmental medium j to environmental medium i (unit: day) - ¹), The transfer rate constant from environmental medium i to environmental medium j (unit: day) - ¹), The emission rate into environmental medium i is expressed as mg / (L·day) The degradation rate constant in environmental medium i (unit: day) - ¹). The first term on the right-hand side of the equation represents the flux flowing in from other media, the second term represents the flux flowing out to other media, the third term represents external emissions, and the fourth term represents degradation losses.

[0208] (4) Determination of model parameters

[0209] The parameters in the multi-media migration model are determined through the following methods: transfer rate constant Primarily derived from experimental measurements and literature reports in the field of environmental chemistry, the system includes a built-in database of transport rate parameters for common emerging pollutants under typical environmental conditions; degradation rate constants. Derived from laboratory degradation kinetics studies or environmental half-life data, through Calculate (where (half-life); emission rate The data is derived from emission inventories or measured data. For cases with significant parameter uncertainty, the system supports parameter range input and Monte Carlo simulation. It generates parameter combinations through random sampling and runs multiple simulations to obtain the probability distribution and confidence interval for concentration prediction.

[0210] (5) Model Solving

[0211] Numerical methods (such as the Runge-Kutta method, finite difference method, etc.) are used to solve the differential equation system to obtain the concentration distribution of pollutants in different environmental media over time and space.

[0212] Exposure Pathway Analyzer: This unit analyzes human exposure to emerging pollutants.

[0213] (1) Identification of exposure pathways

[0214] Identify the main routes of exposure, including inhalation, dietary intake (drinking water, food), and skin contact.

[0215] (2) Exposure estimation

[0216] Based on parameters such as environmental concentration, exposure frequency, exposure time, and intake rate, the exposure dose through different routes is estimated.

[0217] (3) Overall Exposure Assessment

[0218] This system integrates exposures from multiple pathways to assess overall exposure levels and provides foundational data for health risk assessment. The exposure parameter library contains exposure factors for different populations (adults, children, pregnant women, etc.), such as intake rate, inhalation rate, and skin surface area, referencing the "Handbook of Exposure Parameters for the Chinese Population" and the US EPA Exposure Factor Handbook.

[0219] Exposure calculations use the formula for chronic daily exposure dose (mg / kg·day) via multiple pathways:

[0220] in, Total chronic daily exposure dose (mg / (kg·day)) The concentration of pollutants in medium i (mg / L, mg / m³, or mg / kg). EF represents the intake / inhalation rate of medium i (L / day, m³ / day, or mg / day), EF represents the exposure frequency (days / year), and ED represents the exposure duration (years). Let be the absorption fraction of medium i (dimensionless, 0-1). The conversion factor is dimensionless, BW is body weight (kg), AT is mean time (days), non-carcinogenic effect is ED×365, carcinogenic effect is 70×365, the system supports probabilistic exposure assessment, and Monte Carlo simulation is used to evaluate the above parameters. , EF, ED (e.g., BW) assigns a probability distribution, calculates the probability distribution of exposure dose through repeated random sampling, and generates percentiles (P5, P50, P95) of exposure, thus more comprehensively characterizing the variability of population exposure.

[0221] (III) Spatiotemporal Matching and Interpolation Module

[0222] like Figure 5 and Figure 6 As shown, the spatiotemporal matching and interpolation module is the core fusion component of the system, responsible for resolving the mismatch between data from different sources in terms of time and space. This module comprises three key sub-modules: a spatiotemporal graphical neural network (ST-GNN) model, a multi-scale spatiotemporal registration engine, and a data fusion quality assessment system. These sub-modules work collaboratively to form a complete fusion chain from heterogeneous data to a unified spatiotemporal representation. This module corresponds to the data fusion layer, achieving spatiotemporal matching and accurate interpolation of data from different sources. It mainly includes the following core algorithms and functions: 3.1 Spatiotemporal Graph Neural Network (ST-GNN) Model The Spatiotemporal Graph Neural Network (ST-GNN) model innovatively develops a spatiotemporal data fusion model based on graph neural networks. The core of the ST-GNN model is the spatiotemporal graph convolution operation.

[0223] in, For the first Layer node feature matrix The feature matrix of the (l+1)th layer nodes. For time-aware adjacency matrices The weight matrix is ​​a learnable matrix. The activation function (using ReLU or LeakyReLU); Time-aware adjacency matrix construction: (1) Construction of spatial adjacency matrix A spatial adjacency matrix A is constructed based on the spatial distance between monitoring stations. When the distance between two monitoring points is less than a preset threshold, the corresponding matrix element is set to 1; otherwise, it is set to 0. The preset threshold is adaptively determined based on the monitoring point density.

[0224] (2) Time decay function

[0225] Where β is the decay coefficient and Δt is the time difference. The decay coefficient β can be selected through the validation set or used as a learnable parameter of the model.

[0226] (3) Time-aware adjacency matrix

[0227] in, and The timestamps of nodes i and j; Let A(i,j) represent the element in the i-th row and j-th column of the time-aware adjacency matrix, and let A(i,j) represent the element in the i-th row and j-th column of the spatial adjacency matrix. This represents the absolute value of the difference in timestamps; this operation combines spatial adjacency with the time decay effect, reducing the connection weight between nodes with large time differences.

[0228] Attention mechanism: The model uses an attention mechanism to handle the differences in importance between different data sources.

[0229] in, and Let be the node feature vector, a and W be learnable parameter matrices, || denote the feature concatenation operation, Δt be the time difference, and Φ be the time difference embedding function (which can be a sine / cosine positional encoding or a learnable embedding vector). Attention coefficients. This represents the importance weight of node j to node i. Furthermore, the model training employs an occlusion reconstruction objective, using self-supervised learning by randomly occluding node values ​​at a predetermined ratio. The loss function includes reconstruction loss and a smoothing regularization term.

[0230] Model training: (1) Self-supervised learning strategy A masking reconstruction strategy is used for pre-training, randomly masking the observations of nodes at a preset ratio (typically 10%-30%, default is 20%). The training model then predicts the values ​​of masked nodes using other nodes and historical information.

[0231] (2) Loss function

[0232] Where L is the total loss function, The mean absolute error of the occluded nodes, The graph Laplace smoothing regularization term penalizes excessively large differences in predicted values ​​between adjacent nodes. The weight coefficients are used for smoothing and regularization.

[0233] (3) Model optimization and evaluation

[0234] Gradient descent optimization is employed, with overfitting prevented through learning rate decay and early stopping strategies. Model hyperparameters are determined using cross-validation. Evaluation metrics include MAE, RMSE, and R².

[0235] (4) Technical differences from existing methods

[0236] Compared with traditional interpolation methods (such as inverse distance weighting and Kriging), the ST-GNN model fully utilizes spatiotemporal dependencies and the heterogeneity of multi-source data, improving interpolation accuracy, especially in predicting data in sparse regions. Compared with existing spatiotemporal interpolation methods based on graph neural networks, the ST-GNN model of this invention has the following technical differences: First, it introduces a time decay function to dynamically adjust the connection weights of the adjacency matrix, enabling the model to adaptively adjust the spatial correlation strength according to the temporal differences in the data, while existing methods typically use static spatial adjacency matrices; Second, it incorporates a temporal difference embedding Φ(Δt) into the attention mechanism, enabling the model to learn both spatial and temporal attention simultaneously, while existing methods typically handle time and spatial dimensions separately; Third, the model design of this invention particularly enhances the processing capability for monitoring data of emerging pollutants, making it suitable for monitoring points with few points and uneven spatiotemporal distribution.

[0237] 3.2 Multi-scale Spatiotemporal Registration Engine

[0238] The multi-scale spatiotemporal registration engine enables the system to register and align multi-scale spatiotemporal data.

[0239] Spatial alignment method: Multi-scale spatial alignment is achieved based on Variable Resolution Spatial Index (VRSI).

[0240] (1) Quadtree

[0241] The space is recursively divided into four quadrants, and the partitioning depth is adaptively determined based on the data density, thereby achieving spatial resolution control from coarse to fine.

[0242] (2) GeoHash encoding

[0243] Geographic coordinates are encoded into strings, and the encoding length controls the spatial resolution; the longer the encoding, the higher the accuracy.

[0244] (3) Hilbert Curve

[0245] Mapping a two-dimensional space to a one-dimensional curve preserves spatial locality and facilitates efficient indexing and querying.

[0246] The VRSI method can select the optimal spatial indexing strategy based on the characteristics of data distribution and supports the conversion of different spatial representations such as points, grids, and polygons.

[0247] Time alignment method: Multi-scale time alignment based on hierarchical time model (HTM).

[0248] (1) Hierarchical time structure

[0249] HTM includes a hierarchical structure of hour-day-week-month-quarter-year, supporting a unified expression of data at different time granularities.

[0250] (2) Time upsampling

[0251] Upsampling is performed from coarse time granularity (e.g., day) to fine time granularity (e.g., hour) using interpolation methods (e.g., linear interpolation, spline interpolation) or time decomposition models.

[0252] (3) Time downsampling

[0253] From fine time granularity (e.g., hours) to coarse time granularity (e.g., days), use aggregation methods (e.g., average, maximum, cumulative values) and select appropriate aggregation strategies based on the characteristics of the indicators.

[0254] Cross-media alignment: The system has developed a cross-media alignment algorithm to support the unified expression of emerging pollutants in different environmental media.

[0255] (1) Unit conversion

[0256] The system addresses unit differences in different media, such as microplastics being expressed as particle number concentration (particles / L) in water and mass concentration (μg / kg) in sediments. The system converts units using parameters such as media density and particle mass.

[0257] (2) Spatial alignment

[0258] Unify the spatial locations of monitoring points in different media to the same spatial reference system, and support spatial registration of monitoring networks for different media such as water, soil and atmosphere.

[0259] (3) Time synchronization

[0260] Synchronize the timestamps of monitoring data from different media to a unified time axis to handle differences in sampling frequency (such as monthly water quality monitoring, hourly air quality monitoring, etc.).

[0261] 3.3 Data Fusion Quality Assessment System

[0262] like Figure 6 As shown, the system achieves quality assessment and uncertainty quantification of spatiotemporal data fusion: The uncertainty quantification of data fusion employs a comprehensive method to calculate the total uncertainty, which is equal to the square root of the sum of squares of each uncertainty component.

[0263] in, The total uncertainty is given by position x and time t. Cognitive uncertainty stems from differences in data sources and uncertainties in model structure; This is random uncertainty, stemming from measurement errors and natural variability; The source of the detection limit is uncertain, stemming from estimates based on data below the detection limit; The uncertainty in multi-media conversion stems from unit conversion and the propagation of distribution coefficients.

[0264] Calculation methods for each uncertainty component: (1) Cognitive uncertainty

[0265] The variance of the prediction results can be calculated by estimating multiple models with different structures or parameters through model ensemble methods; or by using the Dropout sampling method, multiple forward propagations are performed during the inference phase while retaining the Dropout values, and the variance of the output can be calculated.

[0266] (2) Random uncertainty

[0267] The uncertainty of the predicted value can be estimated by modeling the variance of the predicted distribution, or by using a heteroscedastic neural network to directly output the uncertainty. This uncertainty reflects the random fluctuations and measurement errors of the data itself.

[0268] (3) Uncertainty of detection limit

[0269] The half-width of the confidence interval, obtained from Bootstrap resampling in the detection limit processing algorithm, is given, reflecting the uncertainty in estimating undetected data.

[0270] (4) Uncertainty in multi-media conversion

[0271] The error propagation formula is used for calculation. For multi-media conversion functions y=f(x1,x2,...,x_n), including unit conversions, media density conversions, distribution coefficient calculations, etc., the uncertainty is quantified by the following formula:

[0272] in, To account for the uncertainty of the output variable y, Let f be the partial derivative of function f with respect to variable x_i, and let x_i be the sensitivity coefficient of the input variable. For variables Uncertainty This formula sums over all input variables. Based on the assumption that the input variables are independent, it propagates the uncertainty of the input variables to the output through partial derivatives (sensitivity coefficients). The formula follows the GUM (Guide to the expression of Uncertainty in Measurement) international standard and is widely used in environmental modeling and risk assessment. The system also supports extended formulas that include correlations, suitable for situations where there are significant correlations between environmental media parameters.

[0273] Fusion robustness assessment: The robustness of the fusion model is assessed by leave-one-out cross-validation.

[0274] Cross-validation RMSE calculation formula:

[0275] in, For cross-validation root mean square error, For the i-th observation point, N represents the predicted value without using data source i, and N is the number of observation points; this metric reflects the predictive stability of the model when the data source is missing.

[0276] This leads to the generation of an uncertainty map to guide data reliability assessment.

[0277] Uncertainty map generation: The system generates an uncertainty map, displaying the total uncertainty at each location in the form of a spatial distribution map. The uncertainty map uses varying shades of color to represent the degree of uncertainty. It helps users identify areas with lower data reliability, guiding data collection and decision-making.

[0278] For data on emerging pollutants, the system conducted a comprehensive uncertainty assessment, specifically quantifying the detection limit processing (…). ) and multi-media conversion ( The uncertainty introduced provides a reliability boundary for risk assessment.

[0279] (iv) Application Interface and Service Module

[0280] The Application Interface and Service module is the system's output component, responsible for providing processed, high-quality data to upper-layer applications in a standardized manner. This module comprises four key sub-modules: RESTful API service, data access control system, real-time data push service, and data visualization interface. These sub-modules work together to form a complete service chain from data to application.

[0281] The system operation process is as follows: First, the user's identity and permissions are verified through the data access control system; then, according to the access method, the RESTful API service provides a data query interface, the real-time data push service provides real-time updates, and the data visualization interface provides intuitive display; finally, the service quality is continuously optimized based on user feedback and usage, providing strong data support for environmental health research and decision-making.

[0282] 4.1 RESTful API Service

[0283] RESTful API services are designed to provide data access interfaces that conform to modern web standards, enabling third-party applications to easily obtain and use system data.

[0284] API endpoint design: This unit manages all API endpoints provided by the system. The system configuration is as follows: (1) Resource-oriented architecture The API architecture adopts a RESTful style, with endpoints designed around resources. The main resources include data sources, data queries, analysis functions, and metadata.

[0285] (2) Endpoint hierarchy

[0286] The main endpoints include: the data source endpoint ( / api / v1 / sources) which provides a list of available data sources and detailed information; the data query endpoint ( / api / v1 / query) which supports complex data retrieval, including multi-dimensional filtering by time, space, pollutant type, etc.; the analysis endpoint ( / api / v1 / analysis) which provides functions such as time series analysis ( / timeseries) and spatial analysis ( / spatial); and the metadata endpoint ( / api / v1 / metadata) which provides metadata information such as data models, coding standards, and unit systems.

[0287] (3) Dedicated endpoints for emerging pollutants

[0288] The dedicated endpoint for emerging pollutants ( / api / v1 / emerging-pollutants) supports multi-dimensional queries by pollutant type (such as microplastics, PFAS), environmental medium (water, soil, air, organisms), and geographical region, meeting the special needs of emerging pollutant research.

[0289] (4) Version control

[0290] Version control uses URL paths (such as / api / v1 / , / api / v2 / ) to ensure backward compatibility during API evolution.

[0291] Parameter validation and error handling: (1) Parameter verification Strictly validate API request parameters, including data type checks, value range checks, and mandatory parameter checks, and reject illegal requests.

[0292] (2) Error response

[0293] It uses standard HTTP status codes and structured error messages, such as 400 (Incorrect request parameters), 401 (Unauthorized), 404 (Resource Not Found), and 500 (Server Error), and provides detailed error descriptions.

[0294] Response format and performance optimization: (1) Response format It supports multiple data formats such as JSON, XML, and CSV, with JSON format as the default, which can be specified through the Accept header or URL parameters.

[0295] (2) Pagination mechanism

[0296] Implement pagination for large-scale query results, controlled by the page and page_size parameters, to avoid returning too much data at a time.

[0297] (3) Data compression

[0298] It supports gzip compression, reducing the amount of data transmitted over the network.

[0299] (4) Caching strategy

[0300] Cache frequently accessed, stable data and control the cache duration via the Cache-Control header to improve response speed.

[0301] 4.2 Data Access Control System

[0302] Data access control systems are designed to protect data security and privacy, ensuring that users can only access data within their authorized scope.

[0303] Identity verification: (1) Authentication method The system supports multiple authentication methods, including API key authentication (suitable for server-to-server calls), OAuth2.0 (suitable for third-party application integration), and JWT (token-based stateless authentication).

[0304] (2) Token Management

[0305] Implement token expiration and refresh mechanisms to regularly update access tokens and improve security.

[0306] Permission model: The system implements a composite permission model that combines RBAC (role-based access control) and ABAC (attribute-based access control).

[0307] (1) Role definition

[0308] The system defines four basic roles: Administrator (system management privileges), Researcher (data analysis privileges), Analyst (data viewing and export privileges), and Browser (basic viewing privileges).

[0309] (2) Attribute definition

[0310] Key attributes include data sensitivity (low, medium, and high levels), geographical scope (national, provincial, municipal, and county-level administrative regions), and data type (environmental data, health data, pollution source data, etc.).

[0311] (3) Access control rules

[0312] Fine-grained access control can be implemented by combining roles and attributes. For example, "A researcher with the attribute of province X can access data of medium sensitivity or below in province X".

[0313] (4) Data control of emerging pollutants

[0314] Special access controls are set for data on emerging pollutants, requiring users to have the corresponding research qualifications or be approved before they can access the data.

[0315] Data filtering: Data is filtered based on user permissions to ensure that users can only see data within their authorized scope.

[0316] (1) Query-level filtering

[0317] Modify the query conditions and add permission constraints before executing the query to ensure that the query results comply with the user's permissions.

[0318] (2) Result set filtering

[0319] After the query is executed, the results are filtered to remove unauthorized records.

[0320] (3) Field-level filtering

[0321] Control the visibility of sensitive fields, support field desensitization, masking, and hiding to protect sensitive information.

[0322] (4) Geographic and time range filtering

[0323] Filter spatial and temporal data based on the user's geographic and temporal permission ranges.

[0324] Audit Log: The system records all data access behavior for security auditing and problem tracing.

[0325] (1) Log content

[0326] This includes key information such as access time, user ID, accessed resources, operation type, and access result.

[0327] (2) Log storage

[0328] A distributed architecture is used to achieve high-performance log indexing and querying, ensuring the integrity and traceability of logs.

[0329] (3) Anomaly detection

[0330] It analyzes log patterns in real time to identify potential security threats and supports rule-based and machine learning-based detection methods.

[0331] 4.3 Real-time data push service

[0332] The real-time data push service aims to provide clients with real-time updates of environmental health data, supporting monitoring, early warning, and dynamic analysis.

[0333] Connection Management: (1) Communication protocol It uses the WebSocket protocol to establish a two-way communication channel, supports WSS (WebSocket Secure) encrypted transmission, and ensures data security.

[0334] (2) Connection authentication

[0335] The token-based authentication mechanism, integrated with the API authentication system, ensures that only authorized clients can establish a connection.

[0336] (3) Heartbeat mechanism

[0337] Maintain a persistent connection by periodically sending heartbeat messages to check the connection status.

[0338] (4) Reconnection strategy

[0339] It supports an automatic reconnection mechanism and adopts an exponential backoff strategy to avoid putting pressure on the server due to frequent reconnections in a short period of time.

[0340] Message distribution: The system implements a message distribution framework based on the publish-subscribe (Pub-Sub) pattern.

[0341] (1) Topic Management

[0342] It supports hierarchical topic structures, such as "environment / air / pm25 / beijing" representing the PM2.5 data topic for the Beijing area. Clients can subscribe to specific topics or use wildcards to subscribe to multiple related topics (such as "environment / air / #" to subscribe to all air quality indicators).

[0343] (2) Message quality level

[0344] Supports multiple Message Quality of Service (QoS) levels: QoS 0 (at most once, message may be lost), QoS 1 (at least once, message is guaranteed to arrive but may be duplicated), and QoS 2 (exactly once, message is guaranteed not to be lost or duplicated). Clients can select the appropriate QoS level according to their application requirements.

[0345] (3) Message filtering

[0346] It supports attribute-based message filtering, allowing clients to specify the range of data they are interested in (such as specific pollutants, specific regions, data exceeding a threshold, etc.), and the server will only push messages that meet the criteria.

[0347] Warning triggered: (1) Early warning rules It supports customizable early warning rules, such as PM2.5 exceeding national standards or abnormally high concentrations of emerging pollutants.

[0348] (2) Early warning notification

[0349] When the monitored data triggers an early warning rule, the system immediately pushes an early warning message to clients that have subscribed to the relevant topic, including information such as the early warning level, triggering indicator, current value, and threshold.

[0350] Performance optimization: (1) Connection pooling The system employs sharding technology to handle large-scale concurrent connections, distributing connections across multiple server nodes to improve system scalability.

[0351] (2) Message batch processing

[0352] Merging multiple messages sent to the same destination within a short period of time reduces network transmission frequency and improves efficiency.

[0353] (3) Load balancing

[0354] Distribute client connections through a load balancer to ensure balanced server load and avoid single point of overload.

[0355] The system supports high-concurrency connections and can simultaneously serve the real-time data needs of a large number of clients.

[0356] 4.4 Data Visualization Interface

[0357] The data visualization interface is designed to provide intuitive and interactive data display capabilities to help users understand complex environmental health data.

[0358] Visualization type: The system supports various chart types to adapt to different data structures and analysis needs.

[0359] (1) Time series charts

[0360] This includes line graphs, area graphs, and other formats that display the changing trends of pollutant concentrations and health indicators over time.

[0361] (2) Spatial distribution chart

[0362] These include heat maps, contour maps, scatter plots, etc., which show the spatial distribution patterns of pollutants or health effects.

[0363] (3) Statistical charts

[0364] This includes bar charts, pie charts, radar charts, etc., which display the statistical characteristics and comparative relationships of the data.

[0365] (4) Relationship diagram

[0366] This includes network diagrams, Sankey diagrams, and other methods that demonstrate the relationships between various elements.

[0367] (5) Dedicated visualization of emerging pollutants

[0368] It provides dedicated visualization templates for emerging pollutants, such as heat maps of spatial distribution of microplastics, multi-media migration network diagrams of PFAS, and Sankey diagrams of exposure pathways for endocrine disruptors, to meet the special visualization needs of emerging pollutant research.

[0369] Interactive features: (1) Dynamic filtering It supports dynamic filtering and display of data through interactive components such as time sliders, region selectors, and pollutant filters.

[0370] (2) Chart linkage

[0371] Multiple charts can be linked together; for example, if a region is selected on a map, the time series chart will automatically update to reflect the data for that region.

[0372] (3) Data exploration

[0373] When you hover the mouse over or click on a chart element, it displays detailed data values, confidence intervals, uncertainty, and other information.

[0374] (4) Scaling and translation

[0375] It supports map zooming and panning, timeline zooming, and other operations, making it easy for users to focus on the data range they are interested in.

[0376] Visual configuration: (1) Automatic configuration generation The system automatically recommends appropriate visualization types and configuration parameters based on data type and structure, lowering the barrier to entry for users.

[0377] (2) Custom configuration

[0378] It supports user-defined chart styles, color schemes, axis ranges, etc., to meet personalized needs.

[0379] (3) Configuration saving and sharing

[0380] Users can save the visual configuration and generate a shareable link, facilitating team collaboration and showcasing results.

[0381] Performance optimization: (1) Data aggregation It automatically aggregates and processes large amounts of data, reducing the rendering burden on the front end.

[0382] (2) Loading on demand

[0383] A lazy loading strategy is adopted, loading only the data visible to the current viewport to improve the first screen loading speed.

[0384] (3) Caching mechanism

[0385] Cache static or infrequently updated visualizations to reduce redundant calculations.

[0386] Export function: It supports exporting visualization results as images (PNG, SVG), PDF, or interactive HTML files, making it convenient for users to use in reports and papers.

[0387] Example 1: Application of Environmental Health Data Fusion in a Certain City

[0388] The following is an application example of the system of the present invention in the environmental health data fusion scenario of a certain city.

[0389] A city with an area of ​​12,000 square kilometers and a population of 3.8 million needs to integrate data from environmental protection, health, and meteorological departments to assess the impact of air pollution on residents' health. Data sources include hourly data from 52 air quality monitoring stations, daily outpatient data from 128 medical institutions, and online monitoring data from 200 key pollution sources.

[0390] During the data acquisition phase, the system connects to the Environmental Protection Bureau's monitoring platform, the National Health Commission's disease control system, and the Meteorological Bureau's data platform via API interfaces to collect data from 2020 to 2023. The parameter mapping matrix M is used to achieve cross-system parameter transformation, such as converting parameters from the environmental protection system... "Mapped to the health system" A total of 4.5 million records were collected.

[0391] Specifically, the system first extracts parameter features from the API documents of each system through parameter semantic analysis. Then, it calculates parameter alignment relationships using cosine similarity, generating three m×n mapping matrices (corresponding to parameter transformations between environmental protection and health, environmental protection and meteorology, and health and meteorology, respectively). Finally, it solves the mapping matrices in one go using ridge regression closed-form solutions. When the system initially connects to the new version of the National Health Commission's disease control system, a cold start strategy is adopted. Initial mappings are established through semantic matching, requiring only eight verification request-response pairs to complete the mapping calibration. In subsequent continuous operation, the mapping accuracy is further improved through a feedback mechanism.

[0392] During the data processing phase, the general quality control system employs a multi-level outlier detection model with weighting coefficients. =0.3、 =0.4、 =0.3, identifying 18,736 outliers, accounting for 0.42%. The above weighting was determined based on the characteristics of the air pollutant concentration data (distribution characteristics are obvious but there is no strictly fixed range, therefore statistical methods and machine learning methods have higher weights).

[0393] The medical data processing module implements K-anonymization (K=5) and differential privacy protection (ε=0.8, δ=10) on outpatient data with a sensitivity assessment of Level 2. -5 The utility loss assessment was 12.3%, calculated as follows: statistical distribution distance. The KL divergence was used to measure the difference in distribution between the original and protected data, and the result was 0.08; the performance of the machine learning task decreased. By training a respiratory disease classification model on protected data and comparing it with the model on the original data, the classification accuracy decreased by 0.06; using weighted... , Overall utility loss The percentage is 12.3% (based on normalization), which is lower than the preset threshold.

[0394] The pollution source data processing module classifies, identifies, and lists the emission data of 200 key pollution sources, providing data support for source apportionment.

[0395] In the spatiotemporal fusion phase, the system applies the ST-GNN model to generate PM2.5 concentration distribution and community-level health risk assessment using a 500m×500m grid. The model is configured with 3-layer graph convolution, with hidden layer dimensions of 64-128-64 and a time decay coefficient β=0.05. The spatial adjacency matrix A is constructed based on the Euclidean distances between 52 monitoring stations, with a distance threshold set to 30km (adaptively determined according to the average station spacing). A contains 186 non-zero elements. (Time-aware adjacency matrix...) Through A and the time decay function The element-wise product is obtained, which reduces the weight between nodes with a time difference of 24 hours to about 0.30 and the weight between nodes with a time difference of 48 hours to about 0.09, thereby automatically reducing the impact of data points with a long time distance.

[0396] Self-supervised pre-training for reconstruction using occlusion is employed, with the occlusion ratio set to a default value of p=0.2 (i.e., randomly occluding 20% ​​of node observations during pre-training). A multi-scale spatiotemporal registration engine aggregates hourly monitoring data into daily averages and interpolates point data into grid data, achieving spatiotemporal alignment with daily outpatient data. A fusion quality assessment system calculates the total uncertainty for each grid cell. Cognitive uncertainty The prediction variance was calculated by training five ST-GNN models with different initialization parameters, and the random uncertainty was obtained. The total uncertainty is directly estimated through the output layer of the heteroscedastic neural network, according to... (This embodiment does not involve emerging pollutants, therefore...) and (If the value is zero), an uncertainty map is generated to guide decision-making.

[0397] During the service output phase, the system provides a data query interface via a RESTful API, supporting multi-dimensional queries by time, location, and pollutant type. It pushes PM2.5 exceedance warnings and increased health risk notifications to environmental protection and health departments in real time via WebSocket push service. It generates spatiotemporal distribution heatmaps and health risk trend maps through a visualization interface. Regarding data access control, researchers in the environmental protection department can access city-wide environmental monitoring data and health data with moderate to low sensitivity through RBAC role authorization; analysts in the health commission can only access data within their jurisdiction through ABAC attribute constraints.

[0398] The effectiveness evaluation was based on the following experimental design and validation methods: A space-out method was used, randomly selecting 20% ​​of the monitoring points (i.e., 10 stations randomly selected from 52 stations) as the test set, and the remaining 42 stations as the training set. This random partitioning was repeated 5 times, and the average results were taken. The ST-GNN model had an RMSE of 12.3 μg / m³, while the traditional Kriging method had an RMSE of 21.5 μg / m³, representing a 42.8% reduction in interpolation error. Cross-validation RMSE was calculated according to... Calculate, where N=10 (number of test sites). These are the actual observations from each test site. This is the model's prediction of the site's location.

[0399] Data processing time has been reduced from 7 days to 2 hours using manual methods, significantly improving data processing efficiency.

[0400] After differential privacy processing, the statistical distribution distance Degradation of machine learning task performance Overall utility loss It meets the preset threshold requirements.

[0401] Through spatiotemporal fusion analysis, the system identified a significant positive correlation between PM2.5 concentration and the number of outpatient visits for respiratory diseases (with a lag of 1-3 days), providing a scientific basis for precise control. The system has provided data support for the city's environmental health management, enabling applications such as correlation analysis between air pollution and respiratory diseases, precise control of pollution sources, and public health risk early warning, achieving significant social benefits.

[0402] Example 2: Monitoring and Health Risk Assessment of Emerging Pollutants in the Water Environment of a Coastal City

[0403] The following is an application example of the system of the present invention in the monitoring scenario of emerging pollutants in the water environment of a coastal city.

[0404] In recent years, a coastal city has faced serious emerging water pollution problems, especially the increasing pollution from microplastics and perfluorinated compounds (PFAS), which require a systematic assessment and zoned management of the exposure risks to the city's 3.8 million residents.

[0405] Traditional assessment methods face challenges such as difficulty in acquiring multi-source data, the dispersion of water quality monitoring, pollution source investigation, and health monitoring data across different departments, the complexity of processing detection limits for emerging pollutants, the difficulty in effectively utilizing a large amount of data below the detection limit, the lack of multi-media migration models, the inability to accurately assess the migration and transformation of pollutants between water, sediment, and organisms, and insufficient spatiotemporal data matching accuracy, making it difficult to establish the association between pollution exposure and health effects.

[0406] During the multi-source data automatic acquisition phase, the system adaptively connects to the municipal environmental protection bureau's water quality automatic monitoring system, pollution source online monitoring system, and health commission's disease monitoring system via API interfaces, enabling real-time data acquisition. The system constructs parameter mapping matrices for each of the three data sources and automatically establishes parameter alignment relationships through parameter semantic analysis and cosine similarity calculation, achieving automatic conversion of parameters such as water quality index codes, monitoring time formats, and spatial coordinate systems.

[0407] For data sources without open APIs, the system deployed a semi-supervised learning-enhanced intelligent crawler. Requiring only a small number of labeled sample pages (approximately 4% of the total target pages), it successfully collected 183 water quality reports published by the Environmental Protection Bureau and 96 research reports released by third parties. The crawler engine calculated patterns similarity... Identify the target page, where α is set to 0.65 (emphasizing structure) for environmental quality report pages that are mainly table-based, and α is set to 0.35 (emphasizing content) for research report pages that are mainly text-based.

[0408] For historical data, the system automatically analyzed historical data and reports to extract monitoring data from 2005 to 2022 from historical reports in PDF and Excel formats, including routine water quality indicators and emerging pollutant indicators. The analysis system employed... The model extracts tabular data, and the feature fusion adopts an attention mechanism weighted fusion method. After domain adaptation optimization (using the built-in pollutant name and unit of measurement dictionary for OCR post-processing correction), the overall recognition accuracy reaches over 97%.

[0409] The system features a specially developed emerging pollutant data acquisition adapter, successfully connecting to the microplastics and PFAS detection systems of two professional laboratories. Through multi-source automatic acquisition, the system collected conventional water quality data from 45 cross-sections, emerging pollutant data from 72 locations, and wastewater discharge data from 283 enterprises, forming a complete water environment monitoring database.

[0410] During the unified data processing phase, the system standardized and performed quality control on the acquired raw data. The general data quality control system detected and processed 3,876 outliers in the data, accounting for 4.2%. Among them, the outlier detection weights for water quality data were configured as w1=0.35, w2=0.35, and w3=0.30 (water quality indicators have both clear standard limits and complex multi-indicator correlations, so the weights of the three methods are relatively balanced), and the data quality detection accuracy rate reached 96.3%.

[0411] The emerging pollutant data processing module implements a stratified detection limit processing strategy for the specific properties of microplastics and PFAS. For microplastic data, 35% of the samples are below the detection limit (detection rate of 65%, falling within the moderate detection rate range of 50%-80%), and the system uses maximum likelihood estimation for processing. This method assumes that the data follows a log-normal distribution and iteratively solves the distribution parameters using the EM algorithm to estimate the conditional expectation value of undetected samples. For PFAS data, 29% of the samples are below the detection limit (detection rate of 71%, also falling within the moderate detection rate range), and the system uses regression imputation, employing a random forest regression model with time, spatial location, relevant pollutant concentrations, and environmental conditions as independent variables to estimate the concentration of undetected samples. During the detection limit processing, the system estimates the confidence interval of the processing results through Bootstrap resampling (1000 times) and calculates the source uncertainty of the detection limit. Through these processes, data availability significantly improved from 65% to 97%.

[0412] The system further applies a multi-media migration model to simulate the migration and transformation processes of PFAS in the environment. The model divides the environment into environmental media such as water, sediment, soil, and organisms, and uses a set of differential equations based on mass conservation to describe the transfer, degradation, and accumulation of pollutants among these media. The set of differential equations is as follows: In the model parameters, the transfer rate constant k ij The main references are experimental measurements of PFAS at the water-sediment interface from environmental chemistry literature, and the degradation rate constant D. i pass The emission rate E was obtained by converting the half-life data of PFAS in various media. iThe data was derived from PFAS emission inventories collected from 283 enterprises. For the transfer rate constant, which exhibits significant parameter uncertainty, the system employed Monte Carlo simulation, setting the parameters to a log-normal distribution for uncertainty analysis. The model solution utilized the fourth-order Runge-Kutta method to obtain the concentration distribution and temporal evolution of PFAS in different media. Through exposure pathway analysis, the system identified the main PFAS exposure pathways to residents as drinking water intake, seafood consumption, and agricultural product consumption, and quantified the exposure contribution ratio of each pathway. Exposure dose was calculated using a multi-pathway chronic daily exposure dose formula. Calculations and Monte Carlo probability exposure assessments show that drinking water contributes the most, followed by consumption of aquatic products and agricultural products.

[0413] The goodness-of-fit R² of the multi-media migration model is 0.895, providing reliable basic data for exposure assessment and risk management.

[0414] For highly sensitive raw health records (Level 3) containing personally identifiable information, the medical and health data processing module first performs basic de-identification processing on the sensitive health data, removing direct identifiers such as names and ID numbers, and generalizing data such as age and address. For moderately sensitive outpatient data (Level 2) used for association analysis, the system implements K-anonymization to ensure that each record is indistinguishable from at least four other records. For statistical data to be published (Level 1), the system applies differential privacy protection, adding carefully designed noise to the query results, using (ε,δ)-differential privacy, with ε=0.8 and δ=10. -5 Through utility loss assessment, the system verified that the privacy-protected data maintained good performance on statistical distribution and machine learning tasks, with a comprehensive utility loss of only 12.3%, achieving a balance between privacy protection and data availability.

[0415] In the spatiotemporal matching and fusion modeling phase, the system applied the Spatiotemporal Graph Neural Network (ST-GNN) model to solve the spatiotemporal matching problem for multi-source data. The system constructed a spatiotemporal graph structure covering the entire city, with nodes including water quality monitoring sections, pollution sources, residential areas, and medical institutions. Edges represent water system connectivity and spatial proximity relationships. The spatial adjacency matrix A is constructed based on the spatial distance between nodes, with a distance threshold set to 5km (adaptively determined based on the average spacing of water quality monitoring sections). A time-aware adjacency matrix is ​​also used. pass The model was constructed with an attenuation coefficient β=0.03 (the temporal variation of water environment data is slower than that of atmospheric data, so the value of β is less than that in Example 1). By introducing a time-aware adjacency matrix and an attention mechanism, the model captured the spatiotemporal dependencies of the data and achieved the generation of a city-wide water environment quality grid data with a resolution of 200m×200m.

[0416] Attention coefficient through The calculation involved using a sine-cosine position encoding for the time difference embedding function Φ. Attention weight analysis revealed that the attention weights for nationally controlled cross-section data were significantly higher than those for enterprise self-testing data, reflecting that the model automatically learned from the reliability differences in the data sources.

[0417] Using the spatial leave-out method (dividing by cross-section, randomly selecting 9 cross-sections from 45 cross-sections as the test set, repeating the random division 5 times and taking the average result), the model's average interpolation error on the validation set was reduced by 44.7%, especially in river tributary areas with sparse monitoring points, where the prediction accuracy was significantly improved, reaching 61.2%. The multi-scale spatiotemporal registration engine realizes multi-scale data conversion from hourly to quarterly and from point to watershed. VRSI is implemented using a quadtree spatial index, and the division depth is adaptively determined according to the density of water quality monitoring points in each region. HTM adopts a hierarchical structure of hourly-daily-weekly-monthly-quarterly-yearly, which solves the problem of mismatch between quarterly monitoring data of emerging pollutants and daily health monitoring data in terms of time scale, as well as the difference between point monitoring and area risk assessment in terms of spatial scale.

[0418] The data fusion quality assessment system comprehensively evaluated the fusion results, calculating the total uncertainty, which includes cognitive uncertainty U_e, random uncertainty U_a, detection limit uncertainty U_d, and multi-media transformation uncertainty U_m. U_e is estimated by the prediction variance of the ensemble of 5 models, U_a is estimated by the heteroscedasticity neural network, U_d is given by the Bootstrap resampling result in the detection limit processing step, and U_m is given by the error propagation formula. The uncertainty map is obtained from the propagation of parameter uncertainty in the multi-media migration model and provides decision-makers with a reference for data reliability.

[0419] In the application interface and service phase, the system has built standardized data service interfaces to support a variety of application scenarios.

[0420] The RESTful API service provides query interfaces for water environmental quality, emerging pollutant distribution, and health risk assessment, supporting parameterized queries by pollutant type, environmental medium, and geographical region. Specifically, the dedicated endpoint for emerging pollutants ( / api / v1 / emerging-pollutants) supports combined queries by PFAS carbon chain length (C4-C14), microplastic polymer type (PE, PP, PET, etc.), and environmental medium (water, sediment, organisms). The data access control system implements granular permission management based on roles and attributes, ensuring the secure use of sensitive data. Real-time data push services provide environmental protection departments and health commissions with real-time early warnings of pollutant exceedances and increased health risks. The data visualization interface generates intuitive visualization products, including spatial distribution heatmaps of emerging pollutants, multi-media migration network diagrams, and health risk zoning maps, helping decision-makers understand complex environmental health relationships. The system also developed a mobile application interface for the public, allowing residents to query water environmental quality and health risks in their area, obtain personalized protection advice, and enhance public awareness of risk prevention.

[0421] This system has achieved remarkable results in the monitoring and risk management of emerging pollutants in urban water environments. Regarding data acquisition efficiency, the system has achieved an automation rate of over 90%, reducing data update latency from days to minutes, significantly shortening the annual data processing workload, and substantially lowering labor costs. In terms of emerging pollutant assessment capabilities, the system has achieved, for the first time, a high-precision assessment of microplastic and PFAS pollution across the entire city, significantly improving data availability through detection limit processing technology. Regarding spatiotemporal prediction accuracy, the system's spatiotemporal graph neural network model overcomes the prediction difficulties of traditional methods in sparse data areas. Compared to traditional Kriging interpolation methods, the system significantly reduces prediction errors in sparsely monitored river tributary areas. In terms of health risk assessment, the system has established a correlation model between emerging pollutant exposure and health effects through multi-source data fusion. Studies have found a significant correlation between PFAS exposure levels and abnormalities in specific endocrine indicators, and a potential correlation between microplastic exposure and immune function indicators. Regarding multi-departmental collaboration, the system has broken down data barriers between environmental protection, water resources, and health departments, establishing a collaborative management mechanism for water environment health risks.

[0422] This application case demonstrates that the system can effectively address key challenges in monitoring emerging pollutants in the water environment and assessing health risks, such as difficulties in data acquisition, poor standardization, and low fusion accuracy. Through intelligent acquisition and spatiotemporal fusion processing of multi-source heterogeneous data, it provides comprehensive data support and decision assistance for environmental health management, demonstrating significant scientific value and social benefits.

[0423] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A multi-source heterogeneous data intelligent acquisition and spatiotemporal fusion processing system, characterized in that, include: The multi-source data automatic acquisition module, through API interface adaptive connection system, semi-supervised learning enhanced intelligent crawler engine and historical data parsing system, acquires pollution source, environmental media, health data and emerging pollutant data from environmental protection, health, meteorological departments and emerging pollutant monitoring data sources; The unified data processing module includes a general data quality control system, a pollution source data specific processing module, an environmental monitoring data specific processing module, a medical and health data processing module, and an emerging pollutant data processing module, which performs data cleaning, standardization, and type-specific processing. The spatiotemporal matching and interpolation module uses the spatiotemporal graph neural network ST-GNN model to realize the spatiotemporal alignment and fusion of multi-source data. The ST-GNN model constructs a time-aware adjacency matrix by combining the spatial adjacency matrix with the time decay function, and uses an attention mechanism to handle the heterogeneity of multi-source data. The time-aware adjacency matrix is ​​obtained by the element-wise product of the spatial adjacency matrix A and the time decay function ft. The application interface and service module provides standardized data access through RESTful API services, real-time data push services, and data visualization interfaces.

2. The intelligent acquisition and spatiotemporal fusion processing system for multi-source heterogeneous data according to claim 1, characterized in that, In the multi-source data automatic acquisition module: the API interface adaptive docking system converts data through a parameter mapping matrix. in, Let be the target system parameter vector. M is the parameter vector of the source system and M is the parameter mapping matrix; it realizes cross-system parameter transformation and supports RESTful, SOAP and GraphQL protocols; The intelligent web crawler engine calculates patterns based on the following formula: Collect unstructured data; among which, For pattern similarity, For structural similarity, The content similarity is denoted by α, which is a balancing parameter. The historical data analysis system employs a deep learning model: Extract tabular data from the document, where R represents the recognition result. This is a table structure recognition network, where I represents the document image. For content recognition network, T represents text content, and ⊕ represents feature fusion operator. The feature fusion adopts at least one of feature concatenation, weighted summation, or attention mechanism weighted fusion.

3. The intelligent acquisition and spatiotemporal fusion processing system for multi-source heterogeneous data according to claim 1, characterized in that, In the unified data processing module: The general data quality control system employs a multi-level outlier scoring model: Perform outlier detection; among which, For abnormal scoring, Scoring of statistical methods, Scoring machine learning Scoring domain rules , , These are the weighting coefficients; The medical and health data processing module implements tiered privacy protection: basic de-identification processing is performed for low-sensitivity data; K-anonymization and selective differential privacy protection are performed for medium-sensitivity data, where K≥5, ensuring that any record is indistinguishable from at least K-1 other records; and full differential privacy protection and data aggregation processing are performed for high-sensitivity data, with a privacy budget ε of 0.1, and the differential privacy satisfies the (ε,δ)-differential privacy definition. The emerging pollutant data processing module develops detection limit processing strategies and multi-media migration models for microplastics and perfluorinated compounds. The detection limit processing strategies include LOD / 2 substitution and maximum likelihood estimation, and the detection limit processing strategies are adaptively selected based on the detection rate.

4. The intelligent acquisition and spatiotemporal fusion processing system for multi-source heterogeneous data according to claim 1, characterized in that, In the spatiotemporal matching and interpolation module: The ST-GNN model uses spatiotemporal graph convolution operations. Capture spatiotemporal dependencies, among which, For the first Layer node feature matrix The feature matrix of the (l+1)th layer nodes. For time-aware adjacency matrices Let σ be the weight matrix and σ be the activation function. The time-aware adjacency matrix is ​​obtained by using the spatial adjacency matrix A and the time decay function. Element-wise product is obtained as follows: Where A(i,j) represents the element in the i-th row and j-th column of the spatial adjacency matrix. and These are the timestamps of nodes i and j, respectively. The time decay function represents the absolute value of the difference between timestamps. Using exponential decay: in The attenuation coefficient varies with time difference. Increases while decreasing; The multi-scale spatiotemporal registration engine realizes cross-scale data transformation based on the variable resolution spatial index VRSI and the hierarchical temporal model HTM. The VRSI is implemented using quadtree or geohash encoding, and the HTM includes a hierarchical structure of hour-day-week-month-quarter-year. The data fusion quality assessment system calculates the total uncertainty using the following formula, whereby the total uncertainty is equal to the square root of the sum of squares of all uncertainty components: in, The total uncertainty is given by position x and time t. To understand uncertainty, It is random and uncertain. Due to the uncertainty of the source of the detection limit, This is due to the uncertainty of multi-media conversion.

5. The intelligent acquisition and spatiotemporal fusion processing system for multi-source heterogeneous data according to claim 4, characterized in that, The attention coefficient is calculated using the following formula: Among them, h i and h j Here, 'a' and 'W' are node features, 'a' and 'W' are learnable parameters, and '||' denotes feature concatenation. For the time difference, The time difference embedding function is used; and the model training adopts the occlusion reconstruction target, and self-supervised learning is performed by randomly occluding node values ​​at a preset ratio. The loss function includes reconstruction loss and smoothing regularization term.

6. The intelligent acquisition and spatiotemporal fusion processing system for multi-source heterogeneous data according to claim 1, characterized in that, In the application interface and service module: The RESTful API service provides a dedicated endpoint for emerging pollutants to support multi-dimensional queries by pollutant type, medium, and region. The real-time data push service is based on the WebSocket protocol, adopts a publish-subscribe model and message batch processing technology, and supports high-concurrency connections; The data access control system combines role-based access control (RBAC) and attribute-based access control (ABAC) permission models, defining four roles: administrator, researcher, analyst, and browser, as well as data sensitivity and geographic range attributes.

7. The intelligent acquisition and spatiotemporal fusion processing system for multi-source heterogeneous data according to claim 1, characterized in that, The system adopts a microservice architecture and containerized deployment, supporting elastic scaling; and achieves seamless access to new pollutant data through a modular plug-in architecture, without the need to refactor the system.

8. A method for processing multi-source heterogeneous data in the field of environmental health, based on the system described in any one of claims 1 to 7, characterized in that, include: Step S1: Automatically collect multi-source heterogeneous data through adaptive API interface integration and intelligent web crawling technology; Step S2: Perform unified data processing on the collected multi-source heterogeneous data, including: using a multi-level outlier detection model that integrates statistical scoring, machine learning scoring, and domain rule scoring to identify and process outliers; adaptively selecting imputation methods based on data missing patterns to process missing values; implementing graded privacy protection for medical and health data based on sensitivity levels; adaptively selecting detection limit processing strategies for emerging pollutant data based on detection rates, and simulating cross-media migration and transformation of pollutants through a multi-media migration model. Step S3: Perform spatiotemporal fusion using the ST-GNN model employing a time-aware adjacency matrix and attention mechanism, including: constructing a spatiotemporal graph structure with monitoring stations as nodes and spatial proximity relationships as edges; constructing a time-aware adjacency matrix through element-wise multiplication of the spatial adjacency matrix and the time decay function; calculating the importance weights of nodes from different data sources through the attention mechanism; performing spatiotemporal graph convolution operations to generate gridded prediction results; calculating the total uncertainty, including cognitive uncertainty, random uncertainty, detection limit uncertainty, and multi-media transformation uncertainty, and generating an uncertainty assessment; Step S4: Output the fusion results through standardized API services and real-time push.

9. The method according to claim 8, characterized in that: Medical and health data processing employs a differential privacy mechanism, which satisfies... , where D and For any two datasets that differ by one record, ε is the privacy budget and δ is the relaxation parameter; Utility loss U loss To keep the utility loss within a preset threshold, the following formula is used to calculate the utility loss: in, D stat To statistically determine the distribution distance, For machine learning tasks, performance degrades O The original data, P To protect the data afterward, and These are the weighting coefficients; The emerging pollutant data processing includes detection limit substitution strategies and multi-media migration simulation, supporting a complete processing chain for various emerging pollutants.

10. The method according to claim 8, characterized in that, The ST-GNN model was robustly evaluated through cross-validation: the cross-validation RMSE was calculated as follows: in, These are actual observed values. To avoid using the predicted value of data source i, N is the number of observation points; thus, an uncertainty map is generated to guide the data reliability assessment.