Artificial intelligence-based basin perfluorinated compound tracking and tracing method and system
By constructing a multi-source data cube and combining it with deep learning and hydrodynamic models, the problem of accuracy in tracing the sources of perfluorinated compounds at the watershed scale was solved, enabling precise identification and treatment of pollution sources and improving the targeting and efficiency of pollution control.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF GEOSCIENCES (WUHAN)
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies are insufficient for precise source tracing of perfluorinated and polyfluorinated compounds at the watershed scale, resulting in a lack of targeted pollution control, continuous spread of pollutants, and impact on ecosystems and drinking water safety.
By integrating high-resolution mass spectrometry data, hydrological and meteorological data, and spatial distribution data of polluting enterprises within the watershed, a multi-source data cube is constructed. End-to-end analysis is performed using a deep learning model. A migration and transport model is constructed by combining hydrodynamics and environmental behavior equations. The pollution source contribution is calculated by backpropagation using an attention mechanism, and a pollution source contribution heatmap is generated.
It enables precise identification and migration simulation of perfluorinated compound structures, clearly identifies pollution hotspots, provides accurate basis for governance, and improves the targeting and efficiency of watershed pollution control.
Smart Images

Figure CN121905342B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of environmental monitoring and pollution source tracing technology, and in particular to a watershed-scale method and system for tracing perfluorinated compounds based on artificial intelligence. Background Technology
[0002] In the field of perfluorinated and polyfluorinated compound (PFPC) pollution control, existing technologies mainly include point source monitoring, empirical model extrapolation, non-targeted screening combined with mass spectrometry database matching, and machine learning prediction models. Point source monitoring acquires pollutant concentration and emission information from specific emission sources by deploying monitoring equipment at fixed locations; empirical models estimate pollutant distribution trends based on historical monitoring data and statistical patterns; non-targeted screening technology can comprehensively scan complex components in environmental samples and, combined with mass spectrometry database matching, identify known compounds; machine learning prediction models predict pollutant concentrations or sources by mining the correlations between data. These technologies can play a role in simple scenarios such as local pollution monitoring and identification of known pollutants, providing basic data support for pollution control, and are characterized by relatively simple operation and the ability to quickly obtain preliminary results.
[0003] However, existing technologies have significant limitations in tracing the sources of perfluorinated and polyfluorinated compounds (PFOCs) at the watershed scale. The core issue is the difficulty in accurately tracing pollutants in complex scenarios. This problem stems from a combination of factors: pollutant sources within the watershed include not only explicit industrial point sources but also non-point source emissions, making it difficult for existing monitoring systems to comprehensively cover all potential emission points; pollutant migration within the watershed is influenced by various factors such as hydrological dynamics and land use types, and existing models often fail to fully couple these dynamic processes with pollutant diffusion mechanisms, relying solely on statistical correlations or simple empirical formulas, thus failing to accurately reflect the true migration paths of pollutants; simultaneously, new compounds are constantly emerging, and mass spectrometry databases are lagging behind, making it difficult for non-targeted screening technologies to identify unknown compounds with unlisted structures. This problem directly leads to discrepancies between tracing results and actual pollution sources, making it impossible to accurately pinpoint key emission sources. This results in a lack of targeted pollution control measures, allowing pollutants to continue migrating and spreading within the watershed, disrupting the balance of the watershed ecosystem, impacting the habitat of aquatic organisms, and potentially threatening drinking water safety through surface water and groundwater circulation, increasing the difficulty and resource investment required for subsequent pollution control. For example, if a river basin is polluted by perfluorinated compounds of unknown origin, existing technology can only determine the extent of the pollution but cannot accurately trace the source of the pollutants. This leads to the need for generalized measures to control the entire area, which not only fails to effectively curb the spread of pollution but also wastes resources. Summary of the Invention
[0004] To overcome the aforementioned shortcomings of existing technologies, this invention provides an artificial intelligence-based method for tracing and tracking perfluorinated compounds at the watershed scale. This method can clearly identify pollution hotspots and key control targets, providing environmental management departments with precise and quantifiable governance data, and significantly improving the targeting and efficiency of watershed perfluorinated compound pollution control.
[0005] The technical solution adopted to achieve the above-mentioned objectives of this invention is as follows:
[0006] A watershed-scale method for tracing and tracking perfluorinated compounds based on artificial intelligence, the method comprising the following steps:
[0007] S1: Integrate high-resolution mass spectrometry data, hydrological and meteorological data, land use data, and spatial distribution data of polluting enterprises within the watershed, align them according to a unified spatiotemporal granularity, and construct a watershed multi-source data cube; the data structure of the watershed multi-source data cube is a three-dimensional tensor, and the tensor elements are composite attribute records containing chemical feature vectors, hydrological parameter vectors, meteorological index vectors, and land use codes;
[0008] S2: Based on the mass spectrometry feature data in the multi-source data cube of the watershed, the deep learning model trained by virtual spectrum enhancement is used for end-to-end analysis, outputting a list of candidate molecular structures. After screening by molecular formula constraint and retention time verification, the result table of perfluorinated compound structure identification is obtained.
[0009] S3: Construct a hydrological response unit diagram structure based on the multi-source data cube of the watershed, use the physicochemical parameters in the perfluorinated compound structure identification result table as node attributes, and construct a migration and transport model by combining hydrodynamics and environmental behavior equations to simulate the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0010] S4: Based on the spatiotemporal concentration distribution matrix of perfluorinated compounds, the contribution weight of each potential emission source to the monitoring point is calculated by backpropagation using the attention mechanism. The contribution is then matched and verified by the enterprise source feature fingerprint database to generate a heat map of pollution source contribution and a list of priority control enterprises.
[0011] Furthermore, the steps for constructing the watershed multi-source data cube in S1 include:
[0012] S1.1: First, collect raw multi-source monitoring data within the watershed and construct a raw multi-source monitoring data set, including raw mass spectrometry acquisition sequences, raw hydrological observation sequences, raw meteorological observation sequences, raw land use image data, and raw sewage discharge enterprise registration data;
[0013] S1.2: Then, the original mass spectrometry acquisition sequence is processed through a mass spectrometry preprocessing procedure to obtain the mass spectrometry feature matrix;
[0014] S1.3: Obtain a unified time-granularity hydro-meteorological matrix by using a time-series resampling process from the original hydrological observation sequence and the original meteorological observation sequence;
[0015] S1.4: Obtain a unified spatial granularity geographic feature matrix by using the original land use image data and the original pollution discharge enterprise registration data through a spatial rasterization process;
[0016] S1.5: Finally, based on the mass spectrometry feature matrix, the unified temporal granularity hydro-meteorological matrix, and the unified spatial granularity geographic feature matrix, a watershed multi-source data cube is constructed through a spatiotemporal fusion process.
[0017] Furthermore, the mass spectrometry preprocessing steps described in S1.2 include:
[0018] S1.2.1: Perform baseline correction on the original mass spectrometry acquisition sequence, use the asymmetric least squares smoothing method to fit the background baseline curve at each scan time, and subtract the corresponding baseline value from the response intensity value in the original mass spectrometry acquisition sequence to obtain the baseline-corrected mass spectrometry sequence.
[0019] S1.2.2: Perform noise filtering on the baseline-corrected mass spectrometry sequence. Use wavelet transform to decompose the response intensity signal at each scanning time into multiple scales. Set the coefficients of the high-frequency components with amplitudes lower than the set noise threshold to zero and then perform inverse transform reconstruction to obtain the denoised mass spectrometry sequence.
[0020] S1.2.3: Perform peak extraction on the denoised mass spectrometry sequence, use the continuous wavelet transform ridge tracking method to identify the chromatographic peak position and peak boundary in the signal at each scanning time, extract the peak mass-to-charge ratio, peak area and peak width parameters, and construct the original peak list;
[0021] S1.2.4: Perform feature alignment operation on the original peak list. Using mass-to-charge ratio deviation tolerance and retention time deviation tolerance as constraints, use hierarchical clustering method to merge the peaks corresponding to the same compound detected at different scanning times into a unified feature list.
[0022] S1.2.5: Based on the aligned feature list, construct the mass spectrometry feature matrix using the feature number as the row index, the sampling point number as the column index, and the peak area normalized value as the matrix element.
[0023] Furthermore, the perfluorinated compound structure identification step in S2 includes:
[0024] S2.1: Extract the mass spectrometry feature vectors corresponding to each sampling point from the multi-source data cube of the watershed, and construct the mass spectrometry feature set to be analyzed;
[0025] S2.2: Construct a pre-trained molecular structure generation model based on virtual spectrum generation and fragment masking enhancement strategies;
[0026] S2.3: Based on the pre-trained molecular structure generation model, perform end-to-end parsing of each mass spectrometry feature vector in the mass spectrometry feature set to be analyzed, and generate a preliminary list of candidate molecular structures;
[0027] S2.4: Perform molecular formula constraint verification on the preliminary candidate molecular structure list, screen out candidates whose elemental composition does not conform to the characteristics of perfluorinated compounds, and obtain the candidate list after molecular formula verification;
[0028] S2.5: Perform retention time verification on the candidate list after molecular formula verification, and screen out candidates whose theoretical retention time deviates from the measured retention time by more than the limit, to obtain the candidate list after retention time verification;
[0029] S2.6: Based on the candidate list after retention time verification, sort by confidence level and supplement spatiotemporal information to generate a perfluorinated compound structure identification result table.
[0030] Furthermore, the construction steps of the pre-trained molecular structure generation model described in S2.2 include:
[0031] S2.2.1: Retrieve molecular structure representations and experimental mass spectra of known perfluorinated compounds from chemical databases to construct a basic training sample set;
[0032] S2.2.2: Based on quantum chemical calculation methods, the theoretical molecular structure of perfluorinated compounds is fragmented and simulated to predict the mass-to-charge ratio and relative abundance of each fragment ion, generating a set of virtual mass spectra.
[0033] S2.2.3: Perform fragment masking enhancement operation on the basic training sample set and the virtual mass spectrum set. Randomly select a certain proportion of fragment peaks in each mass spectrum for masking. Set the response intensity of the masked peaks to zero, while retaining the original molecular structure representation as a label to generate an enhanced training sample set.
[0034] S2.2.4: Construct a deep learning model for the encoder-decoder architecture as the initial model. The encoder uses a multi-head self-attention mechanism to encode the features of the input mass spectrum peak sequence, and the decoder uses an autoregressive generation mechanism to predict each symbol in the molecular structure representation string character by character.
[0035] S2.2.5: Using the mass spectrometry peak sequence in the enhanced training sample set as input and the corresponding molecular structure representation string as output, the cross-entropy loss function is used to measure the difference between the predicted character and the real character, and the adaptive learning rate optimization algorithm is used to iteratively update the model parameters. After training is completed, the model parameters are saved to obtain the pre-trained molecular structure generation model.
[0036] Furthermore, the steps in S3 to construct a migration and transport model and simulate concentration distribution include:
[0037] S3.1: Construct a hydrological response unit diagram structure based on hydrological data in the multi-source data cube of the watershed;
[0038] S3.2: Extract the physicochemical parameters of each compound from the perfluorinated compound structure identification result table, and use them as the compound attribute vectors of the nodes in the hydrological response unit diagram structure;
[0039] S3.3: Construct a graph neural network that integrates physical mechanisms as the core architecture of the transfer model;
[0040] S3.4: Perform time-series simulation based on the migration and transport model to generate the spatiotemporal concentration distribution matrix of perfluorinated compounds;
[0041] S3.5: Perform observation assimilation correction on the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0042] Furthermore, the graph neural network construction steps that integrate physical mechanisms as described in S3.3 include:
[0043] S3.3.1: Define the node state vector, including compound concentration value, hydrological response unit area, land use type code and compound physicochemical parameter vector. Extract the compound concentration observation value of each hydrological response unit at the initial time from the watershed multi-source data cube. Set the initial concentration value to zero for units without observation.
[0044] S3.3.2: Define a message passing function, use hydrodynamic equations to calculate the water flow velocity based on edge attributes, and then calculate the advection flux based on the water flow velocity and the concentration value of the upstream node;
[0045] S3.3.3: Define the environmental behavior decay function, use the first-order kinetic degradation equation, calculate the degradation decay coefficient based on the various half-life values and time steps in the compound physicochemical parameter vector, and multiply the node concentration value by the degradation decay coefficient to obtain the concentration value after degradation.
[0046] S3.3.4: Define the rainfall scour response function, extract the precipitation data at each time moment from the watershed multi-source data cube, query the corresponding scour coefficient according to the land use type code of the node, and multiply the precipitation, scour coefficient and node area to obtain the rainfall scour inflow.
[0047] S3.3.5: Define the adsorption-sedimentation function, calculate the adsorption partition coefficient based on the logarithmic value of the octanol-water partition coefficient in the compound's physicochemical parameter vector, and calculate the adsorption-sedimentation loss based on the estimated value of the suspended particulate matter concentration at the node and the sedimentation rate.
[0048] S3.3.6: Combines the message passing function, environmental behavior decay function, rainfall scour response function, and adsorption sedimentation function into a single-step update module of the graph neural network. It takes the node state vector and edge attributes at the current time as input and outputs the node concentration prediction value at the next time step.
[0049] Furthermore, the steps in S4 for generating the pollution source contribution heat map and the list of priority control enterprises include:
[0050] S4.1: Construct a set of potential emission source nodes and extract the compound composition characteristics of the source nodes from the perfluorinated compound structure identification result table;
[0051] S4.2: Construct a source contribution backpropagation network based on the spatiotemporal concentration distribution matrix of perfluorinated compounds and the hydrological response unit diagram structure;
[0052] S4.3: Perform backpropagation calculation of source contributions to generate source contribution weight matrix;
[0053] S4.4: Construct an enterprise source feature fingerprint database and match and verify it with the inversion results;
[0054] S4.5: Combine the source contribution weight matrix and fingerprint matching results to generate a pollution source contribution heat map and a list of priority control enterprises.
[0055] Furthermore, the construction steps of the source contribution backpropagation network described in S4.2 include:
[0056] S4.2.1: Reverse the direction of the edges in the hydrological response unit diagram structure, so that the edges that originally pointed downstream point upstream, resulting in a reverse hydrological connection diagram.
[0057] S4.2.2: Define a reverse attention layer on the reverse hydrological connectivity graph. The input is the concentration time-series vector and node attribute vector of each node, and the output is the attention weight distribution of the node to its upstream neighboring nodes.
[0058] S4.2.3: Define the attention weight calculation function. For the pairing of downstream nodes with each of their upstream neighboring nodes, calculate the query vector and key vector respectively, calculate the dot product of the two and scale them, and normalize them by the soft maximum function to obtain the attention weight.
[0059] S4.2.4: Define the source contribution value propagation function. Starting from the downstream monitoring point node on the reverse hydrological connection diagram, the source contribution signal is propagated along the reverse edge. The source contribution signal is multiplied by the attention weight corresponding to the edge and the attenuation coefficient on the propagation path is accumulated.
[0060] S4.2.5: The back attention layer and the source contribution value propagation function are encapsulated into a source contribution backpropagation network, which takes the concentration time-series vector of the monitoring point node as input and outputs the source contribution weight value of each node in the potential emission source node set.
[0061] This invention also provides an artificial intelligence-based watershed-scale perfluorinated compound (PFOC) tracing system, which is used to implement the aforementioned artificial intelligence-based watershed-scale PFOC tracing method. The system includes:
[0062] Multi-source data cube construction module: used to integrate high-resolution mass spectrometry data, hydrological and meteorological data, land use data and spatial distribution data of polluting enterprises within the watershed, and align them according to a unified spatiotemporal granularity to construct a multi-source data cube for the watershed;
[0063] Perfluorinated compound structure identification module: Based on mass spectrometry feature data in the multi-source data cube of the watershed, it uses a deep learning model trained with virtual spectrum enhancement to perform end-to-end analysis, outputs a list of candidate molecular structures, and obtains a perfluorinated compound structure identification result table after screening by molecular formula constraint and retention time verification.
[0064] Migration and transport model construction and concentration simulation module: It is used to construct the hydrological response unit diagram structure based on the multi-source data cube of the watershed, use the physicochemical parameters in the perfluorinated compound structure identification result table as node attributes, and construct the migration and transport model by combining hydrodynamic and environmental behavior equations to simulate the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0065] The pollution source contribution inversion and control list generation module is used to calculate the contribution weight of each potential emission source to the monitoring point based on the spatiotemporal concentration distribution matrix of perfluorinated compounds and the attention mechanism backpropagation. It is then matched and verified by combining the enterprise source feature fingerprint database, and outputs a pollution source contribution heat map and a list of priority control enterprises.
[0066] Compared to existing technologies, the advantages of this invention are as follows: By integrating multi-source heterogeneous data to construct a watershed multi-source data cube with unified spatiotemporal granularity, this invention effectively solves the joint analysis problem caused by the different formats and spatiotemporal resolution mismatch of traditional monitoring data. It achieves deep fusion and correlation of multi-dimensional data such as mass spectrometry chemical information and hydrological geographic information, providing a structured and standardized data foundation for subsequent source tracing analysis. Utilizing a deep learning model enhanced by virtual spectrum training and a dual verification mechanism of molecular formula and retention time, it overcomes the limitations of traditional spectral library matching in identifying unknown or novel perfluorinated compounds, accurately inferring the molecular structure of compounds not included in the database, and improving the comprehensiveness and reliability of compound identification. Through a graph neural network integrating hydrodynamics and environmental behavior equations, combined with observation assimilation correction, the pollutant migration simulation conforms to actual hydrological laws and chemical degradation mechanisms, avoiding the physical logic bias of purely data-driven models and significantly improving the accuracy of the spatiotemporal concentration distribution simulation of perfluorinated compounds. By combining attention mechanism backpropagation with source feature fingerprint matching, quantitative decoupling of the contributions of each potential emission source in multi-source superimposed pollution scenarios is achieved. The output pollution source contribution heat map and priority control enterprise list clearly identify pollution hotspots and key control targets. At the same time, the uncertainty range of contribution weight is attached, providing environmental management departments with accurate and quantifiable governance basis, which greatly improves the targeting and efficiency of watershed perfluorinated compound pollution control. Attached Figure Description
[0067] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0068] Figure 1 This is a flowchart of a watershed-scale perfluorinated compound tracing method based on artificial intelligence, as described in this invention.
[0069] Figure 2 This is a schematic diagram of the hydrological response unit structure in an embodiment of the present invention;
[0070] Figure 3 This is a schematic diagram of a graph neural network structure that integrates physical mechanisms in an embodiment of the present invention;
[0071] Figure 4 This is a schematic diagram of the attention mechanism backpropagation tracing in an embodiment of the present invention;
[0072] Figure 5 This is a schematic diagram of source feature fingerprint matching in an embodiment of the present invention;
[0073] Figure 6 This is a schematic diagram of the pollution source contribution heat map in an embodiment of the present invention;
[0074] Figure 7 This is a functional block diagram of a watershed-scale perfluorinated compound tracing system based on artificial intelligence, as described in this invention. Detailed Implementation
[0075] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0076] Example 1
[0077] Please see Figure 1 As shown, this embodiment provides an artificial intelligence-based method for tracing and tracking perfluorinated compounds at the watershed scale, including:
[0078] S1: Integrate high-resolution mass spectrometry data, hydrological and meteorological data, land use data, and spatial distribution data of polluting enterprises within the basin, align them according to a unified spatiotemporal granularity, and construct a multi-source data cube for the basin.
[0079] This step addresses the problem of inconsistent formats and mismatched spatiotemporal resolutions in multi-source heterogeneous data at the watershed scale, which prevent joint modeling. By collecting and fusing mass spectrometry chemical information and hydrogeographic information, a standardized watershed multi-source data cube is constructed. This watershed multi-source data cube, as the final output of S1, will be used in S2 to extract mass spectrometry feature data and in S3 to construct the hydrological response unit map structure.
[0080] Specifically, the process of constructing the multi-source data cube of the watershed includes:
[0081] S1.1: Collect raw multi-source monitoring data within the watershed and construct a raw multi-source monitoring data set.
[0082] In this step, various monitoring devices deployed within the watershed and external information systems serve as data sources. Raw mass spectrometry acquisition sequences, raw hydrological observation sequences, raw meteorological observation sequences, raw land use imagery data, and raw pollution discharge enterprise registration data are obtained through synchronous acquisition. Based on these acquired data, a raw multi-source monitoring dataset is constructed. The raw mass spectrometry acquisition sequences refer to the discrete mass-to-charge ratio and response intensity data sequences output by high-resolution mass spectrometers configured at each water quality monitoring station in non-targeted screening mode. These sequences are periodically read by the mass spectrometer data acquisition module. The data structure of this sequence is a two-dimensional matrix, with row indices representing scan time numbers, column indices representing mass-to-charge ratio channel numbers, and matrix elements representing the ion response intensity values of the corresponding channels. The raw hydrological observation sequences refer to the velocity, direction, and water level reading sequences output by hydrological stations deployed at the main stream, tributaries, and confluence nodes. These sequences are acquired at set time intervals through hydrological telemetry terminals. The data structure of this sequence is a structured table, with each row containing the station number, observation timestamp, velocity value, direction angle, and water level value. The raw meteorological observation sequence refers to the sequence of precipitation, temperature, humidity, and wind speed readings output by the meteorological station network within the basin. This data is acquired through the automatic meteorological station data transmission system. The data structure of this sequence is also a structured table, with each row containing the station number, observation timestamp, and numerical fields for each meteorological element. The raw land use image data refers to multispectral remote sensing satellite imagery covering the basin, downloaded through a remote sensing data distribution platform. This data is in raster image file format, with each pixel storing the spectral reflectance value or land use classification code corresponding to the land surface location. The raw pollutant discharge enterprise registration data refers to information such as the enterprise name, geographic coordinates, main pollutant types, and emission limits registered in the pollutant discharge permit information system of the ecological and environmental protection authorities. This data is obtained through a data interface and is in structured table format, with each row corresponding to the registration information of one pollutant discharge enterprise. Organizing the above-mentioned collected data according to data type and collection source yields a raw multi-source monitoring data set with data type identifiers as the classification key, providing a basic data source for subsequent preprocessing and spatiotemporal alignment.
[0083] S1.2: Based on the original mass spectrometry acquisition sequences in the original multi-source monitoring dataset, the mass spectrometry feature matrix is obtained through a mass spectrometry preprocessing procedure.
[0084] Specifically, the mass spectrometry preprocessing procedure includes:
[0085] S1.2.1: Baseline correction is performed on the original mass spectrometry acquisition sequence. An asymmetric least squares smoothing method is used to fit the background baseline curve at each scan time. The response intensity values in the original mass spectrometry acquisition sequence are subtracted from the corresponding baseline values to obtain the baseline-corrected mass spectrometry sequence. The asymmetric least squares smoothing method refers to iterative optimization to obtain a smooth baseline curve by setting asymmetric weighting factors, assigning higher weights to data points with response intensities lower than the fitted values, and assigning lower weights to data points with response intensities higher than the fitted values.
[0086] In this embodiment, the objective function of the asymmetric least squares smoothing method is... for:
[0087]
[0088] in, This represents the original response intensity value. The baseline value for fitting. For smoothing parameters, The second difference is the baseline; For asymmetric weighting factors, when hour, Take the smaller value (e.g., 0.001), when hour, Take a larger value (such as 1) and update it iteratively. Until it converges.
[0089] S1.2.2: Noise filtering is performed on the baseline-corrected mass spectrometry sequence. Wavelet transform is used to decompose the response intensity signals at each scan time into multiple scales. Coefficients with amplitudes below a set noise threshold in the high-frequency components are set to zero and then reconstructed using inverse transform to obtain the denoised mass spectrometry sequence. The noise threshold is determined by statistically analyzing the standard deviation of the response intensity values in the baseline-corrected mass spectrometry sequence and multiplying it by a preset multiplier coefficient. This multiplier coefficient is determined during the laboratory calibration phase based on the instrument's noise characteristics.
[0090] S1.2.3: Peak extraction is performed on the denoised mass spectrometry sequence. A continuous wavelet transform ridge tracing method is used to identify the chromatographic peak positions and boundaries in the signal at each scan time. Peak mass-to-charge ratio, peak area, and peak width parameters are extracted to construct an original peak list. The original peak list is a structured table, with each row corresponding to a detected peak. Fields include scan time number, peak mass-to-charge ratio, peak area, peak initiation mass-to-charge ratio, peak termination mass-to-charge ratio, and peak width.
[0091] S1.2.4: Perform feature alignment on the original peak list. Using mass-to-charge ratio deviation tolerance and retention time deviation tolerance as constraints, a hierarchical clustering method is used to group peaks corresponding to the same compound detected at different scan times into a unified feature list, generating an aligned feature list. The mass-to-charge ratio deviation tolerance is determined based on the mass spectrometer's quality accuracy index, and the retention time deviation tolerance is determined based on the column drift range.
[0092] S1.2.5: Based on the aligned feature list, a mass spectrometry feature matrix is constructed using the feature number as the row index, the sampling point number as the column index, and the peak area normalization value as the matrix element. The peak area normalization value is obtained by dividing the peak area of each sampling point by the sum of the areas of all feature peaks at that sampling point, and is used to eliminate the influence of the difference in total ion current intensity between different sampling points.
[0093] In this embodiment, the peak area normalized value The calculation formula is:
[0094]
[0095] in, Indicates the first The sampling point of the nth sampling point The original peak area of each characteristic This represents the total number of features detected at this sampling point.
[0096] The mass spectrometry feature matrix is used as input to S2 for molecular structure analysis in deep learning models.
[0097] S1.3: Based on the original hydrological and meteorological observation sequences in the original multi-source monitoring data set, a unified time-granularity hydro-meteorological matrix is obtained through a time-series resampling process.
[0098] Specifically, the time-series resampling process includes:
[0099] S1.3.1: Determine a unified time axis. Take the start time of the watershed monitoring cycle as the time origin and divide it evenly according to hourly time intervals to obtain a unified set of time points. This unified set of time points serves as the time reference benchmark for aligning all time series data.
[0100] S1.3.2: For each observation station in the original hydrological observation sequence, using the observation timestamp as the index key, a linear interpolation method is employed to map the flow velocity, flow direction, and water level values to various time points on a unified time axis, resulting in a hydrological interpolation sequence. When the interval between adjacent original observation timestamps exceeds the set maximum interpolation span, the interpolation results for the corresponding time period are marked as missing values.
[0101] S1.3.3: For each observation station in the original meteorological observation sequence, using the observation timestamp as the index key, the inverse distance weighted spatial interpolation method is used to map the precipitation, temperature, humidity, and wind speed values to various time points on a unified time axis, resulting in a meteorological interpolation sequence. The inverse distance weighted spatial interpolation method refers to a method that uses the reciprocal of the distance between the target spatiotemporal location and surrounding observation stations as the weight to perform a weighted average of the observed values from each observation station.
[0102] In this embodiment, the calculation formula for the inverse distance weighted spatial interpolation method is as follows:
[0103]
[0104] in, For target location Interpolation results at the location (such as precipitation). For the first Measured values from each observation station For the target position and the first Euclidean distance between stations It is the power exponent (usually taken as 2). The number of neighboring sites participating in the interpolation.
[0105] S1.3.4: The hydrological and meteorological interpolation sequences are horizontally concatenated according to station numbers and a unified time axis position to generate a unified time-granularity hydro-meteorological matrix with time as the row index and station and variable combination as the column index. Each element of this matrix corresponds to the value of a specific time, a specific station, and a specific variable.
[0106] S1.4: Based on the original land use image data and original polluting enterprise registration data in the original multi-source monitoring data set, a unified spatial granularity geographic feature matrix is obtained through a spatial rasterization process.
[0107] Specifically, the spatial rasterization process includes:
[0108] S1.4.1: Determine a unified spatial grid, using the watershed boundary vector data as the range constraint, and divide the watershed area into a regular grid array with a grid side length of fifty meters. Each grid cell is identified by the latitude and longitude coordinates of its center point.
[0109] S1.4.2: Perform a resampling operation on the original land use image data to adjust the original pixel resolution to be consistent with the unified spatial grid. Use the mode resampling method to determine the land use type code for each grid cell. The mode resampling method refers to statistically analyzing the land use types of all original pixels falling within the target grid range and taking the type that appears most frequently as the land use type code for the target grid.
[0110] S1.4.3: For each enterprise record in the original pollution discharge enterprise registration data, determine the unified spatial grid unit to which it belongs based on its geographical coordinates, set the enterprise existence identifier of the grid unit to be valid, and accumulate the enterprise count and emission limit sum of the grid unit.
[0111] S1.4.4: Organize the land use type code, enterprise presence identifier, enterprise number count, and emission limit sum of each grid unit according to the grid number to generate a unified spatial granularity geographic feature matrix with the grid number as the row index and the geographic feature variables as the column index.
[0112] S1.5: Based on the mass spectrometry feature matrix, the unified temporal granularity hydro-meteorological matrix, and the unified spatial granularity geographic feature matrix, a watershed multi-source data cube is constructed through a spatiotemporal fusion process.
[0113] Specifically, the spatiotemporal fusion process includes:
[0114] S1.5.1: Establish a spatial mapping table for sampling points, match the geographic coordinates of each sampling point in the mass spectrometry feature matrix with a unified spatial grid, record the grid number to which each sampling point belongs, and form a spatial mapping table for sampling points.
[0115] S1.5.2: Establish a sampling point time mapping table, match the sampling timestamps of each sampling point in the mass spectrometry feature matrix with a unified time axis, record the time point index corresponding to each sampling point, and form a sampling point time mapping table.
[0116] S1.5.3: A three-dimensional data cube framework is constructed using the time point index of a unified time axis as the first dimension, the longitude index of a unified spatial grid as the second dimension, and the latitude index of a unified spatial grid as the third dimension. For each voxel unit in the framework, sampling points falling within the spatiotemporal range of the voxel are retrieved according to the sampling point spatial mapping table and the sampling point temporal mapping table. The corresponding mass spectrometry feature vectors are extracted and filled into the chemical feature attribute slots of the voxel. The hydrological and meteorological parameters of the corresponding time point and neighboring stations are extracted from the unified temporal granularity hydro-meteorological matrix and filled into the hydrological and meteorological attribute slots of the voxel. The land use codes and enterprise distribution information of the corresponding grid units are extracted from the unified spatial granularity geographic feature matrix and filled into the geographic feature attribute slots of the voxel.
[0117] S1.5.4: For voxel units with missing sampling point observations, their chemical feature attribute slots are marked as pending inference, while the hydrological and meteorological attribute slots and geographic feature attribute slots are fully filled, resulting in the final watershed multi-source data cube. The data structure of the watershed multi-source data cube is a three-dimensional tensor. The shape of the tensor is determined by the lengths of the time dimension, longitude dimension, and latitude dimension. Each tensor element is a composite attribute record containing chemical feature vectors, hydrological parameter vectors, meteorological index vectors, and land use codes.
[0118] Specifically, this step addresses the problem of traditional monitoring data being unable to be jointly analyzed due to differences in collection frequency, uneven spatial distribution, and varying data formats by constructing a watershed multi-source data cube with unified spatiotemporal granularity. In the watershed-scale perfluorinated compound (PFOC) tracing scenario, mass spectrometry data reflects the presence and concentration of chemical substances, hydrological data determines the migration paths of pollutants, meteorological data influences rainfall erosion and atmospheric deposition processes, and land use data and enterprise distribution data indicate the location of potential pollution sources. These heterogeneous multi-source data can only establish causal relationships within a unified spatiotemporal coordinate system. The watershed multi-source data cube expands discrete monitoring point data into a continuous spatiotemporal field representation, providing a structured input interface for the deep learning model in subsequent steps. This avoids the complexity of designing separate preprocessing modules for models due to inconsistent input formats and also provides a spatial topological information foundation for the construction of the hydrological response unit map in S3.
[0119] S2: Based on the mass spectrometry feature data in the multi-source data cube of the watershed, the deep learning model trained by virtual spectrum enhancement is used for end-to-end analysis, outputting a list of candidate molecular structures. After screening by molecular formula constraint and retention time verification, the structure identification result table of perfluorinated compounds is obtained.
[0120] This step addresses the problem that traditional mass spectrometry database matching methods cannot identify unknown structures or novel perfluorinated compounds. It trains a deep learning model using a virtual spectrum enhancement strategy to achieve high-precision structure inference for compounds not included in the database. The perfluorinated compound structure identification result table, as the final output of S2, will be used in S3 to extract physicochemical parameters as graph node attributes, and in S4 to construct a source feature fingerprint database.
[0121] Specifically, the process for identifying the structure of the perfluorinated compound includes:
[0122] S2.1: Extract the mass spectrometry feature vectors corresponding to each sampling point from the multi-source data cube of the watershed, and construct the mass spectrometry feature set to be analyzed.
[0123] In this step, the chemical feature attribute slots in the multi-source data cube of the watershed are traversed to obtain valid voxel cells. The chemical feature vector for each valid voxel is extracted, and this vector contains the normalized peak area values corresponding to each feature in the aligned feature list. All extracted chemical feature vectors are organized according to the voxel's spatiotemporal coordinate index to generate a mass spectrometry feature set to be analyzed. The data structure of the mass spectrometry feature set to be analyzed is a key-value pair mapping, where the key is a voxel spatiotemporal coordinate tuple, and the value is the corresponding mass spectrometry feature vector.
[0124] S2.2: Based on virtual spectrum generation and fragment masking enhancement strategies, a pre-trained molecular structure generation model is constructed.
[0125] Specifically, the construction process of the pre-trained molecular structure generation model includes:
[0126] S2.2.1: Retrieve molecular structure representations and experimental mass spectra of known perfluorinated compounds from chemical databases to construct a basic training sample set. The molecular structure representations are encoded using the Simplified Molecular Linear Input Canonical Format, which expresses the atomic connections and chemical bond types of molecules in string form. The data structure of the basic training sample set is a structured table, with each row containing a string representing the molecular structure and a corresponding list of mass spectra peaks.
[0127] S2.2.2: Based on quantum chemical calculation methods, fragmentation simulations are performed on the theoretical molecular structures of perfluorinated compounds to predict the mass-to-charge ratio and relative abundance of each fragment ion, generating a virtual mass spectrum set. The fragmentation simulation calculates the bond dissociation energy of each chemical bond in the molecule, simulating the breakage process in ascending order of bond dissociation energy, and accumulating the fragment ions generated along each breakage path and their abundance weights. The virtual mass spectrum set contains simulated mass spectra of over 100,000 theoretical perfluorinated compounds, greatly expanding the scale of the training data.
[0128] S2.2.3: Perform fragment masking enhancement on the basic training sample set and the virtual mass spectrum set. A certain proportion of fragment peaks in each mass spectrum are randomly selected for masking. The response intensity of the masked peaks is set to zero, while the original molecular structure representation is retained as a label. This fragment masking enhancement operation generates an enhanced training sample set, enabling the model to learn the ability to infer complete molecular structures from incomplete mass spectrometry information.
[0129] S2.2.4: Construct a deep learning model for the encoder-decoder architecture as the initial model. The encoder uses a multi-head self-attention mechanism to encode the features of the input mass spectrum peak sequence, converting the variable-length mass spectrum peak sequence into a fixed-dimensional hidden layer vector representation. The decoder uses an autoregressive generation mechanism, using the hidden layer vector representation as a condition, to predict each symbol in the molecular structure representation string character by character.
[0130] S2.2.5: Using the mass spectrometry peak sequences in the enhanced training sample set as input and the corresponding molecular structure representation strings as output, the cross-entropy loss function is used to measure the difference between the predicted characters and the true characters, and an adaptive learning rate optimization algorithm is used to iteratively update the model parameters. When the loss value on the validation set no longer decreases for several consecutive rounds, training is stopped and the model parameters are saved, resulting in a pre-trained molecular structure generation model.
[0131] In this embodiment, the cross-entropy loss function Defined as:
[0132]
[0133] in, The length of the string is represented by a molecular structure. For the first One real character, The mass spectrum hidden layer vector is the output of the encoder. For the model to predict the first The probability that a given character is a real character.
[0134] S2.3: Based on the pre-trained molecular structure generation model, perform end-to-end analysis on each mass spectrometry feature vector in the mass spectrometry feature set to be analyzed, and generate a preliminary list of candidate molecular structures.
[0135] Specifically, the end-to-end resolution process includes:
[0136] S2.3.1: Convert each mass spectrometry feature vector in the mass spectrometry feature set to be analyzed into a mass spectrometry peak sequence format, sort them from low to high mass-to-charge ratio, and each peak record contains two fields: mass-to-charge ratio and normalized intensity.
[0137] S2.3.2: Input the converted mass spectrum peak sequence into the encoder of the pre-trained molecular structure generation model to obtain the hidden layer vector representation.
[0138] S2.3.3: Based on the hidden layer vector representation, the autoregressive generation process of the decoder is initiated. A beam search strategy is employed to retain several candidate sequences with the highest confidence at each generation step until the end symbol is generated or the maximum sequence length limit is reached. The beam search strategy controls the number of candidates retained at each step by setting a beam width parameter, which is determined based on computational resource limitations.
[0139] S2.3.4: Calculate a confidence score for each candidate molecular structure representation string output by the decoder. This score is the geometric mean of the predicted probabilities of each step. Organize all candidate molecular structure representations and their confidence scores corresponding to the same mass spectrometry feature vector into a candidate list, sorted in descending order of confidence score, to generate a preliminary candidate molecular structure list.
[0140] S2.4: Perform molecular formula constraint verification on the preliminary candidate molecular structure list, screen out candidates whose elemental composition does not conform to the characteristics of perfluorinated compounds, and obtain the candidate list after molecular formula verification.
[0141] Specifically, the molecular formula constraint verification process includes:
[0142] S2.4.1: For each candidate molecular structure representation string in the preliminary candidate molecular structure list, call the cheminformatics toolkit to parse its atomic composition, count the number of carbon atoms, fluorine atoms, oxygen atoms, hydrogen atoms and other heteroatoms to obtain the candidate molecular formula.
[0143] S2.4.2: Define a set of molecular formula constraint rules based on the chemical characteristics of perfluorinated compounds. The set of molecular formula constraint rules includes: the ratio of fluorine atoms to carbon atoms must not be lower than a set minimum fluorine-to-carbon ratio threshold; the ratio of hydrogen atoms to carbon atoms must not be higher than a set maximum hydrogen-to-carbon ratio threshold; and the molecular formula must contain at least one fluorine atom. The minimum fluorine-to-carbon ratio threshold and the maximum hydrogen-to-carbon ratio threshold are determined based on the known statistical distribution of molecular formulas of perfluorinated and polyfluorinated compounds.
[0144] Preferably, the constraint rules can be expressed as the following set of inequalities:
[0145]
[0146] in, These represent the number of fluorine, carbon, and hydrogen atoms, respectively. The minimum fluorocarbon ratio threshold (e.g., 0.5). The maximum hydrogen-to-carbon ratio threshold (e.g., 1.5).
[0147] S2.4.3: For each candidate molecular formula, examine each rule in the molecular formula constraint rule set. If all rules pass, retain the candidate; otherwise, mark the candidate as removed. Organize the retained candidates into a candidate list after molecular formula verification.
[0148] S2.5: Perform retention time verification on the candidate list after molecular formula verification, and filter out candidates whose theoretical retention time deviates from the measured retention time by more than the limit, to obtain the candidate list after retention time verification.
[0149] Specifically, the retention time verification process includes:
[0150] S2.5.1: Construct a retention time prediction module. This module takes a molecular structure representation string as input and outputs a theoretical retention time prediction value through molecular fingerprint calculation and multilayer perceptron mapping. The molecular fingerprint is a method of encoding the molecular structure into a fixed-length binary vector, where each bit in the vector represents the presence or absence of a specific substructure fragment. The parameters of the retention time prediction module are obtained by fitting experimental retention time data of known compounds.
[0151] S2.5.2: For each candidate molecular structure representation in the candidate list after molecular formula verification, call the retention time prediction module to calculate the theoretical retention time prediction value.
[0152] S2.5.3: Extract the measured retention time value of the corresponding mass spectrometry feature vector from the metadata record of the multi-source data cube of the watershed, and calculate the absolute deviation between the theoretical retention time prediction value and the measured retention time value.
[0153] S2.5.4: Compare the absolute deviation with the set retention time deviation tolerance. If the absolute deviation is less than or equal to the retention time deviation tolerance, retain the candidate; otherwise, mark the candidate as rejected. The retention time deviation tolerance is determined based on the reproducibility index of the chromatographic system. Organize the retained candidates into a retention time verification candidate list.
[0154] S2.6: Based on the candidate list after retention time verification, sort by confidence level and supplement spatiotemporal information to generate a perfluorinated compound structure identification result table.
[0155] In this step, the candidate options in the retention time verification candidate list are sorted in descending order of confidence score. A unique compound number is assigned to each candidate option, and the corresponding voxel spatiotemporal coordinates are extracted from the key-value pair mapping of the mass spectrometry feature set to be analyzed, converted into the latitude and longitude of the detection location and the timestamp of the detection time. A perfluorinated compound structure identification result table is constructed using the compound number, molecular structure representation string, candidate molecular formula, confidence score, detection location longitude, detection location latitude, and detection timestamp as fields. The data structure of the perfluorinated compound structure identification result table is a structured table, with each row corresponding to one detected perfluorinated compound candidate option.
[0156] Specifically, this step addresses the problem of insufficient generalization ability of deep learning models due to the scarcity of experimental mass spectrometry data by employing a virtual spectrum enhancement strategy and a fragmentation masking pre-training method. In the perfluorinated compound tracing scenario, novel perfluorinated compounds are constantly being synthesized and emitted, while the updates to chemical databases are lagging, making it difficult for traditional spectral library matching methods to identify unlisted compounds. The virtual spectrum generation strategy utilizes quantum chemical calculations to simulate the fragmentation process, covering the theoretically possible structural space of perfluorinated compounds and overcoming the limitations of experimental data scale. The fragmentation masking enhancement strategy forces the model to learn the ability to infer the overall structure from local information, enhancing the model's robustness to noise and missing peaks. Molecular formula constraint verification and retention time verification form a dual filtering mechanism; the former eliminates impossible candidate structures from a chemical composition perspective, while the latter eliminates candidate structures inconsistent with experimental observations from a chromatographic behavior perspective, jointly ensuring the reliability of the structure identification results.
[0157] S3: Construct a hydrological response unit graph structure based on the multi-source data cube of the watershed, use the physicochemical parameters in the perfluorinated compound structure identification result table as node attributes, and construct a migration and transport model by combining hydrodynamic and environmental behavior equations to simulate the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0158] This step addresses the problem of migration paths not conforming to hydrological patterns due to the lack of physical constraints in purely data-driven models. It improves the model's interpretability and spatiotemporal extrapolation capabilities by embedding hydrodynamic equations and environmental chemical degradation mechanisms into a graph neural network architecture. The spatiotemporal concentration distribution matrix of perfluorinated compounds, as the final output of S3, will be used in S4 to back-calculate the contribution weights of each potential emission source.
[0159] Specifically, the process of constructing the migration and transport model and simulating the concentration distribution includes:
[0160] S3.1: Construct a hydrological response unit diagram structure based on hydrological data in the multi-source data cube of the watershed.
[0161] Specifically, the construction process of the hydrological response unit diagram structure includes:
[0162] S3.1.1: Extract hydrological attribute information of a unified spatial grid from the multi-source data cube of the watershed, including the flow direction angle and runoff accumulation of each grid cell. Determine the water flow destination of each grid cell to adjacent grid cells based on the flow direction angle, and divide the watershed into several sub-watersheds based on the runoff accumulation.
[0163] S3.1.2: Within each sub-basin, grid cells with accumulated runoff exceeding a set runoff threshold are designated as channel grids, and grid cells with accumulated runoff below the runoff threshold are designated as slope grids. Adjacent grids of the same type are merged into hydrological response units, each of which serves as a node in the graph structure, and each node is assigned a unique node number. The runoff threshold is determined based on the basin area and river network density.
[0164] S3.1.3: Determine the edge connections between nodes based on the direction of water flow. If the outflow direction of one hydrological response unit points to another hydrological response unit, a directed edge is established between the two corresponding nodes, with the edge direction being the direction of water flow. For edges between channel nodes, edge attributes include channel length, channel slope, and channel roughness coefficient. For edges between slope nodes and channel nodes, edge attributes include slope length, slope, and surface cover coefficient.
[0165] S3.1.4: Organize all nodes and edges into a hydrological response unit graph structure. This structure uses an adjacency list format to record the incoming and outgoing edge lists for each node, as well as the attribute vectors of each edge. Please refer to the schematic diagram of the hydrological response unit graph structure. Figure 2As shown in the figure, the watershed is divided into several hydrological response units, where slope units are represented by circular nodes and channel units by square nodes. Slope units are located on the hillsides on both sides of the watershed, while channel units are distributed along the main river and tributaries. Directed edges in the figure represent the direction of water flow. Slope units are connected to adjacent channel units via confluence edges, and channel units are connected sequentially in the downstream direction of water flow to form a river network topology. Each edge includes attribute information; the attributes of slope confluence edges include slope length and land cover coefficient, while the attributes of channel connecting edges include river segment length and channel slope. The watershed boundary is marked with a dashed line, defining the spatial extent of the hydrological response unit graph. This graph structure discretizes the continuous hydrological processes of the watershed into message passing processes between graph nodes, enabling the migration and transport of perfluorinated compounds within the watershed to be modeled using graph neural networks. The slope unit receives non-point source pollutants from rainfall erosion, which are then transported to the river unit via the confluence edge. The river unit then transmits the pollutants downstream along the water flow direction. This topology ensures the physical consistency of the migration and transport model and avoids simulation results that do not conform to hydrological laws, such as backflow of pollutants.
[0166] S3.2: Extract the physicochemical parameters of each compound from the perfluorinated compound structure identification result table, and use them as the compound attribute vectors of the nodes in the hydrological response unit diagram structure.
[0167] Specifically, the physicochemical parameter extraction process includes:
[0168] S3.2.1: For each compound record in the perfluorinated compound structure identification result table, call the cheminformatics toolkit to calculate the logarithm of the octanol water partition coefficient based on the molecular structure representation string. This value reflects the strength of the compound's hydrophobicity.
[0169] S3.2.2: Call the acid-base dissociation constant prediction module to predict the acid dissociation constant of the compound based on the functional group information in the molecular structure representation string. This value determines the ionic distribution of the compound under different acid and alkaline conditions.
[0170] S3.2.3: Call the environmental durability assessment module to predict the hydrolysis half-life, photolysis half-life, and biodegradation half-life of the compound based on the molecular structure representation string and environmental condition parameters. These half-life values reflect the decay rate of the compound in the environment.
[0171] S3.2.4: Combine the logarithmic value of the octanol-water partition coefficient, acid dissociation constant, hydrolysis half-life, photolysis half-life, and biodegradation half-life into a compound physicochemical parameter vector. Based on the detection location information in the perfluorinated compound structure identification result table, determine the hydrological response unit node where each compound is detected, and assign the compound physicochemical parameter vector to the corresponding compound attribute vector slot.
[0172] S3.3: Construct a graph neural network that integrates physical mechanisms as the core architecture of the migration and transfer model.
[0173] Specifically, the construction process of the graph neural network that integrates physical mechanisms includes:
[0174] S3.3.1: Define the node state vector, including compound concentration values, hydrological response unit area, land use type code, and compound physicochemical parameter vector. Extract the initial compound concentration observation values of each hydrological response unit from the watershed multi-source data cube, and set the initial concentration value to zero for units without observations.
[0175] S3.3.2: Define a message passing function that calculates the amount of mass transported between adjacent nodes. The message passing function uses hydrodynamic equations to calculate the flow velocity based on the segment length, channel slope, and roughness coefficient in the edge attributes, and calculates the advection flux based on the flow velocity and the concentration value of the upstream node.
[0176] In this embodiment, the water flow velocity Calculated according to Manning's formula:
[0177]
[0178] in, This is the roughness coefficient of the river channel. For hydraulic radius, The slope of the river channel.
[0179] Advection flux The calculation formula is:
[0180]
[0181] in, For traffic, The cross-sectional area of the water passage. This represents the concentration at the upstream node.
[0182] S3.3.3: Define an environmental behavior decay function, which is used to calculate the degradation loss of compounds within a node. The decay function uses a first-order kinetic degradation equation, calculates the degradation decay coefficient based on the various half-lives and time steps in the compound's physicochemical parameter vector, and multiplies the node concentration value by the degradation decay coefficient to obtain the post-degradation concentration value.
[0183] In this embodiment, the concentration update corresponding to the first-order kinetic degradation equation is as follows:
[0184]
[0185] in, The overall degradation rate constant, It is a comprehensive half-life (synthesized from the half-life of hydrolysis, photolysis, and biodegradation). For time step.
[0186] S3.3.4: Define a rainfall scour response function, which is used to calculate the non-point source inflow caused by rainfall events. Extract precipitation data for each time moment from the watershed multi-source data cube, look up the corresponding scour coefficient according to the land use type code of the node, and multiply the precipitation, scour coefficient, and node area to obtain the rainfall scour inflow.
[0187] In this embodiment, the amount of rainfall runoff inflow The calculation formula is:
[0188]
[0189] in, For precipitation, For unit area, This is the erosion coefficient corresponding to this land use type.
[0190] S3.3.5: Define the adsorption-sedimentation function, which is used to calculate the amount of compound adsorbed and fixed on particulate matter. The adsorption partition coefficient is calculated logarithmically from the octanol-water partition coefficient in the compound's physicochemical parameter vector, and the adsorption-sedimentation loss is calculated based on the estimated suspended particulate matter concentration at the nodes and the sedimentation rate.
[0191] In this embodiment, the adsorption partition coefficient With octanol water partition coefficient The relationship can be represented as:
[0192]
[0193] in, Organic carbon content, This represents the normalized partition coefficient of organic carbon. Dissolved concentration. With total concentration The relationship is:
[0194]
[0195] in, This represents the concentration of suspended particulate matter.
[0196] S3.3.6: Combines the message passing function, environmental behavior decay function, rainfall scour response function, and adsorption sedimentation function into a single-step update module of the graph neural network. It takes the node state vector and edge attributes at the current time as input and outputs the node concentration prediction value at the next time step.
[0197] Please refer to the schematic diagram of the graph neural network structure that integrates physical mechanisms. Figure 3 As shown in the figure, the graph neural network uses hydrological response units as computational nodes. The figure illustrates the information transmission process between two upstream units and one downstream computational unit. The upstream units store the current concentration status value and transmit advection messages to the downstream computational unit along the water flow direction via a message passing function. This message passing process uses hydrodynamic equations to calculate the water flow velocity based on the river length, channel slope, and roughness coefficient, and then combines this with the upstream concentration value to obtain the advection flux. The downstream computational unit contains an environmental behavior attenuation function module. This module uses a first-order kinetic degradation equation to calculate the degradation attenuation coefficient based on the hydrolysis half-life, photolysis half-life, and biodegradation half-life of perfluorinated compounds, thus attenuating the incoming pollutants. The vertically downward arrows in the figure represent the external input path of the rainfall scour response function. When a rainfall event occurs in the watershed, this function calculates the amount of non-point source pollutants scoured in based on the precipitation, the scour coefficient corresponding to the land use type, and the unit area. The upward arrows in the diagram represent the loss path of the adsorption-sedimentation function. This function calculates the adsorption partition coefficient based on the octanol-water partition coefficient of perfluorinated compounds and estimates the adsorption-sedimentation loss by combining it with the suspended particulate matter concentration. After the joint calculation of the above multiple physical mechanism functions, the downstream computing unit outputs the updated concentration value and passes it to the next level unit. This design, which embeds hydrodynamics and environmental chemical degradation mechanisms into the neural network message passing process, enables the model to produce concentration predictions that conform to physical laws even in data-sparse regions. It avoids unreasonable phenomena such as backflow transport of pollutants or unwarranted increases in concentration that may occur in purely data-driven models, effectively improving the model's spatiotemporal extrapolation capability and interpretability in watershed-scale perfluorinated compound migration simulations.
[0198] S3.4: Perform time-series simulation based on the migration and transport model to generate the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0199] Specifically, the timing deduction process includes:
[0200] S3.4.1: Set the start and end times of the time series simulation, as well as the simulation time step. The time step is determined based on the hydrological response time scale and must meet the numerical stability condition.
[0201] S3.4.2: The concentration observations or initial concentration values of each node at the initial time are used as the initial conditions for the node state vector, and the meteorological driving data at the corresponding time in the watershed multi-source data cube are used as external inputs.
[0202] S3.4.3: Starting from the initial moment, iteratively call the single-step update module of the graph neural network according to the time step. At each time step, execute message passing calculation, environmental behavior decay calculation, rainfall scour response calculation, and adsorption sedimentation calculation in sequence, and summarize the algebra of each flux and update the node concentration value.
[0203] S3.4.4: After each time step, record the concentration prediction value of each node at the current moment, and accumulate them to form concentration prediction time series data.
[0204] S3.4.5: Organize the concentration prediction time series data of all time steps into a matrix format according to the node number and time step index to generate a spatiotemporal concentration distribution matrix of perfluorinated compounds. The data structure of the spatiotemporal concentration distribution matrix of perfluorinated compounds is a two-dimensional matrix, where the row index is the node number of the hydrological response unit, the column index is the time step number, and the matrix element is the concentration prediction value of the corresponding node at the corresponding time.
[0205] S3.5: Perform observation assimilation correction on the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0206] Specifically, the observation assimilation correction process includes:
[0207] S3.5.1: Extract the measured concentration values at each monitoring point location at each time from the multi-source data cube of the watershed, and determine the corresponding hydrological response unit node number based on the monitoring point location.
[0208] S3.5.2: Calculate the deviation between the predicted concentration and the measured concentration at the corresponding position in the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0209] S3.5.3: An ensemble Kalman filter is employed to correct and update all elements in the spatiotemporal concentration distribution matrix of perfluorinated compounds based on the bias. This ensemble Kalman filter constructs an ensemble of predicted states, calculates the cross-covariance matrix between state variables and observed variables, and propagates observed information to unobserved locations based on the Kalman gain, achieving optimal estimation of the overall concentration.
[0210] In this embodiment, the update equation for the ensemble Kalman filter is:
[0211]
[0212] in, This is the set of assimilated analytical states. For the set of predicted states of the model, For the observation vector, For observation operators (mapping the state space to the observation space). The Kalman gain matrix is calculated as follows: ,in It is the covariance between the state and the observation. It is the observed covariance. It is the observation error covariance.
[0213] S3.5.4: Overwrite the original matrix with the corrected matrix elements to obtain the assimilated and corrected spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0214] Specifically, this step addresses the issue of purely data-driven models potentially generating concentration distributions that violate physical laws by embedding hydrodynamic equations and environmental chemical degradation mechanisms into the message passing process of a graph neural network. In watershed-scale pollutant migration simulations, the direction of water flow dictates that pollutants can only be transported downstream along the river channel. Rainfall events cause non-point source pollutants to be washed into the river, and photolysis and hydrolysis lead to concentration decay over time. These physical and chemical processes have clear causal relationships. If the model ignores these constraints, unreasonable results such as pollutants flowing upstream or concentrations increasing unnecessarily may occur. The graph neural network, which integrates physical mechanisms, encodes this domain knowledge into the network's structural bias, enabling the model to produce predictions that conform to physical laws even in data-sparse regions, thus improving the model's spatiotemporal extrapolation capability. Simultaneously, the observation assimilation correction mechanism propagates measured data information across the entire watershed, calibrating prediction biases while maintaining physical consistency, providing a reliable concentration field foundation for subsequent source tracing and inversion.
[0215] S4: Based on the spatiotemporal concentration distribution matrix of perfluorinated compounds, the contribution weight of each potential emission source to the monitoring point is calculated by backpropagation using the attention mechanism. The contribution weight is then matched and verified by combining the enterprise source feature fingerprint database, and the pollution source contribution heat map and the list of priority control enterprises are output.
[0216] This step addresses the problem that traditional source tracing methods rely on experience-based judgment and cannot quantitatively assess the cumulative contribution of multiple sources. By introducing an attention mechanism into the pollution source inversion process, it achieves quantifiable inverse inference from concentration distribution to source contribution. The pollution source contribution heatmap and the list of priority control enterprises, as the final outputs of S4, directly serve the environmental management department's precise governance decisions.
[0217] Specifically, the process for generating the pollution source contribution inversion and control list includes:
[0218] S4.1: Construct a set of potential emission source nodes and extract the compound composition characteristics of the source nodes from the perfluorinated compound structure identification result table.
[0219] Specifically, the process for constructing the potential emission source node set includes:
[0220] S4.1.1: Extract a unified spatial granularity geographic feature matrix from the multi-source data cube of the watershed, filter out grid units with enterprise presence as valid, and extract the geographic coordinates and pollution discharge enterprise registration data of the corresponding grid units.
[0221] S4.1.2: Classify and label enterprises based on the industry type field in the enterprise registration data. Label fluorochemical production enterprises, electroplating enterprises, textile waterproofing enterprises, fire-fighting foam production enterprises, electronics manufacturing enterprises, sewage treatment plants, landfills, and airport fire training areas as high-risk sources of perfluorinated compounds.
[0222] S4.1.3: Identify the grid cells corresponding to enterprises or facilities marked as high-risk sources of perfluorinated compounds as potential emission source nodes, and organize the node numbers and geographic coordinates of all potential emission source nodes into a set of potential emission source nodes.
[0223] S4.1.4: Based on the detection location information in the perfluorinated compound structure identification result table, statistically analyze the detection frequency and compound type distribution of compounds within a set buffer radius around each potential emission source node, and construct a compound composition feature vector for the source node. The buffer radius is determined based on the near-field diffusion characteristics of pollutants.
[0224] S4.2: Construct a source contribution backpropagation network based on the spatiotemporal concentration distribution matrix of perfluorinated compounds and the hydrological response unit diagram structure.
[0225] Specifically, the construction process of the source contribution backpropagation network includes:
[0226] S4.2.1: Reverse the edge direction in the hydrological response unit graph structure, changing edges that originally pointed downstream to edges that pointed upstream, resulting in a reverse hydrological connectivity graph. This reversal operation transforms the topology of forward pollutant transport into a source-tracing topology.
[0227] S4.2.2: Define a reverse attention layer on the reverse hydrological connectivity map. The input to the reverse attention layer is the concentration time-series vector and node attribute vector of each node, and the output is the attention weight distribution of each node to its upstream neighboring nodes. The concentration time-series vector is extracted from the spatiotemporal concentration distribution matrix of perfluorinated compounds according to node number.
[0228] S4.2.3: Define the attention weight calculation function. For each pairing of a downstream node and its upstream neighbors, calculate the query vector and key vector respectively. The query vector is obtained by linearly transforming the concentration time-series vector of the downstream node, and the key vector is obtained by another linear transformation of the concentration time-series vector of the upstream node. Calculate the dot product of the query vector and the key vector, scale by the square root of the vector dimension, and normalize using a soft-maximum function to obtain the attention weight.
[0229] In this embodiment, the attention weight The calculation formula is:
[0230]
[0231] in, For downstream nodes The feature vector (including concentration time series). upstream adjacent nodes eigenvectors, and For learnable transformation matrices, For vector dimensions. Indicates downstream node The pollution contribution is attributed to upstream nodes. The probability weights.
[0232] S4.2.4: Define the source contribution value propagation function. On the reverse hydrological connectivity diagram, the source contribution signal propagates along the reverse edges, starting from the downstream monitoring point node. For each reverse edge encountered, the source contribution signal is multiplied by the attention weight corresponding to that edge, and the attenuation coefficient along the propagation path is accumulated. The attenuation coefficient is calculated based on the edge length and the environmental half-life of the compound.
[0233] In this embodiment, the source contribution value propagation function is expressed as:
[0234]
[0235] in: For nodes In the The source contribution signal received after layer propagation; For nodes In the The source contribution signal of the layer; For nodes The set of downstream adjacent nodes in the reverse hydrological connectivity diagram; For the edge The corresponding attention weights; For the edge The attenuation coefficient.
[0236] The attenuation coefficient The calculation formula is:
[0237]
[0238] in: For the edge The corresponding river section length; The overall degradation rate constant of the compound; River section The average flow velocity.
[0239] S4.2.5: The back attention layer and the source contribution value propagation function are encapsulated into a source contribution backpropagation network. This network takes the concentration time series vector of the monitoring point node as input and outputs the source contribution weight value of each node in the potential emission source node set.
[0240] S4.3: Perform backpropagation calculation of source contributions to generate source contribution weight matrix.
[0241] Specifically, the source contribution backpropagation calculation process includes:
[0242] S4.3.1: Determine the set of monitoring point nodes, and mark the nodes in the hydrological response unit diagram structure that correspond to the sampling points of the watershed multi-source data cube as monitoring point nodes.
[0243] S4.3.2: For each monitoring node, extract its concentration time-series vector from the spatiotemporal concentration distribution matrix of perfluorinated compounds as the starting signal for backpropagation.
[0244] S4.3.3: Invoke the source contribution backpropagation network, using the concentration time-series vector of the monitoring point node as input, and propagate layer by layer along the reverse hydrological connectivity graph. In each layer of propagation, calculate the attention weight of the current node to its upstream neighboring nodes, and distribute the source contribution signal to each upstream node according to the weight.
[0245] S4.3.4: Repeat the propagation process until the potential emission source node is reached or the set maximum propagation layer is reached. Record the source contribution signal value received by each potential emission source node from each monitoring point.
[0246] S4.3.5: Organize the source contribution signal values received by each potential emission source node into a matrix format according to the source node number and monitoring point number to generate a source contribution weight matrix. The data structure of the source contribution weight matrix is a two-dimensional matrix, where the row index is the potential emission source node number, the column index is the monitoring point node number, and the matrix elements are the contribution weight values of the corresponding source node to the corresponding monitoring point. Please refer to the diagram of the attention mechanism backpropagation source tracing. Figure 4As shown in the figure, the diagram illustrates the complete process of backpropagation of source contribution signals from downstream monitoring nodes to upstream potential emission source nodes. The bottom of the figure represents the downstream monitoring node, which stores the time-series concentration vector of perfluorinated compounds (PFOCs) and serves as the starting signal source for backpropagation. The middle section contains three transmission nodes, corresponding to the midstream river units in the hydrological response unit diagram, responsible for transmitting source contribution signals between upstream and downstream. The top of the figure shows four potential emission source nodes, representing high-risk PFOC sources such as fluorochemical enterprises, electroplating enterprises, wastewater treatment plants, and textile enterprises. The directed edges in the figure point from downstream to upstream, representing the reversed hydrological connection diagram structure formed by reversing the direction of the originally downstream-pointing hydrological connection edges. The line width of the edges represents the magnitude of the attention weight; a thicker line width indicates a higher source contribution weight corresponding to the propagation path. It can be observed that the edge propagating from the monitoring node to the left transmission node has the thickest line width, corresponding to a high-weight contribution path. This transmission node then propagates the signal with a higher weight to the two emission source nodes: the fluorochemical enterprise and the electroplating enterprise. The source contribution weighting chart on the right side of the figure visually illustrates the differences between high, medium, and low weighting levels using bar lengths. This attention-based backpropagation method learns the similarity patterns between concentration time-series vectors to automatically identify the strength of causal relationships between upstream and downstream nodes. It then distributes the concentration signals from downstream monitoring points to each potential upstream emission source node according to attention weights, enabling a quantifiable estimation of the contribution ratio of each pollution source in multi-source superposition scenarios. This solves the problem that traditional source tracing methods cannot quantitatively distinguish the contribution of multiple emission sources to the concentration of the same monitoring point.
[0247] S4.4: Construct an enterprise source feature fingerprint database and match and verify it with the inversion results.
[0248] Specifically, the enterprise source feature fingerprint database construction and matching verification process includes:
[0249] S4.4.1: Retrieve the enterprise's chemical usage declaration data and product fluorine content registration information from the pollution discharge permit data system of the ecological and environmental protection authority. The chemical usage declaration data includes the name of the fluorine-containing chemicals used by the enterprise and the annual usage. The product fluorine content registration information includes the type and content of perfluorinated compounds contained in the enterprise's products.
[0250] S4.4.2: Based on chemical usage declaration data and product fluorine content registration information, construct a characteristic compound composition vector for each potential emission source enterprise. Each dimension of the vector corresponds to the probability or relative proportion of the presence of different perfluorinated compound types. Compile the characteristic compound composition vectors of all enterprises into an enterprise source characteristic fingerprint database.
[0251] S4.4.3: Extract the actual detected compound composition patterns around each potential emission source node from the source node compound composition feature vector obtained from S4.1.4.
[0252] S4.4.4: For each potential emission source node, calculate the cosine similarity between its actual detected compound composition pattern and the corresponding enterprise's characteristic compound composition vector in the enterprise source feature fingerprint database. The cosine similarity is obtained by calculating the dot product of the two vectors and dividing it by the product of the magnitudes of the two vectors. The value ranges from zero to one, and the larger the value, the more similar the composition patterns.
[0253] In this embodiment, the cosine similarity The calculation formula is:
[0254]
[0255] in, This is the vector of the actual detected compound composition. For enterprises to declare characteristic compound composition vectors, This represents the total number of compound types.
[0256] S4.4.5: Compare the cosine similarity with a set matching threshold. If the cosine similarity is greater than or equal to the matching threshold, the potential emission source node is marked as a high-confidence match; otherwise, it is marked as pending verification. The matching threshold is determined based on the uncertainty of compound detection and the completeness of the enterprise's declared data. Please refer to the source feature fingerprint matching diagram. Figure 5As shown in the figure, the left side of the figure is the compound composition vector declared by the enterprise. This vector is constructed based on the enterprise's chemical usage declaration data and product fluorine content registration information in the pollution discharge permit. Each dimension of the vector corresponds to the probability or relative proportion of the presence of different perfluorinated compound types. The figure shows the proportion distribution of five perfluorinated compounds, from compound A to compound E, in blue bars. The right side of the figure is the actual detected compound composition vector. This vector is obtained by statistically analyzing the detection frequency and compound type distribution within a buffer radius set around the potential emission source node. The figure shows the actual detection proportion of the corresponding five compounds in red bars. The middle of the figure is the cosine similarity calculation module. The similarity value is obtained by calculating the dot product of two vectors and dividing it by the product of the magnitudes of the two vectors. The similarity value shown in the figure is 0.92. The bottom of the figure is the matching result judgment area. The matching threshold marked by the dashed line is 0.85. Since the calculated cosine similarity of 0.92 is greater than the matching threshold, the potential emission source node is judged as a high-confidence matching state. As can be observed from the figure, the proportion distribution of the enterprise-reported vectors and the actual detected vectors across various compound types shows a highly consistent pattern. Compound A occupies the highest proportion in both vectors, while compound E has the lowest proportion in both. This similarity in compositional patterns provides chemical evidence to support the source tracing conclusions. The source feature fingerprinting method verifies the rationality of the source contribution weights obtained from the backpropagation of the attention mechanism from a chemical composition perspective, avoiding the erroneous attribution of contribution weights to enterprises that do not produce or use the relevant perfluorinated compounds, thus improving the reliability and interpretability of the pollution source tracing conclusions.
[0257] S4.5: Combine the source contribution weight matrix and fingerprint matching results to generate a pollution source contribution heat map and a list of priority control enterprises.
[0258] Specifically, the process for generating the pollution source contribution heat map and the list of priority control enterprises includes:
[0259] S4.5.1: Sum the source contribution weight matrix row by row to obtain the total contribution weight value of each potential emission source node to all monitoring points.
[0260] S4.5.2: Associate the total contribution weight value of each potential emission source node with its geographic coordinates, and map the contribution weight value to the corresponding grid cell using a unified spatial grid. If multiple potential emission source nodes exist within the same grid cell, the sum of the contribution weight values is taken as the contribution intensity value of that grid cell.
[0261] S4.5.3: The contribution intensity values of all grid cells are classified into three levels: high contribution, medium contribution, and low contribution, using the natural discontinuity classification method. This natural discontinuity classification method determines the classification threshold by finding the natural clustering boundaries of the data distribution, minimizing differences within the same level and maximizing differences between different levels.
[0262] S4.5.4: Render the classification results into a spatial heatmap, using different color shades to represent contribution intensity levels, and overlay a base map of watershed boundaries and river system distribution to generate a pollution source contribution heatmap. The data format of the pollution source contribution heatmap is a raster layer file compatible with the Geographic Information System (GIS). Please refer to the schematic diagram of the pollution source contribution heatmap. Figure 6 As shown in the figure, the heat map uses the watershed boundary as its spatial extent, dividing the watershed area into a regular grid array according to a unified spatial grid. Each grid cell is rendered with a different color based on its contribution intensity value. The dark red area in the figure represents a high-contribution zone, corresponding to industrial clusters with high source contribution weights. Two obvious high-contribution hotspots are visible on both sides of the river in the middle of the watershed. Multiple polluting enterprise locations are marked in this area, indicating that these enterprises contribute significantly to the perfluorinated compound concentration at downstream monitoring points. The yellow area represents a medium-contribution zone, mainly distributed in the transitional zone surrounding the high-contribution zone, reflecting the spatial gradient characteristics of pollutant diffusion from high-contribution sources to the surrounding areas. The green area represents a low-contribution zone, mainly distributed on the mountain slopes at the edge of the watershed. Enterprises in this area are sparsely distributed or there are no fluorinated enterprises, resulting in a low contribution weight to the concentration at monitoring points. The blue line in the figure represents the distribution of the watershed's river system, including one main stream and three tributaries. The direction of the river system shows a clear correlation with the spatial distribution of the high-contribution zone. The high-contribution hotspots are all located near the river, consistent with the physical laws of perfluorinated compound migration and transport through water bodies. The dark blue dots in the map represent the locations of water quality monitoring points, distributed at key nodes in the main stream and tributaries. The legend in the lower right corner of the map indicates the correspondence between contribution levels and colors, as well as symbols for the water system and monitoring points. The lower left corner provides scale information, and the upper right corner provides a north arrow. The pollution source contribution heat map transforms abstract source contribution weights into an intuitive spatial visualization, enabling environmental management departments to quickly identify hotspots of perfluorinated compound (PFOC) pollution within the watershed. This provides a spatial decision-making basis for the delineation of priority control areas and the optimal allocation of on-site verification resources.
[0263] S4.5.5: Sort all potential emission source nodes in descending order of their total contribution weight. For the top-ranked nodes, extract the corresponding enterprise's name, address, industry type, total contribution weight, fingerprint matching status, and matching similarity value.
[0264] S4.5.6: Calculate the uncertainty interval of the contribution weight values of each ranked enterprise. The uncertainty interval is obtained by performing perturbation analysis on the source contribution backpropagation network. A random perturbation conforming to the observation error distribution is applied to the input concentration time series vector, and the backpropagation calculation is repeated several times. The distribution range of the contribution weight values of each enterprise is then used as the uncertainty interval.
[0265] S4.5.7: The enterprise name, address, industry type, total contribution weight value, lower limit of the uncertainty interval, upper limit of the uncertainty interval, fingerprint matching status, and matching similarity value are used as fields to organize the enterprises into a priority management list in descending order of total contribution weight value. The data structure of the priority management list is a structured table, with each row corresponding to one priority management enterprise.
[0266] Specifically, this step addresses the problem that traditional source tracing methods cannot quantitatively distinguish the contributions of multiple potential emission sources to the concentration at the same monitoring point by employing an attention mechanism backpropagation method and a source feature fingerprinting method. In actual watersheds, perfluorinated compounds detected at downstream monitoring points typically originate from the superposition of multiple upstream pollution sources. Traditional source tracing methods based on empirical judgment or simple correlation analysis struggle to decouple the contribution ratios of each source. The attention mechanism learns the similarity patterns of concentration time series to automatically identify the strength of causal relationships between upstream and downstream nodes, and back-allocates downstream concentration signals to each upstream source node according to attention weights, achieving a quantifiable estimation of contribution ratios. The source feature fingerprinting method verifies the rationality of the inversion results from a chemical evidence perspective by comparing the similarity between the actual detected compound composition and the compound composition declared by the enterprise, avoiding the erroneous attribution of contribution weights to enterprises that do not produce or use the relevant compounds. The pollution source contribution heatmap transforms abstract contribution weight values into an intuitive spatial visualization, facilitating environmental management departments to quickly identify pollution hotspots. The priority control list includes uncertainty interval information, enabling decision-makers to assess the reliability of source tracing conclusions. Under limited resources, priority can be given to conducting on-site inspections and emission control for high-contribution and high-confidence enterprises, thereby improving the accuracy and efficiency of perfluorinated compound (PFC) control.
[0267] In this embodiment, the workflow of the above method is illustrated using a tributary of the Yangtze River as an application scenario. First, high-resolution mass spectrometry non-targeted screening data from twelve water quality monitoring stations, flow velocity and direction observation data from twenty hydrological stations, precipitation time-series data from three meteorological stations, ten-meter resolution land use remote sensing images covering the basin, and information on eighty-seven fluorine-related enterprises registered in the sewage discharge permit system are collected. Through S1, the above data are aligned and fused at an hourly temporal granularity and a fifty-meter spatial granularity to construct a basin-wide multi-source data cube with a time span of one hydrological year and spatial coverage of the entire basin. Through S2, deep learning analysis is performed on the mass spectrometry data from each monitoring station, identifying twenty-three perfluorinated compounds, including perfluorooctanoic acid (PFOA), perfluorooctane sulfonic acid (PFOS), and hexafluoropropylene oxide dimer acid. Three of these are novel short-chain perfluorinated carboxylic acids not included in the database, generating a perfluorinated compound structure identification result table. By dividing the watershed into 456 hydrological response units using S3, a graph structure was constructed. Concentration field simulations were performed using hydrodynamics and environmental degradation equations to obtain the spatiotemporal concentration distribution matrix of perfluorinated compounds (PFCCs). S4 then performed contribution weight inversion on 87 potential emission sources, generating a pollution source contribution heatmap that revealed three industrial clusters as high-contribution hotspots. A priority control list identified the top 15 contributing enterprises and their uncertainty intervals, providing a precise control basis for PFCC pollution management in the watershed.
[0268] Example 2:
[0269] This embodiment, based on Embodiment 1, provides an artificial intelligence-based watershed-scale perfluorinated compound tracing system, such as... Figure 7 As shown, it includes:
[0270] Multi-source data cube construction module: used to integrate high-resolution mass spectrometry data, hydrological and meteorological data, land use data and spatial distribution data of polluting enterprises within the watershed, and align them according to a unified spatiotemporal granularity to construct a multi-source data cube for the watershed;
[0271] Perfluorinated compound structure identification module: Based on mass spectrometry feature data in the multi-source data cube of the watershed, it uses a deep learning model trained with virtual spectrum enhancement to perform end-to-end analysis, outputs a list of candidate molecular structures, and obtains a perfluorinated compound structure identification result table after screening by molecular formula constraint and retention time verification.
[0272] Migration and transport model construction and concentration simulation module: It is used to construct the hydrological response unit diagram structure based on the multi-source data cube of the watershed, use the physicochemical parameters in the perfluorinated compound structure identification result table as node attributes, and construct the migration and transport model by combining hydrodynamic and environmental behavior equations to simulate the spatiotemporal concentration distribution matrix of perfluorinated compounds.
[0273] The pollution source contribution inversion and control list generation module is used to calculate the contribution weight of each potential emission source to the monitoring point based on the spatiotemporal concentration distribution matrix of perfluorinated compounds and the attention mechanism backpropagation. It is then matched and verified by combining the enterprise source feature fingerprint database, and outputs a pollution source contribution heat map and a list of priority control enterprises.
Claims
1. A watershed-scale method for tracing and tracking perfluorinated compounds based on artificial intelligence, characterized in that: The method includes the following steps: S1: Integrate high-resolution mass spectrometry data, hydrological and meteorological data, land use data, and spatial distribution data of polluting enterprises within the watershed, align them according to a unified spatiotemporal granularity, and construct a watershed multi-source data cube; the data structure of the watershed multi-source data cube is a three-dimensional tensor, and the tensor elements are composite attribute records containing chemical feature vectors, hydrological parameter vectors, meteorological index vectors, and land use codes; S2: Based on the mass spectrometry feature data in the multi-source data cube of the watershed, the deep learning pre-trained molecular structure generation model with virtual spectrum generation and fragment masking enhancement training is used for end-to-end analysis, outputting a list of candidate molecular structures. After screening by molecular formula constraints and retention time verification, the result table of perfluorinated compound structure identification is obtained. The steps for constructing a pre-trained molecular structure generation model include: Molecular structure representations and experimental mass spectra of known perfluorinated compounds were retrieved from chemical databases to construct a basic training sample set; Based on quantum chemical calculation methods, the theoretical molecular structure of perfluorinated compounds is fragmented and simulated to predict the mass-to-charge ratio and relative abundance of each fragment ion, generating a set of virtual mass spectra. Fragment masking enhancement is performed on the basic training sample set and the virtual mass spectrum set. Fragment peaks in each mass spectrum are randomly selected for masking. The response intensity of the masked peaks is set to zero, while the original molecular structure representation is retained as a label to generate an enhanced training sample set. A deep learning model for constructing an encoder-decoder architecture is used as the initial model. The encoder uses a multi-head self-attention mechanism to encode the features of the input mass spectrum peak sequence, and the decoder uses an autoregressive generation mechanism to predict each symbol in the molecular structure representation string character by character. Using the mass spectrometry peak sequence in the enhanced training sample set as input and the corresponding molecular structure representation string as output, the cross-entropy loss function is used to measure the difference between the predicted character and the real character, and the adaptive learning rate optimization algorithm is used to iteratively update the model parameters. After training, the model parameters are saved to obtain the pre-trained molecular structure generation model. S3: Construct a hydrological response unit diagram structure based on the multi-source data cube of the watershed, use the physicochemical parameters in the perfluorinated compound structure identification result table as node attributes, and construct a migration and transport model by combining hydrodynamics and environmental behavior equations to simulate the spatiotemporal concentration distribution matrix of perfluorinated compounds. S4: Based on the spatiotemporal concentration distribution matrix of perfluorinated compounds, the contribution weight of each potential emission source to the monitoring point is calculated by backpropagation using the attention mechanism. The contribution is then matched and verified by the enterprise source feature fingerprint database to generate a heat map of pollution source contribution and a list of priority control enterprises.
2. The watershed-scale perfluorinated compound tracing method based on artificial intelligence according to claim 1, characterized in that: The steps for constructing a multi-source data cube for a watershed in S1 include: S1.1: First, collect raw multi-source monitoring data within the watershed and construct a raw multi-source monitoring data set, including raw mass spectrometry acquisition sequences, raw hydrological observation sequences, raw meteorological observation sequences, raw land use image data, and raw sewage discharge enterprise registration data; S1.2: Then, the original mass spectrometry acquisition sequence is processed through a mass spectrometry preprocessing procedure to obtain the mass spectrometry feature matrix; S1.3: Obtain a unified time-granularity hydro-meteorological matrix by using a time-series resampling process from the original hydrological observation sequence and the original meteorological observation sequence; S1.4: Obtain a unified spatial granularity geographic feature matrix by using the original land use image data and the original pollution discharge enterprise registration data through a spatial rasterization process; S1.5: Finally, based on the mass spectrometry feature matrix, the unified temporal granularity hydro-meteorological matrix, and the unified spatial granularity geographic feature matrix, a watershed multi-source data cube is constructed through a spatiotemporal fusion process.
3. The watershed-scale perfluorinated compound tracing method based on artificial intelligence according to claim 2, characterized in that: The mass spectrometry preprocessing steps described in S1.2 include: S1.2.1: Perform baseline correction on the original mass spectrometry acquisition sequence, use the asymmetric least squares smoothing method to fit the background baseline curve at each scan time, and subtract the corresponding baseline value from the response intensity value in the original mass spectrometry acquisition sequence to obtain the baseline-corrected mass spectrometry sequence. S1.2.2: Perform noise filtering on the baseline-corrected mass spectrometry sequence. Use wavelet transform to decompose the response intensity signal at each scanning time into multiple scales. Set the coefficients of the high-frequency components with amplitudes lower than the set noise threshold to zero and then perform inverse transform reconstruction to obtain the denoised mass spectrometry sequence. S1.2.3: Perform peak extraction on the denoised mass spectrometry sequence, use the continuous wavelet transform ridge tracking method to identify the chromatographic peak position and peak boundary in the signal at each scanning time, extract the peak mass-to-charge ratio, peak area and peak width parameters, and construct the original peak list; S1.2.4: Perform feature alignment operation on the original peak list. Using mass-to-charge ratio deviation tolerance and retention time deviation tolerance as constraints, use hierarchical clustering method to merge the peaks corresponding to the same compound detected at different scanning times into a unified feature list. S1.2.5: Based on the aligned feature list, construct the mass spectrometry feature matrix using the feature number as the row index, the sampling point number as the column index, and the peak area normalized value as the matrix element.
4. The watershed-scale perfluorinated compound tracing method based on artificial intelligence according to claim 1, characterized in that: The steps for identifying the structure of perfluorinated compounds described in S2 include: S2.1: Extract the mass spectrometry feature vectors corresponding to each sampling point from the multi-source data cube of the watershed, and construct the mass spectrometry feature set to be analyzed; S2.2: Construct a pre-trained molecular structure generation model based on virtual spectrum generation and fragment masking enhancement strategies; S2.3: Based on the pre-trained molecular structure generation model, perform end-to-end parsing of each mass spectrometry feature vector in the mass spectrometry feature set to be analyzed, and generate a preliminary list of candidate molecular structures; S2.4: Perform molecular formula constraint verification on the preliminary candidate molecular structure list, screen out candidates whose elemental composition does not conform to the characteristics of perfluorinated compounds, and obtain the candidate list after molecular formula verification; S2.5: Perform retention time verification on the candidate list after molecular formula verification, and screen out candidates whose theoretical retention time deviates from the measured retention time by more than the limit, to obtain the candidate list after retention time verification; S2.6: Based on the candidate list after retention time verification, sort by confidence level and supplement spatiotemporal information to generate a perfluorinated compound structure identification result table.
5. The watershed-scale perfluorinated compound tracing method based on artificial intelligence according to claim 1, characterized in that: The steps for constructing a migration and transport model and simulating concentration distribution in S3 include: S3.1: Construct a hydrological response unit diagram structure based on hydrological data in the multi-source data cube of the watershed; S3.2: Extract the physicochemical parameters of each compound from the perfluorinated compound structure identification result table, and use them as the compound attribute vectors of the nodes in the hydrological response unit diagram structure; S3.3: Construct a graph neural network that integrates physical mechanisms as the core architecture of the transfer model; S3.4: Perform time-series simulation based on the migration and transport model to generate the spatiotemporal concentration distribution matrix of perfluorinated compounds; S3.5: Perform observation assimilation correction on the spatiotemporal concentration distribution matrix of perfluorinated compounds.
6. The watershed-scale perfluorinated compound tracing method based on artificial intelligence according to claim 5, characterized in that: The steps for constructing the graph neural network that integrates physical mechanisms as described in S3.3 include: S3.3.1: Define the node state vector, including compound concentration value, hydrological response unit area, land use type code and compound physicochemical parameter vector. Extract the compound concentration observation value of each hydrological response unit at the initial time from the watershed multi-source data cube. Set the initial concentration value to zero for units without observation. S3.3.2: Define a message passing function, use hydrodynamic equations to calculate the water flow velocity based on edge attributes, and then calculate the advection flux based on the water flow velocity and the concentration value of the upstream node; S3.3.3: Define the environmental behavior decay function, use the first-order kinetic degradation equation, calculate the degradation decay coefficient based on the various half-life values and time steps in the compound physicochemical parameter vector, and multiply the node concentration value by the degradation decay coefficient to obtain the concentration value after degradation. S3.3.4: Define the rainfall scour response function, extract the precipitation data at each time moment from the watershed multi-source data cube, query the corresponding scour coefficient according to the land use type code of the node, and multiply the precipitation, scour coefficient and node area to obtain the rainfall scour inflow. S3.3.5: Define the adsorption-sedimentation function, calculate the adsorption partition coefficient based on the logarithmic value of the octanol-water partition coefficient in the compound's physicochemical parameter vector, and calculate the adsorption-sedimentation loss based on the estimated value of the suspended particulate matter concentration at the node and the sedimentation rate. S3.3.6: Combines the message passing function, environmental behavior decay function, rainfall scour response function, and adsorption sedimentation function into a single-step update module of the graph neural network. It takes the node state vector and edge attributes at the current time as input and outputs the node concentration prediction value at the next time step.
7. The watershed-scale perfluorinated compound tracing method based on artificial intelligence according to claim 1, characterized in that: The steps for generating a heat map of pollution source contributions and a list of priority control enterprises in S4 include: S4.1: Construct a set of potential emission source nodes and extract the compound composition characteristics of the source nodes from the perfluorinated compound structure identification result table; S4.2: Construct a source contribution backpropagation network based on the spatiotemporal concentration distribution matrix of perfluorinated compounds and the hydrological response unit diagram structure; S4.3: Perform backpropagation calculation of source contributions to generate source contribution weight matrix; S4.4: Construct an enterprise source feature fingerprint database and match and verify it with the inversion results; S4.5: Combine the source contribution weight matrix and fingerprint matching results to generate a pollution source contribution heat map and a list of priority control enterprises.
8. The watershed-scale perfluorinated compound tracing method based on artificial intelligence according to claim 7, characterized in that: The construction steps of the source contribution backpropagation network described in S4.2 include: S4.2.1: Reverse the direction of the edges in the hydrological response unit diagram structure, so that the edges that originally pointed downstream point upstream, resulting in a reverse hydrological connection diagram. S4.2.2: Define a reverse attention layer on the reverse hydrological connectivity graph. The input is the concentration time-series vector and node attribute vector of each node, and the output is the attention weight distribution of the node to its upstream neighboring nodes. S4.2.3: Define the attention weight calculation function. For the pairing of downstream nodes with each of their upstream neighboring nodes, calculate the query vector and key vector respectively, calculate the dot product of the two and scale them, and normalize them by the soft maximum function to obtain the attention weight. S4.2.4: Define the source contribution value propagation function. Starting from the downstream monitoring point node on the reverse hydrological connection diagram, the source contribution signal is propagated along the reverse edge. The source contribution signal is multiplied by the attention weight corresponding to the edge and the attenuation coefficient on the propagation path is accumulated. S4.2.5: The back attention layer and the source contribution value propagation function are encapsulated into a source contribution backpropagation network, which takes the concentration time-series vector of the monitoring point node as input and outputs the source contribution weight value of each node in the potential emission source node set.
9. An artificial intelligence-based watershed-scale perfluorinated compound tracing system, used to implement the artificial intelligence-based watershed-scale perfluorinated compound tracing method according to any one of claims 1-8, characterized in that: The system includes: Multi-source data cube construction module: used to integrate high-resolution mass spectrometry data, hydrological and meteorological data, land use data and spatial distribution data of polluting enterprises within the watershed, and align them according to a unified spatiotemporal granularity to construct a multi-source data cube for the watershed; Perfluorinated compound structure identification module: Based on mass spectrometry feature data in the multi-source data cube of the watershed, it uses a deep learning model trained with virtual spectrum enhancement to perform end-to-end analysis, outputs a list of candidate molecular structures, and obtains a perfluorinated compound structure identification result table after screening by molecular formula constraint and retention time verification. Migration and transport model construction and concentration simulation module: It is used to construct the hydrological response unit diagram structure based on the multi-source data cube of the watershed, use the physicochemical parameters in the perfluorinated compound structure identification result table as node attributes, and construct the migration and transport model by combining hydrodynamic and environmental behavior equations to simulate the spatiotemporal concentration distribution matrix of perfluorinated compounds. The pollution source contribution inversion and control list generation module is used to calculate the contribution weight of each potential emission source to the monitoring point based on the spatiotemporal concentration distribution matrix of perfluorinated compounds and the attention mechanism backpropagation. It is then matched and verified by combining the enterprise source feature fingerprint database, and outputs a pollution source contribution heat map and a list of priority control enterprises.