Machine learning based atmospheric source apportionment method for soil heavy metal pollution

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a time-series data table of heavy metals in soil and a machine learning model, combined with meteorological and spatial data, the problem of difficulty in identifying the source of heavy metal pollution in soil in the time dimension in existing technologies has been solved, and accurate source tracing and multi-source differentiation of pollution sources have been achieved.

CN122245482APending Publication Date: 2026-06-19SICHUAN NUCLEAR GEOLOGICAL SURVEY INST

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SICHUAN NUCLEAR GEOLOGICAL SURVEY INST
Filing Date: 2026-03-16
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing methods for tracing the source of heavy metal pollution in soil mainly rely on static data, which makes it difficult to reflect the changes in pollution deposition over time. This results in high similarity in the data performance of different pollution sources, making it difficult to accurately identify the source.

Method used

Using machine learning methods, a time-series data table of heavy metals in soil was constructed. Combined with meteorological data and the spatial location of candidate pollution sources, a sedimentation structure sequence was constructed. A machine learning model was used to train and identify pollution sources. Source tracing was determined by combining structural deviation and path deviation.

Benefits of technology

It enables accurate identification of sources of heavy metal pollution in soil, reflects the temporal variation characteristics of pollution deposition, and provides comparable distinguishing criteria in multi-source situations, taking into account meteorological and spatial relationships.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245482A_ABST

Patent Text Reader

Abstract

This invention discloses a machine learning-based method for tracing atmospheric sources of heavy metal pollution in soil, relating to the fields of soil environmental monitoring and pollution source tracing. By organizing soil heavy metal content data acquired from the same sampling point across multiple sampling periods in a temporal sequence, and combining this with meteorological conditions and spatial location data of candidate pollution sources, a soil heavy metal time-series data table is established. This transforms soil sample data from single-detection records into a data structure with continuous temporal relationships, thereby reflecting the changes in pollution deposition at different time stages during the analysis process. By calculating the proportion of heavy metal content in each sampling period, sorting elements, and analyzing the slope difference between adjacent periods, a deposition structure sequence is constructed, enabling a structured expression of the temporal trajectory of different heavy metal elements. This allows for the identification of the changing characteristics of different pollution sources during the deposition process.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of soil environmental monitoring and pollution source tracing technology, specifically a method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning. Background Technology

[0002] In the fields of environmental geochemistry and environmental data analysis, determining the sources of heavy metals such as Pb, Cd, Hg, and As in soil typically requires a comprehensive analysis combining regional emission information, meteorological conditions, and soil sample testing results. Within this technical system, soil sample testing data, as fundamental information directly reflecting the state of pollution deposition, is widely used in regional pollution source identification and pollution migration pathway research. Therefore, conducting pollution source analysis and tracing based on heavy metal content data in soil samples has become an important research direction in soil environmental monitoring technology.

[0003] In existing methods for tracing the source of heavy metal pollution in soil, soil samples are typically obtained through periodic sampling, recording the concentration levels of various heavy metal elements at a specific sampling time, such as Pb, Cd, and Hg. Most analytical methods rely on statistical analysis of single sampling data or results from a small number of samples, comparing them with regional emission source characteristics or background values to determine the pollution source. However, these methods primarily depend on static data and struggle to reflect the temporal changes in pollution deposition. Because different pollution sources vary in emission rhythms, deposition patterns, and environmental impact conditions, relying solely on concentration data from a single moment often fails to provide a stable basis for source identification.

[0004] In actual environmental monitoring, soil heavy metal deposition is usually affected by a variety of factors, such as emission activity cycles, changes in meteorological conditions, and particulate matter transport pathways. Different pollution sources will form certain regular change trajectories during long-term deposition. When monitoring data is recorded only in the form of discrete single sampling, these deposition patterns are difficult to identify directly, resulting in a high degree of similarity in data performance between different pollution sources. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides a machine learning-based method for tracing atmospheric sources of heavy metal pollution in soil, thus solving the problems mentioned in the background section.

[0006] To achieve the above objectives, the present invention provides the following technical solution: a machine learning-based method for tracing the atmospheric sources of heavy metal pollution in soil, comprising the following steps: S1. Obtain soil heavy metal content data, meteorological data, and spatial location data of candidate pollution sources at the same sampling point in the target area during multiple sampling periods, and establish a soil heavy metal time series data table. S2. Perform percentage calculation, element sorting, and slope difference calculation on the heavy metal content of each sampling period in the soil heavy metal time series data table to construct the sedimentation structure sequence of the target area. S3. Construct candidate pollution source sedimentation structure sequence templates for historical samples with known sources in the same manner as step S2, and input them into the machine learning model for training to obtain a trained pollution source identification model. S4. Input the sedimentation structure sequence of the sample to be judged into the pollution source identification model, and calculate the structural deviation GWs of each candidate pollution source in combination with the candidate pollution source sedimentation structure sequence template. Output the preliminary source discrimination index QG based on the structural deviation GWs. S5. Based on the preliminary source discrimination index QG, calculate the path deviation Bs and final verification value Fs for each candidate pollution source, and output the atmospheric source results of soil heavy metal pollution in the target area according to the final verification value Fs.

[0007] Preferably, S1 includes S11; S11. Use the data acquisition device to read soil heavy metal content data, sampling time interval data ΔTt, meteorological data and spatial location data of candidate pollution sources at the same sampling point during multiple sampling periods; Among them, the soil heavy metal content data were obtained by heavy metal analysis instruments: including the measured content C(m,t) of the m-th heavy metal in the t-th sampling period, the total number of heavy metal species M, and the total number of sampling periods T; The sampling interval data ΔTt specifically represents the actual time interval between the t-th sampling period and the (t-1)-th sampling period; Meteorological data is collected through wind speed sensors, wind direction sensors, rain gauges and humidity sensors in automatic weather monitoring stations: including the cumulative rainfall Rt in the t-th sampling period, the average wind speed Ut in the t-th sampling period, the prevailing wind direction angle PWt in the t-th sampling period and the average relative humidity Ht in the t-th sampling period; The spatial location data of the candidate pollution sources are acquired by GNSS positioning equipment, including the spatial distance Ls from the s-th candidate pollution source to the target sampling point and the azimuth angle Qs of the s-th candidate pollution source pointing to the target sampling point.

[0008] The formula for obtaining the spatial distance Ls from the s-th candidate pollution source to the target sampling point is: ; In the formula, (Xs, Ys) represents the spatial coordinates of the s-th candidate pollution source, and (Xp, Yp) represents the spatial coordinates of the target sampling point; The formula for obtaining the azimuth angle Qs of the s-th candidate pollution source pointing to the target sampling point is: ; In the formula, arctan represents the arctangent function; Preferably, S1 further includes S12; S12. The acquired soil heavy metal content data, sampling time interval data ΔTt, meteorological data, and candidate pollution source spatial location data are checked time-by-time to identify data items that exceed the detection record boundaries, are duplicated, have missing key items, have inconsistent dimensions, or are out of order. Then, the soil heavy metal content data items, meteorological data items, and candidate pollution source spatial location data items are standardized according to the same scale for similar fields, and spliced according to the sampling time sequence to establish a soil heavy metal time series data table. The soil heavy metal content data was preprocessed using a low-pass filter processor to eliminate noise in the data; Anomaly detection was performed on meteorological data using the three-standard-deviation method, outliers were removed, and missing values were filled in. The spatial location data of candidate pollution sources are decomposed into directional components to eliminate the influence of spatial dimension and prevent jumps in angle calculation. The data on soil heavy metal content, sampling time interval data ΔTt, meteorological data, and spatial location data of candidate pollution sources were normalized using the maximum-minimum normalization method to eliminate the dimensions of the data and unify the dimensions of the data.

[0009] Preferably, S2 includes S21; S21. Read the measured content C(m,t) of the m-th heavy metal in the t-th sampling period of the soil heavy metal time series data table, and perform cumulative calculation on the content of all heavy metals in the same sampling period to obtain the total metal content CΣ(t) of the sampling period. Divide the measured content C(m,t) of the m-th heavy metal in the t-th sampling period by the total metal content CΣ(t) in the sampling period to obtain the proportion P(m,t) of the m-th heavy metal in the t-th sampling period. The result reflects the compositional position of a certain element within the sampling period, rather than its absolute content.

[0010] For example, when the absolute content of a certain heavy metal changes in different sampling periods, it is not convenient to judge whether its role in the pollution input structure has changed by simply comparing the absolute value; however, the composition share of the element in that period can be directly determined by P(m,t). The contents of all heavy metals within the same sampling period are sorted according to their numerical values. The ranking position O(m,t) of the m-th heavy metal element in the t-th sampling period is obtained using the following formula: O(m,t) = rank(C(m,t)); where rank represents the sorting operation.

[0011] Preferably, S2 also includes S22; S22. Read the content ranking positions O(m, t) and O(m, t-1) of the m-th heavy metal element in the t-th sampling period in the order of installation time; perform absolute value calculation and cumulative calculation on the difference of ranking position between adjacent periods for each heavy metal to obtain the ranking change Dt of the t-th sampling period relative to the (t-1)-th sampling period; The formula is as follows: ; In the formula, M represents the total number of heavy metal species, and O(m, t-1) represents the ranking of the m-th heavy metal element in the (t-1)-th sampling period. If Dt is small, it means that the order of importance of the elements in two adjacent time periods is close. If Dt is large, it indicates that the composition of pollution has undergone a significant rearrangement in order around this time period; Read the heavy metal content C(m,t), C(m,t-1), C(m,t-2) and the corresponding time intervals ΔTt, ΔTt-1 for three consecutive sampling periods. Calculate the rate of change for two adjacent intervals and perform a difference operation on the two rates of change to obtain the slope difference K(m,t) of the change of the m-th heavy metal element in the t-th sampling period. The formula is as follows: ; In the formula, ΔTt represents the actual time interval between the t-th sampling period and the (t-1)-th sampling period, and ΔTt-1 represents the actual time interval between the (t-1)-th sampling period and the (t-2)-th sampling period; The sedimentation structure sequence of the target area is constructed by combining the proportion P(m,t) of the m-th heavy metal in the t-th sampling period, the content ranking position O(m,t) of the m-th heavy metal element in the t-th sampling period, the ranking change Dt of the t-th sampling period relative to the (t-1)-th sampling period, and the slope difference K(m,t) of the m-th heavy metal element in the t-th sampling period in chronological order.

[0012] Preferably, S3 includes S31; S31. Read historical samples with known sources and collect soil heavy metal time series samples corresponding to the pollution source categories identified in the historical samples; using the calculation steps in S2, calculate the proportion (s,n)P(m,t) of the m-th heavy metal in the t-th sampling period, the ranking change (s,n)Dt of the t-th sampling period relative to the (t-1)-th sampling period, and the difference in the slope of the m-th heavy metal change in the t-th sampling period (s,n)K(m,t) for each group of historical samples. Where s represents the candidate pollution source category number, and n represents the nth known source sample under the category; After obtaining the sedimentation structure sequence of each group of historical samples, the spatial distance Ls from the s-th candidate pollution source to the target sampling point and the azimuth angle Qs of the s-th candidate pollution source pointing to the target sampling point are read and written into the template construction process as the spatial location item of the category candidate pollution source. For all historical samples under the same candidate pollution source category, the percentage sequence, the ranking change sequence, and the change slope difference sequence are respectively processed by averaging the samples to generate the percentage template sP(m,t), the ranking change template sDt, and the slope difference template sK(m,t) of the s-th candidate pollution source. Together with the direction deviation template term sAt and the distance conversion amount sL, they form the candidate pollution source sedimentation structure sequence template. The orientation deviation template term sAt is obtained using the following formula: ; In the formula, π represents the mathematical constant pi, and PWt represents the prevailing wind direction angle during the t-th sampling period; The distance conversion value sL is obtained using the following formula: ; In the formula, max(r=1,2,...,S)Lr represents the maximum spatial distance among all candidate pollution sources, and S represents the total number of candidate pollution source categories.

[0013] Preferably, S3 also includes S32; S32. Assemble the obtained candidate pollution source sedimentation structure sequence templates into a training sample set according to the category number, and label each training sample vector with the corresponding candidate pollution source category identifier. All training sample vectors under the same candidate pollution source category are categorized and aggregated, and the concentrated positions of the category training samples in the feature space are extracted to generate the corresponding category centers, which are used to represent the typical sedimentation structure characteristics of the candidate pollution source category. The degree of difference between each training sample vector and its class center is calculated to obtain the intra-class bias, which characterizes the degree of consistency between the training sample and the typical structure of the class. Perform pairwise difference comparisons on the class centers of different candidate pollution source categories to obtain the inter-class separation measure, which characterizes the separation state of different candidate pollution source categories in the feature space; The training decision metric is constructed based on the correspondence between intra-class bias and inter-class separation for each category. The current training state is evaluated, and the model training process ends when the training decision metric meets the preset training conditions. After the training parameters meet the preset training conditions, the training parameters of the current machine learning model are saved to obtain the trained pollution source identification model. This model receives the sedimentation structure sequence and spatial location data of the area to be identified and outputs the corresponding candidate pollution source categories. In the above description, the training sample vector not only contains the temporal features of the sedimentation structure sequence, but also the spatial distance and directional relationship information between the candidate pollution source and the sampling point, so that the trained pollution source identification model can simultaneously learn the pattern of pollution input structure change over time and the spatial relationship features between the pollution source and the sampling point.

[0014] Preferably, S4 includes S41; S41. Input the sedimentation structure sequence of the s-th candidate pollution source into the pollution source identification model to obtain the set of candidate pollution source matching results corresponding to the sample to be judged. The proportion template sP(m,t), sorting change template sDt, and slope difference template sK(m,t) in the sedimentation structure sequence template of each candidate pollution source are read sequentially. The sedimentation structure sequence of the sample to be judged is compared with the sedimentation structure sequence template of each candidate pollution source by time period and by parameter to obtain the structural deviation GWs of the s-th candidate pollution source. The formula is as follows: ; The structural deviation GWs of the s-th candidate pollution source is compared with the set deviation threshold Tgw to determine the overall degree of deviation. When the structural deviation GWs ≤ deviation threshold Tgw, it means that the closer the judgment sample is to the candidate pollution source, the lower the degree of deviation. When the structural deviation GWs > deviation threshold Tgw, it indicates an abnormal degree of deviation.

[0015] The deviation threshold Tgw is obtained using the following formula: Tgw = μGWs + ka × σGWs; where μGWs represents the mean of the structural deviation of the s-th candidate pollution source, σGWs represents the standard deviation of the structural deviation of the s-th candidate pollution source, and ka represents the adjustment coefficient, which ranges from 0.2 to 0.5. Preferably, S4 also includes S42; S42. Sort the structural deviations GWs corresponding to all selected pollution sources in ascending order, extract the smallest structural deviation GW(1) as the optimal candidate source deviation value, and extract the second smallest structural deviation GW(2) as the second optimal candidate source deviation value. The preliminary source discrimination index QG is calculated based on the interval relationship between the optimal candidate source deviation value and the second-best candidate source deviation value. The initial source discrimination index (QG) is obtained using the following formula: ; Based on the preliminary source tracing discrimination index (QG), determine whether the current preliminary source tracing results have clear boundaries. The determination method is as follows: When the preliminary source tracing discrimination score QG ≥ the discrimination threshold Qref, it means that the difference between the optimal candidate pollution source and other candidate pollution sources has reached the reference discrimination level, and the optimal candidate pollution source is output as the preliminary source tracing result. When the initial source tracing discrimination index QG < the discrimination threshold Qref, it indicates that the difference between the optimal candidate pollution source and other candidate pollution sources is insufficient, suggesting that multiple candidate pollution sources are close to the sample to be judged, and are marked as results to be reviewed.

[0016] The grading threshold Qref is obtained as follows: First, calculate the corresponding preliminary source discrimination scores for all historical samples with known sources in sequence according to the method in step S4, and form a discrimination score sample set; Subsequently, statistical distribution analysis was performed on the discrimination sample set, the discrimination values of each historical sample were sorted, and the discrimination value in the middle of the sample distribution was extracted as the discrimination threshold Qref. In this way, the reference value for differentiation is derived from the statistical results of historical samples, which is used to represent the typical differentiation level of the distinguishable state in the historical tracing results, and serves as the basis for judging whether the preliminary tracing results of subsequent samples to be judged have a clear differentiation state; the differentiation threshold Qref is between 0.1 and 0.6. Preferably, S5 includes S51 and S52; S51. When the preliminary source differentiation score QG < the differentiation threshold Qref, it indicates that the structural differences between multiple candidate pollution sources are not obvious and further verification is needed through atmospheric transport path relationships. At this point, the spatial distance and azimuth of the candidate pollution source spatial location data are read, and the meteorological data for the corresponding sampling period are read, including the prevailing wind direction angle, average wind speed, cumulative rainfall, and average relative humidity. The relationship between the prevailing wind direction and the direction of the candidate pollution source pointing to the sampling point is calculated for each sampling period, and the relationship is corrected in combination with meteorological conditions to obtain the path deviation Bs corresponding to the s-th candidate pollution source. The formula is as follows: ; The consistency status of the path is determined by analyzing the path deviation Bs; the method is as follows: When the path deviation Bs > 0.3, it indicates that the path status is uncertain and the deviation is abnormal. When the path deviation Bs ≤ 0.3, it indicates that the wind direction is consistent with the direction of the candidate pollution source, and the pollutants are transported to the sampling area along the path; S52. The obtained path deviation Bs, structural deviation GWs, and distance conversion sL are combined to calculate the final verification value Fs of each candidate pollution source. The final verification values Fs of all candidate pollution sources are sorted in ascending order, and the candidate pollution source with the smallest final verification value is extracted as the atmospheric source of soil heavy metal pollution in the target area. The final verification value Fs is obtained using the following formula: ; The extraction formula is as follows: ; In the formula, S# represents the final determined pollution source number, and argmin represents the variable number that minimizes the function value.

[0017] This invention provides a machine learning-based method for tracing the atmospheric sources of heavy metal pollution in soil, which has the following beneficial effects: (1) By organizing the soil heavy metal content data obtained from the same sampling point in multiple sampling periods in a time sequence, and combining the meteorological conditions and spatial location data of candidate pollution sources to establish a soil heavy metal time series data table, the soil sample data is transformed from the original single detection record into a data structure with continuous time relationship, so as to reflect the changes in pollution deposition at different time stages during the analysis process. By calculating the proportion of heavy metal content in each sampling period, sorting the elements, and analyzing the slope difference between adjacent periods, a sedimentation structure sequence is constructed. This allows for a structured representation of the trajectories of different heavy metal elements over time, enabling the identification of the changing characteristics of different pollution sources during the sedimentation process. Furthermore, a sedimentation structure sequence template for candidate pollution sources is constructed using historical samples with known sources. A machine learning model is then used to train the sedimentation structure change patterns of different sources, enabling the model to learn the evolution patterns of various pollution sources in the time series. After inputting samples to be judged, the model can perform matching analysis on each candidate pollution source based on the structural deviation.

[0018] (2) By calculating the proportion and sorting the elements of heavy metal content in each sampling period in the soil heavy metal time series data table, the detection data that originally existed in the form of a single concentration value is transformed into a data expression form that can reflect the relative structural relationship between each heavy metal, so that the composition relationship of different elements in the same sampling period can be uniformly described; at the same time, by calculating the sorting changes between adjacent sampling periods and the difference between the rate of change of heavy metal content, the change process of soil heavy metals in the time dimension can be represented in a structured way, thereby constructing a sedimentation structure sequence that reflects the trajectory of pollution sedimentation, so that the monitoring data can not only reflect the pollution state at a certain moment, but also reflect the change characteristics of different pollution sources in the long-term sedimentation process.

[0019] (3) By sorting the structural deviations of all candidate pollution sources, the difference between the best and second-best candidate sources is extracted, and the preliminary source tracing discrimination is calculated. This enables the pollution source identification process to not only provide the closest candidate source, but also to judge the degree of difference between the result and other candidate sources, so that the model output results have comparable discrimination criteria. On this basis, the preliminary source tracing results are judged by setting discrimination judgment conditions. When the difference between the best candidate source and other candidate sources reaches the reference discrimination level, the preliminary source tracing results are output. When the degree of difference is insufficient, it is marked as pending verification. This enables the pollution source identification process to identify the situation where multiple candidate pollution sources are close, and avoids directly outputting results when the source is not clearly distinguishable.

[0020] (4) By calculating the path deviation of different candidate pollution sources, the atmospheric transport path status between each candidate pollution source and the sampling area can be expressed in a unified way, thereby identifying whether there is a consistent relationship between wind direction and pollution source direction, and reflecting the possible transport path of pollutants under airflow conditions; on this basis, the path deviation amount, the structural deviation amount generated by the sedimentation structure characteristics, and the spatial distance relationship between the pollution source and the sampling point are comprehensively calculated, so that the pollution source determination process considers the pollution sedimentation structure characteristics, atmospheric transport path relationship and spatial location relationship at the same time, and the role of different types of information in the same determination process can be uniformly reflected. Attached Figure Description

[0021] Figure 1 This is a schematic diagram of the steps of the machine learning-based method for tracing atmospheric sources of heavy metal pollution in soil according to the present invention. Figure 2 This is a schematic diagram of the final structure acquisition process of the present invention; Figure 3 This is a schematic diagram of the process for obtaining the path deviation of candidate pollution sources according to the present invention. Detailed Implementation

[0022] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0023] Example 1 This invention provides a machine learning-based method for tracing the atmospheric sources of heavy metal pollution in soil. Please refer to [link / reference]. Figures 1 to 3 This includes the following steps: S1. Obtain soil heavy metal content data, meteorological data, and spatial location data of candidate pollution sources at the same sampling point in the target area during multiple sampling periods, and establish a soil heavy metal time series data table. S2. Perform percentage calculation, element sorting, and slope difference calculation on the heavy metal content of each sampling period in the soil heavy metal time series data table to construct the sedimentation structure sequence of the target area. S3. Construct candidate pollution source sedimentation structure sequence templates for historical samples with known sources in the same manner as step S2, and input them into the machine learning model for training to obtain a trained pollution source identification model. S4. Input the sedimentation structure sequence of the sample to be judged into the pollution source identification model, and calculate the structural deviation GWs of each candidate pollution source in combination with the candidate pollution source sedimentation structure sequence template. Output the preliminary source discrimination index QG based on the structural deviation GWs. S5. Based on the preliminary source discrimination index QG, calculate the path deviation Bs and final verification value Fs for each candidate pollution source, and output the atmospheric source results of soil heavy metal pollution in the target area according to the final verification value Fs.

[0024] In this embodiment, by organizing the soil heavy metal content data obtained from the same sampling point at multiple sampling periods in a time sequence, and combining it with meteorological conditions and spatial location data of candidate pollution sources to establish a soil heavy metal time series data table, the soil sample data is transformed from the original single detection record into a data structure with continuous time relationship, thereby reflecting the changes in pollution deposition at different time stages during the analysis process. By calculating the proportion of heavy metal content in each sampling period, sorting the elements, and analyzing the slope difference between adjacent periods, a sedimentation structure sequence is constructed. This allows for a structured representation of the change trajectories of different heavy metal elements over time, thus enabling the identification of the change characteristics of different pollution sources during the sedimentation process. Furthermore, a sedimentation structure sequence template for candidate pollution sources is constructed using historical samples with known sources. A machine learning model is then used to train the sedimentation structure change patterns of different sources, enabling the model to learn the evolution patterns of various pollution sources in the time series. After inputting samples to be judged, the model can perform matching analysis on each candidate pollution source based on the structural deviation. After obtaining the preliminary source tracing results, the difference between the optimal candidate source and other candidate sources is judged by calculating the preliminary source tracing discrimination. When the discrimination is insufficient, meteorological conditions and spatial path relationships are introduced to calculate the path deviation. Then, the structural deviation is combined for comprehensive verification, thus forming a comprehensive judgment mechanism based on sedimentation structure characteristics, transmission path relationships and spatial location relationships. This provides a more complete data basis for the identification of soil heavy metal pollution sources and helps to distinguish and analyze pollution sources when there are multiple potential pollution sources.

[0025] Example 2 Please refer to Figure 1 Specifically: S1 includes S11; S11. Use the data acquisition device to read soil heavy metal content data, sampling time interval data ΔTt, meteorological data and spatial location data of candidate pollution sources at the same sampling point during multiple sampling periods; Among them, the soil heavy metal content data were obtained by heavy metal analysis instruments: including the measured content C(m,t) of the m-th heavy metal in the t-th sampling period, the total number of heavy metal species M, and the total number of sampling periods T; The sampling interval data ΔTt specifically represents the actual time interval between the t-th sampling period and the (t-1)-th sampling period; Meteorological data is collected through wind speed sensors, wind direction sensors, rain gauges and humidity sensors in automatic weather monitoring stations: including the cumulative rainfall Rt in the t-th sampling period, the average wind speed Ut in the t-th sampling period, the prevailing wind direction angle PWt in the t-th sampling period and the average relative humidity Ht in the t-th sampling period; The spatial location data of the candidate pollution sources are acquired by GNSS positioning equipment, including the spatial distance Ls from the s-th candidate pollution source to the target sampling point and the azimuth angle Qs of the s-th candidate pollution source pointing to the target sampling point.

[0026] S1 also includes S12; S12. The acquired soil heavy metal content data, sampling time interval data ΔTt, meteorological data, and candidate pollution source spatial location data are checked time-by-time to identify data items that exceed the detection record boundaries, are duplicated, have missing key items, have inconsistent dimensions, or are out of order. Then, the soil heavy metal content data items, meteorological data items, and candidate pollution source spatial location data items are standardized according to the same scale for similar fields, and spliced according to the sampling time sequence to establish a soil heavy metal time series data table. The soil heavy metal content data was preprocessed using a low-pass filter processor to eliminate noise in the data; Anomaly detection was performed on meteorological data using the three-standard-deviation method, outliers were removed, and missing values were filled in. The spatial location data of candidate pollution sources are decomposed into directional components to eliminate the influence of spatial dimension and prevent jumps in angle calculation. The data on soil heavy metal content, sampling time interval data ΔTt, meteorological data, and spatial location data of candidate pollution sources were normalized using the maximum-minimum normalization method to eliminate the dimensions of the data and unify the dimensions of the data.

[0027] In this embodiment, during the data acquisition stage, the soil heavy metal content data, meteorological data, and spatial location data of candidate pollution sources at multiple sampling periods at the same sampling point are collected in a unified manner by the acquisition device. This enables the data that were originally scattered in different monitoring systems to be integrated and recorded in the same time series framework, thereby transforming the soil sample detection data from a single sampling record into a monitoring data structure with time continuity, which is beneficial to reflecting the changes in soil heavy metal deposition at different time stages. During data processing, the completeness and sequence consistency of the collected data are checked by checking each time period. This allows for the identification and rectification of duplicate entries, missing key fields, and misaligned sequences in the detection records before data modeling, ensuring consistency in the data sources used for subsequent analysis. Simultaneously, a low-pass filter processor is used to preprocess the soil heavy metal content data, and meteorological data is processed using anomaly detection and missing value imputation methods. This corrects abnormal records in the monitoring data caused by environmental disturbances or equipment fluctuations, thereby maintaining a stable trend in the time series data during continuous changes. By performing directional component decomposition on the spatial location data of candidate pollution sources and normalizing all types of data to a uniform scale, the data from different sources maintain a consistent data representation during the time series combination process, avoiding the impact of differences in dimensions or angle calculation jumps. This results in a unified soil heavy metal time series data table, providing a continuous and comparable data foundation for subsequent sedimentation structure sequence construction and pollution source analysis.

[0028] Example 3 Please refer to Figure 1 Specifically: S2 includes S21; S21. Read the measured content C(m,t) of the m-th heavy metal in the t-th sampling period of the soil heavy metal time series data table, and perform cumulative calculation on the content of all heavy metals in the same sampling period to obtain the total metal content CΣ(t) of the sampling period. Divide the measured content C(m,t) of the m-th heavy metal in the t-th sampling period by the total metal content CΣ(t) in the sampling period to obtain the proportion P(m,t) of the m-th heavy metal in the t-th sampling period. The contents of all heavy metals within the same sampling period are sorted according to their numerical values. The ranking position O(m,t) of the m-th heavy metal element in the t-th sampling period is obtained using the following formula: O(m,t) = rank(C(m,t)); where rank represents the sorting operation.

[0029] S2 also includes S22; S22. Read the content ranking positions O(m, t) and O(m, t-1) of the m-th heavy metal element in the t-th sampling period in the order of installation time; perform absolute value calculation and cumulative calculation on the difference of ranking position between adjacent periods for each heavy metal to obtain the ranking change Dt of the t-th sampling period relative to the (t-1)-th sampling period; The formula is as follows: ; In the formula, M represents the total number of heavy metal species, and O(m, t-1) represents the ranking of the m-th heavy metal element in the (t-1)-th sampling period. Read the heavy metal content C(m,t), C(m,t-1), C(m,t-2) and the corresponding time intervals ΔTt, ΔTt-1 for three consecutive sampling periods. Calculate the rate of change for two adjacent intervals and perform a difference operation on the two rates of change to obtain the slope difference K(m,t) of the change of the m-th heavy metal element in the t-th sampling period. The formula is as follows: ; In the formula, ΔTt represents the actual time interval between the t-th sampling period and the (t-1)-th sampling period, and ΔTt-1 represents the actual time interval between the (t-1)-th sampling period and the (t-2)-th sampling period; The sedimentation structure sequence of the target area is constructed by combining the proportion P(m,t) of the m-th heavy metal in the t-th sampling period, the content ranking position O(m,t) of the m-th heavy metal element in the t-th sampling period, the ranking change Dt of the t-th sampling period relative to the (t-1)-th sampling period, and the slope difference K(m,t) of the m-th heavy metal element in the t-th sampling period in chronological order.

[0030] S3 includes S31; S31. Read historical samples with known sources and collect soil heavy metal time series samples corresponding to the pollution source categories identified in the historical samples; using the calculation steps in S2, calculate the proportion (s,n)P(m,t) of the m-th heavy metal in the t-th sampling period, the ranking change (s,n)Dt of the t-th sampling period relative to the (t-1)-th sampling period, and the difference in the slope of the m-th heavy metal change in the t-th sampling period (s,n)K(m,t) for each group of historical samples. Where s represents the candidate pollution source category number, and n represents the nth known source sample under the category; After obtaining the sedimentation structure sequence of each group of historical samples, the spatial distance Ls from the s-th candidate pollution source to the target sampling point and the azimuth angle Qs of the s-th candidate pollution source pointing to the target sampling point are read and written into the template construction process as the spatial location item of the category candidate pollution source. For all historical samples under the same candidate pollution source category, the percentage sequence, the ranking change sequence, and the change slope difference sequence are respectively processed by averaging the samples to generate the percentage template sP(m,t), the ranking change template sDt, and the slope difference template sK(m,t) of the s-th candidate pollution source. Together with the direction deviation template term sAt and the distance conversion amount sL, they form the candidate pollution source sedimentation structure sequence template. The orientation deviation template term sAt is obtained using the following formula: ; In the formula, π represents the mathematical constant pi, and PWt represents the prevailing wind direction angle during the t-th sampling period; The distance conversion value sL is obtained using the following formula: ; In the formula, max(r=1,2,...,S)Lr represents the maximum spatial distance among all candidate pollution sources, and S represents the total number of candidate pollution source categories.

[0031] S3 also includes S32; S32. Assemble the obtained candidate pollution source sedimentation structure sequence templates into a training sample set according to the category number, and label each training sample vector with the corresponding candidate pollution source category identifier. All training sample vectors under the same candidate pollution source category are categorized and aggregated, and the concentrated positions of the category training samples in the feature space are extracted to generate the corresponding category centers, which are used to represent the typical sedimentation structure characteristics of the candidate pollution source category. The degree of difference between each training sample vector and its class center is calculated to obtain the intra-class bias, which characterizes the degree of consistency between the training sample and the typical structure of the class. Perform pairwise difference comparisons on the class centers of different candidate pollution source categories to obtain the inter-class separation measure, which characterizes the separation state of different candidate pollution source categories in the feature space; The training decision metric is constructed based on the correspondence between intra-class bias and inter-class separation for each category. The current training state is evaluated, and the model training process ends when the training decision metric meets the preset training conditions. After the training decision quantity meets the preset training conditions, the training parameters of the current machine learning model are saved to obtain the trained pollution source identification model.

[0032] In this embodiment, by calculating the proportion and sorting the heavy metal content of each sampling period in the soil heavy metal time series data table, the detection data, which originally existed in the form of a single concentration value, is transformed into a data expression form that can reflect the relative structural relationship between each heavy metal. This allows the compositional relationship of different elements in the same sampling period to be uniformly described. At the same time, by calculating the sorting changes between adjacent sampling periods and the differences in the rate of change of heavy metal content, the change process of soil heavy metals in the time dimension can be represented in a structured way. This constructs a sedimentation structure sequence that reflects the trajectory of pollution deposition, so that the monitoring data can not only reflect the pollution state at a certain moment, but also reflect the change characteristics of different pollution sources in the long-term deposition process.

[0033] By reading historical samples from known sources and constructing sedimentation structure sequences in the same manner, sequence statistical processing is performed on historical samples under the same pollution source category to form sedimentation structure sequence templates for the corresponding pollution source category. This allows typical sedimentation patterns of various pollution sources during the time-varying process to be extracted and preserved. At the same time, the spatial distance and directional relationship between candidate pollution sources and sampling points are written into the template construction process, so that sedimentation structure features and spatial location relationships of pollution sources are expressed in the same data structure. This ensures that pollution source features not only include time-series change information but also spatial information related to atmospheric transport paths.

[0034] By assembling various candidate pollution source sedimentation structure sequence templates into a training sample set, and using a machine learning model to learn the sequence features of different pollution source categories, the model can identify the change patterns of different pollution sources in the sedimentation structure sequence. During the training process, by comprehensively analyzing the concentration of similar samples and the separation between different categories, the model can form a classification boundary with discriminative ability. Thus, when inputting unknown samples, it can identify and analyze candidate pollution sources based on sedimentation structure sequence features, transforming the process of judging soil heavy metal pollution sources from relying on single concentration comparisons to an analysis method based on time series change features.

[0035] Example 4 Please refer to Figure 1 Specifically: S4 includes S41; S41. Input the sedimentation structure sequence of the s-th candidate pollution source into the pollution source identification model to obtain the set of candidate pollution source matching results corresponding to the sample to be judged. The proportion template sP(m,t), sorting change template sDt, and slope difference template sK(m,t) in the sedimentation structure sequence template of each candidate pollution source are read sequentially. The sedimentation structure sequence of the sample to be judged is compared with the sedimentation structure sequence template of each candidate pollution source by time period and by parameter to obtain the structural deviation GWs of the s-th candidate pollution source. The formula is as follows: ; The structural deviation GWs of the s-th candidate pollution source is compared with the set deviation threshold Tgw to determine the overall degree of deviation. When the structural deviation GWs ≤ deviation threshold Tgw, it means that the closer the judgment sample is to the candidate pollution source, the lower the degree of deviation. When the structural deviation GWs > deviation threshold Tgw, it indicates an abnormal degree of deviation.

[0036] S4 also includes S42; S42. Sort the structural deviations GWs corresponding to all selected pollution sources in ascending order, extract the smallest structural deviation GW(1) as the optimal candidate source deviation value, and extract the second smallest structural deviation GW(2) as the second optimal candidate source deviation value. The preliminary source discrimination index QG is calculated based on the interval relationship between the optimal candidate source deviation value and the second-best candidate source deviation value. The initial source discrimination index (QG) is obtained using the following formula: ; Based on the preliminary source tracing discrimination index (QG), determine whether the current preliminary source tracing results have clear boundaries. The determination method is as follows: When the preliminary source tracing discrimination score QG ≥ the discrimination threshold Qref, it means that the difference between the optimal candidate pollution source and other candidate pollution sources has reached the reference discrimination level, and the optimal candidate pollution source is output as the preliminary source tracing result. When the initial source tracing discrimination index QG < the discrimination threshold Qref, it indicates that the difference between the optimal candidate pollution source and other candidate pollution sources is insufficient, suggesting that multiple candidate pollution sources are close to the sample to be judged, and are marked as results to be reviewed.

[0037] In this embodiment, after the pollution source identification model training is completed, the sedimentation structure sequence of the sample to be judged and the sedimentation structure sequence template of each candidate pollution source are calculated on a time-by-time and parameter-by-parameter basis. The structural deviation is used to quantify the structural similarity between the sample to be judged and different pollution sources. This allows multi-dimensional features such as the proportion, order changes, and trends in the sedimentation structure sequence to participate in source identification simultaneously. As a result, the pollution source judgment process no longer relies solely on a single concentration comparison, but identifies the differences between different pollution sources through time series structural features, which is beneficial for reflecting the changing patterns of different pollution sources in the long-term sedimentation process.

[0038] By sorting the structural deviations of all candidate pollution sources, the difference relationship between the best and second-best candidate sources is extracted, and a preliminary source tracing discrimination index is calculated. This enables the pollution source identification process to not only provide the closest candidate source but also to judge the degree of difference between this result and other candidate sources, thus providing a comparable basis for the model output. Based on this, the preliminary source tracing results are judged by setting discrimination judgment conditions. When the difference between the best candidate source and other candidate sources reaches the reference discrimination level, the preliminary source tracing result is output. When the degree of difference is insufficient, it is marked as pending verification. This enables the pollution source identification process to identify situations where multiple candidate pollution sources are close together, avoiding direct output of results when the source is unclear.

[0039] By using the above methods, the process of identifying soil heavy metal pollution sources is enhanced by adding a result discrimination judgment step on the basis of sedimentation structure feature matching. This allows the model output to not only reflect the degree of proximity to each candidate pollution source, but also the relative differences between different candidate pollution sources. This provides a clearer basis for the pollution source analysis process and is beneficial for distinguishing pollution sources in regional environments with multiple potential pollution sources.

[0040] Example 5 Please refer to Figure 2 and Figure 3 Specifically: S5 includes S51 and S52; S51. When the preliminary source differentiation score QG < the differentiation threshold Qref, it indicates that the structural differences between multiple candidate pollution sources are not obvious and further verification is needed through atmospheric transport path relationships. At this point, the spatial distance and azimuth of the candidate pollution source spatial location data are read, and the meteorological data for the corresponding sampling period are read, including the prevailing wind direction angle, average wind speed, cumulative rainfall, and average relative humidity. The relationship between the prevailing wind direction and the direction of the candidate pollution source pointing to the sampling point is calculated for each sampling period, and the relationship is corrected in combination with meteorological conditions to obtain the path deviation Bs corresponding to the s-th candidate pollution source. The formula is as follows: ; The consistency status of the path is determined by analyzing the path deviation Bs; the method is as follows: When the path deviation Bs > 0.3, it indicates that the path status is uncertain and the deviation is abnormal. When the path deviation Bs ≤ 0.3, it indicates that the wind direction is consistent with the direction of the candidate pollution source, and the pollutants are transported to the sampling area along the path; S52. The obtained path deviation Bs, structural deviation GWs, and distance conversion sL are combined to calculate the final verification value Fs of each candidate pollution source. The final verification values Fs of all candidate pollution sources are sorted in ascending order, and the candidate pollution source with the smallest final verification value is extracted as the atmospheric source of soil heavy metal pollution in the target area. The final verification value Fs is obtained using the following formula: ; The extraction formula is as follows: ; In the formula, S# represents the final determined pollution source number, and argmin represents the variable number that minimizes the function value.

[0041] In this embodiment, atmospheric transport path relationships are introduced for further verification based on the preliminary source tracing results. When the structural differences between multiple candidate pollution sources are not obvious, the spatial location data of the candidate pollution sources and the meteorological data of the corresponding sampling period are read to analyze the consistency between the directional relationship of the pollution source pointing to the sampling point and the prevailing wind direction. Combined with meteorological factors such as wind speed, rainfall and relative humidity, the state of the transport path is comprehensively judged. This makes the pollution source identification process not only rely on the sedimentation structure sequence characteristics, but also combine atmospheric transport conditions to analyze the possible transport direction of pollutants, thereby providing further discrimination basis when multiple candidate pollution sources are close to each other.

[0042] During the path verification process, the path deviation of different candidate pollution sources is calculated to express the atmospheric transport path status between each candidate pollution source and the sampling area in a unified manner. This allows for the identification of whether there is a consistent relationship between wind direction and pollution source direction, and to reflect the possible transport paths of pollutants under airflow conditions. On this basis, the path deviation, the structural deviation caused by the deposition structure characteristics, and the spatial distance relationship between the pollution source and the sampling point are comprehensively calculated. This ensures that the pollution source determination process simultaneously considers the pollution deposition structure characteristics, atmospheric transport path relationships, and spatial location relationships, and that the role of different types of information in the same determination process is uniformly reflected.

[0043] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and variations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended technical solutions and their equivalents.

Claims

1. A machine learning-based method for tracing atmospheric sources of heavy metal pollution in soil, characterized by: Includes the following steps: S1. Obtain soil heavy metal content data, meteorological data, and spatial location data of candidate pollution sources at the same sampling point in the target area during multiple sampling periods, and establish a soil heavy metal time series data table. S2. Perform percentage calculation, element sorting, and slope difference calculation on the heavy metal content of each sampling period in the soil heavy metal time series data table to construct the sedimentation structure sequence of the target area. S3. Construct candidate pollution source sedimentation structure sequence templates for historical samples with known sources in the same manner as step S2, and input them into the machine learning model for training to obtain a trained pollution source identification model. S4. Input the sedimentation structure sequence of the sample to be judged into the pollution source identification model, and calculate the structural deviation GWs of each candidate pollution source in combination with the candidate pollution source sedimentation structure sequence template. Output the preliminary source discrimination index QG based on the structural deviation GWs. S5. Based on the preliminary source discrimination index QG, calculate the path deviation Bs and final verification value Fs for each candidate pollution source, and output the atmospheric source results of soil heavy metal pollution in the target area according to the final verification value Fs.

2. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 1, characterized in that: S1 includes S11; S11. Use the data acquisition device to read soil heavy metal content data, sampling time interval data ΔTt, meteorological data and spatial location data of candidate pollution sources at the same sampling point during multiple sampling periods; Among them, the soil heavy metal content data were obtained by heavy metal analysis instruments: including the measured content C(m,t) of the m-th heavy metal in the t-th sampling period, the total number of heavy metal species M, and the total number of sampling periods T; The sampling interval data ΔTt specifically represents the actual time interval between the t-th sampling period and the (t-1)-th sampling period; Meteorological data is collected through wind speed sensors, wind direction sensors, rain gauges and humidity sensors in automatic weather monitoring stations: including the cumulative rainfall Rt in the t-th sampling period, the average wind speed Ut in the t-th sampling period, the prevailing wind direction angle PWt in the t-th sampling period and the average relative humidity Ht in the t-th sampling period; The spatial location data of the candidate pollution sources are acquired by GNSS positioning equipment, including the spatial distance Ls from the s-th candidate pollution source to the target sampling point and the azimuth angle Qs of the s-th candidate pollution source pointing to the target sampling point.

3. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 2, characterized in that: S1 also includes S12; S12. The acquired soil heavy metal content data, sampling time interval data ΔTt, meteorological data, and candidate pollution source spatial location data are checked time-by-time to identify data items that exceed the detection record boundaries, are duplicated, have missing key items, have inconsistent dimensions, or are out of order. Then, the soil heavy metal content data items, meteorological data items, and candidate pollution source spatial location data items are standardized according to the same scale for similar fields, and spliced according to the sampling time sequence to establish a soil heavy metal time series data table. The soil heavy metal content data was preprocessed using a low-pass filter processor to eliminate noise in the data; Anomaly detection was performed on meteorological data using the three-standard-deviation method, outliers were removed, and missing values were filled in. The spatial location data of candidate pollution sources are decomposed into directional components to eliminate the influence of spatial dimension and prevent jumps in angle calculation. The data on soil heavy metal content, sampling time interval data ΔTt, meteorological data, and spatial location data of candidate pollution sources were normalized using the maximum-minimum normalization method to eliminate the dimensions of the data and unify the dimensions of the data.

4. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 3, characterized in that: S2 includes S21; S21. Read the measured content C(m,t) of the m-th heavy metal in the t-th sampling period of the soil heavy metal time series data table, and perform cumulative calculation on the content of all heavy metals in the same sampling period to obtain the total metal content CΣ(t) of the sampling period. Divide the measured content C(m,t) of the m-th heavy metal in the t-th sampling period by the total metal content CΣ(t) in the sampling period to obtain the proportion P(m,t) of the m-th heavy metal in the t-th sampling period. The contents of all heavy metals within the same sampling period are sorted according to their numerical values. The ranking position O(m,t) of the m-th heavy metal element in the t-th sampling period is obtained using the following formula: O(m,t) = rank(C(m,t)); where rank represents the sorting operation.

5. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 4, characterized in that: S2 also includes S22; S22. Read the content ranking positions O(m, t) and O(m, t-1) of the m-th heavy metal element in the t-th sampling period in the order of installation time; perform absolute value calculation and cumulative calculation on the difference of ranking position between adjacent periods for each heavy metal to obtain the ranking change Dt of the t-th sampling period relative to the (t-1)-th sampling period; The formula is as follows: ； In the formula, M represents the total number of heavy metal species, and O(m, t-1) represents the ranking of the m-th heavy metal element in the (t-1)-th sampling period. Read the heavy metal content C(m,t), C(m,t-1), C(m,t-2) and the corresponding time intervals ΔTt, ΔTt-1 for three consecutive sampling periods. Calculate the rate of change for two adjacent intervals and perform a difference operation on the two rates of change to obtain the slope difference K(m,t) of the change of the m-th heavy metal element in the t-th sampling period. The formula is as follows: ； In the formula, ΔTt represents the actual time interval between the t-th sampling period and the (t-1)-th sampling period, and ΔTt-1 represents the actual time interval between the (t-1)-th sampling period and the (t-2)-th sampling period; The sedimentation structure sequence of the target area is constructed by combining the proportion P(m,t) of the m-th heavy metal in the t-th sampling period, the content ranking position O(m,t) of the m-th heavy metal element in the t-th sampling period, the ranking change Dt of the t-th sampling period relative to the (t-1)-th sampling period, and the slope difference K(m,t) of the m-th heavy metal element in the t-th sampling period in chronological order.

6. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 5, characterized in that: S3 includes S31; S31. Read historical samples with known sources and collect soil heavy metal time series samples corresponding to the pollution source categories identified in the historical samples; using the calculation steps in S2, calculate the proportion (s,n)P(m,t) of the m-th heavy metal in the t-th sampling period, the ranking change (s,n)Dt of the t-th sampling period relative to the (t-1)-th sampling period, and the difference in the slope of the m-th heavy metal change in the t-th sampling period (s,n)K(m,t) for each group of historical samples. Where s represents the candidate pollution source category number, and n represents the nth known source sample under the category; After obtaining the sedimentation structure sequence of each group of historical samples, the spatial distance Ls from the s-th candidate pollution source to the target sampling point and the azimuth angle Qs of the s-th candidate pollution source pointing to the target sampling point are read and written into the template construction process as the spatial location item of the category candidate pollution source. For all historical samples under the same candidate pollution source category, the percentage sequence, the ranking change sequence, and the change slope difference sequence are respectively processed by averaging the samples to generate the percentage template sP(m,t), the ranking change template sDt, and the slope difference template sK(m,t) of the s-th candidate pollution source. Together with the direction deviation template term sAt and the distance conversion amount sL, they form the candidate pollution source sedimentation structure sequence template. The orientation deviation template term sAt is obtained using the following formula: ； In the formula, π represents the mathematical constant pi, and PWt represents the prevailing wind direction angle during the t-th sampling period; The distance conversion value sL is obtained using the following formula: ； In the formula, max(r=1,2,...,S)Lr represents the maximum spatial distance among all candidate pollution sources, and S represents the total number of candidate pollution source categories.

7. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 6, characterized in that: S3 also includes S32; S32. Assemble the obtained candidate pollution source sedimentation structure sequence templates into a training sample set according to the category number, and label each training sample vector with the corresponding candidate pollution source category identifier. All training sample vectors under the same candidate pollution source category are categorized and aggregated, and the concentrated positions of the category training samples in the feature space are extracted to generate the corresponding category centers, which are used to represent the typical sedimentation structure characteristics of the candidate pollution source category. The degree of difference between each training sample vector and its class center is calculated to obtain the intra-class bias, which characterizes the degree of consistency between the training sample and the typical structure of the class. Perform pairwise difference comparisons on the class centers of different candidate pollution source categories to obtain the inter-class separation measure, which characterizes the separation state of different candidate pollution source categories in the feature space; The training decision metric is constructed based on the correspondence between intra-class bias and inter-class separation for each category. The current training state is evaluated, and the model training process ends when the training decision metric meets the preset training conditions. After the training decision quantity meets the preset training conditions, the training parameters of the current machine learning model are saved to obtain the trained pollution source identification model.

8. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 7, characterized in that: S4 includes S41; S41. Input the sedimentation structure sequence of the s-th candidate pollution source into the pollution source identification model to obtain the set of candidate pollution source matching results corresponding to the sample to be judged. The proportion template sP(m,t), sorting change template sDt, and slope difference template sK(m,t) in the sedimentation structure sequence template of each candidate pollution source are read sequentially. The sedimentation structure sequence of the sample to be judged is compared with the sedimentation structure sequence template of each candidate pollution source by time period and by parameter to obtain the structural deviation GWs of the s-th candidate pollution source. The formula is as follows: ； The structural deviation GWs of the s-th candidate pollution source is compared with the set deviation threshold Tgw to determine the overall degree of deviation. When the structural deviation GWs ≤ deviation threshold Tgw, it means that the closer the judgment sample is to the candidate pollution source, the lower the degree of deviation. When the structural deviation GWs > deviation threshold Tgw, it indicates an abnormal degree of deviation.

9. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 8, characterized in that: S4 also includes S42; S42. Sort the structural deviations GWs corresponding to all selected pollution sources in ascending order, extract the smallest structural deviation GW(1) as the optimal candidate source deviation value, and extract the second smallest structural deviation GW(2) as the second optimal candidate source deviation value. The preliminary source discrimination index QG is calculated based on the interval relationship between the optimal candidate source deviation value and the second-best candidate source deviation value. The initial source discrimination index (QG) is obtained using the following formula: ； Based on the preliminary source tracing discrimination index (QG), determine whether the current preliminary source tracing results have clear boundaries. The determination method is as follows: When the preliminary source tracing discrimination score QG ≥ the discrimination threshold Qref, it means that the difference between the optimal candidate pollution source and other candidate pollution sources has reached the reference discrimination level, and the optimal candidate pollution source is output as the preliminary source tracing result. When the initial source tracing discrimination index QG < the discrimination threshold Qref, it indicates that the difference between the optimal candidate pollution source and other candidate pollution sources is insufficient, suggesting that multiple candidate pollution sources are close to the sample to be judged, and are marked as results to be reviewed.

10. The method for tracing atmospheric sources of heavy metal pollution in soil based on machine learning according to claim 9, characterized in that: S5 includes S51 and S52; S51. When the preliminary source differentiation score QG < the differentiation threshold Qref, it indicates that the structural differences between multiple candidate pollution sources are not obvious and further verification is needed through atmospheric transport path relationships. At this point, the spatial distance and azimuth of the candidate pollution source spatial location data are read, and the meteorological data for the corresponding sampling period are read, including the prevailing wind direction angle, average wind speed, cumulative rainfall, and average relative humidity. The relationship between the prevailing wind direction and the direction of the candidate pollution source pointing to the sampling point is calculated for each sampling period, and the relationship is corrected in combination with meteorological conditions to obtain the path deviation Bs corresponding to the s-th candidate pollution source. The formula is as follows: ； The consistency status of the path is determined by analyzing the path deviation Bs; the method is as follows: When the path deviation Bs > 0.3, it indicates that the path status is uncertain and the deviation is abnormal. When the path deviation Bs ≤ 0.3, it indicates that the wind direction is consistent with the direction of the candidate pollution source, and the pollutants are transported to the sampling area along the path; S52. The obtained path deviation Bs, structural deviation GWs, and distance conversion sL are combined to calculate the final verification value Fs of each candidate pollution source. The final verification values Fs of all candidate pollution sources are sorted in ascending order, and the candidate pollution source with the smallest final verification value is extracted as the atmospheric source of soil heavy metal pollution in the target area. The final verification value Fs is obtained using the following formula: ； The extraction formula is as follows: ； In the formula, S# represents the final determined pollution source number, and argmin represents the variable number that minimizes the function value.