A new pollutant global distribution meta-modeling method and system

By integrating machine learning models and environmental factor analysis, the data integration challenge for assessing the global distribution of new pollutants has been solved, enabling high-resolution pollutant distribution prediction and risk assessment, and providing precise environmental management support.

CN122241108APending Publication Date: 2026-06-19HENAN UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HENAN UNIVERSITY
Filing Date
2026-03-24
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies are insufficient for high-resolution, unified modeling and data integration on a global scale, making it difficult to effectively assess the distribution characteristics and dominant source mechanisms of new pollutants, and lacking a systematic technical approach.

Method used

By employing an integrated machine learning model and combining environmental factors, through data standardization, pollutant feature coding, and variable importance analysis, we can predict the spatial distribution of pollutants in global agricultural regions and generate risk distribution maps.

Benefits of technology

It enables high-resolution global-scale distribution prediction and visualization, quantitatively identifies driving factors and source patterns of pollutants, reduces monitoring costs, and provides precise environmental management decision support.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241108A_ABST
    Figure CN122241108A_ABST
Patent Text Reader

Abstract

This invention proposes a novel meta-modeling method and system for the global distribution of pollutants. By acquiring monitoring data, a standardized data cleaning and feature encoding process is established. Multidimensional environmental driving factors are extracted for each sampling point, and an ensemble machine learning model consisting of random forest, XGBoost gradient boosting, and support vector regression is used to establish the response relationship between pollutant concentration and environmental factors, achieving high-resolution prediction of the spatial distribution of pollutants in global agricultural regions. Based on model analysis, the method can quantitatively identify key pollution driving factors and spatially cluster different pollution source dominance areas. Finally, risk levels are determined based on the global percentile of the prediction results, automatically identifying and visualizing global pollution hotspots. This method, for the first time, achieves integrated analysis from data fusion, global prediction, quantitative source tracing to risk assessment, significantly reducing costs compared to traditional comprehensive sampling, and providing an efficient and reliable decision support tool for global environmental risk management.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of environmental pollution monitoring and analysis technology, and in particular to a novel method and system for meta-modeling the global distribution of pollutants. Background Technology

[0002] Emerging contaminants (ECs) refer to pollutants that have received widespread attention in recent years but have not yet been fully regulated. They mainly include microplastics, antibiotics, pesticide residues, and surfactants. With the widespread use of agricultural films, wastewater irrigation, organic fertilizers, and pesticides and antibiotics in modern agriculture, a large number of emerging contaminants have entered the agricultural soil system, posing a potential threat to soil health, crop safety, and human health.

[0003] To address this issue, scholars both domestically and internationally have conducted extensive research, primarily focusing on field detection of pollutants, regional pollution source analysis, and ecological impact assessment. Common research methods include sampling and detection, risk assessment model construction, and source tracing analysis of limited areas, yielding certain results. For example, infrared spectroscopy and liquid chromatography have been used to quantitatively identify microplastic particles and pesticide residues in soil; some studies have also explored the relationship between wastewater irrigation and agricultural film use with soil pollution, and verified the adverse effects of new pollutants on soil health and crop growth through simulation experiments.

[0004] However, current research is mainly limited to local areas, suffering from fragmented data, inconsistent methodologies, and limited spatial scale, making it difficult to support the assessment of new pollutant distribution and source identification on a global scale. In particular, it lacks systematic technical pathways in areas such as data integration capabilities, unified modeling of pollutant types, and analysis of pollution driving mechanisms. Therefore, there is an urgent need for a modeling method based on the fusion of existing literature data and environmental factors, capable of revealing the distribution characteristics and dominant source mechanisms of new pollutants at the global agricultural soil scale, providing a scientific basis for pollution risk early warning, green agricultural transformation, and environmental policy formulation. Summary of the Invention

[0005] To address the shortcomings of existing technologies, the purpose of this invention is to provide a novel meta-modeling method and system for the global distribution of pollutants.

[0006] To achieve the above objectives, the first aspect of this application proposes the following steps: S100 acquires soil pollutant sampling data from around the world, including geographic location information and quantitative concentration information; s200, standardize the acquired sampling data and establish a pollutant feature coding system to generate a standardized database; s300, extracts environmental factors for each sampling point, including land use factors and human activity factors; S400 uses an integrated machine learning model for training, and uses the trained model to predict the spatial distribution of pollutants in global agricultural regions. S500, based on the variable importance analysis results of the integrated machine learning model, combined with pollutant characteristics, the dominant driving factors affecting pollutant distribution are identified, and different pollution spatial patterns are divided. S600 uses a threshold method based on global percentiles to classify pollution risk levels, identify pollution hotspots, and generate a risk distribution map based on the predicted spatial distribution of pollutants.

[0007] According to one embodiment of this application, step s200 includes: unifying the concentration unit of microplastics to particle count / kg dry soil, and unifying the concentration units of antibiotics, pesticide residues and surfactants to μg / kg dry soil; the pollutant characteristic coding system includes polymer type coding, morphological characteristic coding and color coding.

[0008] According to one embodiment of this application, step s300 includes extracting the land use factor based on ESA CCI-LC land cover data and extracting the human activity factor based on LandScan population distribution data.

[0009] According to one embodiment of this application, step s400 includes: s410 performs a base-10 logarithmic transformation on the response variable, pollutant concentration. S420 uses a 5-fold cross-validation method to train three models: Random Forest Regression (RF), XGBoost Gradient Boosting, and Support Vector Regression (SVR). The optimal hyperparameters are determined using grid search, and each model is retrained using all the data to obtain three trained prediction models: Random Forest Regression (RF_final), XGBoost Gradient Boosting, and Support Vector Regression (SVR_final).

[0010] According to one embodiment of this application, step s400 includes: S430: The prediction results of the three trained prediction models are weighted and averaged to obtain the final integrated prediction value Final_Prediction; the integrated prediction result is then inversely transformed to obtain the final predicted value of the actual pollutant concentration. S440 uses global farmland distribution data as a mask to calculate the predicted actual pollutant concentration for each 5km × 5km raster cell, generating a global agricultural soil pollutant concentration distribution raster map.

[0011] According to one embodiment of this application, step s400 includes: step s500 includes: S510 calculates the feature importance score of each environmental factor using a trained random forest regression model and extracts the pollutant type features corresponding to each sample point from the structured database. S520 sorts all environmental factors from highest to lowest importance score to quantitatively identify the dominant driving factors.

[0012] According to one embodiment of this application, step s500 includes: S530, based on the dominant driving factors and pollutant type characteristics, determines the source category; S540 creates a feature vector for each sample point, which includes predicted pollutant concentrations and dominant driving factor values; it then performs K-means clustering analysis to identify and output global pollution spatial patterns.

[0013] According to one embodiment of this application, step s500 includes: S610, applying farmland masking to the raster map for predicting global pollutant concentrations; S620 extracts pollutant concentration values ​​from all agricultural pixels to form a concentration value array; the predicted concentration values ​​are classified into risk levels according to the percentile of the global distribution using the quantile threshold method. The low-risk area is defined as no greater than the 50th percentile, the medium-risk area is the 50th-90th percentile, the high-risk area is greater than the 90th percentile, and the extremely high-risk area is greater than the 95th percentile.

[0014] According to one embodiment of this application, step s500 includes: s630: Traverse every agricultural cell in the global pollutant concentration prediction raster map, compare its predicted concentration value with the threshold calculated in step s620, and assign it a risk level code according to the following rules; after completing the classification of all cells, generate a risk level raster map.

[0015] The second aspect of this application proposes a novel meta-modeling system for the global distribution of pollutants, including... The data acquisition module is used to acquire soil pollutant sampling data from around the world, including geographic location information and quantitative concentration information. The data standardization module standardizes the acquired sampling data and establishes a pollutant feature coding system to generate a standardized database. The spatial analysis module extracts environmental factors for each sampling point in the standardized database, including land use factors and human activity factors. The machine learning modeling and prediction module uses standardized pollutant concentration as the response variable and the environmental factors as the feature variables. It employs an integrated machine learning model for training to establish the response relationship between pollutant concentration and environmental factors, and uses the trained model to predict the spatial distribution of pollutants in global agricultural regions. The source tracing and pattern recognition module, based on the variable importance analysis results of the integrated machine learning model and combined with pollutant characteristics, identifies the dominant driving factors affecting pollutant distribution and classifies different pollution spatial patterns. The risk assessment and visualization module classifies pollution risk levels based on the predicted spatial distribution of pollutants using a threshold method based on global percentiles, identifies pollution hotspots, and generates a risk distribution map.

[0016] Compared with the prior art, the beneficial effects of the present invention are: (1) Achieving high-resolution distribution prediction and visualization at the global scale: For the first time, high-precision, rasterized prediction (e.g., 5km resolution) of the spatial distribution of various new pollutants (such as microplastics and antibiotics) at the global agricultural regional scale has been achieved. This method overcomes the shortcomings of traditional research which is limited to local points and generates visualized global distribution heat maps and risk level maps, thereby revealing the global distribution pattern of pollutants and high-risk hotspots in a macroscopic and intuitive way.

[0017] (2) Complete the integrated system analysis from data to causes: This method not only predicts "where the pollution is heavy", but also quantitatively identifies and spatially displays driving factors (such as agricultural film use and sewage irrigation) and dominant pollution source patterns (such as agricultural input type and urban composite type) through variable importance analysis and spatial clustering, realizing the correlation analysis between pollution status and causes.

[0018] (3) Significantly reduce global monitoring and assessment costs: The core of this method is based on the concept of "meta-analysis," which involves systematically integrating and reanalyzing published, scattered monitoring data globally, rather than re-implementing full-coverage field sampling. This method can save monitoring costs, making frequent and rapid pollution screening and assessment on a global scale economically feasible.

[0019] (4) Good adaptability and scalability to different pollutant types: By establishing a standardized data cleaning process and pollutant feature coding system, this method can uniformly handle different categories of pollutants with different physicochemical properties, such as microplastics (by particle number) and antibiotics / pesticides (by mass). This design makes the method have good universality and scalability, and can be applied to the global assessment of new pollutants that will emerge in the future.

[0020] (4) Provide precise decision support tools for environmental management: The final output (including global distribution map, risk classification map, and pollution source type map) is presented in the form of standardized geographic information maps, which can directly serve the formulation of national and global environmental policies, precise prevention and control of agricultural non-point source pollution, and delineation of food safety risk areas, providing quantitative scientific basis for promoting sustainable agricultural development and environmental protection. Attached Figure Description

[0021] Figure 1 This is a flowchart of a novel pollutant global distribution meta-modeling method according to the present invention. Detailed Implementation

[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0023] Figure 1 A novel meta-modeling method for the global distribution of pollutants according to the present invention includes the following steps: S100 acquires soil pollutant sampling data from around the world, including geographic location information and quantitative concentration information; s200, standardize the acquired sampling data, establish a pollutant feature coding system, and generate a standardized database; s300, extracts environmental factors for each sampling point, including land use factors and human activity factors; S400 uses standardized pollutant concentration as the response variable and the environmental factors as the feature variables. An integrated machine learning model is used for training to establish the response relationship between pollutant concentration and environmental factors. The trained model is then used to predict the spatial distribution of pollutants in global agricultural regions. S500, based on the variable importance analysis results of the integrated machine learning model, combined with pollutant characteristics, identify the dominant driving factors affecting pollutant distribution and divide different pollution spatial patterns. S600 uses a threshold method based on global percentiles to classify pollution risk levels, identify pollution hotspots, and generate a risk distribution map based on the predicted spatial distribution of pollutants.

[0024] This invention provides a meta-modeling method for the global distribution of new pollutants. This method integrates globally dispersed monitoring data to establish a standardized pollutant distribution prediction model, enabling the modeling of the global distribution of new pollutants such as microplastics, antibiotics, and pesticide residues in agricultural soils and the analysis of pollution sources.

[0025] Furthermore, new pollutants include microplastics, antibiotics, pesticide residues, and surfactants.

[0026] Furthermore, the environmental factors also include land use factors, soil physicochemical factors, climate factors, human activity factors, and agricultural management factors.

[0027] During the data acquisition phase, this invention employs a dual-path data collection strategy. It conducts systematic searches in major scientific databases such as Web of Science, Scopus, and Google Scholar using specific keyword combinations, while simultaneously integrating relevant records from publicly available environmental monitoring databases such as FAO, UNEP, and USGS. Data screening criteria include the requirement of containing on-site soil sampling data, clear geographic coordinate information, standardized testing methods, and quantitative concentration data.

[0028] The specific steps are as follows: Path 1: Systematic Literature Search Constructing a search query: First, identify the target pollutants (such as microplastics, antibiotics) and their synonyms and related terms. Then, design structured keyword combinations. Finally, perform searches in scientific literature databases such as the Web of Science Core Collection, Scopus, and Google Scholar, typically covering the period from the beginning of the database to the present.

[0029] Export all search results (including title, author, abstract, etc.) and use literature management software to remove duplicate records.

[0030] We apply stricter data extraction standards to perform full-text reading and screening, and only records that meet all of the following conditions will be extracted: Provide clear geographic coordinate information: including latitude and longitude, or a location description that can be accurately located to the county level or above and whose coordinates can be obtained through geocoding.

[0031] Standardized detection methods are employed: The text clearly describes the methods for extracting, purifying, and quantitative / semi-quantitatively analyzing pollutants (such as density separation-micro counting for microplastics, and liquid chromatography-mass spectrometry for antibiotics).

[0032] Report quantitative concentration data: Report concentrations in numerical form (e.g., mean, median, range), rather than only qualitative descriptions (e.g., "detected" or "not detected"). For "not detected" data, record the method detection limit.

[0033] For data that meets the criteria, the following information should be structured and entered into a dedicated database or table: document ID, latitude and longitude of sampling point, pollutant type, specific concentration value and unit, sampling time, soil type (if any), detection method, and document source.

[0034] Path Two: Integration of Public Environmental Monitoring Databases In parallel, relevant monitoring records were obtained from the public data platforms of major international organizations and institutions, and datasets related to the target pollutants and soil media were retrieved using similar keywords or classification codes.

[0035] The same core screening criteria as those used for literature data are applied: the database must include field sampling, geographical location, and quantitative concentration information. Because these databases are typically formatted correctly, the screening process is highly efficient.

[0036] The final result of this stage is a raw spatial database containing multiple sampling points around the world, with each record associated with its geographical location, pollutant concentration, and source literature / database citations, providing basic data raw materials for subsequent data standardization and modeling.

[0037] According to one embodiment of this application, step s300 further includes: unifying the concentration unit of microplastics to particle count / kg dry soil, and unifying the concentration units of antibiotics, pesticide residues and surfactants to μg / kg dry soil; the pollutant characteristic coding system includes polymer type coding, morphological characteristic coding and color coding.

[0038] Data standardization is a key technical aspect of this invention. Addressing issues such as inconsistent units and significant differences in detection methods across different studies, a complete standardization process was established. All pollutant concentration data were uniformly converted to standard units, such as microplastics being standardized to "particle count / kg dry soil" and antibiotics to "μg / kg dry soil." Simultaneously, a pollutant classification and coding system was established, including polymer type coding, morphological characteristic coding, and color coding, to ensure data consistency and comparability.

[0039] Specifically, the first step... All pollutant concentration data are uniformly converted to standard units, such as microplastics being uniformly converted to "particle count / kg dry soil" and antibiotics being uniformly converted to "μg / kg dry soil", etc. Step 2: Pollutant Feature Coding and Structured Database Construction Design a coding system: Establish a feature classification and coding table for each type of pollutant. For example: Polymer type codes: Polyethylene (PE): 01, Polypropylene (PP): 02, Polyester (PET): 03, ... Morphological feature encoding: Fiber: F, Fragment: G, Film: M, ... Color coding: Transparent: T, White: W, Blue: BU, ... Construct a structured database: The processed standardized concentration data, along with the feature codes generated in the second step and information such as geographical location and environmental factors from the original data, are entered into a relational database or structured table.

[0040] Ultimately, each data point contains a series of standardized fields, such as: sampling point ID, longitude, latitude, pollutant type, standardized concentration, polymer code, morphology code, color code, and data source.

[0041] According to one embodiment of this application, step s300 includes: the land use factors are extracted based on ESA CCI-LC land cover data; the soil physicochemical factors are extracted based on the SoilGrids soil database; the climate factors are extracted based on the WorldClim climate database; the human activity factors are extracted based on LandScan population distribution data; and the agricultural management factors include the intensity of plastic film use, the proportion of organic fertilizer application, and the proportion of wastewater irrigation area obtained through spatialization of literature and statistical data.

[0042] Environmental factors were integrated using GIS technology to systematically extract multidimensional environmental variables for each pollution monitoring site. Land use factors were extracted based on ESA CCI-LC 300m resolution data, including the proportion of farmland area and urban construction land within different buffer zones. Soil physicochemical factors were derived from the SoilGrids 250m resolution global soil database, including key parameters such as soil organic carbon content, clay content, pH value, and bulk density. Climate factors were obtained from WorldClim 2.1 global climate data, covering annual average precipitation, temperature, wind speed, and evapotranspiration. Human activity factors were combined with LandScan global population data and nighttime light data to quantify population density and urbanization levels. Agricultural management factors were obtained through literature review and statistical data, including specific indicators such as the intensity of plastic film use, the proportion of organic fertilizer application, and the proportion of wastewater irrigation area.

[0043] According to one embodiment of this application, step s400 includes: s410: Acquire all sample data that have undergone standardization and spatial environmental factor integration, and perform a base-10 logarithmic transformation on the response variable pollutant concentration; S420 uses a 5-fold cross-validation method to train three models: Random Forest Regression (RF), XGBoost Gradient Boosting, and Support Vector Regression (SVR). The optimal hyperparameters are determined using grid search, and each model is retrained using all the data to obtain three trained prediction models: Random Forest Regression (RF_final), XGBoost Gradient Boosting, and Support Vector Regression (SVR_final).

[0044] S430, input the environmental factors for the location to be predicted, and obtain three predicted values: Pred_RF, Pred_XGB, and Pred_SVR. Calculate the weighted average of the prediction results from the three trained prediction models to obtain the final ensemble prediction value: Final_Prediction = 0.40 * Pred_RF + 0.45 * Pred_XGB + 0.15 * Pred_SVR. Perform an inverse transformation 10^Final_Prediction on the ensemble prediction result to obtain the final predicted value of the actual pollutant concentration. S440 uses SPAM 2010 global farmland distribution data as a mask. Within the farmland area, for each 5km × 5km raster cell, it extracts the corresponding environmental factors, calculates the predicted value of the actual pollutant concentration for that cell, and summarizes the predicted values ​​of all cells to generate a global agricultural soil pollutant concentration distribution raster map. It records the prediction intervals of the prediction results of three trained prediction models for each cell, thereby generating a spatial layer characterizing the uncertainty of the model prediction. The global agricultural soil pollutant concentration distribution raster map is reclassified according to the set quantile threshold to generate low, medium, and high risk level maps.

[0045] Distribution modeling employs an ensemble machine learning approach to improve prediction accuracy and stability. Three algorithms—random forest regression, XGBoost gradient boosting, and support vector regression—are used for ensemble modeling, with weights allocated as follows: random forest 40%, XGBoost 45%, and SVR 15%. Before model training, the response variable is transformed using a log10 transformation to correct for a right-skewed distribution, and 5-fold cross-validation is used to evaluate model performance. SPAM 2010 global farmland distribution data is used as a prediction mask to generate a 5km × 5km resolution global prediction raster, simultaneously outputting prediction confidence intervals and uncertainty assessment layers.

[0046] The specific steps are as follows: S410, Data Preparation and Preprocessing Acquire all sample data that have undergone standardization and integration of space environment factors; Response variable transformation: Since pollutant concentration data usually exhibits a right-skewed distribution (i.e., a few extremely high values), direct modeling is ineffective. Therefore, a base-10 logarithmic transformation is performed on the response variable (pollutant concentration) (Y_transformed = log10(Y_original)) to make its distribution closer to a normal distribution, thus satisfying the assumptions of many machine learning models and improving model stability.

[0047] Dataset partitioning: The entire dataset is randomly shuffled to prepare for subsequent cross-validation. Instead of the traditional single training / test set partitioning, a cross-validation strategy is directly adopted to evaluate and optimize the model.

[0048] S420, Model Training and Hyperparameter Tuning Three models can be trained in parallel or sequentially: Random Forest Regression (RF), XGBoost Regression, and Support Vector Regression (SVR).

[0049] Cross-validation and performance evaluation: For each model, a 5-fold cross-validation method is used. That is, the dataset is divided into 5 equal parts, and 4 parts are used alternately as the training set, with the remaining part as the validation set, repeated 5 times. After each training iteration, an evaluation metric (such as R-squared) is calculated on the validation set. 2 (Root Mean Square Error, RMSE). Ultimately, the model's performance is the average of the results from five validation runs. This effectively utilizes limited data and reduces the risk of overfitting.

[0050] Hyperparameter optimization: A nested grid search method is used during cross-validation. A set of candidate hyperparameter combinations is defined for each model (e.g., number of trees and maximum depth for Random Forest; learning rate and maximum tree depth for XGBoost; kernel function and penalty coefficient C for SVR, etc.). The grid search iterates through all preset combinations and selects the set of hyperparameters with the best average performance in cross-validation as the final configuration for that model.

[0051] Final model training: Using the optimal hyperparameters determined by grid search, each model is retrained with all the data to obtain three trained prediction models (RF_final, XGB_final, SVR_final).

[0052] S430, integrating prediction and results generation For any location requiring prediction (whether an existing sample point or a new spatial grid point), input its corresponding environmental factor data into the three trained models mentioned above to obtain three predicted values: Pred_RF, Pred_XGB, and Pred_SVR. Note that the input environmental factor data must maintain the same standard as during training.

[0053] Weighted average ensemble: The prediction results of the three trained models are weighted and averaged according to the weights of Random Forest (40%), XGBoost (45%), and SVR (15%) to obtain the final ensemble prediction value.

[0054] Final_Prediction = 0.40 * Pred_RF + 0.45 * Pred_XGB + 0.15 * Pred_SVR Inverse transformation of results: Since the model predicts based on the transformed log10 (concentration), the integrated prediction results need to be inversely transformed (10^Final_Prediction) to obtain the final predicted value of the actual pollutant concentration.

[0055] S440, Global Spatial Forecasting and Uncertainty Output Prepare global environmental factor raster layers: Process all types of environmental factor data (soil, climate, population, agricultural management, etc.) into global raster data with consistent spatial range and uniform resolution.

[0056] Applying farmland masking: Using SPAM 2010 global farmland distribution data as a mask, predictions are made only for agricultural areas worldwide, while non-agricultural areas (such as forests, water bodies, and urban built-up areas) are excluded. This makes the predictions more targeted and saves computational resources.

[0057] Pixel-by-pixel prediction: Within the farmland area, for each 5km × 5km raster pixel, extract its corresponding complete set of environmental factor values. Then, input these values ​​into the integrated model process described above to calculate the predicted pollutant concentration for that pixel.

[0058] Output: Predicted distribution map: The predicted values ​​of all pixels are aggregated to generate a global raster map (heat map) of agricultural soil pollutant concentration distribution.

[0059] Uncertainty layer for prediction: During the integration process, the differences (such as standard deviation and prediction interval) of the prediction results of each pixel from the three trained base models can be recorded to generate a spatial layer that represents the uncertainty of model prediction.

[0060] Risk level map: The predicted distribution map is reclassified according to the set quantile threshold to generate low, medium and high risk level maps.

[0061] According to one embodiment of this application, step s500 includes: S510 calculates the feature importance score of each environmental factor using a trained random forest regression model and extracts the pollutant type features corresponding to each sample point from the structured database. S520 sorts all environmental factors from high to low according to their importance scores, and quantitatively identifies the dominant driving factors. S530, based on the dominant driving factors and pollutant type characteristics, determines the source category; S540 creates a feature vector for each sample point, which includes predicted pollutant concentrations and dominant driving factor values; it then performs K-means clustering analysis to identify and output global pollution spatial patterns.

[0062] Source tracing analysis of pollution is performed based on the importance ranking of variables in the model and the characteristics of pollutant types. The dominant driving factors are quantitatively identified using the feature importance scores of random forests. The K-means clustering algorithm is used to identify global pollution spatial patterns, dividing global agricultural regions into different pollution spatial pattern types, such as high-intensity agricultural pollution areas, mixed pollution areas on the urban fringe, and areas dominated by wastewater irrigation.

[0063] Specifically: Step 1: Data Preparation and Model Output Extraction Input data: A random forest regression model trained using the “distribution modeling and prediction” step, and the complete dataset used by the model during training (i.e., sample data containing pollutant concentrations and all environmental factors).

[0064] Extracting internal model information: Feature Importance Score: The feature importance score is calculated for each environmental factor (feature variable) from the trained random forest model. Common methods include using Gini impurity reduction or ranking importance. This score quantitatively characterizes the contribution of each environmental factor to the prediction of pollutant concentration changes. A higher score indicates a stronger spatial correlation between the factor and pollutant distribution, suggesting it may be a more critical driver.

[0065] Pollutant feature data: Extracting pollutant type features corresponding to each sample point from a structured database, for example: For microplastics: polymer type (e.g., polyethylene PE, polypropylene PP), morphology (fibers, fragments, films).

[0066] For antibiotics / pesticides: compound type, application scenario (veterinary, human, herbicide, insecticide).

[0067] Step 2: Quantitatively identify the dominant driving factors This step is entirely based on the feature importance score of the random forest.

[0068] All environmental factors are ranked from highest to lowest according to their importance score.

[0069] Set a threshold (e.g., the top N factors with a cumulative contribution of 80%, or an inflection point where the importance score is significantly higher than other factors) to filter out the set of dominant driving factors. For example, the analysis results may show that "intensity of plastic film use", "proportion of wastewater irrigation", and "population density" are the top three dominant driving factors.

[0070] Step 3: Inferring source category based on pollutant type characteristics The analysis examines whether the main pollutant types detected in regions with high levels of dominant driving factors exhibit any pattern. For example: If the "plastic film usage intensity" factor is highly important in a certain area, and the microplastics in that area are mainly polymers commonly used in agricultural films (such as low-density polyethylene LDPE), then it can be inferred that the pollution source mainly comes from agricultural film residues.

[0071] If a region has a high proportion of wastewater irrigation and a high population density, and a large amount of microplastic fibers or specific medical antibiotics found in personal care products are detected, it can be inferred that the pollution source is mainly from domestic sewage or medical wastewater irrigation.

[0072] If the "organic fertilizer application rate" factor is highly important in a certain area and veterinary antibiotics are detected, it can be inferred that the pollution source mainly comes from the application of animal manure organic fertilizer.

[0073] Step 4: Spatial clustering to identify pollution pattern types This step aims to delineate different types of contaminated areas from a macroscopic spatial perspective.

[0074] 1. Construct the clustering feature matrix: Create a feature vector for each predicted raster cell (e.g., 5km × 5km) or each sample point globally. This vector typically includes: Pollution level: The predicted pollutant concentration at this location.

[0075] Dominant driving factor value: The numerical value of several key environmental factors identified in the second step.

[0076] Pollutant characteristic indicators: such as the proportion of a certain dominant polymer type.

[0077] 2. Perform K-means clustering: Perform K-means clustering analysis on the constructed global spatial feature matrix. The number of clusters K needs to be pre-defined.

[0078] The algorithm divides all spatial units into K clusters, such that units within the same cluster are similar in terms of pollution level and combination of driving factors, while there are significant differences between different clusters.

[0079] 3. Interpret the clustering results and define the contamination pattern: Analyze the central characteristics of each cluster (i.e., the average level of characteristics across all units in that cluster). For example: Cluster A: Characterized by "extremely high pollutant concentrations, with extremely high values ​​for the dominant driving factors 'intensity of plastic film use' and 'amount of organic fertilizer application'." If the pollutants in this area are mainly agricultural film polymers and veterinary antibiotics, then this area can be defined as a "high-intensity agricultural pollution zone."

[0080] Cluster B: Characterized by "high pollutant concentrations, with high dominant driving factors 'population density' and 'wastewater irrigation ratio,' but moderate agricultural factor values." It can be defined as a "mixed pollution zone on the urban fringe" (affected by both urban emissions and agricultural activities).

[0081] Cluster C: Characterized by "moderate pollutant concentrations, with the dominant driving factor 'proportion of wastewater irrigation' being exceptionally prominent, while other factors are not significant." It can be defined as a "wastewater irrigation-dominated area."

[0082] 4. Output: Generate a global map of spatial patterns of pollution in agricultural regions. Different colors on the map represent different spatial patterns of pollution (such as high-intensity agricultural pollution areas, mixed pollution areas on the edge of cities, and areas dominated by wastewater irrigation).

[0083] According to one embodiment of this application, step s600 includes: S610, applying farmland masking to the raster map for predicting global pollutant concentrations; S620 extracts pollutant concentration values ​​of all agricultural pixels from the global pollutant concentration prediction raster map behind the mask, forming a concentration value array; the quantile threshold method is used to classify the predicted concentration values ​​according to the percentile of the global distribution. The low-risk area is defined as no greater than the 50th percentile, the medium-risk area is the 50th-90th percentile, the high-risk area is greater than the 90th percentile, and the extremely high-risk area is greater than the 95th percentile. s630, iterate through each agricultural cell in the global pollutant concentration prediction raster map, compare its predicted concentration value with the threshold calculated in step s620, and assign it a risk level code according to the following rule: if the cell concentration value <= P50, it is defined as "low risk zone".

[0084] If P50 < pixel density value <= P90, it is defined as a "medium risk zone".

[0085] If P90 < pixel density value <= P95, it is defined as a "high-risk area".

[0086] If the pixel concentration value is greater than P95, it is defined as an "extremely high risk area".

[0087] Areas above the 90th percentile are defined as "pollution hotspots"; after classifying all pixels, a risk level raster map is generated.

[0088] A standardized risk grading system was established for hotspot identification and risk assessment. The predicted concentration values ​​were graded according to the percentile of the global distribution using the quantile threshold method. Low-risk areas were defined as those no greater than the 50th percentile, medium-risk areas as those between the 50th and 90th percentiles, high-risk areas as those greater than the 90th percentile, and very high-risk areas as those greater than the 95th percentile.

[0089] The specific analysis steps are as follows: Step 1: Data Preparation: Obtaining the Predicted Concentration Raster Input data: A global pollutant concentration prediction raster map generated using the "Distribution Modeling and Prediction" step. Each pixel in the map (e.g., 5km × 5km) contains a predicted pollutant concentration value.

[0090] Data scope: Apply farmland masking (such as SPAM2010 data) to ensure that subsequent analysis is conducted only on global agricultural areas, excluding non-agricultural land such as oceans, forests, and deserts.

[0091] Step 2: Statistical Analysis - Calculating Global Percentile Thresholds Extract all cell values: Extract pollutant concentration values ​​for all agricultural cells from the masked global prediction raster to form an array of concentration values.

[0092] Calculate key percentiles: Perform statistical analysis on this array of concentration values ​​to calculate the following four key percentile values: 50th percentile: also known as the median. 50% of agricultural pixels are predicted to have a concentration below or equal to this value.

[0093] 90th percentile: 90% of agricultural pixels are predicted to have a concentration below or equal to this value.

[0094] 95th percentile: 95% of agricultural pixels are predicted to have a concentration below or equal to this value.

[0095] Threshold: Based on these calculated percentile values, establish the threshold for classifying risk levels. Low-risk threshold: Concentration ≤ P50 (50th percentile) Medium risk threshold: P50 < concentration ≤ P90 High-risk threshold: P90 < concentration ≤ P95 Extremely high risk threshold: Concentration > P95 Step 3: Spatial Classification - Assigning a risk value to each cell This step applies a statistical threshold to each point in the space.

[0096] Reclassification operation: Iterate through every agricultural cell in the global prediction raster, compare its predicted concentration value with the threshold calculated in the second step, and assign it a risk level code according to the following rules (e.g., 1=low risk, 2=medium risk, 3=high risk, 4=very high risk): If the pixel concentration value is ≤ P50, it is defined as a "low-risk area".

[0097] If P50 < pixel density value ≤ P90, it is defined as a "medium-risk area".

[0098] If P90 < pixel density value ≤ P95, it is defined as a "high-risk area".

[0099] If the pixel concentration value is greater than P95, it is defined as an "extremely high risk area".

[0100] Areas above the 90th percentile (P90) are defined as "pollution hotspots". Therefore, "high-risk areas" and "very high-risk areas" together constitute the "hotspots" that require priority attention.

[0101] Step 4: Result Post-processing and Visualization Generate a risk level raster map: After classifying all cells, a new raster map is generated. In this map, the value of each cell is no longer concentration, but its corresponding risk level code.

[0102] Mapping and visualization: Using geographic information system software, different risk level codes are assigned intuitive colors (e.g., green for low risk, yellow for medium risk, orange for high risk, and red for very high risk) to form the final "Global Agricultural Soil Pollutant Risk Distribution Map".

[0103] Spatial statistics: This can further analyze the proportion of each risk level region to the total global agricultural area, as well as its distribution across different continents or countries, providing quantitative data for the report.

[0104] Step 5: Output and Interpretation Core deliverables: Global Pollutant Risk Level Distribution Map: A spatial pattern of risk presented at a resolution of 5-10 kilometers.

[0105] Risk statistics report: includes the area and percentage of each risk level.

[0106] Hotspot List / Map: Clearly identifies the geographical location and extent of "high-risk" and "extremely high-risk" areas (i.e., hotspots).

[0107] By combining the results of the "pollution source tracing analysis," the causes of the identified hotspots can be explained. For example, it can be pointed out that a certain extremely high-risk area is mainly driven by high-intensity sewage irrigation, while another high-risk area is closely related to mulching and the large-scale application of organic fertilizers.

[0108] According to a second aspect of the present invention, a meta-modeling system for the global distribution of novel pollutants is provided, which corresponds to a meta-modeling method for the global distribution of novel pollutants described above.

[0109] In some alternative embodiments, a novel meta-modeling system for the global distribution of pollutants includes: The data acquisition module is used to acquire soil pollutant sampling data from around the world, including geographic location information and quantitative concentration information. The data standardization module standardizes the acquired sampling data and establishes a pollutant feature coding system to generate a standardized database. The spatial analysis module extracts environmental factors for each sampling point in the standardized database, including land use factors and human activity factors. The machine learning modeling and prediction module uses standardized pollutant concentration as the response variable and the environmental factors as the feature variables. It employs an integrated machine learning model for training to establish the response relationship between pollutant concentration and environmental factors, and uses the trained model to predict the spatial distribution of pollutants in global agricultural regions. The source tracing and pattern recognition module, based on the variable importance analysis results of the integrated machine learning model and combined with pollutant characteristics, identifies the dominant driving factors affecting pollutant distribution and classifies different pollution spatial patterns. The risk assessment and visualization module classifies pollution risk levels based on the predicted spatial distribution of pollutants using a threshold method based on global percentiles, identifies pollution hotspots, and generates a risk distribution map.

[0110] This invention, through systematic technical implementation, achieves high-precision prediction of the global distribution of new pollutants and quantitative source tracing of pollution, filling the technological gap in global-scale agricultural soil pollution distribution modeling. This method not only has significant scientific value but also provides crucial scientific and technological support for ensuring global food security, promoting sustainable agricultural development, and advancing environmental protection. Compared to traditional comprehensive sampling methods, it saves 60-80% on monitoring costs, demonstrating broad application prospects and significant socio-economic benefits.

[0111] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort. Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0112] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A novel meta-modeling method for the global distribution of pollutants, characterized in that, Includes the following steps: S100 acquires soil pollutant sampling data from around the world, including geographic location information and quantitative concentration information; s200, standardize the acquired sampling data, establish a pollutant feature coding system, and generate a standardized database; s300, extracts environmental factors for each sampling point, including land use factors and human activity factors; S400 uses an integrated machine learning model for training, and uses the trained model to predict the spatial distribution of pollutants in global agricultural regions. S500, based on the variable importance analysis results of the integrated machine learning model, combined with pollutant characteristics, the dominant driving factors affecting pollutant distribution are identified, and different pollution spatial patterns are divided. S600 uses a threshold method based on global percentiles to classify pollution risk levels, identify pollution hotspots, and generate a risk distribution map based on the predicted spatial distribution of pollutants.

2. The method as described in claim 1, characterized in that: Step s200 includes: unifying the concentration unit of microplastics to particle count / kg dry soil, and unifying the concentration unit of antibiotics, pesticide residues and surfactants to μg / kg dry soil; the pollutant characteristic coding system includes polymer type coding, morphological characteristic coding and color coding.

3. The method as described in claim 1, characterized in that: Step s300 includes extracting the land use factor based on ESACCI-LC land cover data and extracting the human activity factor based on LandScan population distribution data.

4. The method as described in claim 1, characterized in that: Step s400 includes: s410 performs a base-10 logarithmic transformation on the response variable, pollutant concentration. S420 uses a 5-fold cross-validation method to train three models: Random Forest Regression (RF), XGBoost Gradient Boosting, and Support Vector Regression (SVR). The optimal hyperparameters are determined using grid search, and each model is retrained using all the data to obtain three trained prediction models: Random Forest Regression (RF_final), XGBoost Gradient Boosting, and Support Vector Regression (SVR_final).

5. The method as described in claim 4, characterized in that: Step s400 includes: S430: The prediction results of the three trained prediction models are weighted and averaged to obtain the final integrated prediction value; the integrated prediction result is then inversely transformed to obtain the final predicted value of the actual pollutant concentration. S440 uses global farmland distribution data as a mask to calculate the predicted actual pollutant concentration for each 5km × 5km raster cell, generating a global agricultural soil pollutant concentration distribution raster map.

6. The method as described in claim 1, characterized in that: Step s400 includes: Step s500 includes: S510 calculates the feature importance score of each environmental factor using a trained random forest regression model and extracts the pollutant type feature corresponding to each sample point. S520 sorts all environmental factors from highest to lowest importance score to quantitatively identify the dominant driving factors.

7. The method as described in claim 6, characterized in that: Step s500 includes: S530, based on the dominant driving factors and pollutant type characteristics, determines the source category; S540 creates a feature vector for each sample point, which includes the predicted pollutant concentration and the dominant driving factor value; it then performs K-means clustering analysis to output a global pollution spatial pattern.

8. The method as described in claim 1, characterized in that: Step s500 includes: S610, applying farmland masking to the raster map for predicting global pollutant concentrations; S620 extracts pollutant concentration values ​​from all agricultural pixels to form a concentration value array; the predicted concentration values ​​are classified into risk levels according to the percentile of the global distribution using the quantile threshold method. The low-risk area is defined as no greater than the 50th percentile, the medium-risk area is the 50th-90th percentile, the high-risk area is greater than the 90th percentile, and the extremely high-risk area is greater than the 95th percentile.

9. The method as described in claim 8, characterized in that: Step s500 includes: s630: Traverse every agricultural cell in the global pollutant concentration prediction raster map, compare its predicted concentration value with the threshold calculated in step s620, and assign it a risk level code according to the following rules; after completing the classification of all cells, generate a risk level raster map.

10. A novel meta-modeling system for the global distribution of pollutants, including The data acquisition module is used to acquire soil pollutant sampling data from around the world, including geographic location information and quantitative concentration information. The data standardization module is used to standardize the acquired sampling data, establish a pollutant feature coding system, and generate a standardized database. The spatial analysis module is used to extract environmental factors for each sampling point, including land use factors and human activity factors. The machine learning modeling and prediction module is used to train an integrated machine learning model with standardized pollutant concentration as the response variable and the environmental factors as the feature variables, and then use the trained model to predict the spatial distribution of pollutants in global agricultural areas. The source tracing and pattern recognition module is used to identify the dominant driving factors affecting the distribution of pollutants and to classify different pollution spatial patterns based on the variable importance analysis results of the integrated machine learning model and the characteristics of pollutants. The risk assessment and visualization module is used to classify pollution risk levels based on the predicted spatial distribution of pollutants using a threshold method based on global percentiles, identify pollution hotspots, and generate risk distribution maps.