PM considering interaction features 2.5 Concentration estimation method
By constructing a PM2.5 concentration estimation method that takes into account interactive features and combining it with the Light Gradient Boosting Machine (LightGBM), the shortcomings of existing models in describing spatiotemporal features and nonlinear relationships are solved, and continuous mapping of daily average PM2.5 concentration is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN UNIV
- Filing Date
- 2023-07-14
- Publication Date
- 2026-06-23
AI Technical Summary
Existing models lack the ability to describe spatiotemporal features and nonlinear relationships, making it difficult to effectively fit daily average PM2.5 concentrations.
A PM2.5 concentration estimation method that takes into account interactive features is constructed. By combining the Light Gradient Enhancement Machine (LightGBM), spatial features, interactive features, and temporal features are generated by acquiring ground-based PM2.5 concentration monitoring station data and multi-source remote sensing image data, and model fitting is performed.
It enables continuous mapping of daily average PM2.5 concentration, improving the model's ability to describe spatiotemporal features and the fitting accuracy of nonlinear relationships.
Smart Images

Figure CN116992230B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to PM 2.5 The technical field of concentration spatial statistical analysis, specifically involving a PM method that takes into account interaction characteristics. 2.5 Concentration estimation methods. Background Technology
[0002] PM 2.5 The aerodynamic diameter in the air is ≤2.5μg / m. 3 Fine particulate matter; studies have shown that high concentrations of PM2.5... 2.5 It can obstruct vision, affect traffic, and long-term exposure can damage the human respiratory, cardiovascular, and immune systems.
[0003] Air quality monitoring stations are scattered, and many areas lack air quality monitoring stations, making it difficult for existing stations to meet the needs of large-scale PM2.5 monitoring. 2.5 The need for concentration monitoring. Many researchers have attempted to estimate PM2.5 concentrations in areas lacking air quality monitoring stations using other methods based on existing air quality station information. 2.5 Concentration. Liu Xiangxiang (Application / Patent No.: CN201910013379.2) used an empirical Bayesian Skripal model to estimate surface PM2.5 concentration. 2.5 Concentration, this method takes PM into account 2.5 Spatial characteristics of concentration, utilized from existing PM2.5 concentration monitoring stations 2.5 Concentration information, but lacks consideration of PM2.5. 2.5 The spatiotemporal effects of concentration.
[0004] With the increasing sophistication of satellite remote sensing, and thanks to its large-scale, seamless characteristics, more and more researchers are using methods that combine remote sensing image data to fit surface PM2.5 levels. 2.5 Concentration, Study of PM 2.5 Factors influencing concentration include aerosol depth (AOD), surface temperature, precipitation, relative humidity, elevation, population density, transportation network, and land use. Liu Jun et al. (application / patent number: CN201711398781.4) used AOD data from the MODIS satellite and employed machine learning methods to establish a correlation between remote sensing imagery and measured PM2.5 concentration. 2.5 The relationship between concentration. Chen Jiangping et al. (Application / Patent No.: CN201810332580.2) used AOD data from the Himawari-8 satellite, combined with meteorological factors and geographical information, to construct a mixed-effects regression model to estimate near-surface PM2.5 concentration. 2.5 Concentration. However, some research methods lack consideration of PM2.5 concentration. 2.5 Spatial and spatiotemporal effects of concentration lead to limited model accuracy.
[0005] Studies have shown that PM2.5 was taken into account. 2.5Models of spatial and spatiotemporal effects of concentration show better performance. Sun Xiaomin et al. (application / patent number: CN202110308071.8) used satellite remote sensing data and a geographic weighted regression algorithm to analyze PM2.5 concentration. 2.5 Concentration Fitting. Shi Shuo et al. (Application / Patent No.: CN201811578840.0) established a random forest regression model using regional meteorological dynamics indicators and satellite AOD as explanatory variables. They combined this with ordinary kriging interpolation to calculate the residuals of the fit between the random forest regression model and the ordinary kriging algorithm. Using an inverse variance weighting method, they combined the fitting results of the two models to finally obtain the regional PM2.5 concentration. 2.5 Concentration. Chen Yumin et al. (Application / Patent No.: CN201711479275.8) used the eigenvector space filtering method to invert the surface PM concentration. 2.5 Concentration was used to improve model accuracy by adding spatial influence factors to the regression model. However, the model lacked consideration of spatial heterogeneity. Based on this, Chen Yumin et al. (application / patent number: CN201811644669.9) constructed a spatially variable coefficient PM model based on the Re-ESF algorithm. 2.5 Concentration estimation model, considering PM 2.5 The spatial heterogeneity of concentration distribution is considered, and the influence of independent variables on PM2.5 is further taken into account. 2.5 Multiscale effects of concentration (Application / Patent No.: 202210578832.6). However, the above studies lack consideration of both spatiotemporal characteristics and relevant independent variable factors related to PM. 2.5 The nonlinear relationship of concentration is not considered, and the interaction effect between spatial characteristics and independent variables is not taken into account.
[0006] In this invention, multi-source remote sensing data and PM2.5 levels from ground-based air quality stations are utilized. 2.5 Based on concentration observations, a PM2.5 concentration model that takes into account interaction characteristics is proposed. 2.5 Concentration estimation models. Due to the uneven distribution and insufficient number of ground-based air monitoring stations, coupled with the lack of spatiotemporal feature descriptions in existing models and the lack of nonlinear relationship description capabilities in classical models, it is difficult to estimate daily average PM2.5 concentrations. 2.5 The problem of concentration fitting. Summary of the Invention
[0007] The purpose of this invention is to address the shortcomings of existing technologies by providing a PM that takes into account interactive features. 2.5 This concentration estimation method aims to construct spatial, interaction, and temporal features, and combines them with a Light Gradient Boosting Machine (LightGBM) to address the shortcomings of existing models in describing spatiotemporal features and classical models in describing nonlinear relationships. This method achieves accurate estimation of daily average PM2.5 concentrations. 2.5 Concentration plotting.
[0008] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0009] A PM that takes into account interactive features 2.5 Concentration estimation methods include the following steps:
[0010] Step 1: Obtain ground-level PM 25 Concentration monitoring station data and preprocessing it.
[0011] Step 2: Acquire remote sensing image data of the study area within a certain time range and use the remote sensing image data as an influencing factor. After preprocessing the remote sensing image data, synthesize the daily / monthly average values to obtain the daily / monthly scale data of the influencing factor data.
[0012] Step 3: Spatially connect the monitoring sites in Step 1 with the impact factors in Step 2;
[0013] Step 4: Extract the spatial feature values and feature vectors of the station based on the station coordinates;
[0014] Step 5: Construct time-dimensional features based on the observation time of the stations;
[0015] Step 6: Combine the influencing factors with the spatial feature vectors generated in Step 4 to form spatial interaction features;
[0016] Step 7: Combine the spatial interaction features obtained in Step 6 with PM 2.5 Concentration was used to perform variable importance analysis, and spatial interaction features were filtered based on importance to obtain the filtered interaction features;
[0017] Step 8: Calculate the daily average PM2.5 from Step 3. 2.5 Concentration data, influencing factor data, spatial features from step 4, temporal features from step 5, and interaction features from step 7 are combined according to the corresponding stations and dates to form the daily average PM2.5. 2.5 Concentration and feature datasets;
[0018] Step 9: Use the feature dataset from Step 8 as the model input features, including the daily average PM2.5 from Step 8. 2.5 Concentration was used as the dependent variable to construct the ST-LightGBM model;
[0019] Step 10: Interpolate the spatial feature vectors from Step 4 to obtain spatial feature vectors for each location within the coverage area, and regenerate the interaction features. Simultaneously, input the various influencing factors, time dimension features, spatial feature vectors, and the filtered interaction features into the ST-LightGBM model constructed in Step 9 to obtain the continuous PM within the study area. 2.5 Concentration graph;
[0020] Step 10: Evaluate the accuracy of the ST-LightGBM model.
[0021] Furthermore, the preprocessing method in step 1 is as follows:
[0022] The original monitoring station PM 2.5 Concentration data are converted into daily average concentration values through mean processing. At the same time, a target coordinate system is determined, and the daily average concentration values of each monitoring station are projected and transformed into the target coordinate system.
[0023] Furthermore, the influencing factors in step 2 include, but are not limited to, aerosol thickness, air temperature, air pressure, relative humidity, elevation, vegetation cover, planetary boundary layer height, land use, and population.
[0024] Furthermore, the method for preprocessing remote sensing image data in step 2 is as follows:
[0025] The remote sensing image data acquired within a certain time range is corrected according to different product types and then projected to the same coordinate system as in step 1, with uniform spatial resolution.
[0026] Furthermore, the method for synthesizing monthly average values of remote sensing images is as follows:
[0027] For each pixel in the raster image, count all values and the total number of values for that pixel in that month, and average them to get the monthly average value for that pixel; if there are no values for that pixel in that month, then the pixel is assigned an empty value.
[0028] Furthermore, the specific method for step 3 is as follows:
[0029] According to PM 2.5 The coordinates of monitoring stations are used to extract the values of various influencing factors from the corresponding remote sensing image raster at each monitoring station's coordinates, forming a graph with coordinates, PM values, and other parameters. 2.5 A data table showing the concentration and values of various influencing factors at each coordinate point over multiple time periods.
[0030] Furthermore, step 4 includes the following steps:
[0031] Step 4.1: Construct the site spatial weight matrix using a Gaussian inverse distance spatial weight model based on the site spatial coordinates, specifically as follows:
[0032]
[0033] Wherein, the spatial weight matrix W is an n×n matrix; n is the number of stations, W i,j The weights between site i and site j are represented by exp; exp represents the exponential function; d i,jThis represents the distance between station i and station j; r represents the length of the longest edge in the minimum spanning tree formed by the stations, or can be set by the user.
[0034] Step 4.2, centralize the spatial weight matrix, the centralization formula is as follows:
[0035]
[0036] Where W1 is the centered spatial weight matrix; W is the spatial weight matrix constructed in step 4.1; n represents the number of battle points; I is an n-dimensional identity matrix; 1 represents an n-dimensional column vector with all elements being 1;
[0037] Step 4.3: Extract eigenvalues and eigenvectors from the centered spatial weight matrix. The solution formula is as follows:
[0038] W1E=λE
[0039]
[0040] Where W1 is the centralized spatial weight matrix obtained in step 4.2, E represents the eigenvector, and λ represents the eigenvalue. By solving the equation, the eigenvalue and eigenvector are obtained. The eigenvector obtained from the centralized spatial weight matrix represents different spatial distribution patterns and is called the spatial eigenvector. And according to... The obtained spatial feature vectors are first initially screened, where λ max λ represents the largest eigenvalue in the spatial eigenvector E. i Let represent the feature value corresponding to each spatial feature vector, and s be the threshold.
[0041] Furthermore, the screening method in step 7 is as follows:
[0042] The spatial interaction features obtained in step 6 are combined with PM 2.5 Concentration was used for variable importance analysis, and the variables were sorted from most important to least important. The interaction features with the highest cumulative importance (top 85%) were selected as the post-selection features, as shown in the formula below:
[0043]
[0044] In the formula, Importance i Let represent the importance of the i-th interaction feature, and n be the total number of spatial interaction features. The top k features in the importance ranking are summed, and the top k interaction features are selected such that their importance is higher than 85% of the total importance. The top k interaction features are then used as the filtered interaction features.
[0045] Furthermore, the ST-LightGBM model constructed in step 9 is expressed as:
[0046]
[0047] X = [X1…X] q ]
[0048] time = [doymonth]
[0049] E = [E1…E p ]
[0050] E x =[E i X j …]
[0051] In the formula, X is the influencing factor, time is the time dimension feature, E is the spatial feature, and Es is the spatial feature. x For interactive features;
[0052] The influencing factor X, the time dimension feature (time), the spatial feature E, and the interaction feature E are considered. x The daily PM was obtained by fitting the ST-LightGBM model. 2.5 Concentration value.
[0053] Furthermore, step 11 employs cross-validation combined with the correlation coefficient R. 2 The root mean square error (RMSE) is used to evaluate the accuracy of the ST-LightGBM model.
[0054] Compared with the prior art, the beneficial effects of the present invention are: the PM that takes into account the interaction features provided by the present invention 2.5 The concentration estimation model constructs spatial, interaction, and temporal features, combined with a Light Gradient Boosting Machine (LightGBM). The interaction features fully consider the influence of spatial effects on various independent variables. This model effectively addresses the shortcomings of existing models in describing spatiotemporal features and the lack of nonlinear relationship description capabilities in classical models, thus achieving daily average PM2.5 concentration estimation. 2.5 Continuous concentration mapping. Attached Figure Description
[0055] Figure 1 PM is an embodiment of the present invention. 2.5 Flowchart of concentration estimation method;
[0056] Figure 2 The daily average PM in the embodiments of the present invention 2.5 Flowchart for concentration graphing;
[0057] Figure 3 This is a schematic diagram illustrating the construction of grid points in an embodiment of the present invention;
[0058] Figure 4 This is a schematic diagram of data partitioning in an embodiment of the present invention. Detailed Implementation
[0059] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.
[0060] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other.
[0061] The present invention will be further described below with reference to specific embodiments, but these are not intended to limit the scope of the invention.
[0062] Given that existing technologies suffer from uneven distribution and insufficient number of ground-based air monitoring points, and that existing models lack spatiotemporal feature descriptions while classical models lack the ability to describe nonlinear relationships, it is difficult to obtain daily average PM2.5 concentrations. 2.5 To address the issue of concentration fitting, this invention discloses a PM method that takes into account interaction characteristics. 2.5 Concentration estimation method, which uses PM2.5 concentration from ground-based air monitoring stations. 2.5 Daily average concentration values and multi-source remote sensing image data were used to construct spatial, interactive, and temporal features. These features were then modeled using the Light Gradient Boosting Machine (LightGBM) method to ultimately obtain the daily average PM2.5 concentration. 2.5 Continuous concentration distribution map.
[0063] See Figure 1 The embodiments of the present invention include a data preprocessing stage (steps 1-3), a feature generation stage (steps 4-8), and a modeling and product generation stage (steps 9-11), specifically involving the following steps:
[0064] Step 1: Ground PM 2.5 Preprocessing of concentration monitoring station data; acquisition of ground-level PM2.5 concentration data. 2.5 Raw data from concentration monitoring stations, original records are in PM2.5 per hour. 2.5 Concentration data, in milligrams per cubic meter (μg / m³) 3 The data includes the latitude and longitude coordinates of the stations. The original records are transformed into daily average concentration values through mean processing. At the same time, a target coordinate system is determined, and the ground stations are all transformed and projected into the target coordinate system. The relevant coordinate system transformation and projection operations can be performed using software such as ArcGIS.
[0065] Step 2: Impact factor preprocessing and daily / monthly mean processing; the preprocessing includes preprocessing and correcting the impact factors, and converting them to the same spatial projection and the same spatial resolution.
[0066] Remote sensing image data of the study area was obtained from relevant databases. This data was used as influencing factor data, including but not limited to: aerosol thickness, air temperature, air pressure, relative humidity, elevation, vegetation cover, planetary boundary layer height, land use, and population. For the remote sensing image data, corrections were performed according to different product types, and the same spatial coordinate system projection transformation as in step 1 was conducted, with a standardized spatial resolution established. All of the above methods can be implemented using ArcGIS, Envi, Erdas, or relevant programming languages. Daily / monthly averages were synthesized from the preprocessed influencing factor data to obtain daily / monthly influencing factor data. For each monitoring station, the daily average PM2.5 concentration was obtained. 2.5 Concentration, but some influencing factors do not have daily-scale data, so the daily average PM2.5 concentration for that month is... 2.5 The influencing factor corresponding to the concentration is the monthly average of that influencing factor. Some influencing factors can reach the hourly level, so a daily average is calculated, which is exactly the daily average PM2.5 concentration. 2.5 The influencing factor corresponding to the concentration is the daily average.
[0067] For remote sensing imagery data that may contain data from multiple time periods within the same area, monthly averages need to be synthesized. Considering the potential for missing data within the imagery, the strategy for monthly average synthesis is as follows: For each pixel in the raster image, all values and the total number of values for that pixel in that month are counted, and the average of these is used as the monthly average for that pixel. If a pixel has no values for the entire month, it is assigned a null value. After the monthly averaging operation, if missing values (null values) still exist in the imagery, interpolation or other methods can be used to fill them in. All of the above operations can be implemented using ArcGIS, Envi, Erdas, or relevant programming languages.
[0068] Step 3: Set up monitoring stations and their PM2.5 concentrations. 2.5 Concentration data is spatially linked with the impact factor imagery; specifically, since the impact factor is a raster image, it is based on PM... 2.5 Monitoring station coordinates are used to extract the values of each influencing factor for the corresponding raster at each monitoring station coordinate, forming a coordinate-based PM data set. 2.5 A data table showing the concentration of each influencing factor at each coordinate point over multiple time periods (rows of data for each station, columns for coordinates, PM2.5, etc.). 2.5 The database includes fields for concentration and influence factors, and records the date of each record.
[0069] Step 4: Generate spatial feature vectors; Based on the station coordinates, construct a spatial weight matrix using a Gaussian inverse distance spatial weight model, and then center the spatial weight matrix and extract the eigenvalues and eigenvectors corresponding to the station coordinates; This step specifically includes the following sub-steps:
[0070] Step 4.1: Construct the spatial weight matrix of the ground monitoring stations using a Gaussian inverse distance spatial weight model based on the station coordinates:
[0071]
[0072] Wherein, the spatial weight matrix W is an n×n matrix; n is the number of stations, W i,j The weights between site i and site j are represented by exp; exp represents the exponential function; d i,j This represents the distance between station i and station j; r represents the length of the longest edge in the minimum spanning tree formed by the stations, or can be set by the user.
[0073] Step 4.2: Centralize the spatial weight matrix, where the centralization formula is as follows:
[0074]
[0075] In the formula, W1 is the centered spatial weight matrix; W is the spatial weight matrix constructed in step 4.1; n represents the number of battle points; I is an n-dimensional identity matrix; 1 represents an n-dimensional column vector, where all elements are 1;
[0076] Step 4.3: Extract feature values and feature vectors. The solution formula is as follows:
[0077] W1E=λE (3)
[0078]
[0079] In formula (3), W1 is the centralized spatial weight matrix obtained in step 4.2, E represents the eigenvector, and λ represents the eigenvalue. By solving the equation, the eigenvalue and eigenvector are obtained. The eigenvector obtained from the centralized spatial weight matrix represents different spatial distribution patterns and can be called spatial eigenvectors. Based on formula (4), the obtained spatial eigenvectors are first preliminarily screened, where λ... max λ represents the largest eigenvalue in the spatial eigenvector E. i s represents the feature value corresponding to each spatial feature vector, and s is the threshold. In this embodiment, s is 25%, that is, the spatial feature vectors whose feature values are in the top 25% are extracted.
[0080] Step 5: Construct time-dimensional features; based on the observation time of the stations, construct yearly and monthly information features. Yearly is a method for continuously calculating dates within a year, starting from January 1st, denoted as 1, and so on, with February 1st recorded as 32. Monthly information only extracts monthly data; for example, January 1st is recorded as 1, and February 1st as 2. Yearly and monthly information together constitute the time-dimensional features, providing time-dimensional characteristics for subsequent modeling.
[0081] Step 6: Construct interaction features; using the influencing factors and the spatial feature vectors generated in Step 4, combine them to construct spatial interaction features; since the spatial feature vector E represents different spatial distribution models, the interaction features represent the influence of various spatial distribution patterns on each variable, and together constitute mutually influential features.
[0082] E x =[E1X1 … E1X q … E p X1 … E m X q (5)
[0083] Among them, E x To represent the interactive features, assume there are p spatial feature vectors extracted in step 4, denoted as E1 to E2. p The influencing factors in step 2 are q, denoted as X1 to X2. q Then the interaction feature E x It can be obtained from formula (5). By multiplying the spatial feature vector with each influencing factor in turn, a column vector is formed, which finally constitutes the interaction feature E. x .
[0084] Step 7: Interaction Feature Filtering; Filter the interaction features E obtained in Step 6. x With PM 2.5 Concentration was used for variable importance analysis, and the variables were sorted from most important to least important. The interaction features with the highest cumulative importance (top 85%) were selected as the post-selection features, as shown in the formula below:
[0085]
[0086] In the formula, Importance i Let represent the importance of the i-th interaction feature, and n be the total number of spatial interaction features. The top k features in the importance ranking are summed, and the top k interaction features are selected such that their importance is higher than 85% of the total importance. The top k interaction features are then used as the filtered interaction features.
[0087] Step 8: Data Reorganization; Reorganize the daily average PM2.5 from Step 3. 2.5Concentration data and influencing factors, spatial features from step 4, temporal features from step 5, and interactive features selected in step 7 are combined according to station and date to form the daily average PM2.5. 2.5 Concentration and feature dataset, where each data point represents the PM2.5 concentration at a specific site on a given day. 2.5 Concentration and characteristic data of each factor.
[0088] Step 9: Establish the Spatiotemporal Light Boosting Machine (ST-LightGBM) model; use the feature dataset from Step 8 as the model input features, including the daily average PM2.5 from Step 8. 2.5 Concentration was used as the dependent variable to construct the ST-LightGBM model, and the overall expression of the model is shown in Equation (7):
[0089]
[0090] X = [X1 … X] q (8)
[0091] time = [doy month] (9)
[0092] E = [E1 … E p (10)
[0093] E x =[E i X j …] (11)
[0094] The input data for the ST-LightGBM model mentioned above includes influencing factors, temporal features, spatial features, and interaction features. The model is fitted to obtain the daily PM2.5 concentration. 2.5 Concentration values and the expression of each input data are shown in formulas (8)-(11). The mild booster model is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, which is suitable for regression and many other machine learning tasks, and can be implemented in programming languages such as Python and R.
[0095] Step 10: Daily average PM 2.5 Concentration mapping; interpolating the spatial feature vectors from step 4 to obtain spatial feature vectors for each location within the coverage area, and regenerating interaction features. Simultaneously, substituting the factors from step 2, the time dimension features from step 5, the generated spatial feature vectors, and the interaction features into the ST-LightGBM model constructed in step 9, yields the continuous PM2.5 concentration within the study area. 2.5 Concentration chart, see detailed flowchart. Figure 2 Specifically:
[0096] Step 10.1: Construct grid points:
[0097] The grid coverage area is determined based on the study area, and an appropriate spatial resolution is established. The spatial resolution is the grid size, and the coordinates of the center of each grid cell are used as the coordinates of that grid cell. A schematic diagram is shown below. Figure 3 As shown, its grid size, distribution, and coordinate system are the same as the unified influence factors in step 2, and this grid serves as the basis for subsequent processing steps.
[0098] Step 10.2: Spatial feature vector interpolation;
[0099] Using the spatial feature vectors and corresponding coordinate points generated in step 4 as data, interpolation is performed on the grid points generated in step 10.1. The spatial feature vectors with fixed coordinates in step 4 are used to obtain the spatial feature vector values of each grid point through interpolation methods such as inverse distance weight (IDW) and Kriging. This interpolation process can be implemented using ArcGIS, QGIS, or related programming software such as Python and R, and finally, spatial feature vectors covering all parts of the entire study area are obtained.
[0100] Step 10.3: Generate interaction features;
[0101] Using the spatial feature vector covering the entire study area obtained in step 10.2 and the raster data of the influence factors in step 2 as input data, and taking the selected interaction features in step 7 as the target, the interaction features are constructed using formula (5) in step 6. Since the size, distribution, and coordinate system of the grid generated in step 10.1 are the same as those of the influence factors in step 2, the interaction features can be directly calculated using formula (5).
[0102] Step 10.4: Generate temporal features
[0103] Using the description of the time dimension features in step 5, construct the time dimension features. Since for a certain time, such as February 1st, the yearly day and month features are consistent across the entire study area, being 32 and 2 respectively, the entire grid values are 32 and 2 respectively. Create the time dimension features for the entire study time range for use in subsequent steps;
[0104] Step 10.5: Data Reorganization
[0105] The influence factors, spatial feature vectors, interaction features, and temporal features obtained in steps 2, 10.2, 10.3, and 10.4 are reorganized. The data structure is now similar to that in step 8, with each grid point containing influence factors, spatial feature vectors, interaction features, and temporal features at a specific point in time.
[0106] Step 10.6: Model Fitting and PM 2.5 Concentration space mapping.
[0107] Substituting the reorganized data from step 10.5 into the ST-LightGBM model constructed in step 9, the predicted PM for each grid point is finally obtained. 2.5 Concentration. Predicted PM2.5 concentration. 2.5 Concentration is mapped to corresponding time-dimension grids, and each resulting time-dimension grid represents the PM2.5 concentration. 2.5 Concentration graph.
[0108] Step 11: Accuracy Evaluation; Evaluate the accuracy of the constructed ST-LightGBM model; This step specifically includes:
[0109] Step 11.1: Set evaluation indicators, including the correlation coefficient R. 2 The accuracy of the ST-LightGBM model is evaluated using the root mean square error (RMSE), and the relevant calculation formulas are as follows:
[0110]
[0111]
[0112] Among them, y i It is the PM at monitoring station i 2.5 Concentration observations; It is the average value of the observed data; It is the PM at monitoring station i 2.5 Predicted concentration value; n is the number of stations. Where R... 2 The range of the root mean square error is generally between 0 and 1. The larger the value, the better the model fits. The smaller the root mean square error, the smaller the model error, indicating that the model has higher accuracy.
[0113] Step 11.2: Accuracy evaluation using cross-validation.
[0114] Divide the dataset; divide the data into 10 equal parts using two different methods. A diagram illustrating one part is shown below. Figure 4 As shown, Figure 4 Figure (a) shows the random partitioning method of the sample. The idea is to randomly partition all the station observation records of all time periods as a whole. The boxed part is the partitioning method of a single part. Figure 4 Figure (b) illustrates the site random partitioning method. The idea is to use a site as a standard to divide the observation records across all time periods at certain sites. The boxed portion represents a single partition. Following both methods, the dataset obtained in step 8 is divided into 10 parts.
[0115] Cross-validation evaluation: The two datasets divided into 10 equal parts are cross-validated as follows: 9 parts are used as the modeling set, and the remaining part is used as the test set. The modeling set is input into the ST-LightGBM model constructed in step 9 for parameter optimization, and the PM values in the single test set are predicted. 2.5 The concentration is repeated until each sample is used as a test set. At this point, the prediction result corresponds to the entire dataset. The results are calculated and evaluated according to the two indicators in step 11.1. This result represents the spatial prediction ability of the model to a certain extent, and the higher the accuracy, the stronger the spatial prediction ability.
[0116] The above are merely preferred embodiments of the present invention and are not intended to limit the implementation methods and scope of protection of the present invention. Those skilled in the art should recognize that any equivalent substitutions and obvious changes made based on the content of this specification should be included within the scope of protection of the present invention.
Claims
1. A PM that takes into account interaction characteristics 2.5 Concentration estimation method, characterized in that, Includes the following steps: Step 1: Obtain ground-level PM 2.5 Concentration monitoring station data and preprocessing it; Step 2: Acquire remote sensing image data of the study area within a certain time range and use the remote sensing image data as an influencing factor. After preprocessing the remote sensing image data, synthesize the daily / monthly average values to obtain the daily / monthly scale data of the influencing factor data. Step 3: Spatially connect the monitoring stations in Step 1 with the influencing factors in Step 2; Step 4: Extract the spatial feature values and feature vectors of the station based on the station coordinates; Step 5: Construct time-dimensional features based on the observation time of the stations; Step 6: Combine the influencing factors with the spatial feature vectors generated in Step 4 to form spatial interaction features; Step 7: Combine the spatial interaction features obtained in Step 6 with PM 2.5 Concentration was used to perform variable importance analysis, and spatial interaction features were filtered based on importance to obtain the filtered interaction features; Step 8: Calculate the daily average PM2.5 from Step 3. 2.5 Concentration data, influencing factor data, spatial features from step 4, temporal features from step 5, and interaction features from step 7 are combined according to the corresponding stations and dates to form the daily average PM2.
5. 2.5 Concentration and feature datasets; Step 9: Use the feature dataset from Step 8 as the model input features, including the daily average PM2.5 from Step 8. 2.5 Concentration was used as the dependent variable to construct the ST-LightGBM model; the ST-LightGBM model is expressed as: In the formula, Impact factor For time dimension features, For spatial features and For interactive features; Step 10: Interpolate the spatial feature vectors from Step 4 to obtain spatial feature vectors for each location within the coverage area, and regenerate the interaction features, including the influencing factors. Time dimension features Spatial features and interactive features The daily PM was obtained by fitting the ST-LightGBM model. 2.5 Concentration value; Step 11: Evaluate the accuracy of the ST-LightGBM model.
2. The PM that takes into account interaction characteristics according to claim 1 2.5 Concentration estimation method, characterized in that, The preprocessing method in step 1 is as follows: The original monitoring station PM 2.5 Concentration data are converted into daily average concentration values through mean processing. At the same time, a target coordinate system is determined, and the daily average concentration values of each monitoring station are projected and transformed into the target coordinate system.
3. The PM that takes into account interaction features according to claim 1 2.5 Concentration estimation method, characterized in that, The influencing factors in step 2 include, but are not limited to, aerosol thickness, air temperature, air pressure, relative humidity, elevation, vegetation cover, planetary boundary layer height, land use, and population.
4. The PM that takes into account interaction characteristics according to claim 1 2.5 Concentration estimation method, characterized in that, The method for preprocessing remote sensing image data in step 2 is as follows: The remote sensing image data acquired within a certain time range is corrected according to different product types and then projected to the same coordinate system as in step 1, with uniform spatial resolution.
5. The PM that takes into account interaction characteristics according to claim 1 2.5 Concentration estimation method, characterized in that, The method for synthesizing monthly average values of remote sensing images is as follows: For each pixel in the raster image, count all values and the total number of values for that pixel in that month, and average them to get the monthly average value for that pixel; if there are no values for that pixel in that month, then the pixel is assigned an empty value.
6. The PM that takes into account interaction characteristics according to claim 1 2.5 Concentration estimation method, characterized in that, Step 3 is as follows: According to PM 2.5 The coordinates of monitoring stations are used to extract the values of various influencing factors from the corresponding remote sensing image raster at each monitoring station's coordinates, forming a graph with coordinates, PM values, and other parameters. 2.5 A data table showing the concentration and values of various influencing factors at each coordinate point over multiple time periods.
7. The PM that takes into account interaction characteristics according to claim 1 2.5 Concentration estimation method, characterized in that, Step 4 includes the following steps: Step 4.1: Construct the site spatial weight matrix using a Gaussian inverse distance spatial weight model based on the site spatial coordinates, specifically as follows: Wherein, the spatial weight matrix W is an n×n matrix; n is the number of stations. Indicates site and sites The weight values between them; exp represents the exponential function; Indicates site and sites The distance between them; This usually represents the length of the longest edge in the minimum spanning tree formed by the sites, or can be set manually. Step 4.2, centralize the spatial weight matrix, the centralization formula is as follows: in, It is the centralized spatial weight matrix; The spatial weight matrix obtained in step 4.1; Indicates the number of battle points; It is an n-dimensional identity matrix; This represents an n-dimensional column vector where all elements are 1; Step 4.3: Extract eigenvalues and eigenvectors from the centered spatial weight matrix. The solution formula is as follows: in, It is the centered space weight matrix obtained in step 4.
2. Represents the eigenvector. The eigenvalues are represented by equations, which are then solved to obtain eigenvalues and eigenvectors. The eigenvectors obtained from the centered spatial weight matrix represent different spatial distribution patterns and are called spatial eigenvectors. Furthermore, based on... The obtained spatial feature vectors are first initially screened, whereby... Representing spatial eigenvectors The largest eigenvalue in the middle. This represents the eigenvalue corresponding to each spatial eigenvector. s The threshold value is used.
8. The PM that takes into account interaction features according to claim 1 2.5 Concentration estimation method, characterized in that, The filtering method in step 7 is as follows: The spatial interaction features obtained in step 6 are combined with PM 2.5 Concentration was used for variable importance analysis, and the variables were sorted from most important to least important. The interaction features with the highest cumulative importance (top 85%) were selected as the post-selection features, as shown in the formula below: In the formula, Let represent the importance of the i-th interaction feature, and n be the total number of spatial interaction features. The top k features in the importance ranking are summed, and the top k interaction features are selected such that their importance is higher than 85% of the total importance. The top k interaction features are then used as the filtered interaction features.
9. The PM that takes into account interaction characteristics according to claim 1 2.5 Concentration estimation method, characterized in that, Step 11 uses cross-validation combined with the correlation coefficient R. 2 The root mean square error (RMSE) is used to evaluate the accuracy of the ST-LightGBM model.