An ozone pollution cause tracing and analysis method based on VOCs component characteristics
By combining photochemical mechanisms and machine learning methods, key precursors of ozone pollution are identified, solving the problems of sample imbalance and chemical consumption interference in existing technologies. This enables high-precision source tracing and analysis of ozone pollution causes and provides precise pollution control strategies.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUIZHOU RUIEN TESTING TECH CO LTD
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from problems such as sample imbalance, chemical consumption interference, misjudgment of secondary products, and chemical inconsistencies in the screening results of purely statistical methods when identifying key precursors of ozone pollution, which limits the effectiveness of ozone formation analysis.
A method based on VOCs component characteristics is adopted, which combines photochemical mechanism constraints with machine learning. Key ozone precursors are screened through a random forest model, and synthetic samples are generated using a chemically constrained synthetic minority oversampling technique. The optimal subset of VOCs features is screened by combining a simulated annealing-optimized partial least squares regression framework, and unreasonable species are excluded by photochemical age correction and chemical inertia penalty terms.
It significantly improves the accuracy and chemical rationality of ozone prediction, accurately locates pollution sources, reduces the dimensionality of variables, and enhances the focus and reliability of source apportionment. It is applicable to ozone pollution analysis in different regions and seasons.
Smart Images

Figure CN122245528A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of atmospheric environmental monitoring and pollution cause analysis technology, specifically to a method for tracing and analyzing the causes of ozone pollution based on the characteristics of VOCs components. Background Technology
[0002] Ground-level ozone (O3) pollution is one of the core issues of complex atmospheric pollution in key areas of my country. Ozone is not a primary pollutant, but a secondary pollutant generated from nitrogen oxides (NOx) and volatile organic compounds (VOCs) through complex photochemical reactions under sunlight. Therefore, accurately identifying the key precursors driving ozone formation from dozens to hundreds of VOC species and elucidating their sources is a prerequisite for formulating scientific management strategies for ozone pollution.
[0003] 1. Existing VOCs analysis methods and their limitations:
[0004] Currently, standardized detection of VOCs in ambient air mainly employs a pre-concentration-gas chromatography-mass spectrometry / flame ionization detector (GC-MS / FID) coupled technique. Existing commercial systems already possess the following capabilities:
[0005] (1) Species coverage: The Entech 7100 pre-concentration system can detect more than 180 VOCs in the air with a detection limit as low as 0.1 ppbv.
[0006] (2) Cold trap heating rate: The heating rate of the focused cold trap in commercial thermal desorption systems can reach 100℃ / s (equivalent to 6000℃ / min) or more.
[0007] However, existing technologies still have room for improvement in the following aspects:
[0008] Sparse sample size during periods of high ozone concentration: In actual monitoring, the sample size during periods of heavy pollution with O3 concentrations exceeding 160 μg / m³ is usually much smaller than that during periods of compliance with standards, resulting in a typical class imbalance problem, which limits the training effect of ozone formation models based on statistical learning.
[0009] 2. Existing methods for identifying ozone precursors and their limitations:
[0010] Common methods for identifying key ozone precursors include:
[0011] (1) Ozone generation potential (OFP) ranking method: ranked according to the maximum incremental reactivity (MIR) multiplied by the concentration. It is simple and intuitive but ignores the collinearity between species and the actual atmospheric reaction environment.
[0012] (2) Correlation analysis / principal component analysis (PCA): cannot handle the high collinearity between species caused by common source emissions.
[0013] (3) Positive definite matrix factor decomposition (PMF) source apportionment: It can identify emission sources, but cannot directly quantify the net contribution of each species to ozone generation.
[0014] (4) Machine learning methods: Koul et al. (Global Transitions Proceedings, 2022) combined simulated annealing (SA) with PLS for feature selection of gene expression data. Li et al. (IEEE Access, 2022) used RF combined with SMOTE to process imbalanced medical data.
[0015] When directly applying the above methods to atmospheric VOCs-ozone research, the following field-specific problems arise:
[0016] (1) Photochemical reaction mechanism not considered: Pure statistical methods may mistakenly select chemically inert but statistically relevant species (such as carbon tetrachloride).
[0017] (2) Photochemical consumption was not considered: VOCs at the downwind observation point had undergone significant chemical consumption, and the relationship between the measured concentration and the initial emission concentration was distorted.
[0018] (3) Secondary products are misidentified as precursors: The photochemical oxidation products of isoprene (such as MVK) may be misidentified as primary emission precursors.
[0019] Therefore, there is an urgent need for a method for identifying and analyzing key ozone precursors that can couple photochemical mechanism constraints with machine learning and take into account both statistical accuracy and chemical rationality, in order to overcome problems such as sample imbalance, chemical consumption interference, misjudgment of secondary products, and chemical irrationality of screening results by pure statistical methods in existing technologies. Summary of the Invention
[0020] The purpose of this invention is to overcome the above-mentioned shortcomings of the prior art and provide a method for tracing and analyzing the causes of ozone pollution based on the characteristics of VOCs components. Specifically, this method is a method for identifying key ozone precursors of volatile organic compounds (VOCs) in ambient air and analyzing pollution sources by coupling photochemical mechanism constraints and machine learning. The photochemical mechanism constraints are domain knowledge constraint functions constructed based on atmospheric photochemical reaction pathways, avoiding the chemically unreasonable screening results produced by existing purely statistical methods.
[0021] To achieve the above objectives, the following technical solution is adopted:
[0022] This invention provides a method for tracing and analyzing the causes of ozone pollution based on VOCs component characteristics, comprising the following steps:
[0023] S1: Acquire concentration data of various volatile organic compounds (VOCs), ozone concentration data, nitrogen oxide concentration data, and meteorological parameter data in ambient air; and amplify samples from periods of high ozone concentration to generate synthetic samples to balance the number of samples at different ozone concentration levels.
[0024] S2: Based on the synthetic samples obtained after amplification, a random forest model is used with ozone concentration as the response variable and the concentration of various VOCs as the independent variable to calculate the feature importance of each VOC species, and select VOC species whose cumulative importance contribution reaches a preset threshold to form an initial feature subset.
[0025] S3: For each VOC species in the initial feature subset, the initial concentration of the species is reconstructed based on its reaction rate constant with hydroxyl radicals and the estimated photochemical age of the air mass, so as to correct the influence of photochemical consumption on the measured concentration, and the VOC species reconstructed from the initial concentration constitute a candidate feature subset.
[0026] S4: Using a simulated annealing-optimized partial least squares regression framework, the candidate feature subset is searched, and the optimal VOCs feature subset is selected with the goal of minimizing the objective function.
[0027] S5: Based on the optimal VOCs feature subset, the standardized regression coefficients of each VOCs are calculated using a partial least squares regression model to determine the key ozone precursors, and the sources of the key ozone precursors are analyzed to generate pollution control strategies.
[0028] Furthermore, in step S1, the amplification process of the ozone high concentration period sample is specifically performed by using chemically constrained synthetic minority oversampling technology to generate synthetic samples. The chemically constrained synthetic minority oversampling technology filters candidate synthetic samples by constructing one or more chemically constrained discriminant functions, retaining only all synthetic samples that pass the discrimination function, so as to construct a chemically feasible region.
[0029] Furthermore, the chemical constraint discrimination function includes: NO titration effect constraint: prohibiting the synthesis of samples where the concentrations of nitrogen oxides and ozone are simultaneously higher than their respective judgment thresholds, wherein the judgment thresholds are determined based on the upper quartile of the nitrogen oxide concentration in the training set and the upper limit of the upper prediction interval of the ozone concentration at a given nitrogen oxide concentration; source spectrum consistency constraint: ensuring that the concentration ratio of specific volatile organic compound species pairs in the synthesized sample falls within the interval formed by the lowest and highest percentiles of this ratio in the training set; oxidation product and parent compound relationship constraint: ensuring that the concentration ratio of secondary photochemical oxidation products in the synthesized sample to its parent volatile organic compound does not exceed the highest percentile of this ratio in the training set.
[0030] Furthermore, in step S2, the preset threshold for the cumulative importance contribution is 85%; the random forest model is set to have 500 trees, and the splitting criterion is Gini impurity; the characteristic importance of the VOCs species is calculated based on the average reduction of the Gini impurity.
[0031] Furthermore, step S3 specifically includes: estimating the photochemical age of the air mass using the concentration ratio of ethylbenzene to m / p-xylene; dynamically estimating the average concentration of hydroxyl radicals in the atmosphere using a parametric model based on the photolysis rate of nitrogen dioxide, temperature, and relative humidity; and reconstructing the initial concentration of the VOC species according to the reaction rate constant between the VOC species and the hydroxyl radicals, the average concentration of the hydroxyl radicals, and the photochemical age.
[0032] Furthermore, the method for estimating the photochemical age of an air mass using the concentration ratio of ethylbenzene to m / p-xylene is as follows: Using the concentration ratio of ethylbenzene to m / p-xylene, combined with the reaction rate constants of ethylbenzene and m / p-xylene with hydroxyl radicals, and the average concentration of hydroxyl radicals, the following formula is used for estimation:
[0033] ;
[0034] in, The photochemical age of the air mass; The reaction rate constant between meta / para-xylene and hydroxyl radicals; The reaction rate constant between ethylbenzene and hydroxyl radicals; This represents the average concentration of hydroxyl radicals in the atmosphere. The measured concentration ratio of ethylbenzene to xylene at time t; This represents the initial concentration ratio of ethylbenzene to xylene in the fresh emission source spectrum.
[0035] Furthermore, based on the photolysis rate of nitrogen dioxide, temperature, and relative humidity, a parametric model is used to dynamically estimate the average concentration of hydroxyl radicals in the atmosphere, as shown in the following formula:
[0036] ;
[0037] in, These are the calibration coefficients in the OH concentration parameterization model. This is the exponential parameter in the OH concentration parameterization model. and Obtained through regression fitting of local observation data; The photolysis rate constant of NO2; RH represents atmospheric temperature and relative humidity.
[0038] Furthermore, reconstructing the initial concentration of the VOC species based on the reaction rate constant between the VOC species and hydroxyl radicals, the average concentration of the hydroxyl radicals, and the photochemical age includes:
[0039] For each VOC species in the candidate feature subset, the initial concentration of the VOC species is reconstructed using the following formula based on its measured concentration, the reaction rate constant between the VOC species and hydroxyl radicals, the average concentration of hydroxyl radicals, and the photochemical age of the air mass: ;in, The measured concentration of VOCs in ambient air; This is the reaction rate constant between the VOC species and hydroxyl radicals; This represents the average concentration of hydroxyl radicals in the atmosphere. The photochemical age of the air mass.
[0040] Furthermore, in step S4, the objective function is a normalized objective function, specifically as follows:
[0041] ;
[0042] This includes a normalized root mean square error term for cross-validation, a chemical inertness soft penalty term, a collinearity soft penalty term, and a quadratic product penalty term; S is the subset of candidate features, and LVs is the number of latent variables in the partial least squares regression. The normalized root mean square error of cross-validation. This is a soft penalty term for chemical inertness. This is a soft penalty term for collinearity. The penalty term for the secondary product, , , These are the corresponding penalty weights.
[0043] Furthermore, the chemical inertness soft penalty term is defined as follows: when the reaction rate constant of each volatile organic compound species with hydroxyl radicals in the current candidate feature subset is lower than the chemical inertness threshold, a non-zero penalty is applied to that species, and the penalty intensity increases as the reaction rate constant of that species with hydroxyl radicals decreases. The value of the chemical inertness soft penalty term is the average of the penalty values of all species in the current candidate feature subset. The collinearity soft penalty term is defined as follows: when the average correlation coefficient among all volatile organic compound species in the current candidate feature subset exceeds a preset correlation coefficient threshold, a penalty is applied to the excess portion. The secondary product penalty term is defined as the proportion of the number of species marked as secondary photochemical oxidation products in the current candidate feature subset to the total number of species in the current candidate feature subset.
[0044] Compared with the prior art, the present invention achieves the following beneficial effects:
[0045] 1. Significantly improved prediction accuracy: Ablation experiments on the same dataset show that the ozone prediction determination coefficient R² of the complete scheme of this invention reaches 0.81±0.03, and the root mean square error RMSE is 12.4±1.2 μg / m³, which is significantly better than the standard SA-PLS and OFP ranking methods.
[0046] 2. Significantly Enhanced Chemical Rationality: This invention successfully eliminated chemically inert species such as carbon tetrachloride and dichloromethane, which were mistakenly selected in the standard SA-PLS, through a soft penalty for chemical inertness; and eliminated secondary photochemical products such as methyl vinyl ketone (MVK) from being misclassified as primary precursors through a secondary product penalty. All eight key precursors screened are known ozone-generating active species, demonstrating strong chemical interpretability.
[0047] 3. Improved focus and reliability of source apportionment: The number of variables to be analyzed has been reduced from 117 to 8-12, and the variable dimensionality has been reduced by approximately 90%, effectively avoiding collinearity noise interference from redundant species in source apportionment. Combined with eigenvalue ratio, backward trajectory, and PSCF analysis, the pollution source can be accurately located (e.g., in the example, the southwest petrochemical park was determined to contribute 40%-50%, and local vegetation isoprene contributed 25%-35%), providing a clear target for precise pollution control.
[0048] 4. Excellent Engineering Applicability: This invention can be implemented based on conventional environmental monitoring data (VOCs, O3, NOx, and meteorological parameters analyzed by GC-MS / FID) without relying on additional expensive equipment. The standard workstation configuration used in the embodiment (16-core CPU, 64GB memory) can complete the calculation of monthly data, demonstrating practical deployment feasibility.
[0049] 5. Wide applicability: The core method framework of this invention does not depend on a specific region or season. By recalibrating the initial ratio parameter and OH concentration estimation coefficient in photochemical age correction, it can be extended to ozone pollution cause analysis in different cities and seasons, and has good generalization ability.
[0050] It should be understood that the description in the Summary of the Invention is not intended to limit the key or essential features of the embodiments of the present invention, nor is it intended to restrict the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0051] The above and other features, advantages, and aspects of the various embodiments of the present invention will become more apparent from the accompanying drawings and the following detailed description. The drawings are provided for a better understanding of the invention and are not intended to limit the invention. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:
[0052] Figure 1 This is a schematic flowchart of a method for tracing and analyzing the causes of ozone pollution based on the characteristics of VOCs components, provided in an embodiment of the present invention.
[0053] Figure 2 This is a schematic diagram of the overall process of a method for tracing and analyzing the causes of ozone pollution based on the characteristics of VOCs components according to an embodiment of the present invention.
[0054] Figure 3 This is a schematic diagram of the SA-PLS fine sieve process according to an embodiment of the present invention;
[0055] Figure 4 This is a schematic diagram of the PSCF potential source contribution and HYSPLIT backward trajectory based on screening key volatile organic compounds (VOCs) in an embodiment of the present invention;
[0056] Figure 5 This is a bar chart comparing the predicted R² and RMSE of O3 in the ablation experiments of each module in this embodiment of the invention. Detailed Implementation
[0057] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0058] Furthermore, the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0059] Explanation of the technical concept and synergistic effect principle of this invention: This invention does not simply connect known technologies such as CC-SMOTE, RF primary screening, SA-PLS, and photochemical age correction, but rather achieves synergistic effects among the modules through in-depth modification and coupling based on atmospheric chemical mechanisms.
[0060] (1) Coupling of CC-SMOTE and photochemical age correction: CC-SMOTE incorporates chemical constraints such as NO titration effect when generating synthetic samples to ensure the chemical rationality of the synthetic samples; photochemical age correction further corrects the consumption bias of VOCs in aging air masses. Together, they ensure the chemical authenticity of the input data for subsequent SA-PLS modeling, which cannot be achieved by using either technology alone.
[0061] (2) Coupling of mechanistic constraints between SA-PLS and RF primary screening: RF primary screening quickly eliminates redundant species, providing SA-PLS with a moderately sized search space (approximately 30 species); SA-PLS, on the other hand, applies chemical inertia penalties, collinearity penalties, and secondary product penalties during the search process to ensure that the screening results satisfy both statistical optimality and chemical rationality. This two-level architecture of "statistical primary screening + chemical fine screening" significantly improves the accuracy of O3 prediction and chemical interpretability of the key species screened compared to using RF or SA-PLS alone (see comparative experiments in the examples).
[0062] This invention's SA-PLS adds a specific penalty item for atmospheric chemistry, which differs from the direct transplantation of the general SA-PLS by Koul et al., and achieves accurate and chemically rational identification of key ozone precursors through the synergy between the above modules.
[0063] Figure 1 This is a schematic flowchart of a method for tracing and analyzing the causes of ozone pollution based on the characteristics of VOCs components, provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of the overall process of a method for tracing and analyzing the causes of ozone pollution based on the characteristics of VOCs components, according to an embodiment of the present invention. Figure 1 and Figure 2 As shown, a method for tracing and analyzing the causes of ozone pollution based on VOCs component characteristics includes the following steps:
[0064] S1: Acquire concentration data of various volatile organic compounds (VOCs), ozone concentration data, nitrogen oxide concentration data, and meteorological parameter data in ambient air; and amplify samples from periods of high ozone concentration to generate synthetic samples to balance the number of samples at different ozone concentration levels.
[0065] Step S1 is used to collect and preprocess ambient air VOCs data. Specifically, it includes the following steps:
[0066] S1.1 Data Acquisition:
[0067] An ambient air sample analysis was performed using a pre-concentration-gas chromatography-dual detector system, enabling the analysis of 117 VOCs (including 57 PAMS, 65 TO-15, and 13 aldehydes and ketones) in a single injection. Specific instrument configurations can be selected within the existing technical framework based on the application scenario and do not constitute a necessary limitation on the algorithm of this invention.
[0068] S1.2, Data Preprocessing and Chemical Constraint SMOTE (CC-SMOTE):
[0069] Samples from periods of high ozone concentration are amplified by chemically constrained synthetic minority oversampling (SCA) to generate synthetic samples. SCA filters candidate synthetic samples by constructing one or more chemically constrained discriminant functions, retaining only synthetic samples that pass the discriminant functions, thus constructing a chemically feasible region.
[0070] Concentration data of 117 VOCs were synchronized with... The time alignment of NOx and meteorological parameters is performed. To address the imbalance issue where samples from high ozone periods are significantly fewer than those from low ozone periods, this invention employs chemically constrained SMOTE (CC-SMOTE) to generate synthetic samples.
[0071] Construction and engineering implementation of constraint rule library:
[0072] Each chemical constraint is expressed as a discriminant function. The conditions for the synthetic sample to pass the filter are: When the success rate of minority class synthetic samples is below 30%, the quantile constraint interval is automatically widened by ±5 percentage points. All modules are implemented in Python. Statistical independence is not required between the constraint functions; their joint determination is used to construct the "empirical chemical feasible region," ensuring the overall reasonableness of the synthetic samples at the atmospheric chemical level.
[0073] The specific mathematical definition of constraint rules:
[0074] (a) NO titration effect constraint
[0075] NO titration effect constraint definition: It is prohibited to synthesize samples where both nitrogen oxide (NOx) and ozone (Ozone) concentrations simultaneously exceed their respective judgment thresholds. These judgment thresholds are determined based on the upper quartile of the NOx concentration in the training set and the upper limit of the upper prediction interval for the Ozone concentration at a given NOx concentration. Specifically:
[0076] For any sample i in the minority class, let its NO concentration be... , concentration Define constraint functions. :
[0077] ;
[0078] in: The upper quartile of NO concentration for all samples in the training set;
[0079] In the training set, NO concentrations were selected from those at... For samples within the range, calculate their The upper limit of the 95% prediction range for concentration. Take 10% of the standard deviation of NO concentration.
[0080] This constraint is based on The chemical titration relationship is used to prevent the synthesis of high NO and high Coexisting chemically irrational samples.
[0081] (b) Source spectrum consistency constraints
[0082] Source consistency constraint definition: Ensure that the concentration ratio of a specific volatile organic compound species pair in the synthesized sample falls within the interval formed by the lowest percentile (5th percentile) and the highest percentile (95th percentile) of that ratio in the training set. Specifically:
[0083] For characteristic species ratio Define constraint functions :
[0084] ;
[0085] in, and These are the 5th and 95th percentiles of the ratio in the training set, respectively. This constraint ensures that the species ratios of the synthetic samples fall within a reasonable range of the measured data.
[0086] : The concentration of VOC species in the characteristic ratio molecule; VOC species concentration in the denominator of the characteristic ratio; : The concentration ratio of characteristic species in the synthetic sample; : The ratio of characteristic species concentrations in the training set samples; The 5th percentile of this ratio in the training set; The 95th percentile of this ratio in the training set; : Source spectrum consistency constraint discrimination function, output 1 indicates passing the constraint, 0 indicates failing; x: represents the feature vector of the synthetic sample to be judged.
[0087] (c) Constraints on the relationship between oxidation products and the parent compound
[0088] The constraint definition for the relationship between oxidation products and the parent compound is: ensuring that the concentration ratio of the secondary photochemical oxidation products in the synthesized sample to the concentration of the parent volatile organic compounds does not exceed the highest percentile (95th percentile) of that ratio in the training set. Specifically:
[0089] For known secondary products (such as...) MVK is a photochemical reaction precursor of MVK, and its parent species is isoprene. MVK is methyl vinyl ketone, a known secondary photochemical oxidation product; isoprene is the photochemical reaction precursor species of MVK, a primary natural source emission precursor. Define constraint functions. :
[0090] ;
[0091] in, : The function for determining the relationship between oxidation products and the parent compound. The output is 1 if the constraint is met, and 0 if it is not met. : Concentration value of MVK in the synthetic sample; : Concentration value of isoprene in the synthesized sample; : Concentration values of MVK in the training set; : Concentration values of isoprene in the training set; The 95th percentile of a certain ratio sequence in the training set.
[0092] This constraint prevents the synthesis of unreasonable samples with an abnormally high proportion of secondary products.
[0093] Synthesis process: For minority class samples (ozone concentration exceeding that of the training set) After generating candidate synthetic samples using the standard SMOTE algorithm (upper quartile of concentration), the samples are sequentially processed... , , Filter and only retain synthetic samples that pass all tests (i.e., have a product of 1) to add to the training set.
[0094] S2: Based on the synthetic samples obtained after amplification, a random forest model is used with ozone concentration as the response variable and the concentration of various VOCs as the independent variable to calculate the feature importance of each VOC species, and select VOC species whose cumulative importance contribution reaches a preset threshold to form an initial feature subset.
[0095] Step S2 is used to perform initial screening of feature importance based on random forest. The specific process is as follows:
[0096] Random Forest (RF) was used to model the CC-SMOTE-processed training set, with hourly O3 concentration as the response variable and the concentrations of 117 VOCs as independent variables. The number of trees was set to 500, and the splitting criterion was Gini impurity (also known as the Gini coefficient in Random Forest). The feature importance of VOC species was calculated based on the average decrease in Gini impurity. Species contributing 85% of the cumulative importance were selected as the initial feature subset (typically about 25-35 species). Gini impurity is an indicator of impurity when splitting decision tree nodes. One standard method for calculating feature importance in Random Forest is based on the mean decrease in Gini impurity: for each feature, the decrease in Gini impurity resulting from all splits across all decision trees is summed, and then divided by the number of trees to obtain the feature's importance score. Therefore, feature importance corresponds to a numerical value for each VOC species, representing the species' importance in predicting ozone concentration. The cumulative importance contribution is calculated by sorting each VOC species by feature importance from high to low, accumulating their importance values sequentially, and dividing by the total sum of importance to obtain the cumulative percentage. When the proportion of the accumulated value to the total sum of importance first reaches a preset threshold (e.g., 85%), all species up to that position are selected as the initial feature subset.
[0097] Specifically, the feature importance is calculated as follows: For each decision tree in the random forest, when using Gini impurity as the node splitting criterion, the reduction in Gini impurity for each feature during a node split (i.e., the Gini impurity of the node before the split minus the weighted average Gini impurity of the child nodes after the split) is added to the feature's score. After accumulating the scores for all node splits across all decision trees, the score is divided by the total number of decision trees to obtain the Gini importance score for each VOC species. All species are sorted from highest to lowest score, and the scores are accumulated sequentially. When the accumulated score first reaches 85% of the total score, these species are selected to form the initial feature subset.
[0098] The selection criteria for the 85% threshold: In preliminary experiments, this invention compared the impact of three thresholds—80%, 85%, and 90%—on the final screening results. Using the same validation set, the 85% threshold, while ensuring that the candidate subset includes all final key species (100% recall), compresses the search space from 117 dimensions to approximately 30 dimensions, balancing the computational efficiency and screening completeness of SA-PLS. The 80% threshold previously missed key species such as propylene, while the 90% threshold introduced too many redundant species, leading to slow SA convergence. Therefore, 85% was determined to be the preferred threshold in this invention.
[0099] S3: For each VOC species in the initial feature subset, the initial concentration of the species is reconstructed based on its reaction rate constant with hydroxyl radicals and the estimated photochemical age of the air mass, so as to correct the influence of photochemical consumption on the measured concentration, and the VOC species reconstructed from the initial concentration constitute a candidate feature subset.
[0100] Step S3 is used to reconstruct the concentration for photochemical age correction.
[0101] Step S3 specifically includes: estimating the photochemical age of the air mass using the concentration ratio of ethylbenzene to m / p-xylene; dynamically estimating the average concentration of hydroxyl radicals in the atmosphere using a parametric model based on the photolysis rate of nitrogen dioxide, temperature, and relative humidity; and reconstructing the initial concentration of the VOC species based on the reaction rate constant between the VOC species and hydroxyl radicals, the average concentration of the hydroxyl radicals, and the photochemical age. Specifically, as follows:
[0102] In downwind or aging air masses, VOC concentrations have undergone significant chemical depletion, and directly using measured concentrations can lead to misjudgments of precursor importance. This invention introduces photochemical age correction:
[0103] 1) Estimating the photochemical age of an air mass using the concentration ratio of ethylbenzene to m / p-xylene. :
[0104] The specific calculation method is as follows: using the concentration ratio of ethylbenzene to m / p-xylene, combined with the reaction rate constants of ethylbenzene and m / p-xylene with hydroxyl radicals, and the average concentration of hydroxyl radicals, the following formula is used for estimation:
[0105] ;
[0106] Where Δt: photochemical age of the air mass, in seconds (s); : Rate constant of the reaction between m / p-xylene and ·OH radicals, in cm³·molc - ¹·s - ¹; : Rate constant of the reaction between ethylbenzene and ·OH radicals, in cm³·molc - ¹·s - ¹; : The average concentration of hydroxyl radicals (·OH) in the atmosphere, in moles·cm - ³; [Ethylbenzene]: Measured concentration of ethylbenzene in ambient air; Measured concentration of intermediate / para-xylene in ambient air at time t; : The measured concentration ratio of ethylbenzene to xylene at time t; The initial concentration ratio of ethylbenzene to xylene in the fresh emission source spectrum is taken as a reference value of 0.35 to 0.45, and is determined based on the measured value of the local fresh emission source spectrum or the reference value of petrochemical / solvent source in the "Second National Pollution Source Census Production and Discharge Coefficient Manual (VOCs General Source Item)".
[0107] 2) Concentration parameterization estimation:
[0108] The specific estimation method is as follows: based on the photolysis rate of nitrogen dioxide, temperature, and relative humidity, a parameterized model based on observation constraints is used for dynamic estimation. The parameters of this parameterized model are obtained through regression calibration using local observation data. The specific formula is:
[0109] ;
[0110] in, : The average concentration of hydroxyl radicals (·OH) in the atmosphere, in moles·cm - ³; α: Calibration coefficient in the OH concentration parameterization model, obtained through regression fitting of local observation data; β: Exponential parameter in the OH concentration parameterization model, obtained through regression fitting of local observation data; : Photolysis rate constant of NO2, in seconds - ¹; T: Atmospheric temperature, in Kelvin (K); RH: Relative humidity, expressed as a decimal or percentage, dimensionless.
[0111] The OH concentration parameterization model is based on The dominant empirical relationship, whose structure is based on the Atkinson photochemical reaction system, is obtained by regression calibration of parameters α and β using local observation data (in this embodiment). This concentration parameterization model avoids the applicability limitations that come with using fixed constants.
[0112] 3) Concentration reconstruction:
[0113] Based on the reaction rate constant of the VOC species with hydroxyl radicals, the average concentration of hydroxyl radicals, and the photochemical age, the initial concentration of the VOC species is reconstructed. Specifically, for each VOC species in the candidate feature subset, based on its measured concentration, the reaction rate constant of the VOC species with hydroxyl radicals, the average concentration of hydroxyl radicals, and the photochemical age of the air mass, the initial concentration of the VOC species is reconstructed using the following formula:
[0114] ;
[0115] in: : Estimated initial VOCs emission concentrations after reconstruction; Measured concentrations of VOCs in ambient air; The rate constant of the reaction between this VOC species and ·OH radicals, expressed in cm³·mol⁻¹. - ¹·s - ¹.
[0116] S4: Using a simulated annealing-optimized partial least squares regression framework, the candidate feature subset is searched, and the optimal VOCs feature subset is selected with the goal of minimizing the objective function.
[0117] Step S4 is used to achieve fine screening of SA-PLS based on photochemical mechanism constraints.
[0118] This invention employs simulated annealing-optimized partial least squares regression (SA-PLS) as a fine-screening framework, and adds an atmospheric chemistry-specific penalty term to modify the objective function for domain specificity. The PLS model is only used for statistical mapping and feature selection, and does not directly characterize the physical causal contribution of species to ozone formation. Figure 3 The diagram shown is a schematic diagram of the SA-PLS fine sieve process according to an embodiment of the present invention.
[0119] S4.1 Construction of the objective function (normalization process):
[0120] To ensure that each penalty term is comparable to RMSECV in terms of dimensions, this invention normalizes the objective function:
[0121] ;
[0122] This includes a normalized root mean square error term for cross-validation, a chemical inertness soft penalty term, a collinearity soft penalty term, and a quadratic product penalty term; S is the subset of candidate features, and LVs is the number of latent variables in the partial least squares regression. The normalized root mean square error of cross-validation. This is a soft penalty term for chemical inertness. This is a soft penalty term for collinearity. The penalty term for the secondary product, , , These are the corresponding penalty weights.
[0123] (1) Divide RMSECV by the training set. Standard deviation of concentration The dimensionless normalized cross-validation error is obtained, with a value range of approximately 0.1 to 1.0.
[0124] (2) (Chemical inertness soft penalty item):
[0125] The chemical inertness soft penalty term is defined as follows: when the reaction rate constant of each volatile organic compound species with hydroxyl radicals in the current candidate feature subset is lower than the chemical inertness threshold, a non-zero penalty is imposed on that species. The penalty intensity increases as the reaction rate constant of that species with hydroxyl radicals decreases. The value of the chemical inertness soft penalty term is the average of the penalty values of all species in the current candidate feature subset. The specific formula is as follows:
[0126]
[0127] in, Chemical inertness penalty term, which is a subset of candidate features. The average level of species reactivity below the threshold; | |: Candidate feature subset The number of VOC species included; j: candidate feature subset The index of the j-th VOC species in the database; Candidate feature subset The rate constant of the reaction between the j-th VOC species and the ·OH radical is expressed in cm³·mol⁻¹. - ¹·s - ¹; : Chemical inertness determination threshold, which takes the value of max(0,·): The maximum value function, ensuring that the penalty term is non-negative.
[0128] The penalty item is... Species below the threshold are subject to continuous penalties, with the penalty intensity being... It is inversely proportional, thus avoiding the rigidity problem of hard threshold.
[0129] (3) (collinearity soft penalty term):
[0130] The collinearity soft penalty term is defined as: a penalty imposed on the excess portion when the average correlation coefficient among all volatile organic compound species within the current candidate feature subset exceeds a preset correlation coefficient threshold. The specific formula is as follows:
[0131]
[0132] in, Collinearity soft penalty term, which measures the subset of candidate features. The extent to which the average correlation among VOC species exceeds a threshold; Candidate feature subset The average Pearson correlation coefficient among all VOC species.
[0133] A penalty is imposed when the average correlation coefficient exceeds 0.85.
[0134] (4) (Penalty for secondary products):
[0135] The secondary product penalty term is defined as the proportion of species in the current candidate feature subset marked as secondary photochemical oxidation products to the total number of species in the current candidate feature subset. The specific formula is as follows:
[0136]
[0137] in, For candidate feature subsets The number of species labeled as secondary photochemical oxidation products (e.g., MVK, MAR, etc.); candidate feature subset This is a subset of VOCs features evaluated during the current SA-PLS search; MVK is methyl vinyl ketone, an example of a secondary photochemical oxidation product; MACR is methacrolein, an example of a secondary photochemical oxidation product. Note: Primary emission precursors such as isoprene and toluene are not included in this category. The penalty term for the secondary product is a subset of the candidate features. The proportion of secondary photochemical products in the middle.
[0138] S4.2, Penalty Weight Optimization Method:
[0139] The optimization employs a method combining nested cross-validation and Bayesian optimization. The specific steps are as follows:
[0140] 1. Divide the training set into 5 equal-sized subsets.
[0141] 2. For each training / validation partition in the outer loop, in the inner loop... , , Perform Bayesian optimization (using a Gaussian process surrogate model with the ExpectedImprovement acquisition function) with the goal of minimizing the validation set RMSECV.
[0142] 3. Select the partition with the best average performance among all outer layer partitions. The combination serves as the final weight.
[0143] 4. In the experiment based on observation data from the Beijing-Tianjin-Hebei region in this invention, the optimal weight obtained through the above optimization process is: , , For ease of description, approximate values of 0.05, 0.10, and 0.08 are used. The optimization process is detailed in step 5 of Example 1.
[0144] S4.3 Simulated Annealing Search Parameters:
[0145] The random seed is fixed at seed=42, and the process is repeated 10 times to obtain the optimal result of the objective function.
[0146] initial temperature Based on the initial acceptance probability method, the acceptance probability of the poor solution in the initial stage is approximately 80%.
[0147] Cooling factor α = 0.92.
[0148] Perturbation mechanism: A VOC is randomly added or removed with a probability of 0.5, and LVs are changed (±1) with a probability of 0.5.
[0149] Termination condition: No improvement after 50 consecutive iterations or .
[0150] S4.4 Identification of key precursors:
[0151] A PLS model was established for the optimal subset and LVs, and the standardized PLS regression coefficients of each VOCs were calculated. The 8-12 species with the largest absolute values of the coefficients were selected as the final confirmed key ozone precursors, and their chemical category labels (primary natural source precursors, primary anthropogenic source precursors, and secondary photochemical products) were output.
[0152] S5: Based on the optimal VOCs feature subset, the standardized regression coefficients of each VOCs are calculated using a partial least squares regression model to determine the key ozone precursors, and the sources of the key ozone precursors are analyzed to generate pollution control strategies.
[0153] Step S5 is used to perform source resolution and policy generation.
[0154] Source apportionment specifically involves analyzing key precursors identified through SA-PLS screening, combining characteristic species diagnostic ratios, backward trajectory (HYSPLIT), and PSCF potential source contribution analysis. By reducing the number of variables from 117 to 8-12, the dimensionality of source apportionment is significantly reduced, simplifying subsequent analysis and reducing collinearity noise interference. Species labeled as "secondary photochemical products" are treated as indicators of their parent species' oxidation pathways in source apportionment and do not directly correspond to primary emission sources. Targeted emission reduction recommendations are then provided.
[0155] Example 1: Identification and Source Tracing of Key Ozone Precursors in a City in the Beijing-Tianjin-Hebei Region During Summer
[0156] 1. Data Collection:
[0157] Online monitoring was conducted using a pre-concentration-gas chromatography-dual detector system (Entech 7100 + Agilent 7890 GC-FID / MSD). The R² values for the standard curves of 117 VOCs were all ≥0.995, and the method detection limits ranged from 0.005 to 0.1 ppbv. The monitoring period was from July 1st to July 31st, 2022 (31 days in total), with a temporal resolution of 1 hour. Data was collected synchronously. NOx, CO, J(NO2) and meteorological parameters (temperature T, relative humidity RH).
[0158] Sample quality control: Raw data were discarded according to the following criteria:
[0159] (1) Instrument calibration or maintenance period (12 hours in total);
[0160] (2) The period during which the concentration of key species (≥10 species) is below the detection limit (a total of 9 hours);
[0161] (3) Outliers exceeding the mean ± 3 standard deviations in the daily variation continuity test (6 hours in total);
[0162] (4) During periods of heavy rainfall (hourly rainfall ≥ 5 mm, totaling 10 hours), the distribution of VOC concentrations is significantly altered due to wet deposition.
[0163] After the above quality control, a total of 707 valid samples were finally obtained (the theoretical maximum number of samples is 31×24=744, with an effectiveness rate of 95.0%).
[0164] Sample proportion explanation: After excluding periods of precipitation, high There were 68 groups of periods with ozone concentrations >160 μg / m³, and 639 groups of periods with low ozone concentrations, a ratio of approximately 1:9.4. This ratio reflects the actual pollution situation at this site in July—there were 7 precipitation events that month, with a cumulative total of 11 rainy days, which significantly suppressed ozone formation, thus resulting in high ozone concentrations throughout the month. The number of hours is too low.
[0165] 2. CC-SMOTE processing:
[0166] Determine the constraint threshold based on the training set (data after removing the validation set):
[0167] ;
[0168] Based on sliding window ( )calculate;
[0169] Toluene / Benzene Ratio .
[0170] A basic SMOTE implementation was created using the Python imbalanced-learn library, with a custom constraint filter overlaid. The initial pass rate was approximately 72%, and no relaxation conditions were triggered. The sample was expanded from 68 groups to approximately 185 groups, representing high and low... The sample ratio improved to approximately 1:2.5.
[0171] 3. RF initial screening:
[0172] The Gini importance ranking of the RF model (500 trees, splitting criterion = Gini, random seed = 42) shows that the top 31 VOCs have a cumulative contribution of 85.2%, forming a candidate subset. If an 80% threshold is used, only 24 VOCs are selected (propylene is missed); a 90% threshold selects 41 VOCs (the number of SA convergence iterations increases by approximately...). This verifies the rationality of the 85% threshold.
[0173] 4. Photochemical age correction
[0174] Dynamic estimation using a parameterized OH concentration model .parameter , Obtained through regression calibration using local historical observation data (same period in 2021). Calculations yielded Daily average approximately Select the ethylbenzene / m-xylene ratio during the midday period (11:00-15:00) and take... (Measured values from local fresh emission sources) The estimated average photochemical age of the air mass is approximately 6–8 hours. For the candidate subset... The initial concentration of active species was reconstructed.
[0175] 5. SA-PLS fine screening (including mechanism constraints)
[0176] Penalty weight optimization process:
[0177] The method employs 5-fold nested cross-validation combined with Bayesian optimization to determine the optimal method. , , Search space: , , Bayesian optimization used the scikit-optimize library, with EI as the acquisition function, and performed 50 iterations. The average optimal weights for outer-layer cross-validation were... For the sake of simplicity, take As a weight in this embodiment. Sensitivity analysis shows that, When the value fluctuates within the range of ±0.02, the overlap rate of the top 8 species in the screening results reaches 87.5%, proving that the optimal solution is robust.
[0178] SA search: Fixed random seed = 42 , The algorithm was run 10 times, and the result with the minimum objective function was selected. The optimal solution converged after 243 iterations. The optimal subset contained 8 VOCs, and the optimal LVs = 3. In the partial cross-validation tradeoff, some weakly reactive species (such as ethane) occasionally entered the candidate subset, but were removed in the final SA optimization, demonstrating the robust screening capability of the method of this invention. Table 1 shows the screening results (standardized PLS regression coefficients and OFP contribution rate):
[0179] Table 1 Screening Results
[0180]
[0181] Chemical justification: MVK selected by standard SA-PLS (unconstrained) was excluded in this invention due to the penalty for secondary products; carbon tetrachloride and dichloromethane selected by standard SA-PLS were excluded due to the penalty for chemical inertness. Isoprene, as a primary natural source precursor, was correctly retained.
[0182] 6. Source Analysis and Strategy Evaluation
[0183] Characteristic ratio analysis and PSCF potential source area calculations were performed only for the above eight key precursors. Toluene / benzene = 0.42 ± 0.05, indicating a mixed contribution from industrial sources and motor vehicles; isopentane / n-pentane = 1.1 ± 0.2, indicating a limited contribution from LPG. HYSPLIT 48-hour backward trajectory clustering (72% of trajectories originating from the southwest) combined with PSCF analysis determined the main pollution sources to be the southwest-upwind petrochemical industrial park (contributing approximately 40%–50%) and isoprene emissions from local vegetation (contributing approximately 25%–35%). Using the MCM v3.3.1 photochemical chamber model, simulations showed that reducing the emissions of the eight key precursors by 30% could reduce the maximum daily 8-hour O3 concentration in the region by approximately 8%–12% (simulation uncertainty approximately ± 2%). Figure 4 This is a schematic diagram of the potential source contributions of PSCF and the HYSPLIT backward trajectory based on screening key volatile organic compounds (VOCs).
[0184] Comparative experiments to verify:
[0185] To verify the synergistic effect of the various modules of this invention, ablation experiments were conducted on the same dataset, and the results are shown in Table 2 and... Figure 5 As shown:
[0186] Table 2 Comparison of O3 prediction performance of different module combinations (validation set, mean ± standard deviation)
[0187]
[0188] Ablation experiments show that the combination of the modules of this invention produces significant synergistic effects—the R² of the complete scheme is about 11% higher than that of the single SA-PLS and about 5% higher than that of the scheme with only a penalty term added, which verifies the synergistic effect of the coupling between modules.
[0189] In summary, this invention provides a method for tracing and analyzing the causes of ozone pollution based on VOCs component characteristics. Its core concept lies in deeply embedding atmospheric photochemical mechanisms into a machine learning feature screening process, constructing a two-tiered architecture of "statistical initial screening + chemical fine screening" to accurately identify key ozone precursors and trace their sources from over a hundred VOCs species. Specifically, addressing the imbalance problem of scarce samples during periods of high ozone concentration, three chemical constraints—NO titration effect, source spectrum consistency, and the relationship between oxidation products and the parent compound—are embedded on top of the standard SMOTE to ensure the atmospheric chemical rationality of the generated synthetic samples. Using ozone concentration as the response variable, the Gini importance of 117 VOCs is calculated, and species with a cumulative contribution of 85% are selected to form an initial feature subset, compressing the search space to approximately 30 dimensions. The photochemical age of the air mass is estimated using the ethylbenzene / xylene ratio, and the average concentration of hydroxyl radicals is estimated using a dynamic parameterization model to reconstruct the initial VOC concentration, correcting the concentration distortion caused by photochemical consumption in the downwind air mass. Employing a simulated annealing-optimized partial least squares regression framework, this invention innovatively incorporates three atmospheric chemistry-specific constraints—chemical inertia soft penalty, collinearity soft penalty, and secondary product penalty—into the objective function. This ensures that the screening results satisfy both statistical optimality and chemical rationality. For only the 8–12 key precursors selected, source tracing is performed using characteristic species diagnostic ratios, backward trajectory (HYSPLIT), and potential source contribution analysis (PSCF), resulting in targeted emission reduction strategies. Through the synergistic coupling of these methods, this invention achieves accurate and chemically rational identification of key ozone precursors, providing reliable technical support for the scientific management of ozone pollution.
[0190] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.
[0191] It should also be noted that, in the embodiments of this application, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0192] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined in the embodiments of this application may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown in this application, but is to be accorded the widest scope consistent with the principles and novel features disclosed in the embodiments of this application.
Claims
1. A method for tracing and analyzing the causes of ozone pollution based on the characteristics of VOCs components, characterized in that, Includes the following steps: S1: Acquire concentration data of various volatile organic compounds (VOCs), ozone concentration data, nitrogen oxide concentration data, and meteorological parameter data in ambient air; and amplify samples from periods of high ozone concentration to generate synthetic samples to balance the number of samples at different ozone concentration levels. S2: Based on the synthetic samples obtained after amplification, a random forest model is used with ozone concentration as the response variable and the concentration of various VOCs as the independent variable to calculate the feature importance of each VOC species, and select VOC species whose cumulative importance contribution reaches a preset threshold to form an initial feature subset. S3: For each VOC species in the initial feature subset, the initial concentration of the species is reconstructed based on its reaction rate constant with hydroxyl radicals and the estimated photochemical age of the air mass, so as to correct the influence of photochemical consumption on the measured concentration, and the VOC species reconstructed from the initial concentration constitute a candidate feature subset. S4: Using a simulated annealing-optimized partial least squares regression framework, the candidate feature subset is searched, and the optimal VOCs feature subset is selected with the goal of minimizing the objective function. S5: Based on the optimal VOCs feature subset, the standardized regression coefficients of each VOCs are calculated using a partial least squares regression model to determine the key ozone precursors, and the sources of the key ozone precursors are analyzed to generate pollution control strategies.
2. The method according to claim 1, characterized in that, In step S1, the amplification process of the ozone high concentration period sample is specifically performed by using chemically constrained synthetic minority oversampling technology to generate synthetic samples. The chemically constrained synthetic minority oversampling technology filters candidate synthetic samples by constructing one or more chemically constrained discriminant functions, retaining only all synthetic samples that pass the discrimination function, so as to construct a chemically feasible region.
3. The method according to claim 2, characterized in that, The chemical constraint discrimination function includes: NO titration effect constraint: It is prohibited to synthesize samples where the concentrations of nitrogen oxides and ozone are simultaneously higher than their respective judgment thresholds. The judgment thresholds are determined based on the upper quartile of the concentration of nitrogen oxides in the training set and the upper limit of the upper prediction interval of the ozone concentration for a given concentration of nitrogen oxides. Source consistency constraint: Ensure that the concentration ratio of a specific volatile organic compound species pair in the synthesized sample falls within the interval formed by the lowest and highest percentiles of that ratio in the training set; Oxidation product-parent product relationship constraint: Ensure that the concentration ratio of secondary photochemical oxidation products in the synthetic sample to the concentration ratio of the parent volatile organic compounds does not exceed the highest percentile of that ratio in the training set.
4. The method according to claim 1, characterized in that, In step S2, the preset threshold for the cumulative importance contribution is 85%; the random forest model is set to have 500 trees, and the splitting criterion is Gini impurity; the characteristic importance of the VOCs species is calculated based on the average reduction of the Gini impurity.
5. The method according to claim 1, characterized in that, Step S3 specifically includes: The photochemical age of an air mass was estimated using the concentration ratio of ethylbenzene to m / p-xylene. Based on the photolysis rate of nitrogen dioxide, temperature, and relative humidity, a parametric model was used to dynamically estimate the average concentration of hydroxyl radicals in the atmosphere. The initial concentration of the VOC species is reconstructed based on the reaction rate constant between the VOC species and hydroxyl radicals, the average concentration of the hydroxyl radicals, and the photochemical age.
6. The method according to claim 5, characterized in that, The method for estimating the photochemical age of an air mass using the concentration ratio of ethylbenzene to m / p-xylene is as follows: Using the concentration ratio of ethylbenzene to m / p-xylene, combined with the reaction rate constants of ethylbenzene and m / p-xylene with hydroxyl radicals, and the average concentration of hydroxyl radicals, the following formula is used for estimation: ; in, The photochemical age of the air mass; The reaction rate constant between meta- / p-xylene and hydroxyl radicals; This is the rate constant for the reaction of ethylbenzene with hydroxyl radicals; This represents the average concentration of hydroxyl radicals in the atmosphere. The measured concentration ratio of ethylbenzene to xylene at time t; This represents the initial concentration ratio of ethylbenzene to xylene in the fresh emission source spectrum.
7. The method according to claim 5, characterized in that, The average concentration of hydroxyl radicals in the atmosphere is dynamically estimated using a parametric model based on the photolysis rate of nitrogen dioxide, temperature, and relative humidity, as shown in the following formula: ; in, These are the calibration coefficients in the OH concentration parameterization model. This is the exponential parameter in the OH concentration parameterization model. and Obtained through regression fitting of local observation data; The photolysis rate constant of NO2; RH represents atmospheric temperature and relative humidity.
8. The method according to claim 5, characterized in that, The process of reconstructing the initial concentration of the VOC species based on the reaction rate constant between the VOC species and hydroxyl radicals, the average concentration of the hydroxyl radicals, and the photochemical age includes: For each VOC species in the candidate feature subset, the initial concentration of the VOC species is reconstructed using the following formula based on its measured concentration, the reaction rate constant between the VOC species and hydroxyl radicals, the average concentration of hydroxyl radicals, and the photochemical age of the air mass: ; in, The measured concentration of VOCs in ambient air; This is the reaction rate constant between the VOC species and hydroxyl radicals; This represents the average concentration of hydroxyl radicals in the atmosphere. The photochemical age of the air mass.
9. The method according to claim 1, characterized in that, In step S4, the objective function is a normalized objective function, as follows: ; This includes a normalized root mean square error term for cross-validation, a chemical inertness soft penalty term, a collinearity soft penalty term, and a quadratic product penalty term; S is the subset of candidate features, and LVs is the number of latent variables in the partial least squares regression. The normalized root mean square error of cross-validation. This is a soft penalty term for chemical inertness. This is a soft penalty term for collinearity. The penalty term for the secondary product, , , These are the corresponding penalty weights.
10. The method according to claim 9, characterized in that, in: The chemical inertness soft penalty term is defined as follows: when the reaction rate constant of each volatile organic compound species with hydroxyl radicals in the current candidate feature subset is lower than the chemical inertness determination threshold, a non-zero penalty is imposed on the species, and the penalty intensity increases as the reaction rate constant of the species with hydroxyl radicals decreases. The value of the chemical inertness soft penalty term is the average value of the penalty values of all species in the current candidate feature subset. The collinearity soft penalty term is defined as: when the average correlation coefficient among all volatile organic species in the current candidate feature subset exceeds a preset correlation coefficient threshold, a penalty is applied to the excess portion. The secondary product penalty term is defined as the proportion of the number of species marked as secondary photochemical oxidation products in the current candidate feature subset to the total number of species in the current candidate feature subset.