Method for correcting bias of era5 surface solar radiation based on cloud amount dependence characteristics

By constructing a stochastic forest residual correction model based on cloud cover dependence characteristics, the nonlinear bias problem of ERA5 surface solar radiation data in areas rich in cloud and water resources was solved, achieving high-precision radiation data correction, which is suitable for solar energy resource assessment and photovoltaic power station site selection in complex terrain.

CN122196483APending Publication Date: 2026-06-12HUBEI PROVINCIAL METEOROLOGICAL SERVICE CENT (HUBEI PROVINCIAL PROFESSIONAL METEOROLOGICAL SERVICE DESK)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUBEI PROVINCIAL METEOROLOGICAL SERVICE CENT (HUBEI PROVINCIAL PROFESSIONAL METEOROLOGICAL SERVICE DESK)
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies exhibit significant nonlinear systematic biases when using ERA5 surface solar radiation data in areas rich in cloud and water resources. Traditional methods fail to effectively distinguish the error mechanisms of different cloud cover intervals, resulting in large photovoltaic power prediction errors and a lack of physical interpretability and high correction accuracy.

Method used

A random forest residual correction model based on cloud cover dependence characteristics was constructed. By acquiring ERA5 reanalysis data and ground observation data of the study area, geometric optical factors and radiation-cloud cover nonlinear interaction terms were constructed. An inverse probability weighting mechanism of cloud cover frequency was introduced to optimize the random forest model to achieve radiation bias correction.

🎯Benefits of technology

It eliminates the "bipolar" nonlinear bias of ERA5 radiation, improves the correction accuracy and physical consistency of radiation data, meets the accuracy requirements of photovoltaic power prediction, and is suitable for solar resource assessment and photovoltaic power plant site selection in complex terrain areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196483A_ABST
    Figure CN122196483A_ABST
Patent Text Reader

Abstract

The application discloses a cloud amount dependence feature-based ERA5 ground surface solar radiation deviation correction method, and belongs to the technical field of meteorological data correction, and comprises the following steps: data acquisition; data quality control and space-time matching; physical feature engineering construction: two types of derived features, i.e., a geometric optical factor and a radiation-cloud amount nonlinear interaction term, are constructed, the geometric optical factor is a cosine value of a solar zenith angle, and the radiation-cloud amount nonlinear interaction term comprises a cloud blocking interaction term and a clear sky potential index; a random forest residual correction model is constructed: taking ground surface shortwave radiation, total cloud amount, high, medium and low cloud amount, derived features and a station identifier as an input feature vector, an inverse probability weighting mechanism is introduced to realize balanced sampling of cloud amount, model parameters are set, and training is completed; and the random forest residual correction model is verified and radiation data is corrected.The application can adapt to cloud amount dependence features, fuse physical properties of layered cloud amount, has physical interpretability and high correction precision, and solves the problem of low ERA5 ground surface solar radiation deviation correction precision.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of meteorological data correction technology, and specifically relates to a method for correcting surface solar radiation deviation, which is an ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics. Background Technology

[0002] Surface solar radiation is the core physical quantity driving the energy balance of the Earth's surface and a core data source for new energy applications such as photovoltaic power generation. ERA5 is the fifth-generation atmospheric reanalysis dataset from the European Centre for Medium-Range Weather Forecasts. With its advantages of high spatiotemporal resolution and long time-series coverage, ERA5 reanalysis data has become a core data source in the fields of meteorology, hydrology, and new energy. However, due to limitations in cloud microphysics and aerosol parameterization schemes, it exhibits significant nonlinear systematic bias in Province A (a province in central China) which is rich in cloud and water resources. Direct use of this data can lead to photovoltaic power prediction errors exceeding 20%, failing to meet the needs of refined applications.

[0003] Clouds are the primary factor causing errors in ERA5 radiation products, and the radiation bias exhibits a significant cloud cover dependence: in the less cloudy range, the radiation is underestimated due to the overestimation of the climatological background aerosol optical thickness, while in the cloudy range, the radiation is overestimated due to insufficient simulation of cloud water paths and cloud optical thickness. This "bipolar nonlinear" characteristic makes it difficult for traditional methods such as linear regression and mean correction to achieve ideal results under all weather conditions.

[0004] Machine learning algorithms, with their powerful nonlinear mapping capabilities, have become the mainstream tool for correcting meteorological biases. Among them, random forests have performed exceptionally well in radiometric correction due to their strong adaptability to high-dimensional features and outstanding resistance to overfitting. However, existing ERA5 radiometric correction methods based on random forests still have significant drawbacks: First, they mix all samples into the model, ignoring the differences in error mechanisms across different cloud cover intervals, resulting in poor physical consistency under extreme weather conditions. Second, they only use total cloud cover as a cloud-related feature, failing to distinguish the significant differences in optical properties between low, medium, and high clouds, and thus failing to characterize the physical heterogeneity of strong extinction in low clouds and weak extinction in high clouds. Third, they do not construct physically derived features related to radiative transfer, resulting in a lack of physical interpretability in the model. Furthermore, they do not consider the uneven distribution of samples across cloud cover intervals, making the model prone to bias towards high-frequency clear and cloudy samples while ignoring samples in the cloudy transition zone where the largest errors occur.

[0005] In addition, existing correction methods are not optimized for subtropical monsoon climate regions with abundant cloud and water resources and complex topography. In this region, local convective clouds occur frequently in summer and large-scale stratiform clouds cover the area in winter. The cloud amount has a significant modulating effect on radiation, and the subgrid cloud effect is prominent. Existing methods are unable to accurately capture the radiation deviation patterns in this region, and the correction accuracy and generalization ability are insufficient.

[0006] Related patent document: CN118426079A discloses a solar radiation correction method and system based on multi-step bias iteration and random forest. The method includes acquiring predicted solar radiation and measured solar radiation, calculating normalized data, and acquiring other predicted meteorological elements and hourly values. A training dataset is constructed using time window rolling modeling. A first random forest regression model is trained with predicted solar radiation, hourly values, and other predicted meteorological elements as input and normalized measured solar radiation data as output. Weather classification is performed using the normalized dataset of predicted solar radiation. A second random forest regression model is trained with the weather classification as input and the deviation between the first predicted normalized solar radiation and the normalized measured solar radiation data as output. Data and corresponding weather classifications for the day to be corrected are acquired to obtain the deviation between the first and second predicted normalized solar radiation, and the corrected solar radiation is calculated. CN120408019A discloses a high-resolution solar radiation data correction method and system based on deep learning, including: acquiring meteorological data and digital elevation model data, and preprocessing them; dynamically optimizing the parameter configuration of the WRF model, and simulating the output of high spatiotemporal resolution meteorological data as training data; fusing solar radiation flux values ​​from meteorological reanalysis data and near-real-time product meteorological data as label data for the correction model; inputting the high spatiotemporal resolution meteorological data and label data into a pre-constructed multi-scale feature fusion deep residual encoder-decoder network model for training, and obtaining the optimal parameter model based on multiple iterative iterations; and using the optimal parameter model to correct the solar radiation flux data output by the WRF model.

[0007] The above technologies do not solve the problem of how to improve the accuracy of ERA5 surface solar radiation deviation correction in existing technologies. Summary of the Invention

[0008] The purpose of this invention is to provide an ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics. This method can adapt to cloud cover dependence characteristics, integrate layered cloud cover physical attributes, and has both physical interpretability and high correction accuracy, thereby solving the problem of low correction accuracy of ERA5 surface solar radiation deviation.

[0009] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows: An ERA5 surface solar radiation bias correction method based on cloud cover dependence characteristics, the technical solution of which includes the following steps: S1. Data Acquisition: Acquire ERA5 reanalysis data and ground observation data for the study area. The ERA5 reanalysis data includes variables such as surface shortwave radiation (SSRD), total cloud cover (TCC), high cloud cover (HCC), medium cloud cover (MCC), and low cloud cover (LCC).

[0010] S2, Data quality control and spatiotemporal matching (data preprocessing).

[0011] S3. Construction of physical feature engineering: Based on the physical mechanism of radiative transfer, two types of derived features are constructed: geometric optical factor and radiation-cloud amount nonlinear interaction term. The geometric optical factor is the cosine value of the solar zenith angle, and the radiation-cloud amount nonlinear interaction term includes cloud obstruction interaction term and clear sky potential index.

[0012] This step, based on a simplified form of the radiative transfer equation, constructs derived features with clear physical meaning to enhance the model's ability to capture the interaction between radiation and cloud cover. The core features are two types: one is the geometric optical factor, i.e., the cosine of the solar zenith angle, which reflects the path length of light in the atmosphere and helps the model learn the geometric laws governing diurnal variations in radiation; the other is the nonlinear interaction term between radiation and cloud cover, which transforms the linear percentage of cloud cover into a physical quantity linked to energy levels, making it easier for the model to segment high and low radiation ranges and accurately capture the nonlinear modulation effect of cloud cover on radiation.

[0013] S4. Construct a random forest residual correction model: Set the objective variable of the correction model as the residual between the ground-observed radiance value and the original radiance value of ERA5, and use the above-mentioned surface shortwave radiation, total cloud cover, high cloud cover, medium cloud cover, low cloud cover, the two types of derived features in step S3, and the station identifier as the input feature vector; introduce an inverse probability weighting mechanism based on cloud cover frequency to achieve cloud cover balanced sampling, set the correction model parameters, and complete the training.

[0014] S5. Random Forest Residual Correction Model Validation and Radiation Data Correction: Divide the matching dataset from step S2 into training and validation sets, perform validation and bias correction, and output the corrected surface solar radiation data.

[0015] In the above technical solutions, the preferred technical solution is that, in step S1, the time resolution is 1h and the spatial resolution is 0.25°. The time range is set according to the research needs. The ground observation data includes hourly surface solar radiation observations, sunshine duration, and the plains and hilly mountainous terrain of the study area covered by the observation stations. That is, hourly surface solar radiation observations and sunshine durations are selected from the observation stations covering the plains and hilly mountainous terrain of the study area. The observation instruments conform to the national standard "Surface Meteorological Observation Specification Radiation" and the measurement accuracy is ≤±5%. The study area is a subtropical monsoon climate area rich in cloud and water resources. In step S1, the sunshine duration is converted into sunshine percentage as a physical proxy variable to verify the accuracy of the ERA5 cloud cover product, and the calculation results are restricted to the interval [0,1]. Sunshine percentage SP=S / S0×100% (1), where S is the actual sunshine duration and S0 is the daily astronomical sunshine duration.

[0016] In the above technical solution, a preferred technical solution may also be that the method in step S2 is to match the ERA5 grid data to the latitude and longitude of the ground observation station through bilinear interpolation, and unify the time of the observation data and the ERA5 data to the hourly Beijing time sequence; remove samples with physical extreme values, low solar altitude angle at night and missing cloud data, and construct a matching dataset, namely the observation dataset and the reanalysis dataset; to ensure the physical validity of the model training data, the multi-source data is strictly preprocessed, and this step includes spatiotemporal matching: using bilinear interpolation to match the ERA5 grid data to the latitude and longitude of the ground observation station, and unify the time of the observation data and the ERA5 data to the hourly Beijing time sequence.

[0017] In step S2, removing physical extreme values ​​involves removing outliers where the observed irradiance is less than 0 W / m² or greater than 1300 W / m², and marking missing values ​​-9999.0 as invalid; let This represents the cosine of the solar zenith angle. Excluding low solar altitude angles at night, the hourly cosine of the solar zenith angle is calculated based on the station's latitude and longitude. Only retain the cosine value of the solar zenith angle. Samples with a value >0.01 during the daytime period are used to avoid interference from zero values ​​at night and dim light at dawn and dusk during model training; cloud data integrity checks are performed: samples with missing cloud layers in the total cloud cover, high cloud cover, medium cloud cover, and low cloud cover are removed; finally, a high-quality matching dataset is constructed, namely the observation dataset and the reanalysis dataset.

[0018] In the above technical solution, a preferred technical solution may also be that, in step S3, let cosθ_z represent the cosine value of the solar zenith angle, and the formula for calculating the cosine value of the solar zenith angle is: (2), in The coordinates are: latitude of the station, δ is the solar declination, and ω is the hour angle. The cloud occlusion interaction item X_inter=R_ERA5×C_TCC (3). Clear sky potential index X_clear=R_ERA5×(1-C_TCC) (4) Where R_ERA5 is the original surface shortwave radiation value of ERA5, and C_TCC is the total cloud cover value of ERA5.

[0019] In the above technical solution, a preferred technical solution is as follows: In step S4, the inverse probability weighting mechanism based on cloud cover frequency is implemented as follows: the training data is divided into different intervals according to the total cloud cover, the sample frequency N_bin of each interval is calculated, and a training weight W_i is assigned to each sample, where W_i ∝ N_total / (N_bin + ε), where N_total is the total number of training data samples, and ε is a correction coefficient to avoid a denominator of 0. In step S4, the parameters of the correction model are set as follows: the number of decision trees is 500, the minimum number of leaf nodes is 25, and the maximum number of features is set to square root mode.

[0020] In step S4, let X1 represent the input feature vector, and the expression for the input feature vector X1 is: X1=[R_ERA5,C_TCC,C_LCC,C_MCC,C_HCC,cosθ_z,X_inter,X_clear,ID_station] (5), where C_LCC is low cloud cover, C_MCC is medium cloud cover, and C_HCC is high cloud cover. The value is the cosine of the solar zenith angle, and ID_station is the unique identifier of the station.

[0021] Step S4 improves the traditional random forest model by implementing two major improvements: a residual learning strategy and cloud cover equalization sampling. This results in a model suitable for radiation bias correction, specifically including: Target variable and input feature setting: The target variable of the correction model is the residual between the ground observation radiation value and the original ERA5 radiation value Y=R_Obs-R_ERA5 (7). This strategy preserves the reasonable physical trends of ERA5 itself to the greatest extent (such as diurnal variation and seasonal variation). The correction model only focuses on correcting the deviation caused by the parameterization scheme. The input feature vector of the correction model is X1, which integrates the original ERA5 data, layered cloud cover, physical derived features and site topographic labels.

[0022] Cloud cover equalization sampling: An inverse probability weighting mechanism based on cloud cover frequency is introduced to address the problem of abundant samples in extreme cloud cover areas and scarce samples in multi-cloud transition areas. The training data is divided into different intervals according to the total cloud cover, and training weights are assigned to each sample. This strategy increases the weight of samples in sparse cloud cover intervals in the loss function, improving the model's correction performance in multi-cloud transition areas.

[0023] Model parameter settings: Optimize model parameters through grid search and cross-validation, and finally set the number of decision trees, minimum number of samples in leaf nodes, and maximum number of features to ensure the stability of model integration, prevent overfitting, and increase the differences between trees.

[0024] In the above technical solution, a preferred technical solution may also be that, in step S5, the method for verifying, correcting biases, and outputting corrected surface solar radiation data is to use evaluation indicators to verify the correction model with full samples and different cloud cover intervals (verifying the performance of the correction model from multiple dimensions, including full samples and different cloud cover intervals), and to use the trained and verified correction model to correct biases in the ERA5 surface solar radiation raw data, and output corrected surface solar radiation data.

[0025] The matched datasets were divided into training and validation sets in an 8:2 ratio. Pearson correlation coefficient (R), root mean square error (RMSE), mean bias (ME), mean absolute error (MAE), and boost rate (SS) were used as evaluation metrics. The boost rate was calculated using the following formula: SS=(RMSE1-RMSE2) / RMSE1×100% (6), where RMSE1 is the root mean square error before correction and RMSE2 is the root mean square error after correction.

[0026] The method of this invention is to first acquire the ERA5 reanalysis data of the study area and the matching dataset of ground observations. After quality control and spatiotemporal matching preprocessing such as physical extreme values ​​and low solar altitude angles, a physical feature engineering is constructed, which includes the cosine value of the solar zenith angle and the nonlinear interaction term of radiation-cloud amount. Then, using the residual between the observed values ​​and the original ERA5 values ​​as the target variable, an equalization sampling strategy with cloud amount frequency inverse probability weighting is introduced to integrate the total cloud amount and high, medium and low layered cloud amounts to construct a random forest residual correction model. Finally, after multi-dimensional verification, the correction model is used to complete the bias correction of the ERA5 surface solar radiation data.

[0027] This invention addresses the shortcomings of existing ERA5 surface solar radiation correction methods, such as neglecting cloud cover dependence, ignoring optical differences in layered cloud cover, lacking physical feature engineering, and low correction accuracy due to uneven sample distribution. It provides a cloud cover-dependent feature-based ERA5 surface solar radiation bias correction method. The technical solution employed in this invention includes core steps such as data acquisition, data preprocessing, physical feature engineering construction, random forest residual correction model training, model validation, and radiation data correction. By systematically diagnosing the cloud cover dependence characteristics of ERA5 radiation bias, a derived feature model incorporating the physical mechanism of radiative transfer is constructed. Layered cloud cover and cloud cover equalization sampling strategies are introduced to optimize the random forest model, achieving targeted correction of radiation bias under different cloud conditions. This eliminates the "bipolar" systematic bias, improves the physical consistency and spatiotemporal adaptability of the corrected data, and provides data support for the development and utilization of solar energy resources in areas rich in cloud and water resources.

[0028] Compared with the prior art, the present invention has the following beneficial effects: 1. The "bipolar" nonlinear system bias of ERA5 radiation was eliminated, and the problems of underestimation of radiation caused by aerosol overestimation in the less cloudy range and overestimation of radiation caused by insufficient cloud microphysics simulation in the cloudy range were specifically solved. After correction, the average deviation of each cloud amount range converged to within ±5W / m², and the radiation bias mechanism under complex cloud conditions was decoupled.

[0029] 2. The accuracy of radiation data correction has been significantly improved. The constructed model reduced the root mean square error of the full sample validation set from 172.2 W / m² to 114.5 W / m², with an overall improvement rate of 33.5%. The improvement rate in the less cloudy area reached 49.9%, and the improvement rate of each observation station remained stable at over 31%. Even in the cloudy transition area with the largest error, an accuracy improvement of over 15% can be achieved.

[0030] 3. The physical interpretability and generalization ability of the correction model are enhanced. Physical feature engineering is constructed based on the radiative transfer mechanism, and layered cloud cover features are introduced to compensate for the physical defects of a single total cloud cover. The importance of the features in the correction model is consistent with the physical laws of atmospheric radiative transfer. The cloud cover equalization sampling strategy solves the problem of uneven sample distribution, so that the correction model has good correction effect in extreme weather and cloudy transition areas, and is applicable to complex terrains such as plains, hilly areas and mountains.

[0031] 4. It has good spatiotemporal adaptability and high-frequency fluctuation reconstruction capability. The improvement rate of the correction model reaches 35% under the background of large-scale layered clouds in winter and 20% under the background of local convective clouds in summer. It can accurately reconstruct the high-frequency fluctuations of radiation on a daily scale. The peaks and troughs of the corrected data are consistent with the ground observation values, which meets the requirements of photovoltaic power prediction for the temporal characteristics of radiation data.

[0032] 5. Facilitates practical application: The corrected ERA5 surface solar radiation data has higher physical consistency and reliability, and can directly provide core data support for regional refined solar energy resource assessment, photovoltaic power plant site selection, power grid load dispatching and hydro-meteorological model improvement, which meets the development needs of the solar energy industry.

[0033] In summary, this invention decouples the radiation bias mechanism under complex cloud conditions, eliminates the "bipolar" characteristic of ERA5 radiation bias, and reduces the root mean square error of the full sample validation set from 172.2 W / m² to 114.5 W / m², an improvement of 33.5%. The average bias across cloud cover intervals converges to within ±5 W / m², and it can accurately reconstruct high-frequency fluctuations in diurnal radiation. This provides a data foundation for regional refined solar energy resource assessment and photovoltaic power prediction, exhibiting good generalization and spatiotemporal adaptability. This invention adapts to cloud cover dependence characteristics, integrates layered cloud cover physical attributes, and combines physical interpretability with high correction accuracy, solving the problem of low correction accuracy for ERA5 surface solar radiation bias. This invention is applicable to the correction of surface shortwave radiation data in areas with abundant cloud and water resources and complex topography, and can be directly applied to scenarios such as regional refined solar energy resource assessment, photovoltaic power prediction, power grid load scheduling, and hydro-meteorological model improvement. Attached Figure Description

[0034] Figure 1 This is a reference diagram (block diagram) for the ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics of the present invention.

[0035] Figure 2 The scatter density distribution and linear fitting plot of the ERA5 raw radiation and the ground-observed radiation (ERA5 irradiance vs. observed irradiance plot).

[0036] Figure 3 This is a ranking plot of the feature importance of each input variable in the random forest residual correction model.

[0037] Figure 4 To verify the scatter density distribution of ERA5 radiation before correction and ground-observed radiation (ERA5 irradiance vs. observed irradiance).

[0038] Figure 5 To verify the scatter density distribution of ERA5 radiation after set correction and ground-observed radiation (ERA5 irradiance vs. observed irradiance).

[0039] Figure 6 This is a comparison of radiation time series and total cloud cover before and after correction under a typical weather process. The horizontal axis represents the date, and the vertical axis represents the radiation value (W / m²). Detailed Implementation

[0040] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below in conjunction with embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. Based on these embodiments, all other embodiments obtained by those skilled in the art without creative effort are within the scope of this invention.

[0041] Example 1: As Figure 1 , Figure 2 , Figure 3 , Figure 4 , Figure 5 , Figure 6 As shown, the ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics of the present invention includes the following steps: S1. Data Acquisition: Acquire ERA5 reanalysis data and ground observation data for the study area. The ERA5 reanalysis data includes variables such as surface shortwave radiation (SSRD), total cloud cover (TCC), high cloud cover (HCC), medium cloud cover (MCC), and low cloud cover (LCC).

[0042] In step S1, the time resolution is 1 hour and the spatial resolution is 0.25°. The time range is set according to research needs. The ground observation data includes hourly surface solar radiation observations, sunshine duration, and the plains and hilly mountainous terrain covered by the observation stations in the study area. Specifically, hourly surface solar radiation observations and sunshine durations are selected from observation stations covering the plains and hilly mountainous terrain of the study area. The observation instruments conform to the national standard "Specifications for Ground Meteorological Observation Radiation" and the measurement accuracy is ≤±5%.

[0043] The study area is a subtropical monsoon climate region rich in cloud and water resources. In step S1, the sunshine duration is converted into sunshine percentage as a physical proxy variable to verify the accuracy of the ERA5 cloud cover product, and the calculation results are restricted to the interval [0,1]. Sunshine percentage SP = S / S0 × 100% (1), where S is the actual sunshine duration and S0 is the daily astronomical sunshine duration.

[0044] S2. Data Quality Control and Spatiotemporal Matching (Data Preprocessing): The ERA5 grid data is matched to the latitude and longitude of the ground observation stations using bilinear interpolation, and the time of the observation data and ERA5 data is unified to the hourly Beijing time series; samples with physical extremes, low solar altitude angles at night, and missing cloud data are removed to construct matching datasets, namely the observation dataset and the reanalysis dataset; to ensure the physical validity of the model training data, the multi-source data is strictly preprocessed. This step includes spatiotemporal matching: the ERA5 grid data is matched to the latitude and longitude of the ground observation stations using bilinear interpolation, and the time of the observation data and ERA5 data is unified to the hourly Beijing time series.

[0045] In step S2, removing physical extreme values ​​involves removing outliers with observed irradiance less than 0 W / m² or greater than 1300 W / m², and marking missing values ​​as invalid; let This represents the cosine of the solar zenith angle. Excluding low solar altitude angles at night, the hourly cosine of the solar zenith angle is calculated based on the station's latitude and longitude. Only retain the cosine value of the solar zenith angle. Samples with a value >0.01 during the daytime period are used to avoid interference from zero values ​​at night and dim light at dawn and dusk during model training; cloud data integrity checks are performed: samples with missing cloud layers in the total cloud cover, high cloud cover, medium cloud cover, and low cloud cover are removed; finally, a high-quality matching dataset is constructed, namely the observation dataset and the reanalysis dataset.

[0046] S3. Construction of physical feature engineering: Based on the physical mechanism of radiative transfer, two types of derived features are constructed: geometric optical factor and radiation-cloud amount nonlinear interaction term. The geometric optical factor is the cosine value of the solar zenith angle, and the radiation-cloud amount nonlinear interaction term includes cloud obstruction interaction term and clear sky potential index. This step, based on a simplified form of the radiative transfer equation, constructs derived features with clear physical meaning to enhance the correction model's ability to capture the radiation-cloud amount interaction. The core features are two types: one type is the geometric optical factor, i.e., the cosine of the solar zenith angle, which reflects the path length of light in the atmosphere and helps the correction model learn the geometric laws governing the diurnal variation of radiation; the other type is the radiation-cloud amount nonlinear interaction term, which transforms the linear cloud amount percentage into a physical quantity linked to energy levels, making it easier for the correction model to segment high and low radiation ranges and accurately capture the nonlinear modulation effect of cloud amount on radiation.

[0047] In step S3, let This represents the cosine of the solar zenith angle. The formula for calculating the cosine of the solar zenith angle is: (2), in The coordinates are: latitude of the station, δ is the solar declination, and ω is the hour angle. The cloud occlusion interaction item X_inter=R_ERA5×C_TCC (3). Clear sky potential index X_clear=R_ERA5×(1-C_TCC) (4) Where R_ERA5 is the original surface shortwave radiation value of ERA5, and C_TCC is the total cloud cover value of ERA5.

[0048] S4. Construct a random forest residual correction model: Set the objective variable of the correction model as the residual between the ground-observed radiance value and the original radiance value of ERA5, and use the above-mentioned surface shortwave radiation, total cloud cover, high cloud cover, medium cloud cover, low cloud cover, the two types of derived features in step S3, and the station identifier as the input feature vector; introduce an inverse probability weighting mechanism based on cloud cover frequency to achieve cloud cover balanced sampling, set the correction model parameters, and complete the training.

[0049] In step S4, the implementation method of the inverse probability weighting mechanism based on cloud cover frequency is as follows: the training data is divided into different intervals according to the total cloud cover, the sample frequency N_bin of each interval is calculated, and a training weight W_i is assigned to each sample, where W_i∝N_total / (N_bin+ε), N_total is the total number of training data samples, and ε is a correction coefficient to avoid the denominator being 0.

[0050] In step S4, the parameters of the correction model are set as follows: the number of decision trees is 500, the minimum number of leaf nodes is 25, and the maximum number of features is set to square root mode.

[0051] In step S4, let X1 represent the input feature vector, and the expression for the input feature vector X1 is: X1=[R_ERA5,C_TCC,C_LCC,C_MCC,C_HCC,cosθ_z,X_inter,X_clear,ID_station] (5), where C_LCC is low cloud cover, C_MCC is medium cloud cover, and C_HCC is high cloud cover. The value is the cosine of the solar zenith angle, and ID_station is the unique identifier of the station.

[0052] Step S4 improves the traditional random forest model by implementing two major improvements: a residual learning strategy and cloud cover equalization sampling. This results in a model suitable for radiation bias correction, specifically including: Target variable and input feature setting: The target variable of the correction model is the residual between the ground observation radiation value and the original ERA5 radiation value Y=R_Obs-R_ERA5 (7). This strategy preserves the reasonable physical trends of ERA5 itself (such as diurnal variation and seasonal variation) to the greatest extent. The correction model only focuses on correcting the deviation caused by the parameterization scheme. The input feature vector of the correction model is X1, which integrates the original ERA5 data, layered cloud cover, physical derived features and site terrain identifiers.

[0053] Cloud cover equalization sampling: An inverse probability weighting mechanism based on cloud cover frequency is introduced to address the problem of abundant samples in extreme cloud cover areas and scarce samples in multi-cloud transition areas. The training data is divided into different intervals according to the total cloud cover, and training weights are assigned to each sample. This strategy increases the weight of samples in sparse cloud cover intervals in the loss function, improving the model's correction performance in multi-cloud transition areas.

[0054] The correction model parameters are set by optimizing the correction model parameters through grid search and cross-validation, and finally setting the number of decision trees, the minimum number of leaf nodes, and the maximum number of features to ensure the integration stability of the correction model, prevent overfitting, and increase the differences between trees.

[0055] S5. Random Forest Residual Correction Model Validation and Radiation Data Correction: Divide the matching dataset from step S2 into training and validation sets, perform validation and bias correction, and output the corrected surface solar radiation data.

[0056] In step S5, the method for verifying, correcting biases, and outputting corrected surface solar radiation data is to use evaluation indicators to verify the correction model with full samples and different cloud cover intervals (verifying the performance of the correction model from multiple dimensions, including full samples and different cloud cover intervals), and to use the trained and verified correction model to correct biases in the ERA5 surface solar radiation raw data, and output corrected surface solar radiation data.

[0057] The matched dataset is divided into training set and validation set according to the ratio of 8:2. Pearson correlation coefficient (R), root mean square error (RMSE), mean bias (ME), mean absolute error (MAE) and improvement rate (SS) are used as evaluation indicators. The improvement rate is calculated as follows: SS=(RMSE1-RMSE2) / RMSE1×100% (6), where RMSE1 is the root mean square error before correction and RMSE2 is the root mean square error after correction.

[0058] The specific implementation of this invention will be described in detail below using a practical application case from Province A.

[0059] Practical application case: Correction of surface solar radiation deviation based on ERA5 data in Province A. See also... Figure 2 , Figure 3 , Figure 4 , Figure 5 , Figure 6 .

[0060] Figure 2 The graph shows the scatter density distribution and linear fitting of the original ERA5 radiation and the ground-observed radiation. Figure 2 Used to display the overall bias characteristics of the raw ERA5 radiation, it can be seen intuitively that there is a non-linear bias in the raw data, namely "overestimation of low values ​​and underestimation of high values", where R=0.788 and RMSE=172.2W / m², providing a basis for subsequent bias diagnosis.

[0061] Figure 3 The chart shows the importance ranking of features of each input variable in the random forest residual correction model. It is used to verify the effectiveness of physical feature engineering and stratified cloud cover. It can be seen that the cosine value of the solar zenith angle, the original ERA5 radiation, and the radiation-cloud cover nonlinear interaction term are high-importance features. The importance of medium cloud cover and low cloud cover is significantly higher than that of total cloud cover and high cloud cover, which proves that the features learned by the correction model are highly consistent with the physical laws of atmospheric radiation transfer.

[0062] Figure 4 , Figure 5 The images show the scatter density distribution of ERA5 radiation before and after correction on the validation set, as well as the radiation observed on the ground. Figure 5 The model is used to demonstrate the overall correction effect of the correction model. It can be seen that the high-density core area of ​​the scatter distribution after correction has significantly converged and is closely distributed around the standard line (y=x). The number of outliers has been greatly reduced, the root mean square error (RMSE) has decreased from 170.99 W / m² to 138.68 W / m², and the Pearson correlation coefficient (R) has increased from 0.79 to 0.87.

[0063] Figure 6 This is a comparison of radiation time series and total cloud cover before and after correction under a typical weather event. The horizontal axis represents the date, and the vertical axis represents the radiation value (W / m²), including ground observations, values ​​before ERA5 correction, values ​​after ERA5 correction, and total cloud cover (blue shading). It is used to demonstrate the model's ability to reconstruct high-frequency fluctuations in radiation. It can be seen intuitively that the peaks and troughs of the corrected radiation data are highly consistent with the ground observations, and it can accurately capture the sudden drop in radiation caused by cloud cover and the peak radiation under clear skies.

[0064] (1) Study area and data acquisition.

[0065] The study area is a region in Province A, with latitudes of 29°01′N–33°06′N and longitudes of 108°21′E–116°07′E. This region has a subtropical monsoon climate, characterized by variable cloud cover, concentrated rainfall during the plum rain season, and a terrain that slopes from west to east, with alternating plains and hilly areas. It is a typical region rich in cloud and water resources and with complex topography. Six ground radiation observation stations in Province A were selected, covering both plains and hilly areas, with elevations ranging from 23.6m to 256.5m.

[0066] Obtain multi-source data from January 1, 2021 to December 31, 2025: ERA5 reanalysis data, including surface shortwave radiation, total cloud cover, and high, medium, and low cloud cover, with a temporal resolution of 1 hour and a spatial resolution of 0.25°.

[0067] Ground observation data includes hourly observations of surface solar radiation and sunshine duration at six stations. The observation instruments conform to the national standard GB / T35231-2017, with a measurement accuracy of ≤±5%.

[0068] (2) Data quality control and spatiotemporal matching.

[0069] Bilinear interpolation was used to match the ERA5 grid data to the latitude and longitude of 6 observation stations, and the time of all data was unified to the hourly Beijing time series; outliers with observed irradiance <0W / m² or >1300W / m² were removed, and missing values ​​-9999.0 were marked as invalid. Calculate the hourly cosine value of the solar zenith angle for each station, retaining only the values ​​from the previous calculations. Daytime samples with a value >0.01 were selected; samples with missing total cloud cover or stratified cloud cover were removed; finally, a matching dataset containing approximately 120,000 samples was constructed, namely the observation dataset and the reanalysis dataset.

[0070] (3) Construction of physical feature engineering.

[0071] Calculate the geometric optics factor: the cosine of the solar zenith angle, i.e. (2), where the solar declination δ=0.409sin(2π・DOY / 365-1.39), DOY is the accumulated day, and the hour angle ω is calculated based on the observation time; Construct the nonlinear interaction terms of radiation and cloud cover: cloud occlusion interaction term X_inter=R_ERA5×C_TCC (3), clear sky potential index X_clear=R_ERA5×(1-C_TCC) (4); The sunshine duration is converted into sunshine percentage SP=S / S0×100% (1), where the daily astronomical sunshine duration S0=24 / π×arccos (-tanϕtanδ), and SP is restricted to the interval [0,1] as a physical proxy variable for verifying the ERA5 cloud cover product.

[0072] (4) Construct a random forest residual correction model.

[0073] Set the target variable of the model as the residual Y = R_Obs - R_ERA5 (7), and let X1 represent the input feature vector. X1=[R_ERA5,C_TCC,C_LCC,C_MCC,C_HCC,cosθ_z,X_inter,X_clear,ID_station] (5), where ID_station is the unique identifier of the 6 stations; Cloud cover balanced sampling: The training data is divided into six intervals according to the total cloud cover: 0~0.1, 0.1~0.3, 0.3~0.5, 0.5~0.7, 0.7~0.9, and 0.9~1.0. The sample frequency N_bin of each interval is calculated, and a weight W_i∝N_total / (N_bin+ε) is assigned to each sample, where ε is 0.001 to avoid the denominator being 0. The parameters of the correction model are set as follows: the number of decision trees is 500, the minimum number of leaf nodes is 25, the maximum number of features is set to square root mode, that is, the maximum number of features is sqrt (square root function), and cross-validation is used to complete the model training.

[0074] (5) Validation of the random forest residual correction model and correction of radiation data.

[0075] The corrected model was validated: its performance was verified from multiple dimensions, including the full sample, cloud cover range, and site data. The results are as follows: ① Full sample: The root mean square error (RMSE) of the validation set decreased from 170.99 W / m² to 138.68 W / m², an improvement of 33.5%, the Pearson correlation coefficient (R) increased from 0.79 to 0.87, and the mean bias (ME) converged to 1.31 W / m².

[0076] ② Cloud cover range: The mean deviation (ME) of all cloud cover ranges converged to within ±5W / m². The root mean square error (RMSE) of the less cloudy range (0~0.1) decreased from 157.37W / m² to 78.92W / m², an improvement rate of 49.9%.

[0077] ③ Sites: The improvement rate of all 6 sites is ≥31%. The correction effect of the hilly and mountainous sites is better than that of the plain sites. The improvement rate of one site reaches 35.4%.

[0078] Radiation correction: Using a trained and validated random forest residual correction model, bias correction is performed on the raw ERA5 surface solar radiation data of the entire province of A, and hourly corrected surface solar radiation data with a spatial resolution of 0.25° are output.

[0079] (6) Verification of high-frequency fluctuation reconstruction.

[0080] Typical weather events in Province A, including clear skies, cloudy skies, and rainy skies, were selected. The time series of ERA5 radiation data before and after correction were compared with those of ground observations. The results show that during cloudy / rainy periods, the depth, rate of decrease, and duration of radiation troughs after correction are consistent with the observed values, solving the overestimation problem caused by insufficient simulation of cloud cover. During clear skies, the radiation peaks after correction are precisely matched with the observed extreme values, correcting the problem of overestimation or underestimation of aerosols. This proves that the correction model described in this invention can accurately reconstruct high-frequency radiation fluctuations on a daily scale.

[0081] The invention can be implemented by mainstream programming software such as Python and MATLAB to construct and train the correction model. The training process of the correction model does not require complex hardware equipment, is easy to operate, and has strong portability. The output corrected data is standardized grid data, which can be directly connected to industrial application platforms such as solar energy resource assessment systems, photovoltaic power prediction models, and power grid dispatching platforms.

[0082] The correction method of this invention has been practically applied in Province A. The corrected ERA5 surface solar radiation data has reduced the photovoltaic power prediction error by more than 20%, providing core data support for the refined assessment of solar energy resources and the planning of photovoltaic power plants in Province A. It has industrial application value and economic and social benefits. This invention can be widely promoted and applied in industries such as meteorology, new energy, and hydrology. It is applicable to areas with abundant cloud and water resources and complex terrain throughout the country, and has good industrial applicability.

Claims

1. A method for correcting ERA5 surface solar radiation bias based on cloud cover dependence characteristics, characterized in that, It includes the following steps: S1. Data Acquisition: Acquire ERA5 reanalysis data and ground observation data for the study area. The ERA5 reanalysis data includes variables such as surface shortwave radiation, total cloud cover, high cloud cover, medium cloud cover, and low cloud cover. S2, Data quality control and spatiotemporal matching; S3. Construction of physical feature engineering: Based on the physical mechanism of radiative transfer, two types of derived features are constructed: geometric optical factor and radiation-cloud amount nonlinear interaction term. The geometric optical factor is the cosine value of the solar zenith angle, and the radiation-cloud amount nonlinear interaction term includes cloud obstruction interaction term and clear sky potential index. S4. Construct a random forest residual correction model: Set the objective variable of the correction model to the residual between the ground-observed radiance value and the original radiance value of ERA5, and use the above-mentioned surface shortwave radiation, total cloud cover, high cloud cover, medium cloud cover, low cloud cover, the two types of derived features in step S3 and the station identifier of ERA5 as the input feature vector; introduce an inverse probability weighting mechanism based on cloud cover frequency to achieve cloud cover balanced sampling, set the correction model parameters and complete the training; S5. Random Forest Residual Correction Model Validation and Radiation Data Correction: Divide the matching dataset from step S2 into training and validation sets, perform validation and bias correction, and output the corrected surface solar radiation data.

2. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, In step S1, the time resolution is 1 hour and the spatial resolution is 0.25°.

3. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, In step S1, the ground observation data includes hourly surface solar radiation observations, sunshine duration, and the topography of plains and hilly areas in the study area covered by the observation station.

4. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 3, characterized in that, In step S1, the study area is a subtropical monsoon climate area rich in cloud and water resources. In step S1, the sunshine hours are converted into sunshine percentage as a physical proxy variable to verify the accuracy of ERA5 cloud cover products. The sunshine percentage SP = S / S0 × 100% (1), where S is the actual sunshine hours and S0 is the daily astronomical sunshine hours.

5. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, Step S2 involves matching the ERA5 grid data to the latitude and longitude of the ground observation station using bilinear interpolation, unifying the time of the observation data and the ERA5 data to an hourly Beijing time sequence, removing samples with physical extremes, low solar altitude angles at night, and missing cloud data, and constructing a matching dataset, namely the observation dataset and the reanalysis dataset. The above-mentioned removal of physical extreme values ​​refers to removing outliers with observed irradiance less than 0 W / m² or greater than 1300 W / m², and marking missing values ​​-9999.0 as invalid; let This represents the cosine of the solar zenith angle. Excluding low solar altitude angles at night, the hourly cosine of the solar zenith angle is calculated based on the station's latitude and longitude. Only retain the cosine value of the solar zenith angle. Samples with a value >0.01 during the daytime period are used to avoid interference from zero values ​​at night and low light at dawn and dusk during model training.

6. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, In step S3, let This represents the cosine of the solar zenith angle. The formula for calculating the cosine of the solar zenith angle is: (2), in δ represents the geographical latitude of the station, ω represents the solar declination, and ω represents the hour angle. The cloud occlusion interaction item X_inter=R_ERA5×C_TCC (3). Clear sky potential index X_clear=R_ERA5×(1-C_TCC) (4) Where R_ERA5 is the original surface shortwave radiation value of ERA5, and C_TCC is the total cloud cover value of ERA5.

7. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, In step S4, the implementation method of the inverse probability weighting mechanism based on cloud cover frequency is as follows: the training data is divided into different intervals according to the total cloud cover, the sample frequency N_bin of each interval is calculated, and a training weight W_i is assigned to each sample, where W_i∝N_total / (N_bin+ε), N_total is the total number of training data samples, and ε is a correction coefficient to avoid the denominator being 0.

8. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, In step S4, the parameters of the correction model are set as follows: the number of decision trees is 500, the minimum number of leaf nodes is 25, and the maximum number of features is set to square root mode.

9. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, In step S4, let X1 represent the input feature vector. The expression of the input feature vector X1 is: X1=[R_ERA5,C_TCC,C_LCC,C_MCC,C_HCC,cosθ_z,X_inter,X_clear,ID_station] (5). Where C_LCC represents low cloud cover, C_MCC represents medium cloud cover, and C_HCC represents high cloud cover. The value is the cosine of the solar zenith angle, and ID_station is the unique identifier of the station.

10. The ERA5 surface solar radiation deviation correction method based on cloud cover dependence characteristics according to claim 1, characterized in that, In step S5, the method for verifying, correcting biases, and outputting corrected surface solar radiation data is to use evaluation indicators to verify the correction model for the whole sample and different cloud cover ranges, use the trained and verified correction model to correct biases in the ERA5 surface solar radiation raw data, and output corrected surface solar radiation data. The matched datasets were divided into training and validation sets in an 8:2 ratio. Pearson correlation coefficient, root mean square error, mean bias, mean absolute error, and boost rate (SS) were used as evaluation metrics. The boost rate was calculated using the following formula: SS=(RMSE1-RMSE2) / RMSE1×100% (6), where RMSE1 is the root mean square error before correction and RMSE2 is the root mean square error after correction.