An XGBoost model-based precise icing prediction method, system and device for a wind farm

By integrating multi-source data and using the XGBoost model, the problems of data bias and quantitative assessment in wind farm icing prediction in cold regions have been solved, achieving high-precision icing prediction and power generation assessment, and providing reliable tools for wind farm planning and asset risk management.

CN122243216APending Publication Date: 2026-06-19ENERGY CHINA YNPD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ENERGY CHINA YNPD
Filing Date
2026-04-24
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for predicting icing in wind farms in cold or high-altitude regions suffer from problems such as reliance on a single data source, lack of systematic bias correction, neglect of interannual variations and periodic patterns, and inability to quantitatively support power generation assessment. These issues result in low prediction accuracy, large errors, and an inability to meet engineering accuracy requirements.

Method used

A multi-source data fusion strategy was adopted. By collecting measured data from wind towers and mesoscale meteorological data, data preprocessing and bias correction were performed to construct a multi-dimensional feature set, generate confidence-level pseudo-labels, and train an XGBoost machine learning model to optimize the classification threshold, calculate the icing reduction coefficient, and capture the 'alternating year' variation pattern of icing.

Benefits of technology

It significantly improves the accuracy of icing prediction, outputs a high-precision icing reduction coefficient, reduces assessment errors, realizes the quantification of wind farm power generation assessment and full life cycle risk management, and enhances the robustness and engineering applicability of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243216A_ABST
    Figure CN122243216A_ABST
Patent Text Reader

Abstract

This invention relates to a method, system, and equipment for accurate wind farm icing prediction based on the XGBoost model. The method first corrects the systematic deviations in temperature and wind speed between mesoscale meteorological data and measured data from anemometers using a linear regression model. Then, a multi-dimensional feature set, including interactive features, is constructed based on the icing physical mechanism. A high-precision icing prediction model is trained using the XGBoost algorithm and SMOTE oversampling technology, and the decision threshold is optimized using PR curves. Finally, the icing probability time series output by the model is converted into an icing state series, and an icing reduction coefficient, which can be directly used in wind farm power generation assessment models, is obtained through statistical calculation. This invention constructs a complete technical chain from multi-source data fusion to engineering-based quantitative assessment, elevating icing prediction from qualitative classification to quantitative engineering parameter output, and solving the key problems of low accuracy and inability to quantitatively support power generation assessment in traditional icing prediction methods.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of wind power generation technology, and in particular to a method, system, and equipment for accurate prediction of wind farm icing based on multi-source data fusion and an XGBoost machine learning model. This invention is especially applicable to the planning, accurate power generation assessment (P50), and asset risk management of wind farms in cold or high-altitude regions. Background Technology

[0002] In wind farms operating in cold or high-altitude regions, icing of wind turbine blades is a key factor leading to power generation loss and equipment safety risks. Traditional icing prediction and assessment methods have the following limitations: Reliance on a single data source and simple thresholds: Existing methods mostly rely on measured data from a single meteorological station or simple numerical models, using fixed temperature thresholds (such as temperatures below 0°C) to determine icing. However, icing is a complex physical process involving the coupling of multiple factors such as temperature, humidity, wind speed, and liquid water content. Judging solely based on a single temperature condition is seriously inconsistent with the actual icing situation, resulting in low prediction accuracy and large errors.

[0003] Lack of systematic bias correction mechanisms: Significant systematic biases exist between widely used free or commercial mesoscale meteorological data (such as ERA5, MERRA-2) and measured data from wind farms. For example, temperature biases can reach 2-5℃, and wind speed biases can reach 3-5 m / s. Existing techniques typically only perform simple correlation analyses and fail to establish effective bias correction models driven by physical mechanisms. Directly using mesoscale data for icing prediction introduces huge errors.

[0004] Lack of effective pseudo-label noise reduction mechanism: Existing data-driven methods rely on manual labeling or simple rule labeling, but ice accretion observation data is scarce and noisy. Unweighted pseudo-labels will introduce training bias, resulting in poor model robustness, decreased prediction accuracy under complex weather conditions, and inability to stably output reliable results.

[0005] Ignoring interannual variations and cyclical patterns: Traditional methods fail to effectively consider the cyclical variations in icing phenomena, i.e., the differences in icing severity between different years, when assessing the total power generation of a wind farm over its entire life cycle. This results in power generation assessments that are either based on a single "lucky year" (underestimating risk) or a single "disastrous year" (overly conservative), failing to reflect long-term average risk.

[0006] The fundamental flaw in existing technologies lies in their inability to quantitatively support the power generation assessment process. Their predictive results (such as simple binary classification) cannot be directly and quantitatively applied to wind resource assessment. A reliable and quantifiable "icing power loss reduction factor" is needed for wind farm power generation assessment. Traditional methods cannot provide such parameters that meet engineering accuracy requirements (error <2%), leading engineers to either rely on the icing impact of a single measured year (increasing risk) or use overly conservative empirical estimates (reducing project economics). For example, Chinese patent CN110942218A considers wind speed reduction based on wind speeds below 0°C during winter and spring icing projects. This is significantly inconsistent with actual icing conditions; temperatures below 0°C do not necessarily indicate icing, and interannual variations are not considered. Icing alone can result in a power generation deviation of over 3%. For example, with a 100,000 kW wind farm operating at 2500 hours of equivalent full load, the power generation deviation reaches 7.5 million kWh, significantly increasing investment risk under market-based electricity pricing.

[0007] Therefore, a new approach is needed to address the shortcomings of existing technologies, particularly to achieve breakthroughs in multi-source data fusion, systematic bias correction, pseudo-label confidence assessment, model accuracy optimization, and practical engineering applications, providing reliable technical tools for wind farm planning, accurate P50 assessment, and asset risk management in cold regions. Summary of the Invention

[0008] The purpose of this invention is to provide a method, system, and equipment for accurate prediction of wind farm icing based on the fusion of multi-source data (mesoscale meteorological data and measured data from wind towers) and the XGBoost machine learning model. Through innovative data fusion strategies, multi-dimensional feature engineering, confidence-weighted pseudo-labels, and machine learning optimization, the accuracy of icing prediction is significantly improved, the high-precision icing reduction coefficient required for power generation assessment is quantitatively output, and the variation pattern of icing "alternating years" is effectively captured.

[0009] To achieve the above objectives, this invention provides a method for accurate prediction of wind farm icing based on multi-source data fusion and an XGBoost machine learning model, comprising the following steps: S1. Multi-source data collection: First, collect the measured data of the wind measurement tower in the wind farm during a certain period, and then collect the mesoscale meteorological historical data of the location closest to the wind measurement tower during the same period. Among them, the measured data of the wind measurement tower includes wind speed, wind direction, 10m temperature, 10m humidity and 10m air pressure at each height; the mesoscale meteorological data includes temperature, dew point temperature, wind speed and air pressure. S2. Data Preprocessing and Fusion: Time alignment and missing value processing (outlier removal) are performed on the measured data from the wind towers and the mesoscale meteorological data. A linear regression model is then used to systematically correct the temperature and wind speed in the mesoscale meteorological data to obtain the corrected mesoscale temperature and wind speed. The data are further fused to obtain the fused corrected mesoscale temperature, dew point temperature, corrected mesoscale wind speed, air pressure, relative humidity, wind speed and direction at each wind tower height, 10m temperature, 10m humidity and 10m air pressure data. S3. Feature Engineering: Based on the fused data, construct a multi-dimensional feature set related to the ice formation mechanism with at least 6 categories and 18 features. The multi-dimensional feature set includes basic features, interaction features, difference features, temporal features, segmentation features, and combined features. S4. Rule Tagging and Pseudo-Label Generation: Based on preset icing determination rules, icing / non-icing pseudo-labels are automatically generated for the fused data. The icing determination rules consist of temperature threshold, relative humidity threshold, wind speed range threshold, and duration threshold. The moment that meets the rules is recorded as an "icing" pseudo-label, and the moment that does not meet the rules is recorded as "non-icing". S5. Confidence rating and sample weighting: Based on the degree of satisfaction of each judgment condition in the rules or the strength of the rule combination, the generated pseudo-labels are assigned confidence rating values, and the pseudo-labels and their confidence ratings are used as labels and sample weights of training samples to reduce the impact of pseudo-label noise and improve the robustness of the model. S6. Model Training and Optimization: The feature set is trained using the XGBoost algorithm. The minority class samples in the training data are oversampled using SMOTE to handle the class imbalance problem. The optimal classification threshold is optimized on the model test set using the PR curve. S7. Calculation of icing reduction factor: Input the mesoscale sequence data to be evaluated into the trained model to obtain the hourly icing probability; use the above-mentioned optimal classification threshold to transform the probability sequence into an icing state sequence, and count the total duration and proportion of icing state in the target time period. This proportion is the icing reduction factor for wind farm power generation assessment. S8. Future Data Application: Repeat the above steps for mesoscale meteorological forecast data for the next few years to obtain the annual series of icing reduction coefficients for the next few years. Based on the statistical characteristics of this annual series, quantitatively analyze the interannual variation of icing intensity and apply it to the probability assessment of wind farm power generation throughout its entire life cycle.

[0010] Furthermore, in step S2, a linear regression model is used to correct the systematic biases of temperature and wind speed in the mesoscale meteorological data, as follows: a. Temperature correction: Based on the actual temperature measured at 10m from the meteorological tower. Mesoscale temperature (T) data are the dependent variable. Establish a univariate linear regression model The correction coefficients were obtained by fitting using the least squares method. and The corrected mesoscale temperature is calculated using the following formula:

[0011] in, The corrected mesoscale temperature data. This is the raw mesoscale temperature data. For correction factors; b. Wind speed correction: The actual wind speed measured at the highest point of the meteorological tower is used as the benchmark. Mesoscale wind speed data is the dependent variable. Establish a univariate linear regression model The correction coefficients were obtained through fitting. and The corrected mesoscale wind speed is calculated using the following formula:

[0012] in, These are the corrected mesoscale wind speed data. This is the raw mesoscale wind speed data. This is the correction factor.

[0013] Furthermore, the multidimensional feature set mentioned in step S3 specifically includes: Basic characteristics, including corrected mesoscale temperature, corrected mesoscale wind speed, and relative humidity; Interactive features include the product of temperature and relative humidity, the product of relative humidity and wind speed, and the product of temperature and wind speed. Differential characteristics include the rate of change of temperature, relative humidity, and wind speed at adjacent time points; Time features, including hour codes, day / night indicators, and seasonal indications; Segmentation features include unique coding features after dividing temperature, wind speed, and relative humidity into different intervals; The combined features include a binary indication feature generated based on a preset icing determination rule and an in-window cumulative / continuous count feature.

[0014] Furthermore, the icing determination rule in step S4 is as follows: (1) Mesoscale temperature ; (2) Correcting the difference between mesoscale temperature and dew point temperature or relative humidity ; (3) Correction of mesoscale wind speed ; (4) Duration of the meteorological conditions mentioned in (1)-(3) above ; in, Threshold parameters set according to site characteristics; If a continuous time window at a certain moment simultaneously meets all four of the above conditions, then the corresponding moment and the window within it are marked with the pseudo-label "iced"; otherwise, they are marked "non-iced".

[0015] Furthermore, the confidence level described in step S5 is divided into at least three levels: high confidence, medium confidence, and low confidence, according to the strictness of the conditions for satisfying the icing judgment rule. The sample weights are assigned to the training samples according to the confidence level linearly or nonlinearly, and participate in the loss function calculation as sample weights during XGBoost training.

[0016] Furthermore, step S6, which involves optimizing the optimal classification threshold on the model test set using the PR curve, specifically involves: Calculate precision and recall at different classification thresholds on the model test set, and plot the PR curve; Choose the threshold point that maximizes the F1 score, or determine the threshold corresponding to the precision-recall balance point as the optimal classification threshold.

[0017] Furthermore, the meteorological data to be evaluated in step S7 is mesoscale meteorological forecast data for the future operation period of the wind farm.

[0018] To achieve the above objectives, the present invention also provides a system for the above-mentioned accurate prediction method of wind farm icing based on multi-source data fusion and XGBoost machine learning model, comprising: The data acquisition and storage module is used to acquire and store measured data from the wind measurement tower and mesoscale meteorological data. The data preprocessing module is used to perform data cleaning, time alignment, systematic bias correction, and data fusion. Rule labeling module: used to automatically generate icing / non-icing pseudo-labels for fused data based on preset icing judgment rules, and to assign confidence levels to pseudo-labels according to the degree of rule satisfaction, and output weighted training sets; The feature engineering module is used to automatically extract and construct the multidimensional feature set from the fused data; The machine learning modeling module, with built-in XGBoost algorithm and SMOTE oversampling procedure, is used to train, evaluate and optimize the icing prediction model and determine the optimal classification threshold. The icing analysis and application module is used to load pre-trained models for batch prediction, generate icing probability time series, calculate icing reduction coefficients, and output visual analysis reports.

[0019] In addition, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the above-described method for accurate prediction of wind farm icing.

[0020] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method for accurate prediction of wind farm icing.

[0021] The beneficial effects of this invention are as follows: This invention constructs a complete technical chain from multi-source data fusion to engineering quantitative assessment, upgrading icing prediction from qualitative classification to quantitative engineering parameter output. The accuracy, robustness, and practicality of this invention have reached the leading level in the industry, significantly improving the accuracy and practicality of icing prediction for wind farms in cold regions. It solves the key problems of low accuracy and inability to quantitatively support power generation assessment in traditional icing prediction methods, and provides reliable data support and technical tools for wind farm early planning, accurate power generation assessment (P50), and asset risk management.

[0022] (1) At the data layer, this invention addresses the core pain point of significant systematic deviation between mesoscale meteorological data and field wind tower data in the prior art, which leads to huge prediction errors. For the first time, it introduces a linear regression model to perform physical mechanism-driven deviation correction for temperature and wind speed. This not only eliminates the systematic error introduced by inaccurate data in traditional methods, but more importantly, it provides an engineering-replicable technical paradigm for the fusion of multi-source heterogeneous data. It solves the long-standing problem of the availability of free mesoscale data in the industry, lays a high-quality data foundation for subsequent model training, and improves prediction accuracy from the source.

[0023] (2) At the algorithm level, this invention addresses the industry challenges of scarce icing observation data, high manual annotation costs, and poor model robustness due to large pseudo-label noise. It innovatively proposes a rule-driven automatic labeling and confidence-based weighted grading mechanism. Pseudo-labels are automatically generated through preset multi-dimensional judgment rules for temperature, humidity, wind speed, and duration. Based on the degree of rule fulfillment, pseudo-labels are innovatively divided into high, medium, and low confidence levels, which are used as sample weights in the loss function calculation during XGBoost training. This mechanism effectively reduces the interference of low-quality labels on the model, significantly improving the prediction accuracy of the model under complex meteorological conditions. Meanwhile, to address the class imbalance issue caused by the low proportion of icing event samples (only 15.6%), this invention employs SMOTE oversampling technology to synthesize high-quality minority class samples and dynamically optimizes the classification threshold to 0.405 (instead of the default 0.5) using the PR curve. On the test set, it achieves excellent performance with a recall rate of 86%, precision of 82%, and F1 score of 83.78%, which is 7.7% higher than the traditional threshold selection method. This ensures the conservative evaluation principle of "better to falsely report than to miss" icing events and significantly enhances the engineering applicability of the model.

[0024] (3) At the application layer, this invention addresses the fundamental deficiency of traditional methods that only output binary classification results and cannot quantitatively support the assessment of wind farm P50 power generation. It innovatively converts the hourly icing probability sequence predicted by the model into the proportion of icing duration, directly outputting the icing reduction factor, an engineering indicator. This value can be directly embedded into theoretical power generation calculations, controlling the assessment error to within 2%, completely solving the industry's long-standing reliance on empirical estimations (error > 3%) or single measured year data (deviation up to 7.5 million kWh). More importantly, by repeatedly calculating the multi-year reduction factor sequence, this invention achieves, for the first time, a quantitative analysis of the cyclical pattern of icing "alternating years," providing a climatological basis for wind farm life-cycle risk management, avoiding assessment bias caused by ignoring interannual variations, and significantly improving the scientific nature of project investment decisions.

[0025] (4) At the system level, the present invention constructs a complete system covering six major modules: data acquisition, deviation correction, rule labeling, feature engineering, model training, and reduction coefficient calculation. It forms a deployable integrated hardware and software solution, realizing a fully automated process from data input to visualization report generation. This enables complex machine learning methods to be conveniently applied to the front line of engineering projects such as design institutes and wind power companies, completely changing the fragmented and experience-based application pattern of traditional methods, and providing standardized and intelligent technical support for wind power development in cold regions. Attached Figure Description

[0026] Figure 1 This is a flowchart of the wind farm icing prediction method in the embodiment; Figure 2This is a scatter plot showing the corrected temperature-wind speed versus icing probability in the example. Figure 3 This is a ranking diagram of the importance of icing-related features in the embodiments; Figure 4 This is the ROC curve of the icing prediction model in the embodiment; Figure 5 This is a graph showing the relationship between the accuracy and recall of the icing prediction model in the embodiment. Detailed Implementation

[0027] To make the technical problems and solutions solved by the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely for explaining the present invention and are not intended to limit the present invention. Example 1

[0028] This embodiment takes a 100 MW wind farm in a high-altitude and cold region in southwest China as the research object, and an 80-meter-high wind measurement tower is installed at the site.

[0029] like Figure 1 As shown, the specific steps of the wind farm icing accuracy prediction method based on the fusion of multi-source data (mesoscale meteorological data and wind tower measured data) and the XGBoost machine learning model described in this invention are as follows: S1. Data Collection and Preparation:

[0030] (1) Measured data from the wind measurement tower Hourly measured data were collected from an 80m anemometer tower at the wind farm site between October 2, 2024 and April 18, 2025, with a time resolution of 5 minutes. The data contains 57,301 valid samples, and the feature dimensions include: wind speed and direction at heights of 10m, 30m, 50m, and 80m, and temperature, relative humidity, and air pressure at height of 10m.

[0031] (2) Mesoscale meteorological historical data Hourly historical data were collected from the mesoscale meteorological grid points closest to the anemometer tower location within the same time period, with a time resolution of 1 hour. The data contains 4776 samples, and the feature dimensions include: temperature, dew point temperature, wind speed, and air pressure. S2. Data Preprocessing and Fusion:

[0032] (1) Data cleaning and alignment The two types of collected data were timestamped and unified to UTC time. Linear interpolation was used to handle the small number of missing values, and obvious outliers were removed based on the 3σ criterion. Finally, 4750 valid samples with fully aligned time points were obtained (for model training).

[0033] (2) Calculate relative humidity Using temperature (T) and dew point temperature (T) from mesoscale data d The relative humidity (RH) at mesoscale locations can be calculated using the following formula:

[0034] in, Temperature at a mesoscale location, This is the dew point temperature.

[0035] (3) Systematic bias correction: a. Temperature correction: Based on the actual temperature measured at 10m from the meteorological tower. Mesoscale temperature (T) data are the dependent variable. Establish a univariate linear regression model The correction coefficients were obtained by fitting using the least squares method. and The corrected mesoscale temperature is calculated using the following formula:

[0036] in, The corrected mesoscale temperature data. This is the raw mesoscale temperature data. This is the correction factor.

[0037] b. Wind speed correction: Based on the measured wind speed at a height of 80m from the meteorological tower. Mesoscale wind speed data is the dependent variable. Establish a univariate linear regression model The correction coefficients were obtained through fitting. and The corrected mesoscale wind speed is calculated using the following formula:

[0038] in, These are the corrected mesoscale wind speed data. This is the raw mesoscale wind speed data. This is the correction factor.

[0039] c. Fusion output: Obtain the hourly series of variables such as corrected mesoscale temperature, dew point temperature, relative humidity, corrected mesoscale wind speed, air pressure, wind speed and direction at each meteorological tower height, 10m temperature, 10m humidity, and 10m air pressure. S3. Feature Engineering

[0040] Based on the physical mechanisms of icing formation (typically occurring within a range of low temperature, high humidity, and specific wind speeds), 18 features across 6 categories were constructed from the fused data, forming the feature set for the model input. Examples of specific features and their importance scores in the model are shown below (e.g., ...). Figure 3 ):

[0041] Feature importance analysis shows that interactive features have the highest importance in this model, indicating that considering the synergistic effects of multiple meteorological conditions simultaneously is crucial for accurate icing prediction. S4. Rule Compliance:

[0042] (1) Example of preset judgment rules (can be adjusted according to site): Condition A: Temperature Example

[0043] Condition B: Correct for the difference between mesoscale temperature and dew point temperature (Example) ),or (Example) =85% Condition C: Corrected mesoscale wind speed (Example) =1m / s) Condition D: Duration of the above conditions (Example) = 3 hours) If the continuous time window at time t satisfies all four conditions above, then the time corresponding to t and the window within it are marked as "iced" pseudo-labels; otherwise, they are marked as "non-iced". from Figure 2 The scatter plot showing the relationship between corrected mesoscale temperature-wind speed and icing probability reveals that the intensity of the scatter plot color indicates the icing probability output by the model. This plot visually demonstrates that icing is most prevalent in specific meteorological conditions characterized by low temperature (<5℃), high humidity, and wind speed (<4m / s), validating the effectiveness of interactive features in feature engineering.

[0044] (2) Example of confidence level grading rules: High confidence : Satisfies strict threshold combination ( or And lasting ≥3 hours); Medium confidence level : Meets the basic threshold ( or And lasting ≥2 hours); low confidence : Satisfies the loose threshold ( or And lasting ≥1.5h); Example of corresponding sample weight mapping: S5. Construction and Training of Icing Prediction Model

[0045] (1) Training set construction: Training samples are constructed based on the feature data obtained by feature engineering. Each sample contains a feature vector. Pseudo-labels Sample weights .

[0046] (2) Data set partitioning: The 4750 samples after preprocessing and feature engineering were divided into a training set (3800 samples) and a test set (950 samples) in 8:2 ratio according to time order.

[0047] (3) Handling Class Imbalance: Since icing events are low-probability events, there are far more "no icing" samples than "icing" samples in the original data. Taking this embodiment as an example, after the dataset is divided, there are approximately 3200 "no icing" samples and approximately 600 "icing" samples in the training set, with a positive-to-negative sample ratio of approximately 5.3:1, indicating a significant class imbalance. To address this issue, this invention employs SMOTE (Synthetic Minority Oversampling) to oversample the "icing" samples in the training set. By linearly interpolating the existing icing samples in the feature space, artificially synthesized icing samples are generated, balancing the number of samples in both classes. This forces the model to learn the boundary features of icing events, thereby improving its ability to identify icing events.

[0048] (4) Model training: The XGBoost (eXtreme Gradient Boosting) algorithm was used to train the model on the oversampled training set. The main hyperparameters were determined through grid search and cross-validation, including: n_estimators=300, learning_rate=0.1, max_depth=6, subsample=0.8, colsample_bytree=0.7, early_stopping_rounds=50, etc.

[0049] (5) Model evaluation and threshold optimization: a. Evaluate model performance on the test set, such as Figure 4 As shown, its area under the ROC curve (AUC) reached 0.9819, which is much higher than the random classification line ( ). Figure 4 (Diagonal line in the diagram). This result indicates that the model has a very strong ability to distinguish between icing and non-icing events, and its classification performance is excellent.

[0050] b. Since icing events are a minority class, we are more concerned with precision and recall. Therefore, we use the PR curve to determine the optimal classification threshold to optimize the balance between precision and recall. Figure 5 As shown, the horizontal axis of the PR curve represents recall, and the vertical axis represents precision. The marked point on the curve corresponds to the point where the F1 score is maximized (threshold = 0.405, F1 = 0.84). This figure explains why 0.405 was chosen instead of the default 0.5 as the optimal classification threshold. This threshold balances high precision (82%) with maximizing the recall rate (86%) for identifying icing events. At this threshold, the model achieves an accuracy of 94.21% and an F1 score of 83.78% on the test set, achieving the best balance between accurate identification and false negative control. S6. Icing Prediction and Reduction Factor Calculation (Technical Effect Demonstration)

[0051] (1) Icing prediction: The complete 57,301 hours of meteorological tower data, after correction and feature engineering, were input into the trained XGBoost model to obtain the icing probability for each hour. Comparing the probability values ​​with a decision threshold of 0.405, it was predicted that there was a risk of icing for 459 hours (8.01% of the total time period). This prediction result is highly consistent with the actual observation records at the site in terms of temporal distribution and severity.

[0052] (2) Calculation of icing reduction factor: a. The total number of periods within the target time period (such as one year or the entire assessment period) that are determined to be in an "icy state".

[0053] b. Calculate the percentage of this total number of time periods.

[0054] c. This percentage is the power generation reduction factor caused by icing.

[0055] In this embodiment, by predicting and statistically analyzing 25 years of mesoscale data, the long-term average icing reduction factor for this site is obtained as 8.01% (i.e., 459 / 57301 = 8.01%). This means that when calculating the theoretical power generation of this wind farm, it is necessary to multiply by a reduction factor of (1 - 8.01%) = 91.99% to accurately reflect the power loss caused by icing. Example 2

[0056] This embodiment takes a 100MW wind farm in a high-altitude and cold region in Southwest China as an example to elaborate in detail the system engineering deployment scheme of the wind farm icing accuracy prediction method based on multi-source data fusion and XGBoost machine learning model. It clarifies the hardware configuration, software implementation, interface definition and collaborative workflow of each module, and provides the pseudocode implementation of the core algorithm. 1. System modular design and hardware deployment

[0057] This system adopts a distributed microservice architecture, where each module can be deployed independently on a dedicated server or virtualized container, communicating through message queues and an API gateway. The specific hardware configuration and connectivity are as follows:

[0058] All servers are connected to a unified management platform (such as Dell OpenManage) via a redundant gigabit management network and are configured with UPS power supplies to ensure power continuity. 2. Transparency in software implementation and interface definition

[0059] 2.1 Data Acquisition and Storage Module

[0060] Software stack: Python 3.9 + InfluxDB 2.0 (time-series database) + PostgreSQL 14 (metadata management) + RabbitMQ 3.9 (message queue) Data collection: Wind measurement tower data: Data is read from the NRG collector every 10 minutes via a Python script (using the pyserial library), parsed, and then sent to the storage queue via RabbitMQ.

[0061] Mesoscale meteorological data: ERA5 reanalysis data are downloaded daily by calling the ECMWF CDS API (cdsapi library), or future forecast data are obtained through APIs provided by commercial data service providers (such as Meteoblue).

[0062] Storage interface: Time-series data is written to the measurement table in InfluxDB, with labels including site_id and data_source, and fields including temperature and wind_speed.

[0063] Metadata (such as data version and collection timestamp) is written to the data_metadata table in PostgreSQL.

[0064] 2.2 Data Preprocessing Module Software stack: Apache Airflow 2.5 (task scheduling) + Python 3.9 (Pandas 1.5, NumPy 1.23, Scikit-learn 1.2) Task flow: DAG definition: Data_preprocessing_dag is triggered every day at midnight, and the following tasks are executed sequentially: fetch_raw_data: Reads the raw data from the previous day from InfluxDB and loads it into a Pandas DataFrame.

[0065] clean_data: Uses the 3σ criterion to remove outliers and uses linear interpolation to fill missing values ​​(DataFrame.interpolate).

[0066] time_align: Aligns mesoscale data with meteorological tower data by hour, and uses mean aggregation for resampling.

[0067] bias_correction: Performs linear regression correction for temperature and wind speed (see pseudocode in Section 2.5). The model coefficients are stored in the correction_coeff table in PostgreSQL.

[0068] calculate_rh: Calculates relative humidity using mesoscale values ​​T and Td.

[0069] publish_processed: Writes the merged data to the InfluxDB's processed_datameasurement and sends a message to the processed_data_queue.

[0070] Interface definition: Provides a REST API / api / v1 / processed_data?start=...&end=..., which returns fused data in JSON format for downstream modules to call. 2.3 Rule Tagging Module

[0071] Software stack: Python 3.9 + custom rule engine (expert system based on pyknow library, or pure Python rule functions) Function: Generate pseudo-labels and confidence levels for fused data according to the icing rules preset in the engineering standard (IEC 61400-1).

[0072] Rule definition: RULES = [ {"name": "standard_icing", "condition": lambda row: (row['temp'] <= 2.0) and (row['rh'] >= 80)and (2.0 <= row['ws'] <= 25.0), `weight: 0.8`, # Base weight {"name": "severe_icing", "condition": lambda row: (row['temp'] <= 0.0) and (row['rh'] >= 90)and (2.5 <= row['ws'] <= 15.0), "weight": 1.0}, {"name": "frost_icing", "condition": lambda row: (row['temp'] <= -5.0) and (row['rh'] >= 70), "weight": 0.7}, {"name": "supercooled_fog", "condition": lambda row: (-3.0 <= row['temp'] <= 1.0) and (row['rh']>= 95) and (row['ws'] <= 8.0), "weight": 0.9} ] Confidence calculation: If a data point satisfies multiple rules simultaneously, the highest weight is taken as the confidence score; if no rule is satisfied, the label is 0 (no icing) and the confidence score is 1.0 (high certainty). The output format is a weighted training set CSV, containing feature columns, label columns (0 / 1), and weight columns (float).

[0073] Interface: Provides the Python function generate_weighted_labels(processed_data_df), which returns a weighted DataFrame. 2.4 Feature Engineering Module

[0074] Software stack: Python 3.9 + Featuretools 1.0 (automatic feature generation) + Pandas Feature generation: Basic features are selected directly.

[0075] Interactive feature: Manually calculate the product term.

[0076] Difference characteristics: First-order differences are calculated using shift().

[0077] Time features: Extract hours and months, and determine whether it is nighttime (hours <6 or >18).

[0078] Segmentation features: Temperature and wind speed are separated into individual thermal encodings using pd.cut.

[0079] Combinatorial features: Generate binary indicator variables based on rules (such as the four conditions defined in the rules).

[0080] Feature storage: Feature data is stored in InfluxDB’s feature_store measurement, or exported as a Parquet file and stored in HDFS.

[0081] Interface: Provides a REST API / api / v1 / features, which receives a time period parameter and returns a feature matrix. 2.5 Machine Learning Modeling Module

[0082] Software stack: Python 3.9 + XGBoost 1.7 + Scikit-learn 1.2 + Optuna 3.1 + imbalanced-learn 0.10 Training process: Load the training set from the feature library and split it into training and test sets (train_test_split).

[0083] Apply SMOTE(SMOTE(random_state=42)) to oversample the minority class of the training set.

[0084] Hyperparameter optimization was performed using Optuna, with the objective function being the F1 score from cross-validation.

[0085] Train the optimal XGBoost model.

[0086] Plot the PR curve on the test set and calculate the threshold that maximizes F1.

[0087] Model storage: The model is saved as xgboost_model.json, the threshold is saved as optimal_threshold.txt, and uploaded to MinIO object storage.

[0088] Interface: Provides a REST API / api / v1 / predict, which receives a JSON array of features and returns a list of icing probabilities. 2.6 Icing Analysis and Application Module

[0089] Software stack: Python 3.9 + Flask 2.2 + Plotly 5.10 + WeasyPrint (PDF generation) Function: Scheduled task: Call the model API daily to predict the latest mesoscale data and generate an icing probability sequence.

[0090] Reduction factor calculation: The percentage of icing duration is calculated by year / month.

[0091] Visualization: Generate time series plots of icing probability, feature importance plots, ROC curves, etc., and display them in Plotly interactive charts.

[0092] Report Export: Generates an icing analysis report in PDF format.

[0093] Interface: Provides a web interface (Grafana dashboard integration) and a REST API / api / v1 / reports / icing_loss, returning the reduction coefficient in JSON. 3. System collaborative working mechanism

[0094] The system adopts an event-driven architecture, and the core collaboration process is as follows: (1) Data acquisition: Wind tower data is pushed to RabbitMQ raw_data_queue every 10 minutes; mesoscale data is downloaded by Airflow at 0:00 every day and pushed to the same queue.

[0095] (2) Data preprocessing: The data_preprocessing service listens to the queue. When new data arrives, processing is triggered. After processing, the data is stored in InfluxDB and published to the processed_data_queue.

[0096] (3) Rule labeling: The rule_labeling service listens to the processed_data_queue, applies rules to new data to generate weighted labels, and stores them in the training_labels table of PostgreSQL.

[0097] (4) Feature engineering: A daily scheduled task reads nearly 30 days of data from InfluxDB, performs feature generation, and stores the data in the feature library.

[0098] (5) Model training: A retraining task is triggered every Sunday morning to load all historical data from the feature library and train a new model. If the performance is better than the old model, it will be replaced.

[0099] (6) Icing prediction: The application module calls the model API every hour to obtain the latest hourly icing probability and stores it in the icing_probability measurement of InfluxDB.

[0100] (7) Report generation: The ice cover analysis report for the previous month is automatically generated on the 1st of each month and pushed to the email of the operations staff.

[0101] All modules are decoupled through Kafka (or RabbitMQ), service discovery uses Consul, and configuration management uses etcd, ensuring high availability and scalability of the system. 4. Specific configuration of electronic equipment

[0102] The present invention also provides an electronic device for performing the steps of the above-described method for accurate prediction of wind farm icing, the specific configuration of which is as follows: Processor model: Intel Xeon Silver 4210 (10 cores, 20 threads, 2.2GHz) or equivalent or higher.

[0103] Memory capacity: no less than 16GB DDR4.

[0104] Storage configuration: 500GB NVMe SSD (for operating system and applications) + 2TB HDD (for data caching).

[0105] Operating system: Ubuntu 20.04 LTS (64-bit) or Windows Server 2019.

[0106] Software environment: Docker 20.10 + Python 3.9 runtime environment, pre-installed with dependencies such as XGBoost and Scikit-learn.

[0107] Deployment methods: All modules are packaged as Docker images and deployed via Docker Compose or Kubernetes orchestration; or they can be run directly in a Python virtual environment and managed using systemd. 5. Specific implementation of computer-readable storage media

[0108] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method. The physical form of the storage medium includes, but is not limited to: Physical media: USB 3.0 flash drive (capacity ≥ 64GB), SD memory card, portable solid-state drive (such as Samsung T7), CD-R / DVD-R disc.

[0109] Cloud storage services: Object storage (such as Alibaba Cloud OSS, Amazon S3, Tencent Cloud COS), with the storage path being / icing-prediction / model / , containing model files, configuration files, and startup scripts.

[0110] File system format: ext4, NTFS or FAT32 compatible format, ensuring readability on different operating systems.

[0111] The program is stored as source code (Python file) or compiled bytecode (.pyc), accompanied by a requirements.txt file listing the versions of dependent libraries. Deployment is simple: just mount the media to the target device and run pythonmain.py to start the prediction service.

[0112] Through the above-mentioned engineering deployment scheme, this invention realizes full-process automation from data collection to business application. Each module has a clear responsibility and well-defined interfaces, and features high availability, easy expansion, and replicability, providing stable and accurate technical support for wind farm icing prediction.

[0113] The present invention has been described in detail above through specific and preferred embodiments. However, those skilled in the art should understand that the present invention is not limited to the embodiments described above. Any modifications, equivalent substitutions, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for precise icing prediction of a wind farm based on an XGBoost model, characterized in that, Includes the following steps: S1. Multi-source data collection: First, collect the measured data of the wind measurement tower in the wind farm during a certain period, and then collect the mesoscale meteorological historical data of the location closest to the wind measurement tower during the same period. Among them, the measured data of the wind measurement tower includes wind speed, wind direction, 10m temperature, 10m humidity and 10m air pressure at each height; the mesoscale meteorological data includes temperature, dew point temperature, wind speed and air pressure. S2. Data Preprocessing and Fusion: Time alignment and missing value processing (outlier removal) are performed on the measured data from the wind towers and the mesoscale meteorological data. A linear regression model is then used to systematically correct the temperature and wind speed in the mesoscale meteorological data to obtain the corrected mesoscale temperature and wind speed. The data are further fused to obtain the fused corrected mesoscale temperature, dew point temperature, corrected mesoscale wind speed, air pressure, relative humidity, wind speed and direction at each wind tower height, 10m temperature, 10m humidity and 10m air pressure data. S3. Feature Engineering: Based on the fused data, construct a multi-dimensional feature set related to the ice formation mechanism with at least 6 categories and 18 features. The multi-dimensional feature set includes basic features, interaction features, difference features, temporal features, segmentation features, and combined features. S4. Rule Tagging and Pseudo-Label Generation: Based on preset icing determination rules, icing / non-icing pseudo-labels are automatically generated for the fused data. The icing determination rules consist of temperature threshold, relative humidity threshold, wind speed range threshold, and duration threshold. The moment that meets the rules is recorded as "icing" pseudo-label, and the moment that does not meet the rules is recorded as "non-icing". S5. Confidence rating and sample weighting: Based on the degree of satisfaction of each judgment condition in the rules or the strength of the rule combination, the generated pseudo-labels are assigned confidence rating values, and the pseudo-labels and their confidence ratings are used as labels and sample weights of training samples to reduce the impact of pseudo-label noise and improve the robustness of the model. S6. Model Training and Optimization: The feature set is trained using the XGBoost algorithm. The minority class samples in the training data are oversampled using SMOTE to handle the class imbalance problem. The optimal classification threshold is optimized on the model test set using the PR curve. S7. Calculation of icing reduction factor: Input the mesoscale sequence data to be evaluated into the trained model to obtain the hourly icing probability; use the above-mentioned optimal classification threshold to transform the probability sequence into an icing state sequence, and count the total duration and proportion of icing state in the target time period. This proportion is the icing reduction factor for wind farm power generation assessment. S8. Future Data Application: Repeat the above steps for mesoscale meteorological forecast data for the next few years to obtain the annual series of icing reduction coefficients for the next few years. Based on the statistical characteristics of this annual series, quantitatively analyze the interannual variation of icing intensity and apply it to the probability assessment of wind farm power generation throughout its entire life cycle.

2. The method according to claim 1, wherein, In step S2, a linear regression model is used to correct systematic biases in the temperature and wind speed data in the mesoscale meteorological data, as follows: Temperature correction: taking the measured value of the 10m temperature of the wind tower as the reference , the mesoscale temperature (T) data as the dependent variable , a linear regression model is established . Through least squares fitting, the correction coefficient is obtained and , the corrected mesoscale temperature is calculated according to the following formula: ; in, The corrected mesoscale temperature data. This is the raw mesoscale temperature data. , For correction factors; Wind speed correction: The measured wind speed at the highest point of the meteorological tower is used as the benchmark. Mesoscale wind speed data is the dependent variable. Establish a univariate linear regression model The correction coefficients were obtained through fitting. and The corrected mesoscale wind speed is calculated using the following formula: ; in, These are the corrected mesoscale wind speed data. This is the raw mesoscale wind speed data. , This is the correction factor.

3. The method for accurate prediction of wind farm icing based on the XGBoost model according to claim 1, characterized in that, The multidimensional feature set mentioned in step S3 specifically includes: Basic characteristics, including corrected mesoscale temperature, corrected mesoscale wind speed, and relative humidity; Interactive features include the product of temperature and relative humidity, the product of relative humidity and wind speed, and the product of temperature and wind speed. Differential characteristics include the rate of change of temperature, relative humidity, and wind speed at adjacent time points; Time features, including hour codes, day / night indicators, and seasonal indications; Segmentation features include unique coding features after dividing temperature, wind speed, and relative humidity into different intervals; The combined features include a binary indication feature generated based on a preset icing determination rule and an in-window cumulative / continuous count feature.

4. The method for accurate prediction of wind farm icing based on the XGBoost model according to claim 1, characterized in that, The icing determination rule in step S4 is as follows: (1) Mesoscale temperature ; (2) Correcting the difference between mesoscale temperature and dew point temperature or relative humidity ; (3) Correction of mesoscale wind speed ; (4) Duration of the meteorological conditions mentioned in (1)-(3) above ; in, Threshold parameters set according to site characteristics; If a continuous time window at a certain moment simultaneously meets all four of the above conditions, then the corresponding moment and the window within it are marked with the pseudo-label "iced"; otherwise, they are marked "non-iced".

5. The method for accurate prediction of wind farm icing based on the XGBoost model according to claim 4, characterized in that, The confidence level described in step S5 is divided into at least three levels: high confidence, medium confidence and low confidence, according to the strictness of the conditions for meeting the ice-covering judgment rule. The sample weights are assigned to the training samples according to the confidence level linear or non-linear mapping, and participate in the loss function calculation as sample weights during XGBoost training.

6. The method for accurate prediction of wind farm icing based on the XGBoost model according to claim 1, characterized in that, Step S6, which involves optimizing the optimal classification threshold using the PR curve on the model test set, specifically includes: Calculate precision and recall at different classification thresholds on the model test set, and plot the PR curve; Choose the threshold point that maximizes the F1 score, or determine the threshold corresponding to the precision-recall balance point as the optimal classification threshold.

7. The method for accurate prediction of wind farm icing based on the XGBoost model according to claim 1, characterized in that, The meteorological data to be evaluated in step S7 is mesoscale meteorological forecast data for the future operation period of the wind farm.

8. A system for implementing the accurate prediction method for wind farm icing based on the XGBoost model as described in any one of claims 1-6, characterized in that, include: The data acquisition and storage module is used to acquire and store measured data from the wind measurement tower and mesoscale meteorological data. The data preprocessing module is used to perform data cleaning, time alignment, systematic bias correction, and data fusion. Rule labeling module: used to automatically generate icing / non-icing pseudo-labels for fused data based on preset icing judgment rules, and to assign confidence levels to pseudo-labels according to the degree of rule satisfaction, and output weighted training sets; The feature engineering module is used to automatically extract and construct the multidimensional feature set from the fused data; The machine learning modeling module, with built-in XGBoost algorithm and SMOTE oversampling procedure, is used to train, evaluate and optimize the icing prediction model and determine the optimal classification threshold. The icing analysis and application module is used to load pre-trained models for batch prediction, generate icing probability time series, calculate icing reduction coefficients, and output visual analysis reports.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method as described in any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method as described in any one of claims 1 to 7.