Photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost
By combining wavelet analysis and adaptive Lasso feature selection with the IKOA algorithm to optimize the XGBoost model, the problem of fault diagnosis under high-dimensional data of photovoltaic systems is solved, and efficient and accurate fault identification and intelligent operation and maintenance of photovoltaic systems are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA YANGTZE POWER
- Filing Date
- 2025-01-03
- Publication Date
- 2026-06-26
AI Technical Summary
Existing machine learning methods are not very adaptable to the high-dimensional and dynamically changing data of photovoltaic systems, making it difficult to effectively identify faults. Furthermore, traditional algorithms suffer from high training complexity, overfitting risk, and improper handling of redundant features in photovoltaic systems.
Wavelet analysis was used to extract the time-frequency feature components of the waveform, and the adaptive Lasso method was used to screen key features. The parameters of the XGBoost model were optimized by the IKOA algorithm that integrates Logistic, Sine and Tent chaotic mappings. The ALasso-IKOA-XGBoost fault diagnosis model was constructed to improve the accuracy and robustness of photovoltaic system fault diagnosis.
It significantly improves the accuracy and efficiency of photovoltaic system fault diagnosis, reduces computational complexity, enhances the generalization ability of the model, reduces the cost of manual parameter tuning, and achieves efficient and intelligent operation and maintenance.
Smart Images

Figure CN120074369B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of photovoltaic system fault diagnosis, and specifically relates to a photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost. Background Technology
[0002] With the transformation and upgrading of the global energy structure, renewable energy, as a clean and environmentally friendly form of green energy, is developing rapidly. Photovoltaic (PV) systems, as an important component of renewable energy, have become a key solution for addressing energy challenges and climate change. With the continuous advancement of PV technology, the installed capacity of PV systems is constantly increasing, making a significant contribution to sustainable energy supply. However, PV systems face problems such as unstable power generation efficiency and difficulties in operation and maintenance during actual operation, resulting in significant challenges to their reliability and performance. Among these, the PV array and inverter, as core components of the system, are particularly vulnerable; their failures directly lead to a decrease in system power generation efficiency, affecting the utilization of renewable energy. Therefore, establishing an automated fault diagnosis and intelligent operation and maintenance system for PV arrays and inverters is of great significance for ensuring the efficient and stable operation of PV systems.
[0003] A review of existing literature revealed that current traditional fault diagnosis methods for photovoltaic power generation systems mainly focus on signal monitoring, mathematical modeling, signal analysis, statistical analysis, and manual experience.
[0004] Reference [1]: "Intelligent Detection Method for Photovoltaic Array Faults Based on Support Vector Machine Algorithm" (Zhou Yunfeng, Liu Guangyu, Li Huajun, et al. Intelligent Detection Method for Photovoltaic Array Faults Based on Support Vector Machine Algorithm [J]. Manufacturing Automation, 2021, 43(06): 45-8.) proposes a photovoltaic array fault detection method based on SVM. This method has a good identification rate for each fault state by combining the external environment and the internal electrical characteristics of the photovoltaic system. However, the training process of SVM requires solving a quadratic programming problem, which is difficult to handle the massive data generated by the photovoltaic system.
[0005] Reference [2]: Research on Diagnosis and Performance Loss Assessment Model of IGBT Open Circuit Fault in Photovoltaic Grid-Connected Inverter (Zhao Zilinglong. Research on Diagnosis and Performance Loss Assessment Model of IGBT Open Circuit Fault in Photovoltaic Grid-Connected Inverter [D]. Hefei University of Technology, 2021.) constructs a photovoltaic inverter fault diagnosis model based on SVM. This model utilizes the superior optimization classification characteristics of SVM to diagnose and classify the feature quantities extracted from 22 types of IGBT open circuit faults. The fault diagnosis model used in this reference is the same as that in the previous one, the difference lies in the diagnostic object. Therefore, the shortcomings here can be described as follows: when dealing with 22 types of IGBT open circuit faults, the training complexity of SVM will increase significantly. Another approach is to describe and summarize the shortcomings of this type of algorithm together after introducing these two references.
[0006] Reference [3]: "Fault Diagnosis of Photovoltaic Systems Based on Principal Component Optimization Neural Network" (Han Yuchen. Fault Diagnosis of Photovoltaic Systems Based on Principal Component Optimization Neural Network [D]. Shanghai Dianji University, 2021.) constructs a GA-BP neural network model, which can effectively diagnose different open-circuit faults of NPC inverters. However, the BP neural network itself relies on the gradient descent method for parameter updates, which is easily limited by local optima. In addition, the BP neural network is a static model, which is difficult to capture time dynamic characteristics.
[0007] Overall, existing machine learning methods are poorly adapted to the high-dimensional and dynamically changing data of photovoltaic systems, and their recognition rate is affected by the degree of sample imbalance. Therefore, how to establish an intelligent diagnostic system capable of processing high-dimensional imbalanced photovoltaic data and performing automated fault identification is one of the key challenges in current research.
[0008] XGBoost integrates CART as the base classifier using gradient boosting, and has excellent classification prediction performance. Reference [4]: "Research on XGBoost-based photovoltaic array fault diagnosis method" (Liu Xingxing, Pazilai Mahmuti, Cheng Zhijiang et al. Research on XGBoost-based photovoltaic array fault diagnosis method [J]. Electronic Measurement Technology, 2023, 46(12): 8-14.) constructs a fault diagnosis model based on XGBoost. This model improves the fault classification accuracy while ensuring generalization performance by extracting fault features under different fault states of the photovoltaic array. Although XGBoost designs a feature importance evaluation mechanism, it does not specifically screen or reduce the dimensionality of redundant features. This reference relies on manual parameter tuning, which is time-consuming and may not be able to find the globally optimal parameter combination.
[0009] Reference [5]: "Anomaly Detection of Photovoltaic Power Generation Based on Improved VMD-XGBoost-BiLSTM Combined Model" (Zhao Bochao, Ma Jiajun, Cui Lei, et al. Anomaly Detection of Photovoltaic Power Generation Based on Improved VMD-XGBoost-BiLSTM Combined Model [J]. Computer Engineering, 2024, 50(03): 306-16.) The XGBoost ensemble learning model has also achieved effective diagnosis of internal faults in photovoltaic power generation units. However, such models still face the risk of overfitting for small datasets, and the accuracy of fault diagnosis is affected by the model parameters. For high-dimensional monitoring data in photovoltaic systems, there are redundant and noisy features. The lasso feature selection method can improve the diagnostic effect.
[0010] Reference [6]: "Variable Selection Method Based on Spatiotemporal Group Lasso and Hierarchical Bayesian Spatiotemporal Model" (Wang Ling, Kang Zihao. Variable Selection Method Based on Spatiotemporal Group Lasso and Hierarchical Bayesian Spatiotemporal Model [J]. Journal of Geoinformation Science, 2023, 25(07): 1312-24.) constructed a hierarchical Bayesian spatiotemporal group Lasso variable selection model. This model fully considers spatiotemporal correlation, selects variables through spatiotemporal group Lasso, and then uses the hierarchical Bayesian spatiotemporal model to verify the effect of variable selection. It accurately selects the subset of variables that have the greatest impact on the dependent variable, thereby improving the prediction effect. However, in photovoltaic fault diagnosis system, there is a high linear relationship or correlation between independent variables, which makes the coefficient estimation of the model unstable or difficult to interpret. Moreover, when Lasso faces highly correlated features, it usually selects only one feature and ignores other related features. Summary of the Invention
[0011] To address the shortcomings of existing technical literature, this invention proposes a photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost. First, to fully explore fault characteristics and solve the multicollinearity problem, wavelet analysis is used to extract time-frequency feature components of the waveform, and Alasso is employed to extract multi-class feature components from the fault data. Next, to address the slow convergence speed, poor robustness, and susceptibility to local optima in KOA, an IKOA algorithm integrating Logistic, Sine, and Tent chaotic mappings is proposed to generate better initial solution distribution, parameter settings, and position update methods. Then, combined with XGBoost, an IKOA-XGBoost fault diagnosis model is built to improve the learning representation and classification accuracy of the photovoltaic fault diagnosis model under complex high-dimensional data conditions. This method effectively improves the diagnostic efficiency, accuracy, and intelligent operation and maintenance level of photovoltaic systems, ensuring reliable, efficient, and sustainable system operation.
[0012] The technical solution adopted in this invention is as follows:
[0013] A photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost includes the following steps:
[0014] Step 1: Extract the time-frequency feature components of the waveform through wavelet analysis, and use the adaptive Lasso method to extract multiple feature components of the fault data;
[0015] Step 2: Propose an IKOA algorithm that integrates Logistic, Sine, and Tent chaotic mappings to generate better initial solution distribution, parameter settings, and position update methods;
[0016] Step 3: Combine the IKOA algorithm from Step 2 with XGBoost to build an IKOA-XGBoost fault diagnosis model for diagnosing photovoltaic system faults.
[0017] In step 1, firstly, wavelet analysis is used to replace the infinitely long trigonometric function basis in the Fourier transform with finite-length, decaying wavelet basis functions. This extracts different frequency features of the signal, retains important information, removes noise, and improves signal quality and readability. Wavelet analysis extracts the time-frequency features of the three-phase inverter output current, which can be expressed as:
[0018]
[0019] Where: WT f (a,b) represents the wavelet transform result of signal f(t); f(t) is the current signal to be analyzed; for The conjugate of ; a, b are the scaling and translation of the wavelet basis function ψ(t), respectively.
[0020] Next, an adaptive Lasso method is proposed to address the irrationality caused by Lasso penalizing all coefficients the same way. Its principle is based on the Lasso method, assigning different weights to different penalty terms. The expression is shown below:
[0021]
[0022] In the formula: λ represents the estimated regression coefficients; λ² is the adaptive Lasso penalty weight; β j This is a penalty term for the model parameter β, which can cause the regression coefficients of the model to shrink toward zero;
[0023]
[0024] In the formula: This represents the adaptive weight, used to adjust the importance of the j-th regression coefficient in the penalty term; j represents the feature index; p represents the number of features; Indicates the initial estimate; X represents the coefficient estimate obtained by the least squares method; j =(X 1j ,X 2j ,…,X nj ) T Let j represent the predictor variable, where j = 1, 2, ..., p.
[0025] Y = (Y1, Y2, ..., Y) n ) T Let Y1, Y2, ..., Y be the response variables. n represents the true output values of the 1st to nth samples respectively; T represents the transpose of the vector.
[0026] Its weight expression is as follows:
[0027]
[0028] In the formula: Represents the adaptive weight matrix; represents the weight estimates for the 1st to pth features respectively; γ is the adaptive Lasso penalty parameter, γ≥0.
[0029] In step 1, the adaptive Lasso method is used to extract multi-class feature components from the fault data; specifically as follows:
[0030] The main types of photovoltaic (PV) module failures are as follows: module short circuit, module open circuit, partial shading, and panel aging and damage. The IV curve characteristics of module failures are as follows: Figure 7 As shown. During a short circuit, the branch voltage U, the maximum power point current Im, the voltage Um, and the open-circuit voltage Uoc are selected as characteristic components. During an open circuit, the branch current I, the short-circuit current Isc, and the maximum power point current Im are selected as characteristic components. During aging, the maximum power point current Im and voltage Um are selected as characteristic quantities. During local shading, the maximum power point current Im, voltage Um, open-circuit voltage Uoc, and the number of local maximum power points are selected as characteristic quantities.
[0031] Open-circuit fault output characteristics of different types of inverters, such as Figure 8 As shown, when a single transistor is open-circuited, the phase current will lack half a cycle of its waveform, while harmonics will be generated in other phase currents, causing waveform distortion and resulting in a low output voltage. When two transistors in the same phase fail, it will lead to a loss of phase current; when two transistors in different phases fail, two phases will lack half a cycle of their waveform, and the waveforms of other phases will be distorted. Therefore, the three-phase output current is chosen as the characteristic quantity.
[0032] In step 2, the chaotic strategy mainly utilizes the properties of chaotic systems to generate a better search strategy and avoid getting trapped in local optima.
[0033] 1) Introduce a Logistic mapping to initialize the population, replacing completely random initialization, but retaining the initial values. The randomness makes the initial solution more evenly distributed;
[0034]
[0035] In the formula: This represents the Logistic mapping value of the j-th variable in the i-th iteration; μ represents the parameter of the Logistic mapping, with a value range of (0,4]. This represents the Logistic mapping value of the j-th variable in the (i-1)-th iteration; This represents the updated value of the j-th decision variable for the i-th planet; This represents the lower bound of the j-th variable on the i-th planet; This represents the random sequence values generated by the Logistic mapping, used to represent variables. Mapped between its upper and lower bounds; d represents the upper bound of the j-th variable on the i-th planet; i represents the index of the planet; j represents the index of the decision variable; N represents the number of candidate solutions in the search space; d represents the dimension of the problem to be optimized.
[0036] 2) Introduce a control factor based on Logistic mapping into the gravitational force F to finely adjust the magnitude of planetary position changes in order to balance exploration and utilization;
[0037]
[0038] In the formula: r1 represents the Logistic mapping sequence L t The value of L; t and L t-1 These represent the results of the Logistic mapping at the t-th and t-1th iterations, respectively. Let μ represent the gravitational force exerted on the i-th planet (solution) during the t-th iteration; e represents... i represents the control factor; μ(t) represents the gravitational constant, which usually decreases gradually with the increase of the iteration number t; Normalized value representing the mass of the Sun; This represents the normalized mass of the i-th planet; This represents the normalized distance between the planet and the Sun; ε represents a very small positive number.
[0039] 3) Introduce the Sine mapping to replace the orbital eccentricity e i The random value of the quality retains the randomness of the initial value S1, which can increase the randomness and oscillation range of the parameter value, and is beneficial to enhancing population diversity and algorithm uncertainty.
[0040]
[0041] In the formula: S i S represents the Sine mapping value at the i-th iteration; 'a' represents a constant used to adjust the magnitude of the Sine mapping value; S i-1 This represents the Sine mapping value at the (i-1)th iteration; e i Indicates the orbital eccentricity;
[0042]
[0043] In the formula: r2 is equivalent to S t That is, the value generated by the Sine mapping; S t S represents the random sequence value at the t-th iteration; t-1 M represents the value of the Sine mapping in the previous iteration; s Indicates the mapping value S with Sine. t Relevant weighting factors; fit s (t) represents the meaning; fit k Worst(t) represents the fitness value of the t-th individual; Worst(t) represents the fitness value of the worst individual in the current population; N represents the size of the population.
[0044] 4) Introduce a perturbation term based on Tent mapping to fully shuffle the sequence as a global perturbation operator to enhance the algorithm's global search capability.
[0045]
[0046] In the formula: This represents the output vector of the Tent mapping; Represents the random sequence vector generated by the t-th iteration of the Tent mapping; T represents the output vector of the Tent mapping at the (t-1)th iteration; t-1 V represents the output value of the Tent mapping at the (t-1)th iteration; β represents the piecewise parameter of the Tent mapping; i (t) represents the velocity of planet i at time t; and Indicates system parameters used for adjustment and The degree of influence on speed; r4 represents the random factor; This represents the current position vector of the i-th candidate solution; This represents the current position vector of the optimal solution in the group; R represents the average position vector in the population; i-nom (t) represents the normalized value of the i-th solution; and This represents an additional perturbation term used to increase the diversity of solutions and prevent the algorithm from getting trapped in local optima. and These represent the upper and lower bounds of the i-th solution, respectively, used to limit the search range; L represents the control variable, used to strengthen the search. The impact on the solution reflects the global search capability; U2 represents the perturbation term, used to adjust the scope of the global search; r3 represents the random factor, used to adjust... The weight.
[0047] The IKOA algorithm, which integrates Logistic, Sine, and Tent chaotic mappings, is as follows:
[0048] This invention employs three chaotic strategies to improve the KOA algorithm, thereby enhancing its global search capability and better balancing the relationship between exploration and exploitation.
[0049] The IKOA algorithm, which integrates Logistic, Sine and Tent chaotic mappings, includes equations (1) to (5);
[0050] First, the population is initialized using the Logistic mapping introduced by formula (1) instead of completely random initialization. At the same time, a control factor based on the Logistic mapping is introduced into the gravity F using formula (2) to finely adjust the magnitude of planetary position changes.
[0051] Secondly, Sine mapping is introduced using equations (3) and (4) to replace the random values in orbital eccentricity and mass, respectively;
[0052] Finally, a perturbation term based on the Tent mapping is introduced using equation (5) and used as a global perturbation operator to fully shuffle the sequence, thereby enhancing the algorithm's global search capability.
[0053] In step 3, a photovoltaic system fault diagnosis model based on XGBoost is constructed:
[0054] First, the XGBoost proposed in this invention is based on CART, combining multiple weak CART learners through gradient boosting to construct a powerful prediction and classification model. During iterative training, each learner corrects the error from the previous round using gradient descent to minimize the loss function. Furthermore, the loss function is optimized using a second-order Taylor series to improve computational accuracy, and L1 and L2 regularization terms are introduced to limit the tree depth and number of nodes, incorporating the complexity of the tree model into the regularization term and penalizing tree structures with multiple leaf nodes. Additionally, sample shrinkage and feature subsampling are combined to avoid overfitting, simplify the model, and thus improve its generalization ability.
[0055] 1) Regularization objective function:
[0056] For a dataset with n samples and m features x i y represents the input feature vector of the i-th sample; i This represents the true value of the i-th sample; n represents the number of samples. Represents the dimension of the feature space; Represents the target value y i The range of values; the final prediction output of K CART trees is defined as:
[0057]
[0058] In the formula: φ(x) represents the predicted value of the i-th sample. i ) represents the model's prediction function; f k (x i ) represents the k-th decision tree for sample x i The predicted value; K represents the total number of CART decision trees used in the model;
[0059] Let f(x) represent the space of the CART regression tree; f(x) represent the prediction function of the decision tree; q represents the structure vector; w represents the leaf weights. Let f represent a vector space consisting of T real numbers; each function f k Each CART tree corresponds to an independent tree structure vector q and leaf weight w; q points from the sample to the corresponding leaf label, and each leaf node of each tree corresponds to a continuous score value, i.e., a weight; the score of the i-th node is w. i G represents the number of leaf nodes; w q(x) The score for sample x is the model's predicted value.
[0060] For each sample, each CART classifies it into a leaf node according to different classification rules, and the final prediction result is obtained by accumulating the scores w of the corresponding leaf nodes.
[0061] objective function It consists of a loss function l and a regularization term Ω:
[0062] A loss function measures the error between a prediction and the actual result, and is typically constrained by minimizing this error. The loss function is specifically represented as...
[0063] The regularization term evaluates the complexity of the XGBoost model to avoid overfitting or underfitting. It is shown in the following equation:
[0064]
[0065] In the formula: This represents the objective function of the model; The loss function representing a single sample; γ represents the regularization parameter that controls the complexity of the tree structure; λ represents the regularization parameter that controls the weight of the leaf nodes; w represents the weight vector of the leaf nodes of the tree.
[0066] Additional regularization terms It is used to penalize the complexity of the XGBoost model, smooth out the final learned weights, and avoid overfitting.
[0067] 2) Gradient tree augmentation:
[0068] The XGBoost model is trained additively. Assume... This represents the prediction for the i-th sample in the t-th iteration, with f added. t To minimize the following objective;
[0069]
[0070] In the formula: Let the objective function be the function at the t-th iteration. f represents the predicted value of sample i in the (t-1)th iteration; t (x i ) represents the prediction value of the t-th tree for sample i; Ω(f t ) represents the regularization term for the t-th tree; n represents the total number of samples.
[0071] To quickly optimize the objective function: Performing a second-order Taylor expansion on the objective function, we obtain:
[0072]
[0073] In the formula: These are the first and second gradient statistics on the loss function. This represents the current loss value for sample i.
[0074] To simplify the objective of the t-th training iteration, the constant term is removed:
[0075]
[0076] Definition I j ={i|q(x i )=j} is the sample set of leaf j, q(x) i ) represents the structure function of the decision tree, used to process the input sample x i Assign to leaf node j and expand Ω:
[0077]
[0078] In the formula: w j I represents the prediction weight of the j-th leaf node; j Let j represent the set of samples at leaf node j; i represents the index of the sample.
[0079] For a fixed structure q(x), the optimal weight of leaf j The definition is as follows:
[0080]
[0081] The corresponding optimal value is used as a scoring function to measure the quality of the tree structure q, and is defined as follows:
[0082]
[0083] In the formula: The score function represents the quality of the tree structure q(x) in the t-th iteration.
[0084] 3) XGBoost employs an approximate greedy algorithm, balancing computational accuracy and model complexity to find the optimal split point. First, candidate split points are determined based on the percentiles of the feature distribution. Then, continuous features are mapped to buckets formed by these candidate points, and statistical data is clustered.
[0085] Typically, percentiles of features are used to ensure that candidates are evenly distributed across the data. This is typically achieved through multi-set processing. Let x represent the k-th feature value and the second-order gradient statistic for each training sample. ik h represents the eigenvalue of the i-th sample on feature k; i Let represent the second gradient of the i-th sample.
[0086] Define a rank function r k : for:
[0087]
[0088] In the formula: r k (z) represents the weighted proportion of samples whose feature k values are less than z; x represents the feature value of the sample; h represents the second gradient of the sample; z represents the value of the current candidate split point; This indicates that sample (x, h) belongs to the sample set corresponding to the k-th feature.
[0089] r k (z) indicates that the eigenvalue k is less than z The proportion of instances, whose goal is to find candidate split points {s} k1 ,s k2 ,…,s kl},
[0090] s k1 ,s k2 ,…,s kl These represent the first to lth candidate split points of feature k, respectively.
[0091] Make:
[0092]
[0093] In the formula: r k (s k,j ) indicates that feature k is at candidate split point s k,j The sample proportion on the left; r k (s k,j+1 ) indicates that feature k is at candidate split point s k,j+1 The sample proportion on the left; s k1 s represents the first candidate split point of feature k; kl x represents the l-th candidate split point of feature k; ik This represents the value of the i-th sample on the k-th feature.
[0094] ε represents an approximation factor. Intuitively, the number of candidate split points is inversely proportional to ε, meaning there are approximately 1 / ε candidate points. Here, each data point is represented by h. i Weighted. Therefore, the formula is:
[0095]
[0096] Rewritten as with the tag g i / h i and weight h i Weighted squared loss:
[0097]
[0098] In the formula: f represents the objective function in the t-th iteration; t (x i ) represents the t-th tree for sample x i The predicted value; Ω(f t ) represents the regularization term for the t-th tree; constant represents the regularization term with respect to f. t (x i ()Irrelevant constant terms.
[0099] Step 3 involves optimizing key parameters in the XGBoost-based photovoltaic system fault diagnosis model using the IKOA algorithm, including the following steps:
[0100] 1) Collect fault data of photovoltaic modules and inverters, and perform preprocessing such as cleaning and screening; then, use wavelet transform to extract time-frequency domain features, and use the adaptive Lasso method to reduce dimensions, select key features, and divide them into test set and training set.
[0101] 2) Initialize the hyperparameters and optimization range of the XGBoost model, set the number of iterations for the IKOA algorithm, and generate a population of a given size using the IKOA algorithm.
[0102] This invention optimizes seven hyperparameters in the XGBoost model—maximum tree depth, learning rate, number of weak learners (trees), and sampling ratio when training each tree—using the IKOA algorithm. The optimization intervals for the maximum tree depth are [3, 100], the learning rate is [0.001, 0.1], the number of weak learners (trees) is [50, 1000], and the sampling ratio when training each tree is [0.5, 1]. The number of iterations is set to 200 generations. The mathematical model for generating the population is as follows: (equation 1):
[0103]
[0104] 3) During the iteration cycle, the fitness of the multi-fold model is obtained by cross-validating the dataset with 5 folds; then, the elitist strategy is executed to find and record the sun, i.e., the optimal solution; the details are as follows:
[0105] By implementing an elitist strategy, the optimal positions of the planets and the sun are ensured. This is illustrated in the following formula:
[0106]
[0107] In the formula: This represents the position of the i-th planet in the (t+1)-th iteration after the update; This represents the candidate position of the i-th planet in the (t+1)-th iteration; This represents the current position of the i-th planet in the (t+1)-th iteration; Indicates position The corresponding fitness value; t represents the number of iterations.
[0108] 4) Calculate the planet-related parameters in the IKOA algorithm and update the planet's position and distance from the sun; repeat step 3) until the maximum number of iterations is reached;
[0109] Update the planet's position using the following formula:
[0110]
[0111] In the formula: Indicates the scaling factor; F represents the velocity of the i-th planet in the t-th iteration; gi (t) represents the gravitational force acting on the i-th planet; |r| represents a random number; Represents a unit vector; Indicates the central location of the system.
[0112] Update the distance to the sun using the following formula:
[0113]
[0114] In the formula: Indicates the position of the a-th reference planet; Indicates the position of the b-th reference planet; This represents the adaptive factor controlling the distance between the Sun and the current planet at time t; η = (a²-1) × r⁴ + 1 represents a linearly decreasing factor from 1 to -2. This represents a loop control parameter that gradually decreases from -1 to -2 over T iterations throughout the optimization process.
[0115] 5) Use the obtained solar parameters as the optimal parameters for the XGBoost model;
[0116] The optimal parameters are as follows: the maximum tree depth is 50, the learning rate is 0.0959, the number of weak learners (trees) is 174, and the sampling ratio when training each tree is 0.81.
[0117] The diagnostic effectiveness was evaluated based on the test set using the following metrics;
[0118] The metrics include accuracy, recall, precision, and F1 score, expressed by the following formulas:
[0119]
[0120] In the formula: TP and TN represent the correctly predicted positive and negative samples, respectively, and FP and FN represent the incorrectly predicted positive and negative samples, respectively.
[0121] This invention provides a fault diagnosis method for photovoltaic systems based on ALasso-IKOA-XGBoost, with the following technical advantages:
[0122] 1) The main advantages of step 1 of the present invention are summarized as follows:
[0123] a. Precise feature selection: The ALasso method adaptively selects the most critical fault features, avoiding interference from irrelevant features.
[0124] b. Improve diagnostic efficiency: The reduction in feature dimensions significantly reduces the computational complexity of subsequent diagnostic models.
[0125] c. Enhanced diagnostic accuracy and robustness: By focusing on fault-related features, the model's generalization ability is improved. Furthermore, compared to the traditional Lasso method, ALasso has the following innovations in feature selection:
[0126] a. ALasso introduces an adaptive weight for each feature, dynamically adjusting the strength of regularization based on the importance of the feature, thus avoiding the limitation of traditional Lasso imposing the same constraints on all features.
[0127] b. By adjusting the weights, ALasso can avoid missing key features while reducing interference from irrelevant features.
[0128] c. Photovoltaic system data often contain multiple failure modes, and adaptive weights can better address the differences in feature importance under different failure modes.
[0129] 2) In step 2 of this invention, IKOA improves upon the traditional Kepler optimization algorithm, and its advantages are summarized as follows:
[0130] a. Enhance randomness: By introducing chaotic optimization strategies such as Logistic mapping and Tent mapping, the algorithm becomes more random in the initial population generation and iterative update process, thus avoiding premature convergence.
[0131] b. Enhanced exploration diversity: The improved randomization mechanism increases the diversity of candidate solutions, ensuring that the algorithm can fully cover the search space.
[0132] c. High stability: When faced with noise or data anomalies that may exist in photovoltaic systems, IKOA's chaos mechanism can improve the robustness of parameter optimization.
[0133] 3) In step 3 of this invention, firstly, the XGBoost algorithm is used for photovoltaic system fault diagnosis. a. By introducing regularization mechanisms (L1 and L2 regularization), overfitting is effectively suppressed, making the model perform more evenly on the training and test sets. b. XGBoost uses an approximate greedy algorithm to efficiently search for split points, and combined with the optimization in step 2, further improves the model training speed.
[0134] Secondly, optimizing the hyperparameters of the XGBoost model using IKOA significantly improved model performance. The optimized model has the following advantages: a. Precise parameter optimization: Traditional parameter search methods (such as grid search and random search) are often inefficient, while IKOA can quickly converge to the optimal parameter combination. b. Improved diagnostic accuracy: By optimizing the key parameters of XGBoost, the model's ability to learn photovoltaic system fault characteristics is enhanced, thereby improving the accuracy of diagnostic results. c. Reduced hyperparameter tuning costs: The automated optimization process greatly reduces the manual cost of parameter tuning, avoiding the tediousness of manual adjustments. Finally, the innovation lies in its synergy with the first two steps. Step 1 selects important features, and Step 2 optimizes the parameters, ensuring that the input data and hyperparameters of the XGBoost model are in optimal condition. Combining the first two steps, a complete and efficient photovoltaic system fault diagnosis process is constructed, significantly improving the accuracy and efficiency of diagnosis. Attached Figure Description
[0135] The present invention will be further described below with reference to the accompanying drawings and examples;
[0136] Figure 1 This is a flowchart of the overall fault diagnosis process for photovoltaic systems based on ALasso-IKOA-XGBoost proposed in this invention.
[0137] Figure 2 This is a fault IV curve characteristic diagram of the photovoltaic module proposed in this invention.
[0138] Figure 3 This is a diagram showing the open-circuit fault output characteristics of the inverter proposed in this invention.
[0139] Figure 4 This is the fault diagnosis result of the photovoltaic module proposed in this invention.
[0140] Figure 5 This is the inverter fault diagnosis result proposed in this invention.
[0141] Figure 6 This is a schematic diagram of the overall evaluation indicators for the different diagnostic models proposed in this invention.
[0142] Figure 7 This is a schematic diagram of the IV curve characteristics of component failure.
[0143] Figure 8 This diagram illustrates the output characteristics of different types of inverters in open-circuit fault mode. Detailed Implementation
[0144] The present invention will be further described in detail below with reference to the embodiments and accompanying drawings, but the embodiments of the present invention are not limited thereto.
[0145] Figure 1This is a flowchart of the overall process for photovoltaic system fault diagnosis based on ALasso-IKOA-XGBoost proposed in this invention. First, to fully explore fault characteristics and solve the multicollinearity problem, wavelet analysis is used to extract time-frequency feature components of the waveform, and Alasso is used to extract multi-class feature components of the fault data. Next, addressing the problems of slow convergence speed, poor robustness, and susceptibility to local optima in KOA, an IKOA algorithm integrating Logistic, Sine, and Tent chaotic mappings is proposed to generate better initial solution distribution, parameter settings, and position update methods. Then, the IKOA algorithm is combined with XGBoost to build an IKOA-XGBoost fault diagnosis model to improve the learning representation and classification accuracy of the photovoltaic fault diagnosis model under complex high-dimensional data conditions.
[0146] Figure 2 This invention presents the fault IV curve characteristics of a photovoltaic module using the proposed method. Photovoltaic module faults mainly include four types: module short circuit, module open circuit, local shading, and panel aging / damage. It can be observed that during a short circuit, the faulty branch current remains positive, the branch voltage decreases, and its volt-ampere characteristic curve shows multiple local maximum power points. After a short circuit, Isc hardly changes, while Uoc and the maximum power point show more significant changes. When the faulty branch is open-circuited, the faulty branch output current is 0, the parallel branch current decreases, and the total output current decreases. During aging, the volt-ampere characteristics change; the more severe the aging, the greater the decrease in the maximum power point, while Uoc and Isc show little change. Shading may lead to multi-peak, multi-knee phenomena, i.e., multiple local maximum power points, and the global maximum power point decreases with increasing shading intensity. The open-circuit voltage also changes slightly, while Isc shows little change.
[0147] Figure 3 This describes the output characteristics of an inverter under open-circuit faults according to the method proposed in this invention. It can be observed that when a single transistor is open-circuited, the phase current will lack half a cycle of its waveform, while other phase currents will experience harmonic distortion, resulting in a lower output voltage. When two transistors in the same phase fail, it will lead to a loss of phase current; when two transistors in different phases fail, both phases will lack half a cycle of their waveforms, and the waveforms of other phases will be distorted.
[0148] Figure 4 This is the result of the fault diagnosis of the photovoltaic module of the present invention. Figure 4 As can be seen from the aging and shadow conditions set by the present invention, some aging is misdiagnosed as normal and shadow, some shadow is misdiagnosed as aging and normal, and some components are also misdiagnosed as shadow due to short circuit and open circuit.
[0149] Figure 5 This is the inverter fault diagnosis result proposed in this invention. (From...) Figure 5It can be seen that there were no misdiagnoses among major categories of faults, but misdiagnoses occurred when diagnosing minor categories of faults. This is because the waveforms of major categories of faults differ significantly, while the extracted frequency domain signals may not differ significantly among minor categories of faults.
[0150] Figure 6 These are the overall evaluation indicators for the different diagnostic models proposed in this invention. Figure 6 As can be seen, the precision, recall, and F1-score of the model proposed in this invention reach 0.97, 0.96, and 0.96, respectively, maintaining a superior performance in the algorithm comparison. Specifically, the method proposed in this invention has the highest average precision of 96.15%, indicating that it performs best in terms of overall prediction accuracy. Recall represents the proportion of positive samples successfully predicted by the model out of all positive samples. Precision represents the proportion of true positive samples among those predicted as positive samples. The method proposed in this invention also performs best, meaning it is most effective in detecting true positive samples.
[0151] Table 1. Comparison and Analysis of Fault Diagnosis Accuracy of Different Diagnostic Models
[0152]
[0153]
[0154] Table 1 is a comparative analysis of the fault diagnosis accuracy of different diagnostic models proposed in this invention. As shown in Table 1, the Alasso-IKOA-XGBoost algorithm model proposed in this invention has the highest fault diagnosis accuracy, reaching 96.25% and 96.13%, respectively. Compared with KOA, it is 4.12% and 3.45% higher, respectively, which verifies the effectiveness of the multi-chaotic strategy optimization of KOA in this paper. Furthermore, compared with the commonly used IGWO
[14] , it is also 2.99% and 0.89% higher, respectively, indicating that the method proposed in this paper effectively improves the global search capability of KOA, thereby significantly improving the accuracy of fault diagnosis.
Claims
1. A photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost, characterized in that... Includes the following steps: Step 1: Extract the time-frequency feature components of the waveform through wavelet analysis, and use the adaptive Lasso method to extract multiple feature components of the fault data; Step 2: Propose an IKOA algorithm that integrates Logistic, Sine, and Tent chaotic mappings to generate better initial solution distribution, parameter settings, and position update methods; Step 3: Combine the IKOA algorithm from Step 2 with XGBoost to build the IKOA-XGBoost fault diagnosis model; In step 1, firstly, wavelet analysis is used to extract the time-frequency characteristics of the three-phase inverter output current, which can be expressed as: ; In the formula: Indicates signal The wavelet transform results; The current signal to be analyzed; for The conjugate; These are wavelet basis functions. Scaling and translation; Next, an adaptive Lasso method is proposed. Its principle is based on the Lasso method, assigning different weights to different penalty terms. The expression is shown below: ; In the formula: This represents the estimated value of the regression coefficients; Adaptive Lasso penalty weights; For model parameters The penalty term can cause the regression coefficients of the model to shrink toward zero; ; In the formula: Indicates adaptive weights, used to adjust the first... j The importance of each regression coefficient in the penalty term; Index representing the feature; Indicates the number of features; Indicates the initial estimate; This represents the coefficient estimate obtained by the least squares method; Let represent the predictor variable, where ; Represents the response variable. They represent the first to the last. n The actual output value of each sample; This indicates transposing the vector; Its weight expression is as follows: ; In the formula: Represents the adaptive weight matrix; They represent the first to the last. p Weight estimates for each feature assignment; For adaptive Lasso penalty parameters, .
2. The photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost according to claim 1, characterized in that: An adaptive Lasso method is used to extract multi-class feature components from fault data; the details are as follows: There are four types of photovoltaic module failures: module short circuit, module open circuit, partial shading, and panel aging and damage. During a short circuit, the characteristic components are branch voltage U, maximum power point current Im, voltage Um, and open circuit voltage Uoc. During an open circuit, the characteristic components are branch current I, short circuit current Isc, and maximum power point current Im. During aging, the characteristic components are maximum power point current Im and voltage Um. During partial shading, the characteristic components are maximum power point current Im, voltage Um, open circuit voltage Uoc, and the number of local maximum power points. In the open-circuit fault output characteristics of different types of inverters, when a single transistor is open-circuited, the phase current will lose half a cycle of waveform, while the currents of other phases will generate harmonics, resulting in waveform distortion and a low output voltage. When two transistors in the same phase are faulty, it will lead to the loss of phase current. When two transistors in different phases are faulty, two phases will lose half a cycle of waveform, and the waveforms of other phases will be distorted. Therefore, the three-phase output current is selected as the characteristic quantity.
3. The photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost according to claim 1, characterized in that: In step 2, 1) Introduce a Logistic mapping to initialize the population, replacing completely random initialization, but retaining the initial values. The randomness makes the initial solution more evenly distributed; (1); In the formula: Indicates the first During the nth iteration, the 1st The Logistic mapping value of each variable is used to map the variables... Mapped between its upper and lower bounds; The parameters representing the Logistic mapping; Indicates the first The variable in the first... Logistic mapping value at the next iteration; Indicates the first Planet number Updated values of each decision variable; Indicates the first Planet number The lower bound of each decision variable; Indicates the first Planet number The upper bound of each decision variable; Index indicating planets; Indicates the index of the decision variable; Indicates the size of the population; Represents the dimension of the problem to be optimized; 2) Introduce a control factor based on Logistic mapping into the gravitational force F to finely adjust the magnitude of planetary position changes in order to balance exploration and utilization; (2); In the formula: Represents the Logistic mapping sequence The value; and These represent the Logistic mappings at the th, th, and thieves, respectively. t and t The result of -1 iterations; Indicates the first i The planetary solution in the th... t The intensity of the gravitational force experienced in the next iteration; Indicates the control factor; The gravitational constant is typically expressed as the number of iterations increases. t The increase gradually decreases; Normalized value representing the mass of the Sun; Indicates the first i Normalized mass of each planet; This represents the normalized distance between the planet and the Sun. Represents positive numbers; 3) Introduce Sine mapping to replace orbital eccentricity Random quality values, retaining initial values. Randomness; (3); In the formula: Indicates the first i The Sine mapping value at the next iteration; This represents a constant used to adjust the magnitude of the Sine mapping value; Indicates the first i The Sine mapping value at -1 iteration; Indicates the orbital eccentricity; (4); In the formula: Equivalent to That is, the value generated by the Sine mapping; Indicates the first t The random sequence value at the next iteration; This represents the value of the Sine mapping in the previous iteration; Represents the mapping value with Sine Relevant weighting factors; The meaning of the expression; Indicates the first t The fitness value of each individual; This represents the fitness value of the worst individual in the current population; here... Indicates the size of the population; 4) Introduce a perturbation term based on Tent mapping to fully shuffle the sequence as a global perturbation operator; (5); In the formula: This represents the output vector of the Tent mapping; Indicates the first generation generated by the Tent mapping t A random sequence vector of iterations; Indicates that the Tent mapping is in the 1st century. t The output vector at the first iteration; Indicates that the Tent mapping is in the 1st century. t Output value at iteration 1; Indicates the segmentation parameters of the Tent mapping; Represents planets exist The speed of time; and Indicates system parameters used for adjustment , and The degree of impact on speed; Represents a random factor; Indicates the first i The current position vectors of the candidate solutions; This represents the current position vector of the optimal solution in the group; Represents the average position vector in the group; Indicates the first i The normalized value of each solution; and This represents an additional perturbation term used to increase the diversity of solutions and prevent the algorithm from getting trapped in local optima. and They represent the first i The upper and lower bounds of each solution are used to limit the search range; Indicates a control quantity, used to strengthen The impact on the solution reflects the global search capability; This represents a perturbation term, used to adjust the scope of the global search. Represents a random factor, adjustment The weight.
4. The photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost according to claim 3, characterized in that: The IKOA algorithm, which integrates Logistic, Sine, and Tent chaotic mappings, is as follows: The IKOA algorithm, which integrates Logistic, Sine and Tent chaotic mappings, includes equations (1) to (5); First, the population is initialized using the Logistic mapping introduced by formula (1) instead of completely random initialization. At the same time, a control factor based on the Logistic mapping is introduced into the gravity F using formula (2) to finely adjust the magnitude of planetary position changes. Secondly, Sine mapping is introduced using equations (3) and (4) to replace the random values in orbital eccentricity and mass, respectively; Finally, a perturbation term based on the Tent mapping is introduced using equation (5) and used as a global perturbation operator to fully shuffle the sequence, thereby enhancing the algorithm's global search capability.
5. The photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost according to claim 3, characterized in that: In step 3, a photovoltaic system fault diagnosis model based on XGBoost is constructed, including: 1) Regularization objective function: For those One sample, Dataset with 1 feature , Indicates the first i The input feature vector of each sample; Indicates the first i The true value of each sample; Indicates the total number of samples; Represents the dimension of the feature space; Indicates the target value The range of values; The final prediction output of a CART tree is defined as follows: ; In the formula: Indicates the first i Predicted values for each sample; Represents the model's prediction function; Indicates the first k Decision trees for samples The predicted value; This indicates the total number of CART decision trees used in the model; Represents the space of the CART regression tree; The prediction function of the decision tree; Represents a structure vector; Indicates the leaf weight; Represents a representation of a given entity. T A vector space consisting of n real numbers; Each function Corresponding to an independent tree structure vector Leaf weight ; Each sample points to a corresponding leaf label, and each leaf node of each CART tree corresponds to a continuous score value, i.e., a weight; the first... The score of each node is ; The number of leaf nodes; For the sample The score, i.e., the model's predicted value; For each sample, each CART classifies it into a leaf node according to different classification rules, and the scores of the corresponding leaf nodes are accumulated. To obtain the final prediction results; objective function By loss function and regularization terms composition: The loss function measures the error between the predicted result and the actual result, and the loss function is constrained by minimizing the error; the loss function is specifically represented as... ; The regularization term evaluates the complexity of the XGBoost model to avoid overfitting or underfitting; as shown in the following equation: ; In the formula: This represents the objective function of the model; The loss function representing a single sample; , This represents the regularization parameter that controls the complexity of the tree structure; This represents the regularization parameter that controls the weight of the leaf nodes; Represents the weight vector of the leaf nodes of the tree; 2) Gradient tree augmentation: The XGBoost model is trained additively; let... Indicates the first In the nth iteration, the th Prediction for each sample, add To minimize the following objective; ; In the formula: Indicates the first t The objective function at the next iteration; Indicates sample i In the t The predicted value at the first iteration; Indicates the first t Tree samples i The predicted value; Indicates the first t The regularization term for each tree; Indicates the total number of samples; To quickly optimize the objective function: Performing a second-order Taylor expansion on the objective function, we obtain: ; In the formula: ; These are the first and second gradient statistics on the loss function; Indicates sample i The current loss value; To simplify the first The goal of this training iteration is to remove the constant term: ; definition As leaves The sample set, The structure function of the decision tree is used to process input samples. x i Assigned to leaf nodes j and unfold : ; In the formula: Indicates the first j The prediction weights of each leaf node; Leaf nodes j The sample set on; Indicates the index of the sample; For a fixed structure ,leaf Optimal weight The definition is as follows: ; The corresponding optimal value is used as a measure of tree structure. The quality scoring function is defined as follows: ; In the formula: Indicates the first t Tree structure in the next iteration Quality scoring function; 3) XGBoost employs an approximate greedy algorithm to balance computational accuracy and model complexity in order to find the optimal split point. First, candidate split points are determined based on the percentiles of the feature distribution. Then, continuous features are mapped to the buckets formed by these candidate points, and the statistical data is clustered. Percentiles of features are used to distribute candidates evenly across the data; Adopting multiple episodes Represents the first training sample eigenvalues and second-order gradient statistics, Indicates the first i Each sample in features k Eigenvalues on; Indicates the first i The second gradient of each sample; Define a rank function for: ; In the formula: Representation of features k The value is less than z The weighting ratio of the samples; Represents the feature values of the sample; This represents the second gradient of the sample; This represents the value of the current candidate split point; Indicates sample Belongs to the k The sample set corresponding to each feature ; Represents eigenvalues Less than The proportion of instances, whose goal is to find candidate split points. , Representing features respectively k The 1st to l One candidate split point; Make: ; In the formula: Representation of features k At candidate split point The sample proportion on the left; Representation of features k At candidate split point The sample proportion on the left; Representation of features k The first candidate split point; Representation of features k The l One candidate split point; Indicates the first i The sample at the th k Values on each feature; Represents an approximation factor; intuitively, the number of candidate split points is related to... Inversely proportional, which means there is There are candidate points; each data point here is used as a candidate point. Weighted; therefore, the formula is: ; Rewrite with tags and weight Weighted squared loss: ; In the formula: Indicates the first t The optimization objective function in the next iteration; Indicates the first t Tree samples The predicted value; Indicates the first t The regularization term for each tree; surface Show and Irrelevant constant terms.
6. The photovoltaic system fault diagnosis method based on ALasso-IKOA-XGBoost according to claim 3, characterized in that: Step 3 involves optimizing key parameters in the XGBoost-based photovoltaic system fault diagnosis model using the IKOA algorithm, including the following steps: 1) Collect fault data of photovoltaic modules and inverters, and perform preprocessing such as cleaning and screening; then, use wavelet transform to extract time-frequency domain features, and use the adaptive Lasso method to reduce dimensionality, select key features, and divide them into test set and training set; 2) Initialize the hyperparameters and optimization range of the XGBoost model, set the number of iterations for the IKOA algorithm, and generate a population of a given size using the IKOA algorithm; The IKOA algorithm was used to optimize seven hyperparameters in the XGBoost model: maximum tree depth, learning rate, number of weak learners (trees), and sampling ratio when training each tree. The optimization intervals for the maximum tree depth were [3, 100], the learning rate was [0.001, 0.1], the number of weak learners (trees) was [50, 1000], and the sampling ratio when training each tree was [0.5, 1]. The number of iterations was set to 200 generations. The mathematical model for generating the population is as follows: (Same as Equation (1)). ; 3) During the iteration cycle, the fitness of the multi-fold model is obtained by cross-validating the dataset with 5 folds; then, the elitist strategy is executed to find and record the sun, i.e., the optimal solution; the details are as follows: By implementing an elitist strategy, the optimal positions of the planets and the sun are ensured; as shown in the following formula: ; In the formula: Indicates the updated number i The planet t The position of +1 iteration; Indicates the first i The planet t Candidate positions for +1 iterations; Indicates the first i The planet t The current position after +1 iterations; Indicates position The corresponding fitness value; Indicates the number of iterations; 4) Calculate the planet-related parameters in the IKOA algorithm and update the planet's position and distance from the sun; repeat step 3) until the maximum number of iterations is reached; Update the planet's position using the following formula: ; In the formula: Indicates the scaling factor; Indicates the first i The planet in the 1st t Speed in the next iteration; Indicates the first i The planetary solution in the th... t The intensity of the gravitational force experienced in the next iteration; Represents a random number; Represents a unit vector; Indicates the central location of the system; Update the distance to the sun using the following formula: ; In the formula: Indicates the first a The positions of a reference planet; Indicates the first b The positions of a reference planet; Indicates control An adaptive factor for the distance between the Sun and the current planet at any given time; This represents a linear decreasing factor from 1 to -2; where, This represents a loop control parameter that, throughout the optimization process, The value gradually decreased from -1 to -2 during each iteration; 5) The obtained solar parameters are used as the optimal parameters for the XGBoost model; and the diagnostic performance is evaluated based on the test set using the following metrics: accuracy, recall, precision, and F1-score, expressed in the following formulas: ; In the formula: TP and TN represent the correctly predicted positive and negative samples, respectively, and FP and FN represent the incorrectly predicted positive and negative samples, respectively.