Photovoltaic power data anomaly detection method and system based on unsupervised deep learning
By employing unsupervised deep learning methods, and calculating errors and iteratively optimizing dynamic thresholds based on historical data of photovoltaic systems, the accuracy problem of photovoltaic power data anomaly detection is solved, achieving high-precision and stable anomaly detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID HEBEI ELECTRIC POWER CO LTD
- Filing Date
- 2025-12-01
- Publication Date
- 2026-06-12
AI Technical Summary
In existing photovoltaic power data anomaly detection technologies, fixed threshold detection methods cannot adapt to environmental changes, resulting in low detection accuracy and a high risk of missed or false detections.
This paper adopts an unsupervised deep learning-based approach. By acquiring historical operating data of photovoltaic systems, calculating data errors, iteratively optimizing dynamic thresholds, maximizing the objective function using F1 scores, and combining K-means clustering and autoencoders, key features are extracted to achieve adaptive adjustment of dynamic thresholds and anomaly detection.
It improves the accuracy of photovoltaic power data anomaly detection, reduces false positives and false negatives, and ensures the stability and accuracy of detection under multiple operating conditions.
Smart Images

Figure CN122196802A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of photovoltaic power data processing technology, and in particular to a method and system for detecting anomalies in photovoltaic power data based on unsupervised deep learning. Background Technology
[0002] As a core component of sustainable energy, photovoltaic (PV) systems are experiencing continuous expansion in global application and play a crucial role in the energy transition process. However, PV systems are susceptible to various types of faults, including physical faults (such as panel cracks and diode failures), environmental faults (such as module shading and surface dirt accumulation), and electrical faults (such as open circuits and inverter malfunctions). These faults can not only cause irreversible damage to PV equipment but also directly lead to a decrease in power generation efficiency, resulting in significant economic losses. Therefore, real-time monitoring of the PV system's operating status and timely identification of power data anomalies have become core requirements for ensuring its stable and efficient operation.
[0003] Current photovoltaic power data anomaly detection technologies are mainly divided into model-based methods and data-driven methods. Among them, fixed threshold detection is a simplified implementation of some schemes in both types of methods. Fixed threshold detection methods are usually based on the statistical results of photovoltaic power data under a specific historical environmental condition. However, photovoltaic power output is strongly coupled with environmental factors. When the environment changes, the numerical range and fluctuation pattern of normal photovoltaic power will change significantly. However, the fixed threshold cannot be adaptively adjusted with environmental changes. Due to the mismatch between the threshold and the data distribution, it is easy to miss or falsely detect anomalies, resulting in inaccurate anomaly detection results. Summary of the Invention
[0004] This invention provides a photovoltaic power data anomaly detection method and system based on unsupervised deep learning, which solves the problem of low accuracy in photovoltaic power data anomaly detection using fixed thresholds in the prior art.
[0005] In a first aspect, embodiments of the present invention provide a photovoltaic power data anomaly detection method based on unsupervised deep learning, comprising: acquiring historical active power time series data of a photovoltaic system; calculating data error based on the active power time series data and normal mode data; constructing an objective function based on maximizing the F1 score, iteratively optimizing the calculation parameters of the initial dynamic threshold to obtain optimized calculation parameters; calculating an updated dynamic threshold based on the data error and the optimized calculation parameters; and performing anomaly detection on the real-time detected active power time series data based on the updated dynamic threshold to obtain anomaly detection results.
[0006] In one possible implementation, anomaly detection is performed on the real-time detected active power time series data based on an updated dynamic threshold to obtain anomaly detection results. This includes: extracting key features of the normal mode of the real-time detected active power time series data based on the real-time detected active power time series data and the autoencoder; obtaining fitting data of the normal mode based on the key features; calculating the real-time data error based on the real-time detected active power time series data and the fitting data of the normal mode; and if the real-time data error is higher than the updated dynamic threshold, then the real-time detected active power time series data is considered abnormal.
[0007] In one possible implementation, an updated dynamic threshold is calculated based on data error and optimized calculation parameters, including: filtering data errors corresponding to normal samples based on data error; calculating the mean and standard deviation of data errors corresponding to normal samples; and calculating the updated dynamic threshold based on the mean, standard deviation, and optimized calculation parameters, wherein the normal data retention rate of the updated dynamic threshold is higher than the preset retention rate threshold.
[0008] In one possible implementation, after constructing the objective function based on the maximum F1 score and iteratively optimizing the calculation parameters of the initial dynamic threshold to obtain the optimized calculation parameters, the anomaly detection method further includes: calculating the matching deviation based on the mean of the data error corresponding to the normal sample and the standard deviation of the data error corresponding to the normal sample; if the matching deviation exceeds the deviation threshold, it triggers the re-optimization of the calculation parameters, and the adjustment range of the calculation parameters is calculated by the learning rate and the F1 score.
[0009] In one possible implementation, the data errors corresponding to normal samples are screened based on the data errors, including: clustering the data errors based on the data errors corresponding to all samples using K-means clustering to obtain two sample error clusters, with the number of clusters being 2; calculating the central error value of the two sample error clusters based on the data errors of the two sample error clusters, and taking the sample error cluster with the smaller central error value as the data error corresponding to the normal samples.
[0010] In one possible implementation, before extracting key features of normal patterns from the real-time detected active power time-series data based on real-time detected active power time-series data and an autoencoder, the anomaly detection method further includes: extracting multiple anomalous samples based on the historical active power time-series data of the photovoltaic system; calculating the synthetic value score of each anomalous sample based on its local density factor and global density factor, where the local density factor is the proportion of normal samples among a preset number of nearest neighbors of the anomalous sample, and the global density factor is the relative density of the anomalous sample among all anomalous samples; determining the generation weight of each anomalous sample based on its synthetic value score; determining the number of new anomalous samples corresponding to the anomalous sample based on the generation weight; calculating the new anomalous sample corresponding to each anomalous sample based on each anomalous sample, the number of nearest neighbors of each anomalous sample, and the number of new anomalous samples corresponding to each anomalous sample; and obtaining enhanced active power time-series data based on the active power time-series data and the new anomalous samples.
[0011] In one possible implementation, before calculating the number of new abnormal samples corresponding to each abnormal sample based on each abnormal sample, the nearest neighbor samples corresponding to each abnormal sample, and the number of new abnormal samples corresponding to each abnormal sample, the anomaly detection method further includes: if the global density factor of an abnormal sample is higher than a preset density threshold, determining the nearest neighbor samples corresponding to the abnormal sample according to a preset distance range and the number of new abnormal samples corresponding to the abnormal sample; if the global density factor of an abnormal sample is less than or equal to the preset density threshold, determining the new abnormal sample according to a preset number of nearest neighbor samples of the abnormal sample.
[0012] In one possible implementation, after performing anomaly detection on the real-time active power time series data based on the updated dynamic threshold and obtaining the anomaly detection result, the detection method further includes: if the active power time series data is abnormal, dividing the active power time series data and the normal mode data into blocks according to a preset time interval to obtain multiple sequence data blocks and normal mode data blocks; calculating the MSM distance of each sequence data block based on each sequence data block and the corresponding normal mode data block; obtaining the abnormal sequence block and the time corresponding to the abnormal sequence block based on the MSM distance of each sequence data block and a preset anomaly threshold of a preset quantile; and obtaining the location of the abnormal data according to the time corresponding to the abnormal sequence block.
[0013] Secondly, embodiments of the present invention provide a photovoltaic power data anomaly detection system based on unsupervised deep learning, comprising: a communication module for acquiring historical active power time series data of a photovoltaic system; a processing module for calculating data error based on the active power time series data and normal mode data; constructing an objective function based on maximizing the F1 score, iteratively optimizing the calculation parameters of the initial dynamic threshold to obtain optimized calculation parameters; calculating an updated dynamic threshold based on the data error and the optimized calculation parameters; and performing anomaly detection on the real-time detected active power time series data based on the updated dynamic threshold to obtain anomaly detection results.
[0014] Thirdly, embodiments of the present invention provide an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the method described in the first aspect or any possible implementation thereof.
[0015] In this embodiment of the invention, the standard for judging anomalies in active power time series data is quantified by calculating data errors, thus eliminating subjective errors. The calculation parameters are iteratively optimized using F1 scores, which balance precision and recall, reducing false positives for normal data and false negatives for abnormal data, thereby improving detection accuracy. Based on the data errors and calculation parameters, an updated dynamic threshold is calculated to accurately define the range of normal and abnormal errors, improving detection accuracy and ensuring stable detection accuracy under multiple operating conditions. Through the updated dynamic threshold, accurate detection of photovoltaic power data anomalies is ultimately achieved. Attached Figure Description
[0016] Figure 1 This is a flowchart illustrating the implementation of the photovoltaic power data anomaly detection method based on unsupervised deep learning provided in this embodiment of the invention. Figure 2a This is a schematic diagram of the original data labeling sample provided in an embodiment of the present invention; Figure 2b This is a schematic diagram of anomaly samples generated by AEB-SMOTE according to an embodiment of the present invention; Figure 3 This is a schematic diagram illustrating the location of an abnormal segment during a certain time period provided in an embodiment of the present invention; Figure 4 This is a comparative analysis chart of anomaly detection performance using multiple thresholds evaluated by classification indicators, provided in an embodiment of the present invention. Figure 5 This is a graph showing the test results of the balanced set and the unbalanced set provided in the embodiments of the present invention; Figure 6 This is a schematic diagram of the structure of the photovoltaic power data anomaly detection system based on unsupervised deep learning provided in an embodiment of the present invention; Figure 7 This is a schematic diagram of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0017] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0018] See Figure 1 The document illustrates a flowchart of the photovoltaic power data anomaly detection method based on unsupervised deep learning provided in this embodiment of the invention, detailed below: Step 101: Obtain the historical active power time series data of the photovoltaic system.
[0019] In some embodiments, a photovoltaic system refers to a complete hardware system that converts solar energy into electrical energy.
[0020] For example, a photovoltaic system includes photovoltaic panels (such as polycrystalline silicon / monocrystalline silicon modules), inverters (DC to AC), combiner boxes, power sensors, and monitoring units, which are the physical carriers that generate active power data.
[0021] In some embodiments, active power is the electrical power that the photovoltaic system actually outputs to the outside world and can directly drive loads (such as electrical appliances and motors). It is a core indicator reflecting the system's power generation efficiency and operating status.
[0022] In some embodiments, the historical active power time series data is active power data collected continuously in chronological order, in the format of timestamp and power value.
[0023] Step 102: Calculate the data error based on the active power time series data and the normal mode data.
[0024] In some embodiments, the normal mode data is power data generated by an autoencoder after learning historical active power time series data. This data conforms to the normal operating patterns of a photovoltaic system and serves as a benchmark template reflecting the normal power generation state of the photovoltaic system. Its shape must match the normal characteristics in the historical data, such as a bell-shaped curve on sunny days and a smooth, low-power curve on cloudy days. The normal mode data is the standard template for error calculation. During autoencoder training, the normal mode is learned by minimizing the mean square error (MSE) between the input and output data. The final output data is the normal mode data. The core function of the normal mode data is to provide a quantitative standard for normal conditions. If the original historical active power data conforms to normal power generation patterns (such as a power curve on sunny days), the difference from the normal mode data is small, and the calculated MSE (data error) is also small. If the original historical data contains anomalies (such as a sudden drop in power caused by an inverter failure), the difference from the normal mode data is large, and the MSE will increase significantly.
[0025] In some embodiments, data error refers to the difference between active power time series data and normal pattern data. Data error (MSE) is the core bridge connecting raw data and anomaly detection, serving two purposes: providing a statistical basis for dynamic threshold calculation and providing a basis for anomaly sample screening. Data error is generated through point-by-point calculation of raw and reconstructed data. The specific steps are: data alignment, aligning the active power time series data (raw data) and normal pattern data (reconstructed data) point-by-point according to timestamps (ensuring a one-to-one correspondence between each minute-level data point); error calculation, calculating the difference between the two samples according to the MSE formula (one MSE value for each historical sample); and data screening, removing invalid error values generated during the calculation process due to data alignment failures (such as missing timestamps), retaining valid MSEs as data error.
[0026] As one possible implementation, this embodiment of the invention calculates the difference between real historical active power time series data and the baseline (normal mode data) of the normal mode learned by the model, providing a quantifiable core basis for subsequent dynamic threshold optimization and anomaly judgment.
[0027] Step 103: Construct an objective function based on maximizing the F1 score, and iteratively optimize the calculation parameters of the initial dynamic threshold to obtain the optimized calculation parameters.
[0028] In some embodiments, the F1 score is a comprehensive metric for measuring the balance between precision and recall in anomaly detection models. Precision is the proportion of samples judged as anomalous that are actually anomalous (reducing false positives). Recall is the proportion of samples that are actually anomalous that are successfully detected (reducing false negatives). The F1 score ranges from [0,1], with scores closer to 1 indicating better detection performance. The F1 score is a quantitative evaluation benchmark for detection performance. In anomaly detection, precision and recall have a "seesaw" contradiction (e.g., a too-loose threshold increases recall but decreases precision, while a too-tight threshold has the opposite effect). The F1 score is needed to balance the two. If only precision is focused on, a large number of latent anomalies (such as slight power fluctuations) may be missed, leading to untimely maintenance. If only recall is focused on, a large number of normal data may be falsely detected (e.g., normal low power on cloudy days), resulting in invalid alarms. As a comprehensive metric, the F1 score can accurately measure the balance between false positives and false negatives, avoiding performance bias caused by a single metric.
[0029] In some embodiments, the objective function serves as a directional guide for iterative optimization. Maximizing the F1 score provides a clear direction for parameter adjustment. After each parameter adjustment, the current F1 score is calculated using the objective function to determine if the adjustment was effective. If the F1 score increases, the adjustment is retained; otherwise, it is rolled back. Without an objective function, parameter optimization would fall into blind trial and error, failing to guarantee that the final parameters correspond to optimal detection performance.
[0030] In some embodiments, the parameters used to calculate the initial dynamic threshold are those parameters required to calculate the initial dynamic threshold. The formula for calculating the initial dynamic threshold is: ,in, It is the mean The weighting coefficients. Standard deviation The weighting coefficients. The calculation parameters refer to... and The initial value refers to the preset value before optimization, which needs to be set based on the statistical characteristics of normal data.
[0031] As one possible implementation, this embodiment of the invention uses scientific parameter optimization logic to find the optimal calculation parameters for the dynamic threshold, balancing the contradiction between reducing false detections and avoiding missed detections, and ensuring that subsequent anomaly detection not only meets the operation and maintenance needs of photovoltaic power plants but also has high reliability.
[0032] Step 104: Calculate the updated dynamic threshold based on data errors and optimized calculation parameters.
[0033] In some embodiments, the updated dynamic threshold refers to the statistical characteristics of the normal sample MSE compared with the optimized threshold. and This is the calculated boundary value used to distinguish between normal and abnormal data. The update is reflected in the fact that, unlike the initial threshold, the updated dynamic threshold is the final result after parameter optimization, and it can be recalculated as data distribution changes, such as after seasonal changes. and Then calculate the dynamic threshold.
[0034] As one possible implementation, embodiments of the present invention generate accurate, usable, and adaptive anomaly detection boundaries, which not only transform abstract statistical features and optimization parameters into concrete detection standards, but also ensure that the standards conform to data patterns and meet performance requirements, while supporting dynamic updates to adapt to environmental changes.
[0035] As one possible implementation, step 104 can be specifically implemented as steps A11-A13.
[0036] A11: Based on data error, filter out the data error corresponding to normal samples.
[0037] As one possible implementation, step A11 can be specifically implemented as steps B11-B12.
[0038] B11: Based on the data errors corresponding to all samples, K-means clustering is used to cluster the data errors, resulting in two sample error clusters with a total of 2 clusters.
[0039] B12: Based on the data errors of two sample error clusters, the central error value of the two sample error clusters is calculated, and the sample error cluster with the smaller central error value is taken as the data error corresponding to the normal sample.
[0040] In some embodiments, K-means clustering is an unsupervised machine learning clustering algorithm. Its core logic is to preset the number of clusters k, iteratively calculate cluster centers, and assign samples to the nearest cluster, ultimately resulting in high similarity among samples within a cluster and low similarity among samples between clusters. K-means clustering is a core tool for automatic error classification. Historical data from photovoltaic systems typically lack manually labeled normal / abnormal data. Traditional screening methods rely on empirical thresholds, which may fail due to changes in data distribution (such as seasonal changes leading to increased normal errors). K-means clustering, based on the natural distribution of error values, unsupervisedly divides samples into two classes without requiring manually set empirical thresholds. The algorithm automatically groups samples with similar errors into one cluster by calculating the distance between samples (such as Euclidean distance). Given the characteristics of photovoltaic data—many normal samples, few anomalous samples, but small normal errors and large anomalous errors—K-means (k=2) can reliably distinguish between small and large error clusters, avoiding the subjectivity of empirical screening.
[0041] In some embodiments, a sample error cluster refers to two sets of error samples formed after K-means clustering. One cluster contains samples with smaller errors (corresponding to normal operating conditions), and the other cluster contains samples with larger errors (corresponding to abnormal operating conditions). The error samples within each cluster have similar numerical ranges and distribution characteristics. The two sample error clusters formed after clustering serve as a transition from a mixed set of data errors to a categorical set. Their function is to transform abstract error values into a set with clear group attributes, making it easier to determine their normal or abnormal attributes later through cluster characteristics (such as the central error value).
[0042] In some embodiments, the central error value refers to the mean of all error samples within each error cluster, reflecting the average level of the cluster's error. Since the error during normal operation of a photovoltaic system is necessarily smaller than the error during abnormal operation (normal data has a small deviation from the normal pattern, while abnormal data has a large deviation due to deviation from the normal pattern), the central error values of the two clusters will inevitably have a significant difference. Its function is to provide a quantitative standard for judging cluster attributes. The cluster with the smaller central error value must be the normal sample error cluster; the cluster with the larger central error value must be the abnormal sample error cluster.
[0043] In some embodiments, the data error corresponding to normal samples is the MSE value of samples selected from the data errors that correspond to the normal operating state of the photovoltaic system. That is, it is the error between the normal historical data processed by the automatic encoder and the original data. The data error corresponding to normal samples eliminates the interference of abnormal sample errors, ensuring that the mean and standard deviation of subsequent statistics can truly reflect the error patterns under normal operating conditions. The data error corresponding to normal samples provides pure statistical data for threshold calculation. Only the mean and standard deviation statistically calculated based on this selection result can truly reflect the error patterns under normal operating conditions, avoiding interference from abnormal sample errors in threshold calculation.
[0044] As one possible implementation, embodiments of the present invention eliminate manual dependence through unsupervised clustering, eliminate abnormal interference through cluster screening, and adapt to different power plant scenarios through adaptive features.
[0045] A12: Calculate the mean of the data error corresponding to the normal sample and the standard deviation of the data error corresponding to the normal sample.
[0046] A13: Based on the mean, standard deviation, and optimized calculation parameters, the updated dynamic threshold is calculated. The normal data retention rate of the updated dynamic threshold is higher than the preset retention rate threshold.
[0047] In some embodiments, the mean of the data error corresponding to the normal sample refers to the arithmetic mean of the data errors corresponding to the normal sample, reflecting the average level of the normal sample error. The standard deviation of the data error corresponding to the normal sample refers to the dispersion index of the data error corresponding to the normal sample, reflecting the range of fluctuation of the normal sample error around the mean.
[0048] In some embodiments, the normal data retention rate refers to the proportion of samples in the data error corresponding to normal samples whose MSE is ≤ the updated dynamic threshold, reflecting the threshold's inclusiveness towards normal data. The normal data retention rate quantifies the risk of misjudging normal data by the threshold.
[0049] In some embodiments, the preset retention rate threshold is a pre-set minimum retention rate standard to ensure that normal data is not overly misjudged, and serves as a constraint to verify the reasonableness of the threshold. The preset retention rate threshold sets the minimum acceptable standard. If the retention rate is lower than the preset retention rate, it indicates that the threshold is too tight, and the parameters need to be re-optimized to improve the retention rate, thereby avoiding invalid alarms and increasing operation and maintenance costs.
[0050] As one possible implementation, embodiments of the present invention ensure the reliability of the statistical basis by screening clean, normal sample errors, ensure that the threshold conforms to the data pattern by standardizing the fusion logic of statistical features and optimizing parameters, and ensure that the threshold meets the operational and maintenance requirements by constraining the retention rate.
[0051] Step 105: Based on the updated dynamic threshold, perform anomaly detection on the real-time active power time series data to obtain anomaly detection results.
[0052] In some embodiments, the anomaly detection results include anomaly markers (normal / abnormal), the time interval corresponding to the abnormal sample (e.g., 1 day, 1 hour), and real-time data errors, indicating anomalies in a certain time period.
[0053] As one possible implementation, step 105 can be specifically implemented as steps A21-A24.
[0054] A21: Based on real-time detected active power time series data and an automatic encoder, extract key features of normal patterns from real-time detected active power time series data.
[0055] A22: Based on key features, obtain fitting data for the normal pattern.
[0056] A23: The real-time data error is calculated based on the real-time detected active power time series data and the fitted data of the normal mode.
[0057] A24: If the real-time data error is higher than the updated dynamic threshold, then the real-time detected active power time series data is abnormal.
[0058] In some embodiments, the real-time detected active power time series data refers to the active power data collected on a minute-by-minute basis during the current operating period of the photovoltaic system, in the format of a real-time timestamp and a real-time power value.
[0059] In some embodiments, the autoencoder is a trained 13-layer fully connected deep autoencoder, which includes an encoder (which compresses the input data into low-dimensional latent features) and a decoder (which reconstructs the original dimensional data from the latent features). Its core function is to learn the temporal patterns of normal photovoltaic power patterns.
[0060] In some embodiments, the key feature of the normal mode refers to the low-dimensional latent feature vector output by the "encoder" part of the autoencoder. This vector condenses the core information in the real-time power data that conforms to the normal operating rules (such as daily periodicity, power peak periods, and smooth fluctuation trends), eliminating noise and abnormal interference.
[0061] In some embodiments, the fitted data of the normal mode refers to the power data generated by the autoencoder "decoder" part based on the key features of the normal mode, which is consistent with the dimension of the real-time data. It is a concrete expression of the key features of the normal mode and represents the power curve that the current real-time data should present if it is running normally.
[0062] In some embodiments, real-time data error refers to the difference between the real-time detected active power time series data and the fitted data of the normal mode.
[0063] As one possible implementation, this embodiment of the invention transforms the abstract anomaly detection logic into a standardized, quantifiable, and implementable real-time judgment process. This provides a concrete reference for real-time data, avoids subjective bias by quantifying errors, and ensures accurate and consistent judgment criteria by connecting with previous optimization results.
[0064] As one possible implementation, the embodiments of the present invention not only achieve real-time identification of anomalies, solving the core pain points of photovoltaic power plants, but also output structured results to support operation and maintenance decisions, improve fault handling efficiency, and verify the effectiveness of preceding technologies through result feedback, forming a closed-loop optimization.
[0065] In this embodiment of the invention, the standard for judging anomalies in active power time series data is quantified by calculating data errors, thus eliminating subjective errors. The calculation parameters are iteratively optimized using F1 scores, which balance precision and recall, reducing false positives for normal data and false negatives for abnormal data, thereby improving detection accuracy. Based on the data errors and calculation parameters, an updated dynamic threshold is calculated to accurately define the range of normal and abnormal errors, improving detection accuracy and ensuring stable detection accuracy under multiple operating conditions. Through the updated dynamic threshold, accurate detection of photovoltaic power data anomalies is ultimately achieved.
[0066] Optionally, before step A21, the anomaly detection method may further include steps A31-A36.
[0067] A31: Based on the historical active power time series data of the photovoltaic system, multiple abnormal samples were extracted.
[0068] A32: Calculate the synthetic value score of each anomalous sample based on its local density factor and global density factor. The local density factor is the proportion of normal samples among a preset number of nearest neighbors of the anomalous sample, and the global density factor is the relative density of the anomalous sample among all anomalous samples.
[0069] A33: Determine the generation weight of each anomalous sample based on its synthetic value score.
[0070] A34: Based on the generation weight, determine the number of new abnormal samples corresponding to the abnormal sample.
[0071] A35: Based on each anomalous sample, the number of its nearest neighbor samples, and the number of new anomalous samples corresponding to each anomalous sample, the new anomalous sample corresponding to each anomalous sample is calculated.
[0072] A36: Enhanced active power time series data are obtained based on active power time series data and new anomalous samples.
[0073] In some embodiments, anomaly samples refer to samples identified from historical active power data that do not conform to normal operating patterns. These samples are characterized by abrupt changes, sudden drops, or abnormal fluctuations in the power curve and serve as seed samples for the subsequent synthesis of new anomaly samples.
[0074] In some embodiments, the local density factor is an indicator that quantifies the local danger level of a single anomalous sample, referring to the proportion of normal samples among a predetermined number of its nearest neighbors. This helps identify dangerous samples that are easily overlooked by the model. If an anomalous sample is surrounded by normal samples, the model may fail to learn its features due to majority class suppression; therefore, it is necessary to synthesize and supplement such samples to avoid missed detections.
[0075] In some embodiments, the global density factor is an indicator that quantifies the global rarity of a single outlier sample, referring to its relative density among all outlier samples. This avoids missing samples in sparse regions. If an outlier sample is extremely sparse in the global distribution, it indicates that there are few samples of that type of outlier (such as sudden inverter failure). Synthesizing new samples can supplement this type of anomaly and improve the model's ability to identify rare anomalies. The global density factor and local density factor work together to evaluate sample value from two dimensions: local risk and global rarity, avoiding bias from a single dimension.
[0076] In some embodiments, the synthesis value score is a comprehensive metric that measures the priority of anomalous sample synthesis, calculated based on local and global density factors.
[0077] In some embodiments, the generation weight is based on a normalized composite value score, representing the proportion of new anomalous samples that should be generated from a single anomalous sample. To avoid the waste of resources caused by egalitarianism, samples with high composite value scores (e.g., locally dangerous and globally sparse) are given high weights, generating more new samples, while samples with low composite value scores (e.g., locally safe and globally dense) are given low weights, generating fewer new samples. This ensures that composite resources are concentrated on high-value samples, improving data augmentation efficiency.
[0078] In some embodiments, the new anomalous samples are anomalous data that conforms to the time series pattern of photovoltaic power, synthesized from seed anomalous samples and their nearest neighbors. By using new anomalous samples to improve data imbalance, the model can learn normal and anomalous features equally.
[0079] In some embodiments, the enhanced active power time series data refers to the original historical data (a set of normal samples, real abnormal samples and new abnormal samples, in which the proportion of abnormal samples is increased and the data balance is significantly improved, and it can be directly used for training autoencoders).
[0080] As one possible implementation, the embodiments of the present invention not only solve the model bias problem caused by data imbalance, but also ensure the effectiveness of synthetic samples through regular constraints, ultimately enabling the model to fully learn normal and abnormal features and achieve high-precision and high-robust anomaly detection.
[0081] Optionally, before step A35, the anomaly detection method may further include steps A41-A42.
[0082] A41: If the global density factor of an abnormal sample is higher than the preset density threshold, the nearest neighbor sample corresponding to the abnormal sample is determined according to the preset distance range and the number of new abnormal samples corresponding to the abnormal sample.
[0083] A42: If the global density factor of an abnormal sample is less than or equal to a preset density threshold, a new abnormal sample is determined based on a preset number of neighboring samples of that abnormal sample.
[0084] In some embodiments, the preset density threshold is a pre-defined critical value used to distinguish between globally dense and globally sparse anomalous samples. Samples with a global density factor greater than the preset density threshold are considered dense samples, while those with a global density factor less than or equal to the preset density threshold are considered sparse samples.
[0085] In some embodiments, the preset distance range is a distance boundary set for globally dense anomaly samples to limit the search for nearest neighbor samples, ensuring that the feature similarity between the nearest neighbor samples and the target anomaly sample is high and avoiding the introduction of irrelevant samples.
[0086] In some embodiments, the preset number of nearest neighbor samples is a fixed number of nearest neighbor samples set for global sparse anomalous samples, ensuring that even if the sample distribution is sparse (few surrounding anomalous samples), a sufficient number of nearest neighbor samples can be obtained to synthesize new anomalous samples, avoiding synthesis failure due to insufficient nearest neighbors.
[0087] In some embodiments, a nearest neighbor sample refers to a sample selected from all anomalous samples that has similar characteristics to the target anomalous sample (determined by distance or quantity rules), and serves as a reference sample for synthesizing a new anomalous sample. The morphology of the new anomalous sample is jointly determined by the target anomalous sample and the nearest neighbor sample.
[0088] As one possible implementation, this invention uses a differentiated nearest neighbor selection rule to remove redundancy from dense samples and increase the number of sparse samples, ensuring that the generated new abnormal samples not only conform to the photovoltaic power time series pattern, but also comprehensively cover different types and forms of anomalies. Ultimately, this provides high-quality, comprehensive training data for the autoencoder, which is a key guarantee for achieving the technical goal of high robustness and low false negative rate in document anomaly detection.
[0089] Optionally, after step 103, the anomaly detection method may further include steps A51-A52.
[0090] A51: The matching bias is calculated based on the mean of the data error corresponding to the normal sample and the standard deviation of the data error corresponding to the normal sample.
[0091] A52: If the matching deviation exceeds the deviation threshold, the calculation parameters are re-optimized. The adjustment range of the calculation parameters is calculated by the learning rate and the F1 score.
[0092] In some embodiments, the matching bias refers to the quantified value of the deviation between the mean / standard deviation of the current normal sample error and the mean / standard deviation of the normal sample error during historical training.
[0093] In some embodiments, the deviation threshold is a pre-set critical value used to determine whether the matching deviation exceeds the acceptable range. If the matching deviation is greater than the deviation threshold, it indicates that the normal error distribution has changed too much and the original calculation parameters are no longer suitable; if it is less than or equal to the threshold, no adjustment is required.
[0094] In some embodiments, the learning rate is a proportional coefficient used to control the adjustment range of the calculation parameters. Its function is to avoid the threshold from changing abruptly due to excessive parameter adjustment and to ensure a smooth transition of the detection results.
[0095] As one possible implementation, this invention resolves the contradiction between parameter fixation and dynamic distribution through three steps: identifying distribution changes, re-optimizing parameters, and smoothing adjustments. This ensures that the dynamic threshold always maintains optimal performance with low false detections and low false negatives, regardless of changes in the environment or equipment status. This is a key guarantee for achieving high stability and long-term applicability.
[0096] Optionally, after step 105, the anomaly detection method may further include steps A61-A64.
[0097] A61: If the active power time series data is abnormal, divide the active power time series data and the normal mode data into blocks according to preset time intervals to obtain multiple sequence data blocks and normal mode data blocks.
[0098] A62: Based on each sequence data block and the corresponding normal mode data block, calculate the MSM distance of each sequence data block.
[0099] A63: Based on the MSM distance of each sequence data block and the preset anomaly threshold of the preset quantile, obtain the abnormal sequence block and the time corresponding to the abnormal sequence block.
[0100] A64: Based on the time corresponding to the abnormal sequence block, obtain the location of the abnormal data.
[0101] In some embodiments, MSM refers to the multi-scale sliding window matching algorithm. The core logic is: to construct a sliding window based on multiple preset time scales, calculate the matching degree between the real-time data error and the abnormal pattern in each window, and filter out the abnormal window with the highest matching degree and the finest time granularity through multi-scale comparison, so as to achieve accurate positioning of abnormal periods.
[0102] In some embodiments, the preset time interval is a pre-defined time granularity used to split long-series power data into short-series data blocks, which can be adjusted according to operation and maintenance needs.
[0103] In some embodiments, a sequence data block refers to a short time series segment obtained by splitting abnormal active power time series data into preset time intervals. Each data block contains a timestamp and the power value for the corresponding time period, and is the smallest analytical unit for anomaly localization.
[0104] In some embodiments, a normal mode data block refers to a short time-series segment obtained by splitting the normal mode data according to the same preset time interval and timestamp alignment principle as the sequence data block. Each normal data block and the corresponding sequence data block (within the same time period) form a "reference and judgment" pair.
[0105] In some embodiments, MSM distance refers to multi-scale matching distance, which is an indicator that quantifies the similarity between a sequence data block and a normal pattern data block. The smaller the distance, the closer the two are in shape (more likely to be normal); the larger the distance, the greater the difference in shape (more likely to be abnormal), adapting to the time offset and scale change characteristics of time series data.
[0106] In some embodiments, the preset anomaly threshold of the preset quantile refers to the anomaly determination threshold set based on the MSM distance statistics of historical normal data blocks. Data blocks in a sequence that exceed this value are determined to be abnormal.
[0107] In some embodiments, an abnormal sequence block refers to a sequence data block whose MSM distance is greater than a preset abnormal threshold. The time period corresponding to this data block is a fine-grained abnormal time period, which is a direct result of abnormal location localization.
[0108] In some embodiments, the location of the abnormal data refers to the specific start and end times of the abnormality determined based on the timestamp of the abnormal sequence block, directly pointing to the precise time point when the abnormality occurred.
[0109] As one possible implementation, this embodiment of the invention achieves a leap from coarse-scale to practical anomaly localization through data segmentation and similarity quantification, providing maintenance personnel with direct time-based information for troubleshooting.
[0110] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0111] The above embodiments are in Figure 1 Based on the method shown, each step will be discussed in detail. To facilitate understanding of the complete execution process, the overall method flow will be discussed below with reference to an embodiment.
[0112] Step 1: Data Preprocessing Collect univariate active power time series data of photovoltaic systems, and determine missing values through linear interpolation.
[0113] The data is normalized. The raw data, including daily photovoltaic power samples, is normalized using a minimum-maximum normalization formula. This scales the values within each feature to between 0 and 1 to achieve uniformity for effective comparison and analysis. The normalization formula is as follows: ,in and These are the minimum and maximum values for each feature of the photovoltaic power input. yes The normalized value of each element.
[0114] Step 2: Data Augmentation Photovoltaic power data exhibits a clear diurnal periodicity and weather dependence. Normal data typically displays a smooth bell-shaped curve (midday peak), while anomalous data may show sudden changes, drops, or fluctuations. Existing data augmentation methods (such as SMOTE) have the following limitations when processing photovoltaic power time series: They ignore temporal correlations; traditional methods are based on feature space interpolation and do not consider the temporal dependence and morphological continuity of photovoltaic power, leading to synthesized samples that may disrupt the diurnal periodicity. They are insensitive to local morphological changes; photovoltaic anomalies often manifest as local shape variations (such as midday power collapse), but existing methods struggle to generate anomalous samples with temporal consistency. Data imbalance is exacerbated; photovoltaic data contains very few anomalous samples with diverse morphologies, making it easy for general augmentation methods to generate redundant or invalid samples, reducing model robustness.
[0115] The AEB-SMOTE algorithm proposed in this invention optimizes the temporal and morphological characteristics of photovoltaic power data: Temporal consistency is maintained by prioritizing temporally adjacent points when dynamically selecting nearest neighbors, ensuring continuity of the synthesized samples along the time axis. Morphological sensing synthesis is applied based on typical photovoltaic power curves (e.g., unimodal distribution), constraining the generation range during sample synthesis to avoid generating unreasonable power values (e.g., high power at night). Performance verification is also provided. Figure 2a , 2bAs shown, the anomalous samples generated by AEB-SMOTE are more morphologically similar to real anomalies (such as a sudden drop at midday), significantly improving the model's F1 score under imbalanced data (see implementation results).
[0116] The AEB-SMOTE algorithm of this invention is specifically designed for the characteristics of photovoltaic power data. The core idea of the ADASYN (Adaptive Synthetic Sampling)-Enhanced Borderline-SMOTE (Synthetic Minority Over-sampling Technique) algorithm is to combine the "local focusing" advantage of Borderline-SMOTE with the "adaptive" advantage of ADASYN, and dynamically adjust it according to the global distribution density of the samples, thereby achieving more intelligent and secure sample synthesis. The principle of ADASYN is to assign different weights to minority class samples in different regions; difficult samples (with many surrounding majority class samples) have higher weights, resulting in the generation of more new samples.
[0117] Calculate each minority class sample Synthetic Value Score (SVS):
[0118] (Local density factor): Calculation of The proportion of majority class samples in the nearest neighbors. A higher proportion indicates that there are more majority class samples in the area where the sample is located. LD The larger the value.
[0119] (Global Density Factor): Calculation The relative density in the entire minority class dataset. This means that the sparser the area, the higher its value.
[0120] (Trade-off parameter): A hyperparameter between 0 and 1 used to adjust the relative importance of local danger and global rarity.
[0121] For each minority class sample, according to its SVS After score normalization, a weight is calculated. A higher weight means more new samples need to be generated for that weight. For each seed sample used to generate new samples... Dynamically select neighbor sources: if of GD The value is very low (in the global sparse region), from its Nearest neighbors are selected, while the search scope is broadened to include more distant, but still similarly distributed, minority neighbors to avoid overfitting. This helps to "fill in" sparse distribution regions, making them more complete. If of GD If the value is very high (in a globally dense region), then it is strictly selected from its nearest neighbor to avoid generating overly similar and redundant samples.
[0122] The synthesis formula uses the standard SMOTE formula, combining line segments connecting the original instance with randomly selected neighbors to synthesize a new instance, as shown in the formula below:
[0123] in, This is a PV power synthesis example generated by AEB-SMOTE. Refers to minority class instances of One of the nearest neighbors. This represents the minority class instances selected from the dataset. The proportion of synthetic instances among the original instances and their nearest neighbors is uniformly distributed. U (0,1) control. This process continues until the desired inter-class balance is achieved.
[0124] Based on the characteristics of photovoltaic power data, key parameters in AEB-SMOTE (such as the number of nearest neighbors) Trade-off parameters The settings are as follows: Nearest neighbor number Based on the temporal continuity of photovoltaic power data, set This is to capture local morphological features. If abrupt changes occur in the data (such as a sudden drop in power due to cloud cover), the algorithm dynamically adjusts... Value: Automatically increases when an increase in local variance is detected. To enhance sample diversity; when the data is stationary, reduce To maintain local consistency.
[0125] Trade-off parameters Used to balance local density factor ( LD ) and global density factor ( GD In photovoltaic scenarios, anomalies are often concentrated in specific time periods (such as midday), therefore, a setting is made... Prioritize focusing on locally hazardous areas. If the overall data distribution changes (e.g., seasonal changes alter the shape of the power curve), recalculation can be performed using a sliding window. GD and LD Dynamic updates .
[0126] Parameter adaptive principle: When photovoltaic power data experiences sudden changes or discrepancies, the system monitors the distribution of data errors in real time. If the error distribution deviates from the historical pattern, parameter re-optimization is triggered: within a preset range ( Re-search for optimal parameters to ensure that data augmentation always adapts to the current data characteristics.
[0127] Step 3: Training the Deep Autoencoder An autoencoder is an unsupervised machine learning algorithm that aims to reduce the dimensionality of input data, typically to a lower-dimensional space, by learning a compact representation of the input features. This latent representation captures fundamental patterns or features in the data, aiding tasks such as data compression, denoising, or anomaly detection. Through iterative encoding and decoding processes, an autoencoder effectively learns to compress input data to a lower-dimensional space and reconstruct it back to its original form. An autoencoder consists of two parts: an encoder function... It starts from the input Learning latent feature representations A decoder function ( It learns the latent representation from the output. Reconstruct the input.
[0128]
[0129] in It measures the input PV samples ( ) and the data generated by the automatic encoder ( The loss function is the difference between the two.
[0130] Step 4: Anomaly Detection Photovoltaic power data is normally affected by factors such as solar irradiance and weather stability, and its daily power curve statistically approximates a normal distribution. Analysis of historical data from photovoltaic power plants shows that the empirical cumulative distribution function (ECDF) of the data error (MSE) of the normal daily power curve indicates that 82% of the data points have an MSE less than 1, and the error distribution is close to normal (Shapiro-Wilk test p-value > 0.05). Therefore, a threshold can be set based on the normal distribution.
[0131] The proposed anomaly detection mechanism has two stages: First, the autoencoder reconstructs the input, and based on a statistical distribution method, a fault threshold is selected according to the distribution of data errors during the training phase. Finally, the proposed data-driven model takes the preprocessed photovoltaic power input data ( ) and the data reconstructed by its trained autoencoder ( The data is compared; if the error exceeds a defined threshold, such samples are marked as outliers. Since the evaluation metrics for normal photovoltaic power data (such as data error and standard deviation) follow a normal distribution, the threshold can be set based on the statistical properties of this distribution.
[0132] To enhance the performance of the anomaly detection mechanism, a threshold is defined based on the mean and standard deviation of the data error, as shown in the following formula: ,in, and These are the mean and standard deviation of the data error, respectively. Based on the statistical properties of a normal distribution for normal data, when... At that time, it covered approximately 95.4% of the normal data; when Yes, it covers approximately 99.7% of normal data; At that time, it almost covered all normal data. Therefore, it sought to satisfy... Maximize F1 score under constraints and The threshold obtained from the value This can maximize the F1 score.
[0133] choose and The value is to maximize the F1 score given by the following formula:
[0134] Optimization using coordinate descent method To maximize the F1 score, and Optimize using the following steps: (1) Initial value setting: Based on the characteristics of normal distribution, set the initial value. The initial value is 2.0 (covering 95.4% of normal data). The initial value is 0.
[0135] (2) Coordinate descent optimization: Using the F1 score as the objective function, under the constraints... The internal iterative search finds the optimal solution. In each iteration, the True Negative Rate (TNR) and F1 score at the current threshold are calculated. If the TNR is below 95%, the threshold is increased. or If the F1 score decreases, the step size is reduced for fine-tuning.
[0136] (3) Convergence condition: The process stops when the change in the F1 score is less than 0.01 or the number of iterations exceeds 100, thus obtaining the optimal result. and .
[0137] Meanwhile, the environment, equipment status, and seasons of photovoltaic power plants are constantly changing, and the distribution of newly generated data may gradually deviate from the distribution of the initial training data. At this point, the initial "normal distribution" assumption may become biased. Therefore, it is necessary to match the threshold relationship based on two parameters with the normal distribution law.
[0138] Ideally, the data error of normal data should perfectly follow a normal distribution. However, the actual distribution may have biases. Define the matching bias. for:
[0139] in, This represents the actual error. These are the quantiles of the standard normal distribution. Bias The smaller the value, the closer the actual distribution is to a normal distribution.
[0140] According to deviation Dynamic adjustment and Range:
[0141] in, The learning rate is set to 0.01. and The partial derivatives of the F1 score with respect to the parameters are estimated using the numerical difference method.
[0142] (1) Initial adjustment: Adjust the parameters with a large step size to quickly approach the optimal value.
[0143] (2) Fine-tuning: When D < 0.1, switch to fine-tuning with small steps until the F1 score converges.
[0144] Anomaly markers are created by comparing the input values of a data-driven model with the output values of its autoencoder. If the error exceeds a defined threshold, such samples are marked as anomalies.
[0145] Step 5: Locating and classifying faults during abnormal periods The Move-Split-Merge Metric (MSM) is a resilient distance metric based on edit operations. Its core principle is to calculate the cumulative cost of two time series by minimizing the cumulative cost of three edit operations. and The MSM preserves the temporal order and structure, enabling accurate comparisons even when sequences are misaligned. When the resolution is set to greater than 1 minute and the shape of the anomaly is considered, the proposed algorithm can identify significant deviations from normal patterns by calculating the MSM error, which is mathematically defined and dynamically expressed in this system:
[0146] Among them, This is an abnormal sample. To reconstruct the values corresponding to the sequence.
[0147] The steps for identifying outlier locations in a fitted data version of an input data block and its normal pattern based on the MSM matrix. First, the required accuracy is given. k Data based on n Divide by k The division of commerce k Each block. Abnormal samples. Fitting data to normal patterns The MSM distance is calculated in blocks, and the 95th percentile of the MSM distance in each block is used as the dynamic threshold. ; Mark MSM distance exceeds The time period is considered abnormal, and the photovoltaic system can be classified as abnormal based on the abnormal state.
[0148] System implementation of the present invention: 1. Hardware: Can run on standard computing devices, including CPUs (such as Intel i7-8700K) and GPUs (such as Nvidia RTX GeForce 2080), and supports cloud platform expansion.
[0149] 2. Software: The software functions are implemented using Python, and the dependent libraries include NumPy, Pandas (data processing), Keras, TensorFlow (model training), Matplotlib, and Seaborn (visualization).
[0150] 3. Monitoring Implementation: Applicable to photovoltaic system data centers, integrated into SCADA systems to achieve minute-level anomaly alarms.
[0151] The present invention has the following beneficial effects: 1. This invention is label-free and data-independent: it only requires single-variable photovoltaic power data, without the need for labeled data or additional sensor information, reducing the cost and complexity of data acquisition, and is suitable for scenarios with limited historical data or privacy-sensitive scenarios.
[0152] 2. This invention has high detection accuracy: the F1 score reaches 0.9968 (training set) and 0.9535 (test set) under single power variable input. 3. This invention exhibits strong robustness: in a data imbalance scenario (10% anomaly rate), the F1 score is 0.9202; 4. The invention provides precise positioning: The MSM local adaptive mechanism takes into account abnormal shape changes, achieving minute-level positioning and improving fault diagnosis efficiency; 5. Low cost of this invention: No tag data or additional sensors are required, greatly reducing hardware costs; 6. This invention is flexible in deployment and can be applied to both new and existing photovoltaic power plants. It is also easy to integrate into existing monitoring systems or cloud platforms. 7. This invention has wide applicability: it can be extended to fields such as photovoltaic system fault diagnosis, data quality assurance, energy forecasting, and security monitoring.
[0153] Case background: Implementation of an anomaly detection system for a 72kW photovoltaic power station.
[0154] Specifically, data preparation and preprocessing A solar power station uses polycrystalline silicon modules and has a peak power output of 96kW. It consists of 400 modules, each with a power output of 240W.
[0155] (1) Data acquisition: Collect minute-level photovoltaic active power time series data.
[0156] (2) At the same time, exclude samples with too many missing values and fill the missing values by interpolation.
[0157] These data points are used as feature vectors in machine learning models used in anomaly detection.
[0158] (3) Data normalization: The matrix containing daily photovoltaic power samples will be normalized. Minimum-maximum normalization is performed using the formula to normalize each feature. The values are scaled to between 0 and 1 to achieve uniformity for effective comparison and analysis.
[0159] (4) AEB-SMOTE Data Generation: Due to the high imbalance of the real dataset, outlier samples are very rare compared to normal samples. To alleviate this problem and ensure fair comparison with supervised methods, additional synthetic data was generated using the AEB-SMOTE technique. This approach expands the dataset, providing the model with more representative minority class samples, promoting more robust training, and enabling more accurate evaluation of model performance under imbalanced conditions.
[0160] In this embodiment, the AEB-SMOTE parameter is set to , When the power data variance is detected to continuously exceed twice the historical mean, the system automatically initiates a parameter adjustment process to ensure that the synthesized samples can effectively cover the current abnormal patterns. AEB-SMOTE is applied to daily samples, and each sample is represented by a time series containing timestamp features. Figure 2a and 2b The original data labeled samples and the anomalous samples generated by AEB-SMOTE are depicted.
[0161] Model building and training (1) Autoencoder architecture: Input layer (1440 neurons), encoding layer (720 neurons optimized), latent layer (360 neurons), decoding layer (720 neurons), output layer (1440 neurons). ReLU is used as the activation function. The layer structure is a 13-layer fully connected neural network (input 1440→1024→512→256→128→64→32→16→8→16→32→64→128→256→512→1024→output 1440).
[0162] The autoencoder is trained according to a formula, and the training process involves minimizing data error by iteratively reconstructing input samples. This is achieved through an optimization function that guides the autoencoder to effectively represent and reconstruct the input data.
[0163] (2) Optimization parameters: The hyperparameters were determined by Bayesian optimization: Dropout rate of 0.25, batch size of 128, learning rate of 0.0005, and Adam optimizer (β1=0.9, β2=0.999).
[0164] (3) Training: TensorFlow was used on Nvidia GPU with a batch size of 64, 100 epochs, and loss function MSE. The MSE of the training set eventually converged to 0.0021.
[0165] Anomaly detection (1) Calculation data error: comparison and .
[0166] (2) Dynamic threshold calculation: Calculate the data error distribution on the validation set (20% normal data + all synthetic outliers), and select the optimal threshold formula. and The value is to maximize F1.
[0167] In optimization problems, the goal is to satisfy... Maximize F1 score under constraints and value.
[0168] In this example, the coordinate descent method yields... α =1.5、 β =2.0, the F1 score is maximized, the normal data retention rate reaches 96.5%, and the F1 score is maximized to 0.9968.
[0169] Anomaly Location (1) Set precision (Hourly segments).
[0170] (2) Calculate the MSM distance in blocks according to the formula, using Euclidean distance.
[0171] (3) Dynamic threshold: Each block takes .
[0172] (4) Output the location of the anomaly: for example, Figure 3 The image shows the location of an abnormal segment during a certain period. This feature is characterized by a sudden drop in power in the middle of the period, accompanied by abnormal fluctuations. The sudden drop in power during the midday period may be due to temporary shadows caused by clouds or objects blocking the view, and the subsequent fluctuations may be caused by rapidly changing weather conditions.
[0173] Performance evaluation The performance of this invention was evaluated using anomaly detection performance comparative analysis, including precision, recall, accuracy, and F1 score.
[0174] Figure 4 This paper presents a comparative analysis of anomaly detection performance using multiple thresholds evaluated with classification metrics. The threshold proposed in this embodiment demonstrates the highest precision of 0.9936, indicating an extremely low false positive rate. Furthermore, the threshold of the proposed method achieves the highest precision of 0.9937, highlighting its superior ability to correctly identify anomalies in the dataset. In addition, the threshold of the proposed method exhibits the highest F1 score of 0.9968, demonstrating a balanced performance in accurately detecting anomalies while minimizing false positives. This comprehensive evaluation underscores the effectiveness of the proposed method's threshold in anomaly detection.
[0175] The balanced test set (160% normal / 160% abnormal) had an F1 score of 96.00%; the imbalanced test set (10% abnormal) had an F1 score of 92.02%. Figure 5 As shown.
[0176] In the photovoltaic power plant test, the model effectively identified almost all anomalies in the test dataset, achieving a recall rate of 96.89% and an F1 score accuracy of 96.00% on the balanced test set, and a recall rate of 93.41% and an F1 score of 92.02% on the unbalanced test set. Anomalies in photovoltaic systems can be caused by a variety of issues, including persistent power output deviations due to MPPT or converter failures, unstable power generation due to shading or panel degradation, erroneous power readings due to sensor or data logging errors, and fluctuations due to severe weather conditions. This example has been tested to achieve minute-level positioning accuracy, supporting real-time maintenance decisions.
[0177] The following are system embodiments of the present invention. For details not described in detail, please refer to the corresponding method embodiments described above.
[0178] Figure 6 The diagram illustrates the structure of a photovoltaic power data anomaly detection system based on unsupervised deep learning provided in an embodiment of the present invention. For ease of explanation, only the parts relevant to the embodiment of the present invention are shown, and are described in detail below: like Figure 6 As shown, the photovoltaic power data anomaly detection system 6 based on unsupervised deep learning includes: Communication module 61 is used to acquire the historical active power time series data of the photovoltaic system; Processing module 62 is used to calculate data error based on active power time series data and normal mode data; construct an objective function based on maximizing F1 score, iteratively optimize the calculation parameters of the initial dynamic threshold to obtain optimized calculation parameters; calculate the updated dynamic threshold based on data error and optimized calculation parameters; and perform anomaly detection on the real-time detected active power time series data based on the updated dynamic threshold to obtain anomaly detection results.
[0179] Figure 7 This is a schematic diagram of an electronic device provided in an embodiment of the present invention. For example... Figure 7 As shown, the electronic device 7 of this embodiment includes a processor 70 and a memory 71. The memory 71 stores a computer program 72. When the processor 70 executes the computer program 72, it implements the steps in the various method embodiments described above. Alternatively, when the processor 70 executes the computer program 72, it implements the functions of each module / unit in the various device embodiments described above.
[0180] For example, computer program 72 may be divided into one or more modules / units, which are stored in memory 71 and executed by processor 70 to complete the present invention. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of computer program 72 in electronic device 7.
[0181] Electronic device 7 may include, but is not limited to, processor 70 and memory 71. Those skilled in the art will understand that... Figure 7 This is merely an example of electronic device 7 and does not constitute a limitation on electronic device 7. It may include more or fewer components than shown, or combine certain components, or different components. For example, electronic device 7 may also include input / output devices, network access devices, buses, etc.
[0182] For the sake of simplicity and clarity, only the above-described functional modules / units are used as examples. In practical applications, the functions described above can be assigned to different functional modules / units as needed. These modules / units can be implemented in hardware, software, or a combination of both.
[0183] In the above embodiments, the descriptions of each embodiment have their own emphasis. Parts not detailed or described in a particular embodiment can be referred to in the relevant descriptions of other embodiments. Unless otherwise specified or in conflict with logic, the terminology and / or descriptions between different embodiments are consistent and can be referenced interchangeably. Technical features in different embodiments can be combined to form new embodiments based on their inherent logical relationships.
[0184] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A photovoltaic power data anomaly detection method based on unsupervised deep learning, characterized in that, include: Obtain historical active power time series data of the photovoltaic system; Based on the active power time series data and the normal mode data, the data error is calculated; The objective function is constructed based on maximizing the F1 score, and the calculation parameters of the initial dynamic threshold are iteratively optimized to obtain the optimized calculation parameters. Based on the data error and the optimized calculation parameters, the updated dynamic threshold is calculated. Based on the updated dynamic threshold, anomaly detection is performed on the real-time active power time series data to obtain anomaly detection results.
2. The photovoltaic power data anomaly detection method based on unsupervised deep learning according to claim 1, characterized in that, The method of performing anomaly detection on real-time active power time series data based on the updated dynamic threshold to obtain anomaly detection results includes: Based on the real-time detected active power time series data and the automatic encoder, key features of the normal pattern of the real-time detected active power time series data are extracted. Based on the aforementioned key features, fitting data for the normal pattern is obtained; Based on the real-time detected active power time series data and the fitting data of the normal mode, the real-time data error is calculated. If the real-time data error is higher than the updated dynamic threshold, then the real-time detected active power time series data is abnormal.
3. The photovoltaic power data anomaly detection method based on unsupervised deep learning according to claim 1, characterized in that, The process of calculating the updated dynamic threshold based on the data error and the optimized calculation parameters includes: Based on the data errors, filter the data errors corresponding to normal samples; Calculate the mean of the data error corresponding to the normal sample and the standard deviation of the data error corresponding to the normal sample; Based on the mean, the standard deviation, and the optimized calculation parameters, the updated dynamic threshold is calculated, and the normal data retention rate of the updated dynamic threshold is higher than the preset retention rate threshold.
4. The photovoltaic power data anomaly detection method based on unsupervised deep learning according to claim 3, characterized in that, After constructing the objective function based on maximizing the F1 score, iteratively optimizing the calculation parameters of the initial dynamic threshold to obtain the optimized calculation parameters, the anomaly detection method further includes: The matching deviation is calculated based on the mean of the data error corresponding to the normal sample and the standard deviation of the data error corresponding to the normal sample. If the matching deviation exceeds the deviation threshold, the calculation parameters are re-optimized, and the adjustment range of the calculation parameters is calculated by the learning rate and the F1 score.
5. The photovoltaic power data anomaly detection method based on unsupervised deep learning according to claim 3, characterized in that, The step of filtering out the data errors corresponding to normal samples based on the data errors includes: Based on the data errors corresponding to all samples, K-means clustering is used to cluster the data errors to obtain two sample error clusters, with the number of clusters being 2. Based on the data errors of the two sample error clusters, the central error value of the two sample error clusters is calculated, and the sample error cluster with the smaller central error value is taken as the data error corresponding to the normal sample.
6. The photovoltaic power data anomaly detection method based on unsupervised deep learning according to claim 2, characterized in that, Before extracting key features of normal patterns from the real-time detected active power time series data based on the real-time detected active power time series data and the autoencoder, the anomaly detection method further includes: Based on the historical active power time series data of the photovoltaic system, multiple abnormal samples were extracted; Based on the local density factor and global density factor of each anomalous sample, the synthetic value score of the anomalous sample is calculated. The local density factor is the proportion of normal samples among a preset number of nearest neighbors of the anomalous sample, and the global density factor is the relative density of the anomalous sample among all anomalous samples. The generation weight of each anomalous sample is determined based on its synthetic value score. Based on the generation weight, determine the number of new abnormal samples corresponding to the abnormal sample; The number of new abnormal samples corresponding to each abnormal sample is calculated based on each abnormal sample, the number of nearest neighbor samples corresponding to each abnormal sample, and the number of new abnormal samples corresponding to each abnormal sample. Based on the active power time series data and the new anomalous samples, enhanced active power time series data are obtained.
7. The photovoltaic power data anomaly detection method based on unsupervised deep learning according to claim 6, characterized in that, Before calculating the number of new abnormal samples corresponding to each abnormal sample based on each abnormal sample, the number of nearest neighbor samples corresponding to each abnormal sample, and the number of new abnormal samples corresponding to each abnormal sample, the anomaly detection method further includes: If the global density factor of an abnormal sample is higher than a preset density threshold, the nearest neighbor sample corresponding to the abnormal sample is determined according to the preset distance range and the number of new abnormal samples corresponding to the abnormal sample. If the global density factor of an abnormal sample is less than or equal to a preset density threshold, the new abnormal sample is determined based on a preset number of neighboring samples of the abnormal sample.
8. The photovoltaic power data anomaly detection method based on unsupervised deep learning according to claim 1, characterized in that, After performing anomaly detection on the real-time detected active power time series data based on the updated dynamic threshold and obtaining the anomaly detection result, the detection method further includes: If the active power time series data is abnormal, the active power time series data and the normal mode data are divided into blocks according to a preset time interval to obtain multiple sequence data blocks and normal mode data blocks. Based on each sequence data block and the corresponding normal mode data block, the MSM distance of each sequence data block is calculated; Based on the MSM distance of each sequence data block and the preset anomaly threshold of the preset quantile, the abnormal sequence blocks and the time corresponding to the abnormal sequence blocks are obtained. The location of the abnormal data is obtained based on the time corresponding to the abnormal sequence block.
9. A photovoltaic power data anomaly detection system based on unsupervised deep learning, characterized in that, include: The communication module is used to acquire the historical active power time series data of the photovoltaic system. The processing module is used to calculate the data error based on the active power time series data and the normal mode data; The objective function is constructed based on maximizing the F1 score. The calculation parameters of the initial dynamic threshold are iteratively optimized to obtain the optimized calculation parameters. Based on the data error and the optimized calculation parameters, the updated dynamic threshold is calculated. Based on the updated dynamic threshold, anomaly detection is performed on the real-time active power time series data to obtain anomaly detection results.
10. An electronic device, characterized in that, It includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the method as described in any one of claims 1 to 8.