A method and apparatus for predicting power generation
By constructing a prediction model based on contrastive learning and Transformer, the diurnal differences in photovoltaic power generation are learned and parameters are fine-tuned, thus solving the problem of accuracy in photovoltaic power generation prediction and improving the stability and accuracy of grid dispatch.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUADIAN TRADING INTERNATIONAL (BEIJING) CO LTD
- Filing Date
- 2025-12-19
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies cannot effectively solve the intermittency and volatility of photovoltaic power generation, leading to increased grid dispatch complexity and power system stability issues, and the accuracy of photovoltaic power generation prediction is low.
By acquiring historical and meteorological data from photovoltaic power plants, a prediction model based on comparative learning and Transformer is constructed. This model learns the diurnal differences in photovoltaic power generation and incorporates a learnable adapter for parameter fine-tuning, thereby enabling short-term prediction of photovoltaic power generation.
It improves the accuracy of photovoltaic power generation forecasting, reduces the complexity of grid dispatching, and enhances the stability of the power system.
Smart Images

Figure CN121906402B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of data prediction technology, and in particular relates to a method and apparatus for predicting power generation. Background Technology
[0002] The operational efficiency of photovoltaic (PV) power generation systems primarily relies on renewable solar energy resources. This type of energy has attracted widespread attention due to its environmental friendliness, renewability, and wide geographical distribution. During power generation, this energy produces almost no pollutant emissions, effectively reducing the ecological and environmental burden. However, PV power generation is characterized by intermittency, randomness, and volatility. The volatility primarily stems from the uncertainty and difficulty in predicting natural energy supply. Specifically, PV power generation systems are subject to the combined effects of multiple environmental parameters, including solar irradiance, cloud dynamics, diurnal cycles, and seasonal cycles, exhibiting significant energy output instability. Furthermore, the technological limitations of the power generation equipment itself (e.g., the nonlinear characteristics of energy conversion efficiency and performance degradation due to operational losses) further exacerbate system power fluctuations.
[0003] The intermittent nature of photovoltaic (PV) power generation increases the complexity of grid dispatching, requiring the dispatching system to construct a flexible response mechanism to maintain real-time power supply and demand balance by dynamically adjusting the output of conventional power sources. Power fluctuations can cause deviations in grid voltage amplitude and system frequency, which not only degrades power transmission quality but may also affect the performance of end-user equipment. When PV penetration reaches a critical threshold, severe power fluctuations may trigger power system stability issues, potentially jeopardizing the overall operational safety of the power grid.
[0004] There is currently no effective solution for accurately and efficiently predicting photovoltaic power generation. Summary of the Invention
[0005] The purpose of this application is to provide a method and apparatus for predicting power generation, which can accurately and efficiently predict photovoltaic power generation.
[0006] This application provides a method and apparatus for predicting power generation, which is implemented as follows:
[0007] A method for predicting power generation, the method comprising:
[0008] Historical power generation data of the target photovoltaic power plant and corresponding meteorological forecast data for various time periods are obtained as the raw dataset;
[0009] A sampled dataset is formed by extracting data in predetermined proportions from daytime and nighttime data respectively from the original dataset.
[0010] For each sample in the original dataset, a sample is randomly selected from the sampled dataset to form a sample pair, which serves as the input data for the pre-training stage;
[0011] The original prediction model is pre-trained by comparison learning using the input data to learn the diurnal differences in photovoltaic power generation.
[0012] The photovoltaic day and night power generation patterns are saved by a model encoder in a parameter-sharing manner, and the parameters are fine-tuned by a learnable adapter to obtain a prediction model.
[0013] The prediction model is used to make short-term predictions of photovoltaic power generation for the target photovoltaic power plant.
[0014] In one implementation, the short-term prediction of photovoltaic power generation of the target photovoltaic power plant using the prediction model includes:
[0015] Input the original dataset into the prediction model;
[0016] The prediction model is used to perform model preprocessing, encoder feature extraction, and decoder to obtain power output, so as to output the photovoltaic power prediction results for each preset number of minutes of a predetermined length in the future.
[0017] In the prediction model, the decoder initially takes a zero-value tensor as input and uses its own predictions as subsequent inputs to gradually generate an output sequence during the prediction phase.
[0018] In one implementation, the original prediction model is pre-trained using the input data through comparative learning to learn the diurnal differences in photovoltaic power generation, including:
[0019] The time string in the input data is converted into an object, and the time features of the object are extracted. The time features include year, month, day, and hour. The hour is periodically encoded to preserve the time periodicity.
[0020] Spatial averaging is performed on the meteorological elements in the input data, and the meteorological elements are time-aligned with the power data.
[0021] The encoder extracts features from the input feature matrix containing time and meteorological features. The encoder is composed of N identical encoder blocks stacked together. Each encoder block contains: multi-head self-attention, feedforward network, residual connection and layer normalization multiple sub-layers, where N is a positive integer.
[0022] Feature enhancement is performed on the output of the encoder;
[0023] The output results after feature enhancement are pre-trained using contrastive learning.
[0024] In one implementation, feature extraction is performed on an input feature matrix containing time and meteorological features using an encoder, including:
[0025] For each encoder block in the encoder, the calculation is performed as follows:
[0026] Calculate the projection of the query, key, and value:
[0027]
[0028] Among them, Q l Z represents the query matrix at level l, where the index l indicates the level number; l-1 K represents the output of layer l-1 and the input of layer l; l Represents the bond matrix of the l-th layer; This represents the value matrix of the l-th layer; This represents the weight matrix of the query, key, and value corresponding to the l-th layer, with dimension d. model ×d n , where d model d represents the dimension of the encoder. n These are the feature dimensions of queries, keys, and values;
[0029] Calculate attention score:
[0030]
[0031]
[0032] Where K is the key matrix for calculating the attention score, K T This indicates that K is transposed to satisfy the requirements of matrix multiplication. This represents the attention result of the i-th single head in layer l. This represents the weight matrix of the i-th corresponding query matrix in the l-th layer. This represents the weight matrix of the i-th corresponding key matrix K in the l-th layer. Let V be the weight matrix of the i-th corresponding value matrix in the l-th layer, where i represents the i-th head;
[0033] After obtaining single-head attention, we concatenate them to obtain multi-head attention:
[0034]
[0035] Wherein, Concat represents the concatenation operation of matrices or vectors; This represents the output weight matrix of the multi-head attention of the l-th layer encoder, where the subscript 0 indicates the output, the superscript l indicates the layer number, and he indicates the number of heads.
[0036] Residual connections and layer normalization are performed on the output of the multi-head self-attention sublayer:
[0037]
[0038] Where LayerNorm represents layer normalization. This represents the output of the multi-head self-attention sublayer in the l-th encoder, with the superscript ' indicating the sum of the outputs of the l-th encoder and the output Z. l Distinguish between them;
[0039] In the feedforward network sublayer, perform the feedforward transformation:
[0040]
[0041] in, Let x represent the feedforward neural network in the l-th layer encoder, and let x represent the input of the neural network. , where represents the two weight matrices used for the feedforward transformation in the l-th layer, . This represents the dimension of the hidden layer in a feedforward neural network;
[0042] The output of the feedforward transform is then subjected to residual connection and layer normalization:
[0043]
[0044] After passing through N identical encoder blocks, the output of the encoder is obtained:
[0045]
[0046] in, This represents the output of the Nth layer encoder, which is also the final output of the encoder. N is the number of encoder blocks. X represents the input feature matrix containing time and meteorological features. Encoder represents all operations of a single layer encoder.
[0047] In one implementation, feature enhancement is performed on the output of the encoder, including:
[0048] Construct feature vectors based on the encoder's output;
[0049] The global features of the entire sequence are obtained by using mean pooling, resulting in the pooled feature vector:
[0050]
[0051] Where t is the time step index in the sequence, h is the pooled feature vector, and T represents the length of the entire sequence;
[0052] Layer normalization normalizes the internal dimensions of the feature vector of a single sample, calculates the mean and variance of all dimensions of the feature vector of the single sample, and adjusts them to a mean of 0 and a variance of 1 to eliminate the dimensional differences of different feature dimensions within the same sample.
[0053] Batch normalization normalizes the feature vectors of all samples within the same batch, calculates the mean and variance of all samples in the current batch in the same dimension, and adjusts them to a stable distribution to eliminate feature distribution shifts between different samples, accelerate model training convergence, and avoid comparison loss calculation errors caused by differences in batch data distribution.
[0054] In one implementation, the feature-enhanced output is subjected to contrastive learning pre-training, including:
[0055] The feature-enhanced output is projected using the following formula:
[0056]
[0057] Where z represents the result of the projection transformation. This represents the second weight matrix of the projection transformation, where the subscript q indicates the projection transformation. This represents the activation function. This represents the first weight matrix of the projection transformation. This represents the first bias vector of the projection transformation. This represents the second bias vector of the projection transformation;
[0058] The similarity between two samples in a sample pair is calculated using the projection transformation results:
[0059]
[0060] in, This represents the similarity between two samples in a sample pair, and τ represents the temperature parameter.
[0061] Calculate the similarity probability value of the sample pair based on the similarity between the two samples in the sample pair:
[0062]
[0063] Where p is the similarity probability value of the sample pair. Represents the sigmoid function;
[0064] Based on the similarity probability values of the sample pairs, contrastive learning pre-training is performed using the following contrastive learning loss function:
[0065]
[0066] in, , is a very small positive constant, used to prevent overflow when the logarithmic function input is 0. Let A represent the loss function for contrastive learning, and let A represent the total number of sample pairs. Let represent the true label of the i-th sample pair, and let log represent the logarithmic function. This represents the similarity probability value of the i-th sample pair.
[0067] In one embodiment, the adapter of the model encoder includes: a lower projection layer, a hidden layer, an upper projection layer, a residual connection, and a normalized output, wherein:
[0068] The lower projection layer reduces the dimensionality of the input:
[0069]
[0070] in, The input feature vector, The weight matrix is the downward projection matrix. Let be the downward projection bias vector. Let be the eigenvectors after dimensionality reduction, where , Multiples of 2 Indicates the dimension of the encoder;
[0071] The hidden layer is calculated according to the following formula:
[0072]
[0073] in, For learnable scalar parameters, during initialization , equivalent to During training, parameters can be adjusted. Adjust the activation function of the hidden layer. Let LAct represent the hyperbolic tangent function, and let LAct represent the activation function of the hidden layer.
[0074] The upper projection layer is calculated according to the following formula:
[0075]
[0076] in, For output of the hidden layer, The up-projection weight matrix is... Let be the upward projection bias vector. The feature vector after restoring the dimensions;
[0077] The adapter output is obtained after residual connection and layer normalization:
[0078]
[0079] Here, Adapter represents all operations of the adapter, LayerNorm represents layer normalization operations, and c represents the input of the adapter.
[0080] In one implementation, the decoder obtains a power output, including:
[0081] After obtaining the feature representation through the encoder, the feature representation is input into the decoder to obtain the power output;
[0082] The decoder is composed of M stacked decoder blocks. Each decoder block includes: a masked multi-head self-attention layer, an encoder-decoder attention layer, a feedforward network layer, a residual connection, and a layer normalization, where M is a positive integer.
[0083] In one implementation, inputting the feature representation into a decoder to obtain a power output includes:
[0084] Embed the target sequence:
[0085]
[0086] in, Represents the target sequence. This represents the weight matrix for calculating the embedded target sequence, where Y represents the encoder output during the power prediction stage, and b embed This represents the calculation of the bias vector embedded in the target sequence;
[0087] The target sequence is then fused with the positional encoding:
[0088]
[0089] Where D0 represents the initial input of the decoder, PE tgt Indicates the location code of the target;
[0090] The self-attention query value, key value, and result value are calculated through a masked multi-head self-attention layer.
[0091] Calculate mask attention based on query value, key value, and result value;
[0092] The mask attention is concatenated to obtain the mask attention concatenation result;
[0093] The power output is obtained by performing residual connection and layer normalization on the mask attention splicing result.
[0094] A power generation prediction device, comprising:
[0095] The acquisition module is used to acquire historical power generation data of the target photovoltaic power plant and meteorological forecast data of various categories for the corresponding time period as the raw dataset;
[0096] The extraction module is used to extract data in preset proportions from the daytime and nighttime data in the original dataset to form a sampled dataset;
[0097] The selection module is used to randomly select a sample from the sampled dataset for each sample in the original dataset to form a sample pair, which serves as input data for the pre-training stage.
[0098] The pre-training module is used to perform comparative learning pre-training on the original prediction model using the input data, so as to learn the diurnal differences in photovoltaic power generation;
[0099] The adjustment module is used to save the photovoltaic day and night power generation mode through parameter sharing via the model encoder, and to fine-tune the parameters in conjunction with the learnable adapter to obtain the prediction model.
[0100] The prediction module is used to make short-term predictions of photovoltaic power generation of the target photovoltaic power plant using the prediction model.
[0101] An electronic device includes a processor and a memory for storing processor-executable instructions, wherein the processor, when executing the instructions, implements the steps of the method described above.
[0102] A computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implement the steps of the above-described method.
[0103] A computer program product includes a computer program / instructions that, when executed by a processor, implement the steps of the above-described method.
[0104] The power generation prediction method provided in this application obtains historical power generation data of the target photovoltaic power plant and corresponding multi-category meteorological forecast data for the same period as the original dataset. Then, it extracts data in predetermined proportions from daytime and nighttime data in the original dataset to form a sampled dataset. The original prediction model is pre-trained using comparative learning with the input data to learn the diurnal differences in photovoltaic power generation. The model encoder saves the diurnal photovoltaic power generation pattern through parameter sharing and fine-tunes the parameters using a learnable adapter to obtain the prediction model, thereby enabling short-term prediction of photovoltaic power generation at the target photovoltaic power plant. By introducing diurnal differences for photovoltaic power generation prediction, this method solves the technical problem of low accuracy in existing photovoltaic power generation predictions, achieving a significant improvement in prediction accuracy. Attached Figure Description
[0105] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0106] Figure 1 This is a flowchart of one embodiment of the power generation prediction method provided in this application;
[0107] Figure 2 This is a framework diagram of the short-term photovoltaic power generation prediction method based on contrastive learning and Transformer provided in this application;
[0108] Figure 3 This is a schematic diagram of the encoder provided in this application;
[0109] Figure 4 This is a schematic diagram of the encoder architecture with a learnable adaptation layer provided in this application;
[0110] Figure 5 This is a schematic diagram of the decoder provided in this application;
[0111] Figure 6 This is a hardware structure block diagram of an electronic device for a method of predicting power generation provided in this application;
[0112] Figure 7 This is a schematic diagram of the module structure of one embodiment of the power generation prediction device provided in this application. Detailed Implementation
[0113] To enable those skilled in the art to better understand the technical solutions in this application, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this application.
[0114] It should be noted that the information and data related to users involved in the embodiments of this specification are all information and data authorized by the user or fully authorized by the relevant parties. Furthermore, the collection, storage, use, processing, transmission, provision, disclosure, and application of the relevant data all comply with relevant laws, regulations, and standards, and necessary confidentiality measures have been taken. They do not violate public order and good morals, and corresponding operation entry points are provided for users or relevant parties to choose to authorize or refuse.
[0115] It should also be noted that in the embodiments of this specification, certain software, components, models and other existing solutions in the industry may be mentioned. These should be regarded as exemplary and are only intended to illustrate the feasibility of implementing the technical solution of this application. However, it does not mean that the applicant has used or necessarily used the solution.
[0116] Figure 1 This is a flowchart of one embodiment of the power generation prediction method provided in this application. Although this application provides method operation steps or apparatus structures as shown in the following embodiments or figures, more or fewer operation steps or module units may be included in the method or apparatus based on conventional or non-inventive effort. In steps or structures where there is no logically necessary causal relationship, the execution order of these steps or the module structure of the apparatus is not limited to the execution order or module structure described in the embodiments and figures of this application. When the method or module structure is applied in actual devices or terminal products, it can be executed sequentially or in parallel according to the method or module structure shown in the embodiments or figures (e.g., in a parallel processor or multi-threaded processing environment, or even a distributed processing environment).
[0117] Specifically, such as Figure 1 As shown, the above-mentioned method for predicting power generation may include the following steps:
[0118] Step 101: Obtain historical power generation data of the target photovoltaic power plant and corresponding multi-category meteorological forecast data for the same period as the raw dataset;
[0119] Step 102: Extract data in preset proportions from the daytime and nighttime data in the original dataset to form a sampled dataset;
[0120] Step 103: For each sample in the original dataset, randomly select a sample from the sampled dataset to form a sample pair, which will serve as input data for the pre-training stage;
[0121] Step 104: Perform comparative learning pre-training on the original prediction model using the input data to learn the diurnal differences in photovoltaic power generation;
[0122] Step 105: Save the photovoltaic day and night power generation pattern through the model encoder in the form of parameter sharing, and fine-tune the parameters in combination with the learnable adapter to obtain the prediction model;
[0123] Step 106: Make a short-term prediction of photovoltaic power generation for the target photovoltaic power plant using the prediction model.
[0124] Specifically, the short-term prediction of photovoltaic power generation of a target photovoltaic power plant using the prediction model may include: inputting the original dataset into the prediction model; using the prediction model, performing model preprocessing, encoder feature extraction, and decoder to obtain power output, so as to output the photovoltaic power prediction results for each preset number of minutes of a predetermined length in the future; wherein, the decoder in the prediction model is initially input as a zero-value tensor, and the decoder uses its own prediction as subsequent input during the prediction stage to gradually generate the output sequence.
[0125] In implementation, the decoder obtains power output, which may include: after obtaining the feature representation through the encoder, inputting the feature representation into the decoder to obtain power output; wherein, the decoder may be composed of M decoder blocks stacked together, each decoder block containing: a masked multi-head self-attention layer, an encoder-decoder attention layer, a feedforward network layer, a residual connection and layer normalization, where M is a positive integer.
[0126] Accordingly, inputting the feature representation into the decoder to obtain the power output may include:
[0127] S1: Embedded target sequence:
[0128]
[0129] in, Represents the target sequence. This represents the weight matrix for calculating the embedded target sequence, where Y represents the encoder output during the power prediction stage, and b embed This represents the calculation of the bias vector embedded in the target sequence;
[0130] S2; Fuse the target sequence with the positional encoding:
[0131]
[0132] Where D0 represents the initial input of the decoder, PE tgt Indicates the location code of the target;
[0133] S3: Calculate the self-attention query value, key value, and result value through a masked multi-head self-attention layer;
[0134] S4: Calculate masked attention based on the query value, key value, and result value;
[0135] S5: Concatenate the mask attention to obtain the mask attention concatenation result;
[0136] S6: Perform residual connection and layer normalization on the mask attention splicing result to obtain the power output.
[0137] In the example above, historical power generation data of the target photovoltaic power plant and corresponding meteorological forecast data for various time periods are obtained as the original dataset. Then, data with predetermined proportions are extracted from the daytime and nighttime data in the original dataset to form a sample dataset. The original prediction model is pre-trained using comparative learning with the input data to learn the diurnal differences in photovoltaic power generation. The photovoltaic diurnal power generation pattern is saved by the model encoder in a parameter-sharing manner, and the parameters are fine-tuned by a learnable adapter to obtain the prediction model, thereby enabling short-term prediction of photovoltaic power generation at the target photovoltaic power plant. By introducing diurnal differences for photovoltaic power generation prediction, the technical problem of low accuracy in existing photovoltaic power generation predictions can be solved, achieving the technical effect of improving the accuracy of photovoltaic power generation prediction.
[0138] To learn the diurnal differences in photovoltaic power generation, this paper proposes a method that uses input data to perform comparative learning pre-training on the original prediction model to learn the diurnal differences in photovoltaic power generation. This method may include the following steps:
[0139] S1: Convert the time string in the input data into an object, extract the time features of the object, wherein the time features include: year, month, day, and hour, and periodically encode the hour to preserve the time periodicity;
[0140] S2: Perform spatial averaging on the meteorological elements in the input data, and align the meteorological elements with the power data in time.
[0141] S3: The encoder extracts features from the input feature matrix containing time and meteorological features. The encoder consists of N identical encoder blocks stacked together. Each encoder block contains multiple sub-layers, including multi-head self-attention, feedforward network, residual connection and layer normalization, where N is a positive integer.
[0142] The feature extraction process, which involves using an encoder to extract features from an input feature matrix containing both time and meteorological features, may include:
[0143] For each encoder block in the encoder, the calculation is performed as follows:
[0144] S31: Calculate the projections of the query (Q), key (K), and value (V):
[0145]
[0146] Among them, Q l Z represents the query matrix at level l, where the index l indicates the level number; l-1 K represents the output of layer l-1 and the input of layer l; l This represents the bond matrix of the l-th layer. This represents the value matrix of the l-th layer; This represents the weight matrix of the query, key, and value corresponding to the l-th layer, with dimension d. model ×d n , where d model d represents the dimension of the encoder. n These are the feature dimensions of queries, keys, and values;
[0147] S32: Calculate the attention score:
[0148]
[0149]
[0150] Where K is the key matrix for calculating the attention score, K T This indicates that K is transposed to satisfy the requirements of matrix multiplication. This represents the attention result of the i-th single head in layer l. This represents the weight matrix of the i-th corresponding query matrix in the l-th layer. This represents the weight matrix of the i-th corresponding key matrix K in the l-th layer. Let V be the weight matrix of the i-th corresponding value matrix in the l-th layer, where i represents the i-th head;
[0151] S33: After obtaining single-head attention, concatenate them to obtain multi-head attention:
[0152]
[0153] Wherein, Concat represents the concatenation operation of matrices or vectors; This represents the output weight matrix of the multi-head attention of the l-th layer encoder, where the subscript 0 indicates the output, the superscript l indicates the layer number, and he indicates the number of heads.
[0154] S34: Perform residual connection and layer normalization on the output of the multi-head self-attention sublayer:
[0155]
[0156] Where LayerNorm represents layer normalization. This represents the output of the multi-head self-attention sublayer in the l-th encoder, with the superscript ' indicating the sum of the outputs of the l-th encoder and the output Z. l Distinguish between them;
[0157] S35: Perform feedforward transformation in the feedforward network sub-layer:
[0158]
[0159] in, Let represent the feedforward neural network in the l-th layer encoder; x represents the input of this neural network. , where represents the two weight matrices for the feedforward transformation in the l-th layer, where This represents the dimension of the hidden layer in a feedforward neural network;
[0160] The output of the feedforward transform is then subjected to residual connection and layer normalization:
[0161]
[0162] S36: After passing through N identical encoder blocks, the output of the encoder is obtained:
[0163]
[0164] in, This represents the output of the Nth layer encoder, which is also the final output of the encoder, where N is the number of encoder blocks; X represents the input feature matrix containing time and meteorological features; and Encoder represents all operations of a single layer encoder.
[0165] S4: Perform feature enhancement on the output of the encoder;
[0166] Specifically, feature enhancement of the encoder's output may include:
[0167] S41: Construct feature vectors based on the encoder output;
[0168] S42: Use mean pooling to obtain the global features of the entire sequence, resulting in the pooled feature vector:
[0169]
[0170] Where t is the time step index in the sequence, h is the pooled feature vector, and T represents the length of the entire sequence;
[0171] S43: Normalize the internal dimensions of the feature vector of a single sample by layer normalization, calculate the mean and variance of all dimensions of the vector, and adjust them to a mean of 0 and a variance of 1 to eliminate the difference in the dimensions of different feature dimensions within the same sample.
[0172] S44: Batch normalization normalizes the feature vectors of all samples in the same batch, calculates the mean and variance of all samples in the same dimension, and adjusts them to a stable distribution to eliminate feature distribution shifts between different samples, accelerate model training convergence, and avoid comparison loss calculation errors caused by batch data distribution differences.
[0173] S5: Perform comparative learning pre-training on the output results after feature enhancement.
[0174] When performing contrastive learning pre-training on the output results after feature enhancement, the following can be included:
[0175] S51: Perform a projection transformation on the feature-enhanced output according to the following formula:
[0176]
[0177] Where z represents the result of the projection transformation. This represents the second weight matrix of the projection transformation, where the subscript q indicates the projection transformation. This represents the activation function. This represents the first weight matrix of the projection transformation. This represents the first bias vector of the projection transformation. This represents the second bias vector of the projection transformation;
[0178] S52: Calculate the similarity between two samples in a sample pair using the projection transformation results:
[0179]
[0180] in, The similarity between two samples in a sample pair is represented by τ, which is a temperature parameter used to control the shape of the similarity function curve. The smaller the value of τ, the greater the difference in similarity between the samples will be, and the steeper the function curve will be. For learning the day and night patterns of photovoltaic power generation, a smaller value of τ can help the model better distinguish between daytime and nighttime patterns.
[0181] S53: Calculate the similarity probability value of the sample pair based on the similarity between the two samples in the sample pair:
[0182]
[0183] Where p is the similarity probability value of the sample pair. Represents the sigmoid function;
[0184] S54: Based on the similarity probability values of the sample pairs, perform contrastive learning pre-training using the following contrastive learning loss function:
[0185]
[0186] in, , is a very small positive constant, used to prevent overflow when the logarithmic function input is 0. Let A represent the loss function for contrastive learning, and let A represent the total number of sample pairs. Let represent the true label of the i-th sample pair, and let log represent the logarithmic function. This represents the similarity probability value of the i-th sample pair.
[0187] In the example above, the adapter for the model encoder can include: a lower projection layer, a hidden layer, an upper projection layer, a residual connection, and a normalized output, wherein:
[0188] 1) The lower projection layer reduces the dimensionality of the input:
[0189]
[0190] in, The input feature vector, The weight matrix is the downward projection matrix. Let be the downward projection bias vector. Let be the eigenvectors after dimensionality reduction, where , Multiples of 2 Indicates the dimension of the encoder;
[0191] 2) The hidden layer is calculated using the following formula:
[0192]
[0193] in, For learnable scalar parameters, during initialization , equivalent to During training, parameters can be adjusted. Adjust the activation function of the hidden layer flexibly. LAct represents the hyperbolic tangent function, and LAct represents the activation function of this layer.
[0194] 3) The upward projection layer is calculated using the following formula:
[0195]
[0196] in, For output of the hidden layer, The up-projection weight matrix is... Let be the upward projection bias vector. The feature vector after restoring the dimensions;
[0197] 4) The adapter output is obtained after the residual connection and layer normalization:
[0198]
[0199] Here, Adapter represents all operations of the adapter, LayerNorm represents layer normalization operations, and c represents the input of the adapter.
[0200] The above method will be described below with reference to a specific embodiment. However, it should be noted that this specific embodiment is only for better illustration of this application and does not constitute an improper limitation of this application.
[0201] Photovoltaic power prediction can be divided into two approaches: physical models and data-driven models. Physical models rely on numerical simulation techniques based on physical mechanisms, requiring high-performance computing clusters with heterogeneous hardware architectures. However, the inherent computational bottlenecks of physical models remain difficult to overcome. Data-driven models include autoregressive moving average models, exponential smoothing, and machine learning. As an extension of machine learning, deep learning can solve many problems in the field of artificial intelligence and can be applied to photovoltaic power prediction. Furthermore, since photovoltaic power prediction is a time-series data analysis problem, prediction methods based on deep learning often employ recurrent neural networks such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). However, power prediction using recurrent neural networks also has the following problems: using a time-series model to model power output ignores the changes in photovoltaic power generation patterns during the day and night, resulting in continued power output during nighttime periods, which can easily lead to prediction errors. Furthermore, it cannot fully capture the characteristics of photovoltaic power generation information. For example, a sudden drop in photovoltaic output caused by a change in cloud cover requires the model to respond quickly. However, the time window sliding mechanism of the recurrent neural network means that the sudden signal needs 5-7 time steps to fully propagate, which will inevitably affect the model's prediction accuracy and adaptability.
[0202] Based on this, this example presents a short-term photovoltaic power generation prediction method based on contrastive learning and Transformer. First, a Transformer encoder module based on a self-attention mechanism is constructed to capture feature mutations within a single step. Then, a contrastive learning task of daytime and nighttime photovoltaic power generation is established to enhance the model's ability to recognize power generation patterns at different times. Finally, an efficient parameter fine-tuning method is introduced, integrating knowledge from the pre-training and fine-tuning stages to improve the model's overall prediction performance. In other words, a short-term photovoltaic power generation prediction method based on contrastive learning and Transformer is provided, with the structure as follows: Figure 2 As shown, the process includes: first, data sampling and relationship modeling are performed, a sampling strategy is designed in the dataset, and labeled data for supervised contrastive learning is constructed (i.e., stage 1: data sampling and relationship modeling); then, the model is pre-trained for contrastive learning to learn the differences between day and night photovoltaic power generation (i.e., stage 2: pre-training of photovoltaic day and night power generation patterns); finally, in the final training stage, the model encoder module saves the photovoltaic day and night power generation patterns in a parameter-sharing manner, and combines a learnable adapter to perform efficient parameter fine-tuning to achieve short-term prediction of photovoltaic power generation (i.e., stage 3: efficient parameter fine-tuning for photovoltaic power prediction).
[0203] Specifically, it can include:
[0204] 1) Data sampling and relationship modeling stage:
[0205] Sample pair labels are created to guide contrastive learning pre-training, where these labels are constructed based on the contrastive relationships between day and night. Ensuring the quality of training sample pairs through appropriate sampling strategies can significantly improve the model's contrastive learning performance and effectively reduce introduced biases.
[0206] Specifically, there are three main types of relationships between samples: "DN" (day and night), "DD" (day and day), and "NN" (night and night). Through contrastive learning models, it is possible to reveal the commonalities between day and day, night and night, as well as the differences between day and night.
[0207] Based on this, sampling can be performed in the following manner in this example:
[0208] First, extract m% of the total number of samples from both daytime and nighttime data in the original dataset to construct a sampling dataset. Then, for each sample in the original dataset, randomly select one sample from the sampling dataset to form a sample pair, which will be used as input for the pre-training stage. Finally, construct sample pair labels to guide the training of the contrastive model; the construction rules are expressed as follows:
[0209]
[0210] That is, DD (day-day): positive sample pair, label=1; NN (night-night): positive sample pair, label=1; DN (day-night): negative sample pair, label=0. This label is used for supervised contrastive learning, guiding the model to learn the similarity of samples in the same time period and the differences of samples in different time periods.
[0211] 2) Pre-training phase of photovoltaic day and night power generation mode:
[0212] A. Feature engineering processing:
[0213] First, the original time string is converted into an object, and time features such as year, month, day, and hour are extracted. Then, the hours are periodically encoded (sin / cos conversion) to preserve time periodicity. Finally, the features are standardized. Simultaneously, meteorological elements are spatially averaged and time-aligned with the power data.
[0214] B, Feature extraction calculation:
[0215] In this example, features are extracted using an encoder, and the encoder's structure can be as follows: Figure 3As shown, the encoder consists of N identical encoder blocks stacked together. Each encoder block contains: multi-head self-attention, feedforward network, residual connections, and multiple sub-layers of layer normalization (Add & Norm).
[0216] First, position encoding is performed on the input embedding X:
[0217]
[0218]
[0219] Where psos is the position index in the sequence, i is the dimension index, and d is the position index in the sequence. model This represents the dimension of the model's hidden layers.
[0220] Then, the input embedding X is added to its position code:
[0221]
[0222] Where X is the input feature matrix (containing time features and meteorological features), and PE(pos) is the location coding matrix.
[0223] For each encoder block, its multi-head self-attention sublayer is calculated as follows:
[0224] First, calculate the projections of the query (Q), key (K), and value (V):
[0225]
[0226] in, .
[0227] Then, the attention score is calculated:
[0228]
[0229]
[0230] After obtaining single-head attention, it is spliced together:
[0231]
[0232] Then, residual connections and layer normalization are performed on the output of the multi-head self-attention sub-layer:
[0233]
[0234] In the feedforward network sub-layer, the feedforward transformation is performed first:
[0235]
[0236] in, .
[0237] The output of the feedforward transform is then processed through residual connections and layer normalization:
[0238]
[0239] After passing through N layers of encoders, the output is obtained:
[0240]
[0241] C, Feature Enhancement:
[0242] The model constructs feature vectors based on the encoder's output and uses mean pooling to obtain global features for the entire sequence.
[0243]
[0244] Where t is the time step index in the sequence, and h is the feature vector after pooling.
[0245] Then, in this example, two linear transformations (layer normalization and batch normalization) are applied sequentially to process the feature vectors:
[0246]
[0247] Specifically, layer normalization normalizes the "internal dimensions of the feature vector of a single sample," that is, it calculates the mean and variance of all dimensions of the vector and adjusts them to a mean of 0 and a variance of 1, thereby eliminating the dimensional differences between different feature dimensions within the same sample. By normalizing the difference in the feature value range of "temperature (°C)" and "irradiance (W / m²)" in photovoltaic data, it avoids the influence of excessively large values of a certain feature dimension (e.g., irradiance) on subsequent calculations, ensuring a balanced contribution of each feature dimension to contrastive learning. Batch normalization normalizes the "feature vectors of all samples within the same batch," calculates the mean and variance of all samples in the same dimension, adjusts them to a stable distribution, eliminates feature distribution shifts between different samples, accelerates model training convergence, and avoids contrast loss calculation errors caused by differences in batch data distribution. For example, if a batch has too many nighttime samples, the feature mean may be too low, affecting the distinction between day and night samples.
[0248] D, Comparative learning pre-training:
[0249] To address the nighttime prediction error problem and enhance the distinction between day and night photovoltaic power generation patterns, the training loss function is designed in this example as follows:
[0250] To prevent the contrastive learning task from interfering with the general features learned by the encoder, the feature enhancement vector is first subjected to a projection transformation:
[0251]
[0252] The designed contrastive learning loss function is as follows:
[0253]
[0254] in, , is a very small positive constant, used to prevent overflow when the logarithmic function input is 0. Let A represent the loss function for contrastive learning, and let A represent the total number of sample pairs. Let represent the true label of the i-th sample pair, and let log represent the logarithmic function. This represents the similarity probability value of the i-th sample pair.
[0255] Calculate using the following formula:
[0256]
[0257] Where scaled_sim represents the similarity between two samples in a sample pair:
[0258]
[0259] Wherein, τ∈(0.05,0.5) is a temperature parameter used to control the sharpness of the similarity distribution. When the value of τ is small, the distribution will be sharper, and the similarity difference will be amplified; conversely, when the value of τ is large, the distribution will be smoother, and the similarity difference will be mitigated. During the pre-training phase, the contrastive loss is calculated according to the above formula and backpropagated to update the encoder parameters.
[0260] In this example, similar power generation patterns are mapped to nearby locations in the feature space, while different power generation patterns are mapped to more distant locations in the feature space. This provides a better feature representation basis for subsequent power prediction tasks, thereby reducing nighttime prediction errors.
[0261] The pre-training phase is based on the following formula:
[0262]
[0263] The model parameters are updated using gradient descent to fully learn the diurnal power generation characteristics of the pre-training dataset consisting of sample pairs. After pre-training, the learned diurnal power generation pattern knowledge is saved as parameters for use in the subsequent main prediction task.
[0264] 3) High-efficiency parameter fine-tuning stage for photovoltaic power prediction:
[0265] The efficient parameter fine-tuning stage can include three parts: data preprocessing, encoder, and decoder, with the following structure: Figure 2 As shown.
[0266] A. Feature engineering processing:
[0267] Similar to the pre-training phase, the data undergoes preprocessing to consider temporal and meteorological characteristics.
[0268] B, Feature Extraction:
[0269] This example presents an encoder architecture with a learnable adaptation layer, the structure of which is as follows: Figure 4 As shown, the calculation process is as follows:
[0270] Calculate the multi-head self-attention output:
[0271]
[0272] Calculate the adapter layer output:
[0273]
[0274] Calculate the output of the feedforward network layer:
[0275]
[0276] Except for the adapter layer, the calculation method for the other layers of the encoder is the same as in the pre-training phase. During initialization, this encoder shares parameters with the encoder trained in the pre-training phase, indicating that it has learned the differences between the day and night modes of photovoltaic power generation.
[0277] Adapter structure as follows Figure 4 As shown, it consists of a lower projection layer, a hidden layer, and an upper projection layer, and is finally output after residual connection and normalization. The lower projection layer reduces the dimensionality of the input:
[0278]
[0279] in, The input feature vector, The weight matrix is the downward projection matrix. Let be the downward projection bias vector. These are the eigenvectors after dimensionality reduction.
[0280] here , It is a multiple of 2.
[0281] The hidden layer is calculated as follows:
[0282]
[0283] in, For learnable scalar parameters, during initialization , equivalent to During training, parameters can be adjusted. Adjust the activation function of the hidden layer flexibly.
[0284] The upward projection layer is calculated as follows:
[0285]
[0286] in, For output of the hidden layer, The up-projection weight matrix is... Let be the upward projection bias vector. This is the feature vector after restoring the dimensions.
[0287] Finally, the adapter output is obtained through residual connection and layer normalization:
[0288]
[0289] The adaptive layer activation function LAct(∙) proposed in this example overcomes the shortcomings of ordinary activation functions such as ReLU and GELU, which rely on manual design, have fixed forms, and are not trainable. Its parameters are learnable and can automatically discover the activation shape most suitable for the current task, thus achieving task adaptability. At the same time, LAct(∙) can exhibit different function shapes under different parameter settings, has strong expressive power, and can approximate any nonlinear function.
[0290] C, Calculate power output:
[0291] The training samples are processed by the encoder to obtain feature representations, which are then input into the decoder module to obtain the power output. For example... Figure 5 As shown, the decoder consists of M stacked decoding blocks, each of which introduces masked multi-head self-attention and encoder-decoder attention on top of the encoder.
[0292] First, embed the target sequence:
[0293]
[0294] Furthermore, it is fused with location coding:
[0295]
[0296] Then, the input is processed by the decoder. Each decoder block contains a "masked multi-head self-attention layer", an "encoder-decoder attention layer", a "feedforward network layer", and "residual connections and layer normalization".
[0297] In the masked multi-head self-attention layer, the self-attention values Q, K, and V are first calculated:
[0298]
[0299] Then, calculate the masked attention:
[0300]
[0301] Where Mask is the mask matrix, calculated as follows:
[0302]
[0303] Concatenate the mask attention:
[0304]
[0305] Finally, residual connections and layer normalization are performed on the mask attention:
[0306]
[0307] In the encoder-decoder attention layer, the encoder's output is used as the key and value, and the decoder's output from the previous layer is used as the query to calculate the attention score. These scores are then concatenated into a multi-head attention layer, a process similar to encoder self-attention, and will not be elaborated further. Finally, residual connections and layer normalization are performed to obtain the output.
[0308]
[0309] Here, Memory refers to the output of the last layer of the encoder.
[0310] The feedforward transformation is performed in the feedforward network layer as follows:
[0311]
[0312] After passing through the M-layer decoder, we obtain:
[0313]
[0314] D, High-efficiency parameter fine-tuning training:
[0315] Based on the decoder output, the prediction result is obtained through projection:
[0316]
[0317] In this stage, the model is trained using mean squared error as the loss function, that is:
[0318]
[0319] Where B represents the batch size and L is the sequence length.
[0320] At the start of training, the encoder weights are directly transferred from the pre-trained part and frozen during training; the adapter parameters, decoder parameters, and output projection parameters are randomly initialized and iteratively updated using gradient descent guided by the loss function. The encoder freezes its core parameters and only fine-tunes its adapter parameters during training, meaning that knowledge about the differences in day and night photovoltaic power generation patterns can be preserved. Simultaneously, since there is no need to retrain a large number of encoder parameters, computational power consumption is significantly reduced, improving training efficiency.
[0321] 4) Short-term power prediction (model inference):
[0322] The system takes historical power data and corresponding meteorological data as input. After model preprocessing, encoder feature extraction, and decoder sequence generation, it outputs photovoltaic power prediction results for a specified length, presented in 15-minute increments. The decoder operates in an autoregressive manner during the prediction phase, using its own predictions as subsequent inputs to progressively generate the output sequence. Initially, the decoder input is a zero-value tensor.
[0323] In the example above, a photovoltaic power generation day-night pattern feature extractor based on contrastive learning was proposed. A sample distance metric was constructed using the encoder's mean pooling strategy, effectively aggregating encoder and decoder information to build a robust sample representation and strengthening the model's understanding of different power generation patterns during the day and night. Furthermore, a training adapter for short-term photovoltaic power prediction was proposed. This adapter employs a learnable activation function, overcoming the shortcomings of common activation functions such as ReLU and GELU, which rely on manual design, have fixed forms, and are not trainable. It can select different optimal activation modes for samples from different time periods, effectively improving the model's prediction accuracy. Moreover, a "pre-training + efficient parameter fine-tuning" training paradigm for photovoltaic power prediction was proposed. Pre-training learns the differences in photovoltaic power generation day-night patterns, while efficient parameter fine-tuning establishes an accurate mapping between meteorological elements and photovoltaic power while retaining pre-training knowledge, significantly reducing computational consumption and improving training efficiency.
[0324] The following example illustrates the short-term photovoltaic power generation prediction method based on contrastive learning and Transformer. Contrastive learning enhances the model's understanding of the diurnal differences in photovoltaic power generation patterns, and efficient parameter fine-tuning techniques improve prediction accuracy and training efficiency. Specifically, this may include:
[0325] 1) Implementation preparation:
[0326] A. Data source:
[0327] In this example, for a photovoltaic power generation task, based on historical power generation data and multi-category meteorological forecast data for the corresponding time period, the power generation of the photovoltaic power station is predicted in 15-minute increments from midnight the next day to the next 24 hours. In this example, the experimental data dataset can contain 10 variables: time, zonal wind at 100 meters altitude, meridional wind at 100 meters altitude, temperature, total precipitation, total cloud cover, surface air pressure, photovoltaic panel irradiance, total horizontal irradiance, and total horizontal irradiance, with power as the target variable.
[0328] The training set was based on data from January 1st to November 14th, 2024; the validation set was based on data from November 15th to December 1st, 2024; and the test set was based on data from December 2nd to December 31st, 2024. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) were used as experimental metrics; smaller values indicate higher prediction accuracy and better model performance.
[0329] B. Implementation environment and model parameter settings:
[0330] The model was implemented using PyTorch in a Python 3.8.11 environment and trained on a computer equipped with an NVIDIA GeForce RTX 4090 GPU and CUDA version 11.5.
[0331] The encoder and decoder stacks are both set to 4 layers, the number of attention heads is set to 8, the hidden layer dimension is set to 256, the adapter hidden size is 64, and the dimensionality reduction ratio is 16 to ensure a balance between model performance and computational efficiency.
[0332] The training batch size is 32, the pre-training rounds are 30, the fine-tuning rounds are 100, the learning rate is 10⁻⁴, the sequence length is 96, and the prediction length is 96.
[0333] The temperature parameter for comparison learning was set to 0.5, and the sampling ratio was 50%.
[0334] 2) Data sampling and relationship modeling:
[0335] This stage is used to construct sample pairs and their labels for comparative learning pre-training, in order to clearly distinguish between daytime and nighttime power generation patterns.
[0336] A. Sampling strategy:
[0337] Daytime is defined as 7:00 AM to 6:00 PM (inclusive), and nighttime is defined as 7:00 PM to 6:00 AM the following day (inclusive). From the original dataset, 50% of the total data from both the daytime and nighttime datasets are randomly selected to form the sample datasets. To construct positive sample pairs, two samples are randomly selected from the daytime dataset to form a DD pair, and two samples are randomly selected from the nighttime dataset to form an NN pair. To construct negative sample pairs, one sample is randomly selected from each of the daytime and nighttime datasets to form a DN pair and an ND pair, respectively.
[0338] B. Tag Construction:
[0339] The relationship is labeled based on the samples: if two samples are both daytime or both nighttime, the label is 1; if one is daytime and the other is nighttime, the label is 0.
[0340] This label is used for supervised contrastive learning, guiding the model to learn the similarity of samples within the same time period and the differences between samples from different time periods.
[0341] 3) Pre-training of photovoltaic day and night power generation mode:
[0342] A. Feature engineering processing:
[0343] The time strings in the original data are converted into time objects, and time features such as year, month, day, and hour are extracted. The "hour" feature is then encoded using sin / cos periodicity to preserve the periodicity of the time dimension. All features (including time and meteorological features) are standardized to eliminate dimensional differences and avoid excessive influence of a single feature on model training. For meteorological data, spatial averaging is performed along the latitude and longitude dimensions, and hourly data is interpolated to 15-minute intervals.
[0344] B. Feature Extraction:
[0345] In this example, a Transformer encoder is used to extract features. The encoder consists of four identical encoder blocks stacked together. Each encoder block contains multiple sub-layers, including multi-head self-attention, feedforward network, residual connections, and layer normalization.
[0346] First, for input embedding Perform position encoding:
[0347]
[0348]
[0349] in, For the position index in the sequence, For dimensional indexing, This represents the dimension of the model's hidden layers.
[0350] Then, embed the input Add it to its position code:
[0351]
[0352] in, The input feature matrix (including time features and meteorological features) is used. This is the position encoding matrix.
[0353] For each encoder block, the multi-head self-attention sublayer is calculated as follows:
[0354] First, calculate the projections of the query (Q), key (K), and value (V):
[0355]
[0356] in, .
[0357] Then, the attention score is calculated:
[0358]
[0359]
[0360] After obtaining the individual attention heads, they are spliced together, resulting in a total of 8 attention heads:
[0361]
[0362] Then, residual connections and layer normalization are performed on the output of the multi-head self-attention sub-layer:
[0363]
[0364] In the feedforward network sub-layer, the feedforward transformation is performed first:
[0365]
[0366] in, .
[0367] The output of the feedforward transform is then processed through residual connections and layer normalization:
[0368]
[0369] After passing through N layers of encoders, the output is obtained:
[0370]
[0371] C. Feature enhancement:
[0372] Furthermore, the model constructs feature vectors based on the encoder output and uses mean pooling to obtain global features for the entire sequence:
[0373]
[0374] in, For time step indices in the sequence, This is the feature vector after pooling.
[0375] Then, this application applies two linear transformations (layer normalization and batch normalization) to process the feature vectors:
[0376]
[0377] D. Comparative learning pre-training:
[0378] To prevent the contrastive learning task from interfering with the general features learned by the encoder, the feature enhancement vector is first transformed by projection as follows:
[0379]
[0380] The contrastive learning loss function is designed as follows:
[0381]
[0382] in, , is a very small positive integer used to prevent overflow when the logarithmic function input is 0; The similarity probability value for sample pairs is calculated as follows:
[0383]
[0384] in, The similarity between two samples in a sample pair:
[0385]
[0386] in, This is a temperature parameter used to control the sharpness of the similarity distribution.
[0387] During the pre-training phase, the contrastive loss is calculated and backpropagated to update the encoder parameters.
[0388] 4) Parameter fine-tuning for photovoltaic power prediction may include:
[0389] S1: Data Preprocessing
[0390] Similar to the pre-training phase, the data undergoes preprocessing to consider temporal and meteorological characteristics.
[0391] S2: Feature extraction calculation:
[0392] The computation process employs a 4-layer encoder and an 8-head attention structure, as follows:
[0393] Calculate the multi-head self-attention output:
[0394]
[0395] Calculate the adapter layer output:
[0396]
[0397] Calculate the output of the feedforward network layer:
[0398]
[0399] The adapter consists of a lower projection layer, a hidden layer, and an upper projection layer, ultimately outputting via residual connections and normalization. The lower projection layer reduces the dimensionality of the input using the following formula:
[0400]
[0401] in, The input feature vector, The weight matrix is the downward projection matrix. Let be the downward projection bias vector. Let be the eigenvectors after dimensionality reduction, where , It is a multiple of 2.
[0402] The hidden layer is calculated using the following formula:
[0403]
[0404] in, For learnable scalar parameters, during initialization , equivalent to During training, parameters can be adjusted. Adjust the activation function of the hidden layer flexibly.
[0405] The upward projection layer is calculated using the following formula:
[0406]
[0407] in, For output of the hidden layer, The up-projection weight matrix is... Let be the upward projection bias vector. This is the feature vector after restoring the dimensions.
[0408] Finally, the adapter output is obtained through residual connection and layer normalization:
[0409]
[0410] Calculate the power output unit:
[0411] The training samples are processed by the encoder to obtain feature representations, which are then input into the decoder module for further processing. The decoder consists of four stacked decoding blocks, each of which introduces masked multi-head self-attention and encoder-decoder attention on top of the encoder.
[0412] First, embed the target sequence:
[0413]
[0414] Then it is fused with positional coding:
[0415]
[0416] Then, the input is processed by the decoder. Each decoder block contains a "masked multi-head self-attention layer", an "encoder-decoder attention layer", a "feedforward network layer", and "residual connections and layer normalization".
[0417] In the masked multi-head self-attention layer, the self-attention values Q, K, and V are first calculated:
[0418]
[0419] Then, calculate the masked attention:
[0420]
[0421] in, The mask matrix is calculated as follows:
[0422]
[0423] Concatenate the mask attention:
[0424]
[0425] Finally, residual connections and layer normalization are performed on the mask attention:
[0426]
[0427] In the encoder-decoder attention layer, the encoder's output is used as the key and value, and the decoder's output from the previous layer is used as the query to calculate the attention score. These scores are then concatenated into a multi-head attention layer, a process similar to encoder self-attention, and will not be elaborated further. Finally, residual connections and layer normalization are performed to obtain the output.
[0428]
[0429] Here, "memory" refers to the output of the last layer of the encoder.
[0430] The feedforward transformation is performed in the feedforward network layer as follows:
[0431]
[0432] After After layer decoding, we get:
[0433]
[0434] Parameter fine-tuning:
[0435] Based on the decoder output, the prediction result is obtained through projection:
[0436]
[0437] In this stage, the model is trained using mean squared error as the loss function, that is:
[0438]
[0439] Where B=32 represents the batch size and L=96 represents the sequence length.
[0440] At the start of training, the encoder weights are directly transferred from the pre-trained part and frozen during training; the adapter parameters, decoder parameters, and output projection parameters are randomly initialized and iteratively updated by gradient descent under the guidance of the loss function.
[0441] Short-term power prediction (model inference):
[0442] Input meteorological and power data from 96 historical time steps. After model preprocessing, encoder feature extraction, and decoder sequence generation, output photovoltaic power prediction results for the next 24 hours in 15-minute increments.
[0443] The decoder operates in an autoregressive manner during the prediction phase, meaning it uses its own predictions as subsequent inputs to gradually generate the output sequence. Initially, the decoder input is a zero-value tensor.
[0444] The method presented in this example is compared with four other methods—LSTM, GRU, Transformer, and CNN—in a photovoltaic prediction experiment. The results are shown in Table 1 below.
[0445] Table 1
[0446]
[0447] Therefore, the method provided in this example achieves minimum values for both RMSE and MAE, which proves that the method in this example has higher prediction accuracy and stability in the short-term prediction of photovoltaic power generation.
[0448] The methods and embodiments provided in the above-described embodiments of this application can be executed in a mobile terminal, computer terminal, or similar computing device. Taking operation on an electronic device as an example... Figure 6 This is a hardware structure block diagram of an electronic device for a method of predicting power generation provided in this application. (See diagram for example.) Figure 6 As shown, the electronic device 10 may include one or more (only one is shown in the figure) processors 02 (processors 02 may include, but are not limited to, processing devices such as microprocessors (MCUs) or programmable logic devices (FPGAs), a memory 04 for storing data, and a transmission module 06 for communication functions. Those skilled in the art will understand that... Figure 6 The structure shown is for illustrative purposes only and does not limit the structure of the electronic device described above. For example, electronic device 10 may also include... Figure 6 The more or fewer components shown, or having the same Figure 6 The different configurations shown.
[0449] The memory 04 can be used to store software programs and modules of application software, such as the program instructions / modules corresponding to the power generation prediction method in this embodiment. The processor 02 executes various functional applications and data processing by running the software programs and modules stored in the memory 04, thereby realizing the power generation prediction method of the aforementioned application. The memory 04 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 04 may further include memory remotely located relative to the processor 02, and these remote memories can be connected to the electronic device 10 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0450] The transmission module 06 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the communication provider of the electronic device 10. In one example, the transmission module 06 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission module 06 may be a Radio Frequency (RF) module, used for wireless communication with the Internet.
[0451] At the software level, the aforementioned power generation prediction device can be as follows: Figure 7 As shown, it includes:
[0452] The acquisition module 701 is used to acquire historical power generation data of the target photovoltaic power plant and meteorological forecast data of various categories for the corresponding time period as the raw dataset;
[0453] Extraction module 702 is used to extract data in preset proportions from daytime and nighttime data in the original dataset to form a sampled dataset;
[0454] The selection module 703 is used to randomly select a sample from the sampled dataset for each sample in the original dataset to form a sample pair, which serves as input data for the pre-training stage.
[0455] The pre-training module 704 is used to perform comparative learning pre-training on the original prediction model using the input data, so as to learn the diurnal differences in photovoltaic power generation.
[0456] The adjustment module 705 is used to save the photovoltaic day and night power generation mode through the model encoder in the form of parameter sharing, and to fine-tune the parameters in combination with the learnable adapter to obtain the prediction model.
[0457] The prediction module 706 is used to make short-term predictions of photovoltaic power generation of the target photovoltaic power plant through the prediction model.
[0458] In one embodiment, the prediction module 706 can specifically input the original dataset into the prediction model; through the prediction model, perform model preprocessing, encoder feature extraction, and decoder to obtain power output, so as to output the photovoltaic power prediction result for each preset number of minutes of a predetermined length in the future; wherein, the decoder in the prediction model is initially input as a zero-value tensor, and the decoder uses its own prediction as subsequent input during the prediction stage to gradually generate the output sequence.
[0459] In one implementation, the pre-training module 704 may include:
[0460] The conversion unit is used to convert the time string in the input data into an object, extract the time features of the object, wherein the time features include: year, month, day, and hour, and periodically encode the hour to preserve the time periodicity;
[0461] The alignment unit is used to perform spatial averaging on the meteorological elements in the input data and to align the meteorological elements with the power data in time.
[0462] The extraction unit is used to extract features from the input feature matrix containing time features and meteorological features through the encoder. The encoder is composed of N identical encoder blocks stacked together. Each encoder block contains: multi-head self-attention, feedforward network and residual connection and layer normalization multiple sub-layers, where N is a positive integer.
[0463] An enhancement unit is used to enhance the features of the encoder's output.
[0464] The pre-training unit is used to perform comparative learning pre-training on the output results after feature enhancement.
[0465] In one implementation, the extraction unit may specifically calculate for each encoder block in the encoder as follows:
[0466] Calculate the projections of the query (Q), key (K), and value (V):
[0467]
[0468] Among them, Q l Z represents the query matrix at level l, where the index l indicates the level number; l-1 K represents the output of layer l-1 and the input of layer l; l Represents the bond matrix of the l-th layer; This represents the value matrix of the l-th layer; This represents the weight matrix of the query, key, and value corresponding to the l-th layer, with dimension d. model ×d n , where d model d represents the dimension of the encoder. n These are the feature dimensions of queries, keys, and values;
[0469] Calculate attention score:
[0470]
[0471]
[0472] Where K is the key matrix for calculating the attention score, K TThis indicates that K is transposed to satisfy the requirements of matrix multiplication. This represents the attention result of the i-th single head in layer l. This represents the weight matrix of the i-th corresponding query matrix in the l-th layer. This represents the weight matrix of the i-th corresponding key matrix K in the l-th layer. Let V represent the weight matrix of the i-th corresponding value matrix in the l-th layer, where i represents the i-th head; after obtaining the single-head attention, they are concatenated to obtain the multi-head attention:
[0473]
[0474] Wherein, Concat represents the concatenation operation of matrices or vectors; This represents the output weight matrix of the multi-head attention of the l-th layer encoder, where the subscript 0 indicates the output, the superscript l indicates the layer number, and he indicates the number of heads.
[0475] Residual connections and layer normalization are performed on the output of the multi-head self-attention sublayer:
[0476]
[0477] Where LayerNorm represents layer normalization. This represents the output of the multi-head self-attention sublayer in the l-th encoder, with the superscript ' indicating the sum of the outputs of the l-th encoder and the output Z. l Distinguish between them;
[0478] In the feedforward network sublayer, perform the feedforward transformation:
[0479]
[0480] in, Let represent the feedforward neural network in the l-th layer encoder; x represents the input of this neural network.
[0481] The output of the feedforward transform is then subjected to residual connection and layer normalization:
[0482]
[0483] After passing through N identical encoder blocks, the output of the encoder is obtained:
[0484]
[0485] in, This represents the output of the Nth layer encoder, which is also the final output of the encoder, where N is the number of encoder blocks; X represents the input feature matrix containing time and meteorological features; and Encoder represents all operations of a single layer encoder.
[0486] In one implementation, the enhancement unit can specifically construct a feature vector based on the encoder's output; and use mean pooling to obtain the global features of the entire sequence, resulting in a pooled feature vector.
[0487]
[0488] Where t is the time step index in the sequence, h is the pooled feature vector, and T represents the length of the entire sequence;
[0489] Layer normalization normalizes the internal dimensions of the feature vector of a single sample, calculates the mean and variance of all dimensions of the vector, and adjusts them to a mean of 0 and a variance of 1 to eliminate the dimensional differences of different feature dimensions within the same sample.
[0490] Batch normalization normalizes the feature vectors of all samples within the same batch, calculates the mean and variance of all samples in the same dimension, and adjusts them to a stable distribution to eliminate feature distribution shifts between different samples, accelerates model training convergence, and avoids comparison loss calculation errors caused by differences in batch data distribution.
[0491] In one implementation, the pre-trained unit can specifically perform a projection transformation on the feature-enhanced output according to the following formula:
[0492]
[0493] Where z represents the result of the projection transformation. This represents the second weight matrix of the projection transformation, where the subscript q indicates the projection transformation. This represents the activation function. This represents the first weight matrix of the projection transformation. This represents the first bias vector of the projection transformation. This represents the second bias vector of the projection transformation;
[0494] The similarity between two samples in a sample pair is calculated using the projection transformation results:
[0495]
[0496] in, The similarity between two samples in a sample pair is represented by τ, which is a temperature parameter used to control the shape of the similarity function curve. The smaller the value of τ, the greater the difference in similarity between the samples will be, and the steeper the function curve will be. For learning the day and night patterns of photovoltaic power generation, a smaller value of τ can help the model better distinguish between daytime and nighttime patterns.
[0497] Calculate the similarity probability value of the sample pair based on the similarity between the two samples in the sample pair:
[0498]
[0499] Where p is the similarity probability value of the sample pair. Represents the sigmoid function;
[0500] Based on the similarity probability values of the sample pairs, contrastive learning pre-training is performed using the following contrastive learning loss function:
[0501]
[0502] in, , is a very small positive constant, used to prevent overflow when the logarithmic function input is 0. Let A represent the loss function for contrastive learning, and let A represent the total number of sample pairs. Let represent the true label of the i-th sample pair, and let log represent the logarithmic function. This represents the similarity probability value of the i-th sample pair.
[0503] In one embodiment, the adapter of the above-described model encoder may include: a lower projection layer, a hidden layer, an upper projection layer, a residual connection, and a normalized output, wherein:
[0504] The lower projection layer reduces the dimensionality of the input:
[0505]
[0506] in, The input feature vector, The weight matrix is the downward projection matrix. Let be the downward projection bias vector. Let be the eigenvectors after dimensionality reduction, where , Multiples of 2 Indicates the dimension of the encoder;
[0507] The hidden layer is calculated according to the following formula:
[0508]
[0509] in, For learnable scalar parameters, during initialization , equivalent to During training, parameters can be adjusted. Adjust the activation function of the hidden layer flexibly. LAct represents the hyperbolic tangent function, and LAct represents the activation function of this layer.
[0510] The upward projection layer is calculated according to the following formula:
[0511]
[0512] in, For output of the hidden layer, The up-projection weight matrix is... Let be the upward projection bias vector. The feature vector after restoring the dimensions;
[0513] The adapter output is obtained after residual connection and layer normalization:
[0514]
[0515] Here, Adapter represents all operations of the adapter, LayerNorm represents layer normalization operations, and c represents the input of the adapter.
[0516] In one implementation, the decoder obtains power output by: after obtaining feature representation through the encoder, inputting the feature representation into the decoder to obtain power output; wherein the decoder is composed of M decoder blocks stacked together, each decoder block containing: a masked multi-head self-attention layer, an encoder-decoder attention layer, a feedforward network layer, a residual connection and layer normalization, where M is a positive integer.
[0517] In one implementation, inputting the feature representation into the decoder to obtain the power output may include:
[0518] Embed the target sequence:
[0519]
[0520] in, Represents the target sequence. This represents the weight matrix for calculating the embedded target sequence, where Y represents the encoder output during the power prediction stage, and b embed This represents the calculation of the bias vector embedded in the target sequence;
[0521] The target sequence is then fused with the positional encoding:
[0522]
[0523] Where D0 represents the initial input of the decoder, PE tgt Indicates the location code of the target;
[0524] The self-attention query value, key value, and result value are calculated through a masked multi-head self-attention layer.
[0525] Calculate mask attention based on query value, key value, and result value;
[0526] The mask attention is concatenated to obtain the mask attention concatenation result;
[0527] The power output is obtained by performing residual connection and layer normalization on the mask attention splicing result.
[0528] The embodiments of this application also provide a specific implementation of an electronic device capable of implementing all steps in the power generation prediction method of the above embodiments. The electronic device specifically includes: a processor, a memory, a communication interface, and a bus; wherein the processor, memory, and communication interface communicate with each other through the bus; the processor is used to call a computer program in the memory, and when the processor executes the computer program, it implements all steps in the power generation prediction method of the above embodiments. For example, when the processor executes the computer program, it implements the following steps:
[0529] Step 1: Obtain historical power generation data of the target photovoltaic power plant and corresponding meteorological forecast data for various time periods as the raw dataset;
[0530] Step 2: Extract data from the daytime and nighttime data in the original dataset at a predetermined ratio to form a sampled dataset;
[0531] Step 3: For each sample in the original dataset, randomly select a sample from the sampled dataset to form a sample pair, which will serve as the input data for the pre-training stage;
[0532] Step 4: Use the input data to perform comparative learning pre-training on the original prediction model to learn the diurnal differences in photovoltaic power generation;
[0533] Step 5: Save the photovoltaic day and night power generation pattern through the model encoder in a parameter sharing manner, and fine-tune the parameters in combination with the learnable adapter to obtain the prediction model;
[0534] Step 6: Use the prediction model to make a short-term prediction of photovoltaic power generation for the target photovoltaic power plant.
[0535] Embodiments of this application also provide a computer-readable storage medium capable of implementing all steps of the power generation prediction method in the above embodiments. The computer-readable storage medium stores a computer program that, when executed by a processor, implements all steps of the power generation prediction method in the above embodiments. For example, when the processor executes the computer program, it implements the following steps:
[0536] Step 1: Obtain historical power generation data of the target photovoltaic power plant and corresponding meteorological forecast data for various time periods as the raw dataset;
[0537] Step 2: Extract data from the daytime and nighttime data in the original dataset at a predetermined ratio to form a sampled dataset;
[0538] Step 3: For each sample in the original dataset, randomly select a sample from the sampled dataset to form a sample pair, which will serve as the input data for the pre-training stage;
[0539] Step 4: Use the input data to perform comparative learning pre-training on the original prediction model to learn the diurnal differences in photovoltaic power generation;
[0540] Step 5: Save the photovoltaic day and night power generation pattern through the model encoder in a parameter sharing manner, and fine-tune the parameters in combination with the learnable adapter to obtain the prediction model;
[0541] Step 6: Use the prediction model to make a short-term prediction of photovoltaic power generation for the target photovoltaic power plant.
[0542] As described above, this embodiment of the application obtains historical power generation data of the target photovoltaic power plant and corresponding meteorological forecast data of various categories for the same period as the original dataset. Then, it extracts data of a preset proportion from the daytime and nighttime data in the original dataset to form a sampling dataset. The original prediction model is pre-trained using comparative learning with the input data to learn the diurnal differences in photovoltaic power generation. The model encoder saves the photovoltaic diurnal power generation pattern through parameter sharing and combines it with a learnable adapter for parameter fine-tuning to obtain the prediction model, thereby enabling short-term prediction of photovoltaic power generation at the target photovoltaic power plant. By introducing diurnal differences for photovoltaic power generation prediction, the technical problem of low accuracy in existing photovoltaic power generation predictions can be solved, achieving the technical effect of improving the accuracy of photovoltaic power generation prediction.
[0543] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on its differences from other embodiments. In particular, hardware + program embodiments are relatively simple in description because they are fundamentally similar to method embodiments; relevant parts can be referred to the descriptions in the method embodiments.
[0544] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.
[0545] While this application provides the method operation steps as described in the embodiments or flowcharts, more or fewer operation steps may be included based on conventional or non-inventive labor. The order of steps listed in the embodiments is merely one possible execution order among many and does not represent the only execution order. In actual device or client product execution, the methods shown in the embodiments or drawings can be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment).
[0546] While this specification provides method operation steps as described in the embodiments or flowcharts, more or fewer operation steps may be included based on conventional or non-inventive means. The order of steps listed in the embodiments is merely one possible execution order among many and does not represent the only execution order. In actual device or end product execution, the methods shown in the embodiments or drawings may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment, or even a distributed data processing environment). The terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, product, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, product, or apparatus. Without further limitations, the presence of other identical or equivalent elements in the process, method, product, or apparatus that includes said elements is not excluded.
[0547] For ease of description, the above devices are described in terms of function, divided into various modules. Of course, in implementing the embodiments of this specification, the functions of each module can be implemented in one or more software and / or hardware components, or a module that performs the same function can be implemented by a combination of multiple sub-modules or sub-units. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division; in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between devices or units, and may be electrical, mechanical, or other forms.
[0548] Those skilled in the art will also know that, besides implementing the controller using purely computer-readable program code, the same functions can be achieved by logically programming the method steps, making the controller function as logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers (PLCs), and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the devices within it used to implement various functions can also be considered structures within that hardware component. Alternatively, the devices used to implement various functions can be considered as both software modules implementing the method and structures within a hardware component.
[0549] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0550] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, the embodiments of this specification can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, the embodiments of this specification can take the form of computer program products implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0551] The embodiments described in this specification can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. The embodiments of this specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0552] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, system embodiments are basically similar to method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments. In the description of this specification, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of the embodiments in this specification. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described can be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification and the features of different embodiments or examples.
[0553] The above description is merely an embodiment of the present specification and is not intended to limit the embodiments of the present specification. For those skilled in the art, various modifications and variations can be made to the embodiments of the present specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the embodiments of the present specification should be included within the scope of the claims of the embodiments of the present specification.
Claims
1. A method for predicting power generation, characterized in that, The method includes: Historical power generation data of the target photovoltaic power plant and corresponding meteorological forecast data for various time periods are obtained as the raw dataset; A sampled dataset is formed by extracting data in predetermined proportions from daytime and nighttime data respectively from the original dataset. For each sample in the original dataset, a sample is randomly selected from the sampled dataset to form a sample pair, which serves as the input data for the pre-training stage; The original prediction model is pre-trained by comparison learning using the input data to learn the diurnal differences in photovoltaic power generation. The photovoltaic day and night power generation patterns are saved by a model encoder in a parameter-sharing manner, and the parameters are fine-tuned by a learnable adapter to obtain a prediction model. The prediction model is used to make short-term predictions of photovoltaic power generation for the target photovoltaic power plant. The process includes pre-training the original prediction model using the input data to learn the diurnal differences in photovoltaic power generation, including: The time string in the input data is converted into an object, and the time features of the object are extracted. The time features include year, month, day, and hour. The hour is periodically encoded to preserve the time periodicity. Spatial averaging is performed on the meteorological elements in the input data, and the meteorological elements are time-aligned with the power data. The encoder extracts features from the input feature matrix containing time and meteorological features. The encoder is composed of N identical encoder blocks stacked together. Each encoder block contains: multi-head self-attention, feedforward network, residual connection and layer normalization multiple sub-layers, where N is a positive integer. Feature enhancement is performed on the output of the encoder; The output results after feature enhancement are subjected to contrastive learning pre-training; The adapter of the model encoder includes: a lower projection layer, a hidden layer, an upper projection layer, a residual connection, and a normalized output, wherein: The lower projection layer reduces the dimensionality of the input: in, The input feature vector, The weight matrix is the downward projection matrix. Let be the downward projection bias vector. Let be the eigenvectors after dimensionality reduction, where , Multiples of 2 Indicates the dimension of the encoder; The hidden layer is calculated according to the following formula: in, For learnable scalar parameters, during initialization , equivalent to 'x' represents the input to the neural network, and the parameters can be adjusted during training. Adjust the activation function of the hidden layer. Let LAct represent the hyperbolic tangent function, and let LAct represent the activation function of the hidden layer. The upper projection layer is calculated according to the following formula: in, For output of the hidden layer, The up-projection weight matrix is... Let be the upward projection bias vector. The feature vector after restoring the dimensions; The adapter output is obtained after residual connection and layer normalization: Here, Adapter represents all operations of the adapter, LayerNorm represents layer normalization operations, and c represents the input of the adapter.
2. The method according to claim 1, characterized in that, The prediction model is used to make short-term predictions of photovoltaic power generation for the target photovoltaic power plant, including: Input the original dataset into the prediction model; The prediction model is used to perform model preprocessing, encoder feature extraction, and decoder to obtain power output, so as to output the photovoltaic power prediction results for each preset number of minutes of a predetermined length in the future. In the prediction model, the decoder initially takes a zero-value tensor as input and uses its own predictions as subsequent inputs to gradually generate an output sequence during the prediction phase.
3. The method according to claim 1, characterized in that, Feature extraction is performed on the input feature matrix containing time and meteorological features using an encoder, including: For each encoder block in the encoder, the calculation is performed as follows: Calculate the projection of the query, key, and value: Among them, Q l Z represents the query matrix at level l, where the index l indicates the level number; l-1 K represents the output of layer l-1 and the input of layer l; l Represents the bond matrix of the l-th layer; This represents the value matrix of the l-th layer; This represents the weight matrix of the query, key, and value corresponding to the l-th layer, with dimension d. model ×d n , where d model d represents the dimension of the encoder. n These are the feature dimensions of queries, keys, and values; Calculate attention score: Where K is the key matrix for calculating the attention score, K T This indicates that K is transposed to satisfy the requirements of matrix multiplication. This represents the attention result of the i-th single head in layer l. This represents the weight matrix of the i-th corresponding query matrix in the l-th layer. Represents the i-th corresponding key matrix in the l-th layer. K The weight matrix, Represents the matrix of corresponding values of the i-th element in the l-th layer. V The weight matrix, where i represents the i-th head; After obtaining single-head attention, we concatenate them to obtain multi-head attention: Wherein, Concat represents the concatenation operation of matrices or vectors; This represents the output weight matrix of the multi-head attention of the l-th layer encoder, where the subscript 0 indicates the output, the superscript l indicates the layer number, and he indicates the number of heads. Residual connections and layer normalization are performed on the output of the multi-head self-attention sublayer: Where LayerNorm represents layer normalization. This represents the output of the multi-head self-attention sublayer in the l-th encoder, with the superscript ' indicating the sum of the outputs of the l-th encoder and the output Z. l Distinguish between them; In the feedforward network sublayer, perform the feedforward transformation: in, Let x represent the feedforward neural network in the l-th layer encoder, and let x represent the input of the neural network. , where represents the two weight matrices used for the feedforward transformation in the l-th layer, . This represents the dimension of the hidden layer in a feedforward neural network; The output of the feedforward transform is then subjected to residual connection and layer normalization: After passing through N identical encoder blocks, the output of the encoder is obtained: in, This represents the output of the Nth layer encoder, which is also the final output of the encoder. N is the number of encoder blocks. X represents the input feature matrix containing time and meteorological features. Encoder represents all operations of a single layer encoder.
4. The method according to claim 1, characterized in that, Feature enhancement is performed on the output of the encoder, including: Construct feature vectors based on the encoder's output; The global features of the entire sequence are obtained by using mean pooling, resulting in the pooled feature vector: Where t is the time step index in the sequence, h is the pooled feature vector, and T represents the length of the entire sequence; Layer normalization normalizes the internal dimensions of the feature vector of a single sample, calculates the mean and variance of all dimensions of the feature vector of the single sample, and adjusts them to a mean of 0 and a variance of 1 to eliminate the dimensional differences of different feature dimensions within the same sample. Batch normalization normalizes the feature vectors of all samples within the same batch, calculates the mean and variance of all samples in the current batch in the same dimension, and adjusts them to a stable distribution to eliminate feature distribution shifts between different samples, accelerate model training convergence, and avoid comparison loss calculation errors caused by differences in batch data distribution.
5. The method according to claim 1, characterized in that, The output after feature enhancement is pre-trained using contrastive learning, including: The feature-enhanced output is projected using the following formula: Where z represents the result of the projection transformation. This represents the second weight matrix of the projection transformation, where the subscript q indicates the projection transformation. This represents the activation function. This represents the first weight matrix of the projection transformation. This represents the first bias vector of the projection transformation. This represents the second bias vector of the projection transformation; The similarity between two samples in a sample pair is calculated using the projection transformation results: in, This represents the similarity between two samples in a sample pair, and τ represents the temperature parameter. Calculate the similarity probability value of the sample pair based on the similarity between the two samples in the sample pair: Where p is the similarity probability value of the sample pair. Represents the sigmoid function; Based on the similarity probability values of the sample pairs, contrastive learning pre-training is performed using the following contrastive learning loss function: in, , is a very small positive constant, used to prevent overflow when the logarithmic function input is 0. Let A represent the loss function for contrastive learning, and let A represent the total number of sample pairs. Indicates the first i The true labels of each sample pair are given by the logarithmic function. Indicates the first i The similarity probability value of each sample pair.
6. The method according to claim 2, characterized in that, The decoder obtains power output, including: After obtaining the feature representation through the encoder, the feature representation is input into the decoder to obtain the power output; The decoder is composed of M stacked decoder blocks. Each decoder block includes: a masked multi-head self-attention layer, an encoder-decoder attention layer, a feedforward network layer, a residual connection, and a layer normalization, where M is a positive integer.
7. The method according to claim 6, characterized in that, The feature representation is input into the decoder to obtain the power output, including: Embed the target sequence: in, Represents the target sequence. This represents the weight matrix for calculating the embedded target sequence, where Y represents the encoder output during the power prediction stage, and b embed This represents the calculation of the bias vector embedded in the target sequence; The target sequence is then fused with the positional encoding: Where D0 represents the initial input of the decoder, PE tgt Indicates the location code of the target; The self-attention query value, key value, and result value are calculated through a masked multi-head self-attention layer. Calculate mask attention based on query value, key value, and result value; The mask attention is concatenated to obtain the mask attention concatenation result; The power output is obtained by performing residual connection and layer normalization on the mask attention splicing result.
8. A device for predicting power generation, characterized in that, include: The acquisition module is used to acquire historical power generation data of the target photovoltaic power plant and meteorological forecast data of various categories for the corresponding time period as the raw dataset; The extraction module is used to extract data in preset proportions from the daytime and nighttime data in the original dataset to form a sampled dataset; The selection module is used to randomly select a sample from the sampled dataset for each sample in the original dataset to form a sample pair, which serves as input data for the pre-training stage. The pre-training module is used to perform comparative learning pre-training on the original prediction model using the input data, so as to learn the diurnal differences in photovoltaic power generation; The adjustment module is used to save the photovoltaic day and night power generation mode through parameter sharing via the model encoder, and to fine-tune the parameters in conjunction with the learnable adapter to obtain the prediction model. The prediction module is used to make short-term predictions of photovoltaic power generation of the target photovoltaic power plant using the prediction model. The process includes pre-training the original prediction model using the input data to learn the diurnal differences in photovoltaic power generation, including: The time string in the input data is converted into an object, and the time features of the object are extracted. The time features include year, month, day, and hour. The hour is periodically encoded to preserve the time periodicity. Spatial averaging is performed on the meteorological elements in the input data, and the meteorological elements are time-aligned with the power data. The encoder extracts features from the input feature matrix containing time and meteorological features. The encoder is composed of N identical encoder blocks stacked together. Each encoder block contains: multi-head self-attention, feedforward network, residual connection and layer normalization multiple sub-layers, where N is a positive integer. Feature enhancement is performed on the output of the encoder; The output results after feature enhancement are subjected to contrastive learning pre-training; The adapter of the model encoder includes: a lower projection layer, a hidden layer, an upper projection layer, a residual connection, and a normalized output, wherein: The lower projection layer reduces the dimensionality of the input: in, The input feature vector, The weight matrix is the downward projection matrix. Let be the downward projection bias vector. Let be the eigenvectors after dimensionality reduction, where , Multiples of 2 Indicates the dimension of the encoder; The hidden layer is calculated according to the following formula: in, For learnable scalar parameters, during initialization , equivalent to 'x' represents the input to the neural network, and the parameters can be adjusted during training. Adjust the activation function of the hidden layer. Let LAct represent the hyperbolic tangent function, and let LAct represent the activation function of the hidden layer. The upper projection layer is calculated according to the following formula: in, For output of the hidden layer, The up-projection weight matrix is... Let be the upward projection bias vector. The feature vector after restoring the dimensions; The adapter output is obtained after residual connection and layer normalization: Here, Adapter represents all operations of the adapter, LayerNorm represents layer normalization operations, and c represents the input of the adapter.
9. An electronic device comprising a processor and a memory for storing processor-executable instructions, characterized in that, When the processor executes the instructions, it implements the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method according to any one of claims 1 to 7.
11. A computer program product, comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method according to any one of claims 1 to 7.