A cloud image prediction method based on frequency domain enhancement and causal attention
By introducing a dual-scale spatiotemporal feature backbone network with a learnable discrete wavelet enhancement module and a causal temporal aggregation module into satellite cloud image prediction, the problems of high-frequency detail attenuation and insufficient long-range dependence in satellite cloud image prediction are solved, and clearer and more stable multi-step prediction results are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NAT SATELLITE METEOROLOGICAL CENT
- Filing Date
- 2026-04-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing satellite cloud image prediction methods suffer from high-frequency detail attenuation and insufficient long-range dependence in multi-step prediction, resulting in blurred boundaries, texture breaks, and error accumulation, which affects the clarity and consistency of predicted details.
A cloud map prediction method based on frequency domain enhancement and causal attention is adopted. By introducing a learnable discrete wavelet enhancement module L-DWT in the shallow layer and a causal temporal aggregation module CTAM in the bottleneck layer, a dual-scale spatiotemporal feature backbone network is constructed to achieve frequency domain detail compensation and the formation of long-term trend anchors.
It significantly improves the detail clarity, structural consistency and robustness of multi-step prediction, effectively enhancing the ability of cloud map prediction.
Smart Images

Figure CN122244597A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a cloud image prediction method based on frequency domain enhancement and causal attention, belonging to the field of meteorological cloud image prediction technology. Background Technology
[0002] Image region analysis, as a crucial component of computer vision and intelligent image understanding, leverages geostationary meteorological satellites to continuously acquire cloud images over large areas at a temporal resolution of 10–15 minutes. This provides a highly timely data foundation for short-term forecasting, severe convective weather monitoring, aviation operations support, and energy dispatching. Existing methods generally employ convolutional neural networks (CNNs) as the fundamental module for feature extraction and reconstruction. Through two-dimensional or three-dimensional convolution, multi-scale spatial features are extracted at the encoding end, and resolution and details are gradually restored at the decoding end, thereby completing end-to-end cloud image extrapolation.
[0003] Early video and cloud image prediction work often directly used standard convolutional networks for extrapolation between adjacent frames, modeling local correlations in the spatial domain by stacking convolutional layers, and gradually developing encoder-decoder structures and pyramid-style feature extraction frameworks to enhance the expressive power of multi-scale targets. While improving semantic abstraction capabilities, such methods also introduced a long-standing side effect: to expand the receptive field and obtain stronger high-level semantic representations, the network usually employs multi-level downsampling and layer-by-layer convolutional smoothing, which causes high-frequency components in the image to be continuously suppressed in deep features. For satellite cloud images, structures such as cloud edges, thin clouds, and narrow cloud bands often rely on high-frequency textures and sharp boundaries for representation. In multi-scale downsampling chains, these structures are more prone to boundary blurring, texture breakage, and disappearance of fine lines, thus affecting the morphological reliability of subsequent multi-step predictions and the identification value of hazardous cloud systems. Therefore, how to explicitly preserve and enhance the high-frequency details that gradually decay during downsampling without significantly increasing computational overhead has become one of the key issues in the spatial representation level of satellite cloud image prediction in recent years.
[0004] Alongside the development of spatial representation, the ability to model temporal dynamics has been continuously enhanced. Numerous studies have begun to explicitly integrate temporal information onto the convolutional backbone: on the one hand, 3D convolutions or spatiotemporal convolutions are used to simultaneously process spatial and temporal dimensions within the network; on the other hand, CNNs are combined with recurrent units or attention mechanisms to ensure that convolutional features retain both local texture and a certain degree of temporal memory. A series of methods, including F-CLSTM, typically use CNNs as spatiotemporal feature extractors, combined with different forms of temporal modeling modules to complete video and cloud image prediction. Srivastava et al. introduced LSTM into the field of video prediction, using LSTMs as both the encoder and decoder of the prediction model to achieve video prediction, and verified the feasibility of combining CNNs and LSTMs to improve prediction accuracy. Shi et al., drawing on the above ideas, proposed ConvLSTM and designed an encoding-forecasting (EF) structure based on the information flow transmission in the network, applying it to precipitation prediction with good results. Subsequently, Wang et al. further integrated 3D convolution, spatiotemporal memory units, and gradient highways into ConvLSTM, proposing PredRNN, PredRNN++, and E3D-LSTM, making significant contributions to the development of video prediction technology. Regarding specialized modeling of satellite cloud images, Tan et al. proposed F-CLSTM, which, through a multi-scale hierarchical ConvLSTM structure and the Forecaster loss function, achieved joint prediction of cloud intensity and cloud morphology, making it more adaptable to the multi-scale feature distribution of cloud images compared to general ConvLSTM. However, these methods still primarily rely on recursive propagation of gated memory to characterize long-range temporal dependencies. As the prediction step size increases, errors tend to accumulate and drift during the recursive process, manifesting as instability in step direction and velocity, and decreased morphological consistency. There is still room for improvement in step consistency.
[0005] In recent years, Gao et al. proposed the SimVP framework, which is entirely based on CNNs. This framework decouples spatiotemporal learning into spatial encoding, temporal translation, and spatial decoding, achieving superior performance on various video prediction datasets with a relatively simple structure and significantly reducing training costs. However, its temporal translation module still relies heavily on local convolutions, which has limitations in capturing long-range dynamic processes such as large-scale slow transport, cross-scale interactions, and sudden cloud formation. For tropical cyclone cloud image prediction, Lian et al. constructed the sequence-to-sequence SCSTque model, using multi-scale convolutions and a temporal encoder-decoder structure to fully exploit the spatiotemporal features of satellite cloud images. They also constructed a tropical cyclone cloud cover dataset, achieving good results in extreme weather scenarios. However, SCSTque still primarily operates within a spatial domain convolution and multi-level downsampling framework, with relatively insufficient explicit constraints on high-frequency details. Furthermore, the model structure is relatively complex, resulting in high computational costs, which is not conducive to efficient training and rapid inference on large-scale operational satellite data.
[0006] Overall, existing methods either lean more towards semantic abstraction in the spatial dimension while neglecting detail fidelity, or lean more towards local dependence in the temporal dimension while making it difficult to form stable long-term trend anchors. These two types of problems are superimposed and amplified in multi-step prediction tasks. Summary of the Invention
[0007] The technical problem to be solved by this invention is to provide a cloud map prediction method based on frequency domain enhancement and causal attention. It constructs a dual-scale spatiotemporal feature backbone, introduces a learnable discrete wavelet enhancement module L-DWT in the shallow layer for frequency domain detail compensation, and introduces a causal temporal aggregation module CTAM in the bottleneck layer to form long-term trend anchors. Finally, it realizes the same-scale fusion reconstruction of details and semantics, and improves the detail clarity, temporal coherence and overall robustness of multi-step extrapolation.
[0008] To solve the above-mentioned technical problems, the present invention adopts the following technical solution: The present invention designs a cloud map prediction method based on frequency domain enhancement and causal attention, and performs the following steps A to C to obtain a cloud map prediction model for cloud map prediction;
[0009] Step A. For each of the at least one preset target direction sky regions, collect data at a preset time step. Cloud imagery of the sky region in the target direction, in frames sequentially from the sky regions corresponding to the same target direction, is collected in order of acquisition. Frame cloud imagery, re-continuous A single sample is constructed by combining frame cloud images, and the first consecutive frames in the single sample are... Frame cloud imagery and subsequent continuation The number of targets is preset between frame cloud images. The process involves several time steps to obtain individual samples, forming a sample set, and then proceeding to step B; where... ;
[0010] Step B. Based on the dual-scale SeqConv spatiotemporal backbone network, a learnable discrete wavelet enhancement module L-DWT is introduced in the shallow layer, and a causal temporal aggregation module CTAM is introduced in the bottleneck layer to construct the network to be trained, and then proceed to step C;
[0011] Step C. Based on the sample set, select the earliest continuous samples. Frame cloud imagery is the input, followed by continuous frames. The cloud image is used as the output to train the network and obtain the cloud image prediction model.
[0012] As a preferred technical solution of the present invention: the network structure to be trained includes a shallow layer, a pooling layer, a bottleneck layer, a decoder layer, an upsampling layer, and a splicing module;
[0013] The shallow layer includes the learnable discrete wavelet enhancement module L-DWT and Each sub-shallow branch consists of two first sequential convolutional feature extraction modules (SeqConv1) and a first convolutional long short-term memory module (ConvLSTM1) connected in series from its input to its output. The input of the first sequential convolutional feature extraction module (SeqConv1) forms the input of the sub-shallow branch, and the output of the first convolutional long short-term memory module (ConvLSTM1) forms the output of the sub-shallow branch. The input of each sub-shallow branch receives the continuous input one-to-one. Each frame of the cloud image in the frame cloud image is continuous. In the sequence of shallow sub-branches corresponding to the frame cloud image, the first The outputs of the shallow branches of each sub-branch are connected to the inputs of the first convolutional long short-term memory module (ConvLSTM1) in the next adjacent shallow branch. The output of the last shallow branch is connected to the input of the learnable discrete wavelet enhancement module (L-DWT), which outputs the enhanced shallow latent state features. Simultaneously, the outputs of each sequential shallow branch are processed by a pooling layer to generate sequential... Path output;
[0014] The bottleneck layer includes the causal time-series aggregation module CTAM, the element-wise addition module, and... Each sub-bottleneck branch consists of two sequential second-order convolutional feature extraction modules (SeqConv2) and a second-order convolutional long short-term memory module (ConvLSTM2) connected in series from its input to its output. The input of the sub-bottleneck branch is formed by the input of the first and second-order convolutional feature extraction modules (SeqConv2), and the output of the second-order convolutional long short-term memory module (ConvLSTM2) is formed by the output of the sub-bottleneck branch. The input of each sub-bottleneck branch corresponds one-to-one with the sequence of pooling layers. Output path, pooling order In the order of the output paths corresponding to the bottleneck branches of each path, the first... The outputs of the bottleneck branches are connected to the inputs of the second convolutional long short-term memory (ConvLSTM2) module in the next adjacent bottleneck branch. The output of the last bottleneck branch is connected to one of the inputs of the element-wise addition module. Simultaneously, the outputs of each bottleneck branch are connected to the input of the causal temporal aggregation module (CTAM), and the output of CTAM is connected to the other input of the element-wise addition module, which outputs enhanced bottleneck hidden state features. The output of the element-wise addition module, after being sequentially connected to a decoder layer and an upsampling layer, is connected to one of the inputs of the splicing module. The output of the learnable discrete wavelet enhancement module (L-DWT) is connected to the other input of the splicing module, and the output of the splicing module is used to output continuous... Frame cloud image backwards at preset target number Continuous after a time step Frame cloud image.
[0015] As a preferred embodiment of the present invention: the first sequential convolutional feature extraction module SeqConv1 in each shallow sub-branch is connected in series from the input end to the output end. Convolutional module, normalization layer, ReLU activation layer, The input of the convolutional module constitutes the input of the first sequential convolutional feature extraction module SeqConv1, and the output of the ReLU activation layer constitutes the output of the first sequential convolutional feature extraction module SeqConv1; each sub-shallow branch receives continuous... The corresponding cloud image in the frame cloud image is processed by two sequential first-order convolutional feature extraction modules (SeqConv1) for the received cloud image as follows:
[0016] ;
[0017] Obtain cloud imagery The corresponding shallow features The output is fed to the first convolutional long short-term memory module, ConvLSTM1, where... , Indicates continuity The first frame cloud image Frame cloud imagery, express The corresponding shallow features, The following represent the first sequential convolutional feature extraction modules (SeqConv1) in each of the shallow sub-branches. The convolutional kernel weights of the convolutional module, The following represent the first sequential convolutional feature extraction modules (SeqConv1) in each of the shallow sub-branches. Bias terms of the convolution module, This represents a two-dimensional convolution operation. Represents the normalization function. This represents the ReLU function in the shallow sub-branch. This represents the computation and processing function of the first sequential convolutional feature extraction module, SeqConv1.
[0018] Further, the first convolutional long short-term memory module ConvLSTM1 processes the received shallow features. And the shallow hidden state features output by the first convolutional long short-term memory module ConvLSTM1 in the adjacent shallow branch of the previous path. Features of shallow memory units Execute as follows:
[0019] ;
[0020] Obtaining shallow hidden state features Features of shallow memory units and output, where, , express The corresponding shallow hidden state features. , express The corresponding shallow memory unit features, , express The corresponding shallow hidden state features. , express The corresponding shallow memory unit features, Indicates the height of the shallow feature map. Indicates the width of the shallow feature map. This represents the number of channels in the shallow feature map. This represents the computation processing function of the first convolutional short-term memory module, ConvLSTM1;
[0021] The structure of the second sequential convolutional feature extraction module SeqConv2 in each bottleneck branch is the same as that of the first sequential convolutional feature extraction module SeqConv1, and is further defined by the following formula:
[0022] ;
[0023] Each shallow branch of the sequential path outputs the hidden state features. To the pooling layer, after After spatial downsampling by pooling, the output is sent to the corresponding sub-bottleneck branch, where it is first processed by the second sequential convolution feature extraction module SeqConv2 to obtain bottleneck scale features. Then, the second convolutional long short-term memory module, ConvLSTM2, targets the received bottleneck scale features. And the hidden state features of the output of the second convolutional long short-term memory module ConvLSTM2 in the bottleneck branch of the adjacent path. Memory unit characteristics According to the following formula:
[0024] ;
[0025] Obtaining bottleneck hidden state features Bottleneck memory unit characteristics and output, where, , express The corresponding bottleneck scale features, This indicates the pooling window corresponding to the pooling layer POOLING. A pooling function with a step size of 2. , express The corresponding bottleneck hidden state features. , express The corresponding bottleneck memory unit characteristics, , express The corresponding bottleneck hidden state features. , express The corresponding bottleneck memory unit characteristics, This represents the number of channels in the bottleneck feature map. This represents the computation processing function of the second convolutional short-term memory module, ConvLSTM2.
[0026] As a preferred embodiment of the present invention: the learnable discrete wavelet enhancement module L-DWT includes a depthwise separable convolution module, a splicing module, an element-wise addition module, and two branches with identical structures. The input of the depthwise separable convolution module constitutes the input of the learnable discrete wavelet enhancement module L-DWT. Each branch is connected in series with an upsampling module and a convolutional layer from its input to its output. The input of the upsampling module constitutes the input of a branch, and the output of the convolutional layer constitutes the output of a branch. The output of the depthwise separable convolution module is connected to the inputs of the two branches. The outputs of the two branches are connected to the two inputs of the splicing module. The output of the splicing module is connected to one input of the element-wise addition module. The other input of the element-wise addition module is connected to the input of the learnable discrete wavelet enhancement module L-DWT. The output of the element-wise addition module constitutes the output of the learnable discrete wavelet enhancement module L-DWT. The learnable discrete wavelet enhancement module L-DWT outputs the shallow hidden state features of the last received sub-shallow branch. Execute as follows:
[0027] First, the depthwise separable convolutional module is used according to the following formula:
[0028] ;
[0029] First target A downsampling transform with a stride of 2 is performed, followed by pointwise convolution to complete the linear combination of channels, thereby obtaining learnable low-frequency components. With high frequency components ,in, and Let represent the learnable low-frequency band decomposition operator and the high-frequency band decomposition operator, respectively. ;
[0030] Next, the low-frequency components were analyzed separately. With high frequency components According to the following formula:
[0031] ;
[0032] First, perform bilinear interpolation upsampling, then complete channel alignment through 1×1 convolution to obtain the result. Low-frequency enhancement term of the same spatial size With high frequency enhancement term ;in, This represents the bilinear interpolation upsampling function; Represents a nonlinear activation function; This represents a 1×1 convolution function;
[0033] Then press Introducing global scalar gating coefficients and And according to the following formula:
[0034] ;
[0035] Combining the shallow hidden state features of the last shallow branch output To obtain enhanced shallow hidden state features ;in, .
[0036] As a preferred technical solution of the present invention: the causal time-series aggregation module CTAM includes, in series from the input end to the output end, a global average pooling GAP, a linear mapping module, a causal multi-head attention module, a splicing module, a linear mapping module, and a broadcast alignment module, wherein the input end of the global average pooling GAP constitutes the input end of the causal time-series aggregation module CTAM, and the output end of the broadcast alignment module constitutes the output end of the causal time-series aggregation module CTAM.
[0037] The causal time-series aggregation module CTAM outputs bottleneck hidden state features for each bottleneck branch. Execute as follows:
[0038] First, Global Average Pooling (GAP) is used to target the bottleneck hidden state characteristics. According to the following formula:
[0039] ;
[0040] Compress the two-dimensional feature map into a time token sequence. and stack them as ,in, This represents the global average pooling function. express The corresponding global pooling feature vector, Representing the bottleneck hidden state features The corresponding global pooling feature vector matrix;
[0041] Next, the first linear mapping module in sequence is used to target the global pooling feature vector matrix. Perform a linear mapping as follows:
[0042] ;
[0043] Obtain the corresponding query ,key ,value ,in, These represent the learnable parameters;
[0044] Then the causal multi-head attention module uses the upper triangular causal mask... , among which when season According to the following formula:
[0045] ;
[0046] Obtain the The aggregated result of attention heads, where, The feature dimension of a single attention head. , , They represent the first The query matrix, key matrix, and value matrix corresponding to each attention head. express transpose, Represents the normalization function. Indicates the first The aggregation result of attention heads under causal constraints.
[0047] The concatenation module then concatenates the aggregation results of each attention head to obtain a trend aggregation vector sequence. It is represented as:
[0048]
[0049] in, Indicates the number of heads of attention. This indicates concatenation based on feature dimensions, where D2 represents the feature dimensions after concatenation.
[0050] Finally, select the trend aggregation vector sequence. Trend aggregation vector at the last moment As a long-term trend representation, the sequence is obtained by the linear mapping module and the broadcast alignment module, first through channel linear mapping, and then through spatial dimension broadcast alignment. and further with The output of the causal time-series aggregation module CTAM constitutes the output.
[0051] Corresponding to the above, the present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of a cloud map prediction method based on frequency domain enhancement and causal attention.
[0052] Furthermore, a computer-readable storage medium is designed, on which a computer program is stored, which, when executed by a processor, implements the steps of a cloud map prediction method based on frequency domain enhancement and causal attention.
[0053] The cloud map prediction method based on frequency domain enhancement and causal attention described in this invention has the following technical advantages compared with existing technologies:
[0054] This invention presents a cloud image prediction method based on frequency domain enhancement and causal attention. Addressing key issues such as high-frequency detail attenuation and insufficient characterization of long-range dependencies leading to error accumulation and structural drift in multi-step extrapolation of satellite cloud images, this method utilizes a dual-scale SeqConv spatiotemporal backbone network. A learnable discrete wavelet enhancement module (L-DWT) is introduced in the shallow layer for frequency domain detail compensation. A causal temporal aggregation module (CTAM) is introduced in the bottleneck layer to aggregate cross-temporal context, forming stable trend anchors. This achieves explicit compensation for key high-frequency structures such as cloud edges and narrow cloud bands, and stable constraints on long-term evolution trends. Finally, details and semantics are fused and reconstructed at the same scale to build a training network. The cloud image prediction model is obtained through training on a sample set. In cloud image prediction applications, this method significantly improves the detail clarity, structural consistency, and robustness of multi-step prediction, effectively enhancing the predictive capability of cloud images. Attached Figure Description
[0055] Figure 1 The overall structural block diagram of the WA-CLSTM network in the cloud map prediction method based on frequency domain enhancement and causal attention in this invention is shown below.
[0056] Figure 2 A schematic diagram of the spatiotemporal feature extraction network structure of the dual-scale SeqConv spatiotemporal backbone network and the two-layer ConvLSTM in the WA-CLSTM network designed for this invention.
[0057] Figure 3 This is a schematic diagram of the learnable discrete wavelet enhancement module L-DWT structure in the WA-CLSTM network designed for this invention.
[0058] Figure 4 This is a schematic diagram of the causal temporal aggregation module CTAM in the WA-CLSTM network designed in this invention;
[0059] Figure 5 A comparison chart of the multi-step prediction qualitative visualization results of the WA-CLSTM and the contrasting methods designed in this invention on the FY-4B cloud map dataset;
[0060] Figure 6 This is a visualization of the ablation experiment results. Detailed Implementation
[0061] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
[0062] This invention designs a cloud image prediction method based on frequency domain enhancement and causal attention to solve two key bottlenecks commonly found in multi-step prediction of geostationary meteorological satellite cloud images: First, multiple downsampling and convolutional smoothing cause high-frequency details such as cloud edges, thin clouds, and narrow cloud bands to gradually attenuate, resulting in blurred boundaries and texture breaks; Second, relying solely on local convolution and gated memory makes it difficult to form stable long-range evolution constraints, and errors are prone to accumulate in rolling prediction, causing cloud cluster position drift, structural phase misalignment, and morphological distortion, thereby resulting in insufficient cross-step consistency of prediction results and limited operational availability.
[0063] The cloud map prediction method based on frequency domain enhancement and causal attention designed in this invention, in a specific application, is designed to execute the following steps A to C to obtain a cloud map prediction model for cloud map prediction.
[0064] Step A. For each of the at least one preset target direction sky regions, collect data at a preset time step. Cloud imagery of the sky region in the target direction, in frames sequentially from the sky regions corresponding to the same target direction, is collected in order of acquisition. Frame cloud imagery, re-continuous A single sample is constructed by combining frame cloud images, and the first consecutive frames in the single sample are... Frame cloud imagery and subsequent continuation The number of targets is preset between frame cloud images. The process involves several time steps to obtain individual samples, forming a sample set, and then proceeding to step B; where... .
[0065] Step B. Based on the dual-scale SeqConv spatiotemporal backbone network, a learnable discrete wavelet enhancement module L-DWT is introduced in the shallow layer, and a causal temporal aggregation module CTAM is introduced in the bottleneck layer to construct the network to be trained (WA-CLSTM), and then proceed to step C.
[0066] Step C. Based on the sample set, select the earliest continuous samples. Frame cloud imagery is the input, followed by continuous frames. The cloud image is used as the output to train the network (WA-CLSTM) to obtain the cloud image prediction model.
[0067] Regarding the designed network to be trained (WA-CLSTM), three design directions are specifically introduced, including the following:
[0068] (1) Dual-scale temporal feature extraction and lightweight representation method: The dual-scale SeqConv spatiotemporal backbone is adopted, and only one spatial downsampling is retained. While reducing the computational overhead, it also reduces the irreversible loss of information caused by excessive downsampling. This enables the model to have both fine-grained structural expression at high-resolution scale and global morphological convergence representation at low-resolution scale, thereby providing a stable spatiotemporal joint representation for multi-step prediction.
[0069] (2) Frequency domain adaptive detail enhancement method based on learnable discrete wavelet transform: Learnable subband decomposition is performed on the latest time feature in the shallow layer to obtain low frequency trend components and high frequency detail components. The dual frequency information is adaptively injected into the shallow layer representation through channel alignment and gated residual back injection to highlight key high frequency structures such as cloud edge gradient, texture and narrow cloud band, and suppress the detail degradation caused by excessive smoothing, thereby improving the boundary clarity and structure fidelity of multi-step extrapolation.
[0070] (3) Causal constraint long-term time series aggregation and trend anchor injection method: The causal time series aggregation module CTAM is introduced at the bottleneck scale to perform cross-time context aggregation of historical sequences under strict causal constraints, form long-term trend representation and inject bottleneck features, so that the model can enhance the ability to characterize the long-term evolution trend of cloud system without leaking future information, suppress error accumulation, phase error and shadow phenomenon in rolling prediction, thereby improving the temporal consistency and overall structural stability of multi-step prediction.
[0071] Based on the above design points, in practical applications, taking the FY-4B infrared brightness temperature cloud image of Fengyun-4B satellite as an example, such as... Figure 1 As shown, the design of the network to be trained (WA-CLSTM) includes a shallow layer, a pooling layer, a bottleneck layer, a decoder layer, an upsampling layer, and a concatenation module; among which, the shallow layer includes a learnable discrete wavelet enhancement module L-DWT and Each sub-shallow branch consists of two first sequential convolutional feature extraction modules (SeqConv1) and a first convolutional long short-term memory module (ConvLSTM1, CLSTM) connected in series from its input to its output. The input of the first SeqConv1 module forms the input of the sub-shallow branch, and the output of the ConvLSTM1 module forms the output. Each sub-shallow branch receives its corresponding input. Each frame of the cloud image in the frame cloud image is continuous. In the sequence of shallow sub-branches corresponding to the frame cloud image, the first The outputs of the shallow branches of each sub-branch are connected to the inputs of the first convolutional long short-term memory module (ConvLSTM1) in the next adjacent shallow branch. The output of the last shallow branch is connected to the input of the learnable discrete wavelet enhancement module (L-DWT), which outputs the enhanced shallow latent state features. Simultaneously, the outputs of each sequential shallow branch are processed by a pooling layer to generate sequential... Output path.
[0072] Specifically regarding shallow design, such as Figure 2 As shown, the first sequential convolutional feature extraction module SeqConv1 in each shallow sub-branch is connected in series from the input end to the output end. Convolutional module, normalization layer, ReLU activation layer, The input of the convolutional module constitutes the input of the first sequential convolutional feature extraction module SeqConv1, and the output of the ReLU activation layer constitutes the output of the first sequential convolutional feature extraction module SeqConv1; each sub-shallow branch receives continuous... The corresponding cloud image in the frame cloud image is processed by two sequential first-order convolutional feature extraction modules (SeqConv1) for the received cloud image as follows:
[0073] ;
[0074] Obtain cloud imagery The corresponding shallow features The output is fed to the first convolutional long short-term memory module, ConvLSTM1, where... , Indicates continuity The first frame cloud image Frame cloud imagery, express The corresponding shallow features, The following represent the first sequential convolutional feature extraction modules (SeqConv1) in each of the shallow sub-branches. The convolutional kernel weights of the convolutional module, The following represent the first sequential convolutional feature extraction modules (SeqConv1) in each of the shallow sub-branches. Bias terms of the convolution module, This represents a two-dimensional convolution operation. Represents the normalization function. This represents the ReLU function in the shallow sub-branch. This represents the computation and processing function of the first sequential convolutional feature extraction module, SeqConv1.
[0075] Shallow features The data is then fed into the first convolutional long short-term memory module (ConvLSTM1) for time-series recursive updates to continuously accumulate dynamic information about the evolution of local details over time. The first convolutional long short-term memory module (ConvLSTM1) then processes the received shallow features. And the shallow hidden state features output by the first convolutional long short-term memory module ConvLSTM1 in the adjacent shallow branch of the previous path. Features of shallow memory units Execute as follows:
[0076] ;
[0077] Obtaining shallow hidden state features Features of shallow memory units The system outputs a high-resolution representation of the shallow scale, continuously characterizing high-frequency structures such as cloud edge gradients, narrow cloud bands, and convective cells; among which, , express The corresponding shallow hidden state features. , express The corresponding shallow memory unit features, , express The corresponding shallow hidden state features. , express The corresponding shallow memory unit features, Indicates the height of the shallow feature map. Indicates the width of the shallow feature map. This represents the number of channels in the shallow feature map. This represents the computation processing function of the first convolutional short-term memory module, ConvLSTM1.
[0078] In satellite cloud image prediction tasks, the fine textures within cloud edges, thin cloud bands, and cloud clusters typically correspond to strong high-frequency information. These details determine the visual clarity of the prediction results and directly affect the structural stability of multi-step extrapolation. However, the spatiotemporal recursive modeling and decoding reconstruction process is often accompanied by a smoothing effect, which can easily lead to problems such as blurred boundaries and broken narrow cloud bands. To alleviate the attenuation of high-frequency details and enhance the structural fidelity of shallow representations, this invention designs a learnable discrete wavelet enhancement module (L-DWT) to be introduced into the shallow branches. Through learnable frequency band decomposition and dual-frequency residual backinjection, the shallow features are explicitly enhanced at the frequency domain level.
[0079] Regarding the learnable discrete wavelet enhancement module L-DWT, such as Figure 3As shown, the specific design includes a depthwise separable convolutional module, a concatenation module, an element-wise addition module, and two branches with identical structures. The input of the depthwise separable convolutional module forms the input of the learnable discrete wavelet enhancement module (L-DWT). Each branch connects an upsampling module and a convolutional layer in series from its input to its output. The input of the upsampling module forms the input of a branch, and the output of the convolutional layer forms the output of a branch. The output of the depthwise separable convolutional module connects to the inputs of the two branches. The outputs of the two branches connect to the two inputs of the concatenation module. The output of the concatenation module connects to one input of the element-wise addition module. The other input of the element-wise addition module connects to the input of the learnable discrete wavelet enhancement module (L-DWT). The output of the element-wise addition module forms the output of the learnable discrete wavelet enhancement module (L-DWT). The learnable discrete wavelet enhancement module (L-DWT) outputs the shallow hidden state features of the last received sub-shallow branch. Execute as follows.
[0080] First, the depthwise separable convolutional module is used according to the following formula:
[0081] ;
[0082] First target A downsampling transform with a stride of 2 is performed, followed by pointwise convolution to complete the linear combination of channels, thereby obtaining learnable low-frequency components. With high frequency components ,in, and Let represent the learnable low-frequency band decomposition operator and the high-frequency band decomposition operator, respectively. .
[0083] Low frequency components It primarily preserves a relatively smooth background and large-scale intensity variations, providing a more stable characterization ability for the main outline and slow evolution trend of cloud clusters, and high-frequency components. It tends to respond more to high-frequency structural information such as cloud edge gradients, textures, and narrow cloud bands. During the frequency domain enhancement stage, it simultaneously utilizes… and By performing coordinated compensation, both overall consistency and clarity of details can be taken into account in multi-step extrapolation.
[0084] To effectively inject high-frequency information back into the original resolution shallow features, the low-frequency components were then processed separately. With high frequency components According to the following formula:
[0085] ;
[0086] First, perform bilinear interpolation upsampling, then complete channel alignment through 1×1 convolution to obtain the result. Low-frequency enhancement term of the same spatial size With high frequency enhancement term ;in, This represents the bilinear interpolation upsampling function; Represents a nonlinear activation function; This represents a 1×1 convolution function.
[0087] Then press Introducing global scalar gating coefficients and The injection intensity of high and low frequencies is adaptively adjusted to avoid noise amplification or artifact accumulation caused by excessive enhancement, while improving the stability of the training process, according to the following formula:
[0088] ;
[0089] Combining the shallow hidden state features of the last shallow branch output Finally, the dual-frequency enhancement terms are fused back into the shallow features using residual backinjection to obtain enhanced shallow hidden state features. ;in, ,because and It can adaptively learn during training, when When the size is large, the model emphasizes the compensation of low-frequency components for the subject outline and background trend. When the brightness is high, the model emphasizes the enhancement of cloud edges and texture details by high-frequency components, thereby achieving dynamic weighting of the enhancement magnitude under different weather conditions and brightness temperature gradients.
[0090] The bottleneck layer includes the causal time-series aggregation module CTAM, the element-wise addition module, and... Each sub-bottleneck branch consists of two sequential second-order convolutional feature extraction modules (SeqConv2) and a second-order convolutional long short-term memory module (ConvLSTM2, CLSTM) connected in series from its input to its output. The input of the first and second sequential convolutional feature extraction modules (SeqConv2) forms the input of the sub-bottleneck branch, and the output of the second convolutional long short-term memory module (ConvLSTM2) forms the output. The input of each sub-bottleneck branch corresponds one-to-one with the sequence of pooling layers. Output path, pooling order In the order of the output paths corresponding to the bottleneck branches of each path, the first... The output of each bottleneck branch is connected to the input of the second convolutional long short-term memory module (ConvLSTM2) in the next adjacent bottleneck branch. The output of the last bottleneck branch is connected to one of the inputs of the element-wise addition module. At the same time, the outputs of each bottleneck branch are connected to the input of the causal temporal aggregation module (CTAM). The output of the causal temporal aggregation module (CTAM) is connected to the other input of the element-wise addition module. The element-wise addition module outputs the enhanced bottleneck hidden state features.
[0091] Regarding the bottleneck layer in practical applications, the structure of the second sequential convolutional feature extraction module SeqConv2 in each sub-bottom branch is the same as the structure of the first sequential convolutional feature extraction module SeqConv1, further as follows:
[0092] ;
[0093] Each shallow branch of the sequential path outputs the hidden state features. To the pooling layer, after After spatial downsampling by pooling, the output is sent to the corresponding sub-bottleneck branch, where it is first processed by the second sequential convolution feature extraction module SeqConv2 to obtain bottleneck scale features. Then, the second convolutional long short-term memory module, ConvLSTM2, targets the received bottleneck scale features. And the hidden state features of the output of the second convolutional long short-term memory module ConvLSTM2 in the bottleneck branch of the adjacent path. Memory unit characteristics According to the following formula:
[0094] ;
[0095] Obtaining bottleneck hidden state features Bottleneck memory unit characteristics The bottleneck scale focuses on modeling the morphology and slow evolution trend of large-scale cloud systems, and provides stable semantic support for subsequent causal temporal aggregation; among them, , express The corresponding bottleneck scale features, This indicates the pooling window corresponding to the pooling layer POOLING. A pooling function with a step size of 2. , express The corresponding bottleneck hidden state features. , express The corresponding bottleneck memory unit characteristics, , express The corresponding bottleneck hidden state features. , express The corresponding bottleneck memory unit characteristics, This represents the number of channels in the bottleneck feature map. This represents the computation processing function of the second convolutional short-term memory module, ConvLSTM2.
[0096] The temporal evolution of satellite cloud images exhibits significant multi-scale characteristics. Local cloud edges and narrow cloud bands change rapidly on short-term scales, while the organization and overall translational trend of large-scale cloud systems span a longer time range and show a relatively smooth and continuous evolution. In multi-step extrapolation scenarios, relying solely on the stepwise recursive memory of ConvLSTM to advance future sequences can easily lead to the gradual accumulation of errors during rolling predictions, manifesting as cloud cluster position drift, cloud band structure breaks, and morphological distortions. To enhance the modeling capability for long-range evolution trends, this invention introduces a causal temporal aggregation module (CTAM) at the bottleneck scale.
[0097] Regarding the practical application of the causal time series aggregation module CTAM, such as Figure 4 As shown, the specific design includes, in series from input to output, a global average pooling (GAP), a linear mapping module, a causal multi-head attention module, a splicing module, a linear mapping module, and a broadcast alignment module. The input of the global average pooling (GAP) constitutes the input of the causal temporal aggregation module (CTAM), and the output of the broadcast alignment module constitutes the output of the causal temporal aggregation module (CTAM).
[0098] The causal time-series aggregation module CTAM outputs bottleneck hidden state features for each bottleneck branch. Execute as follows:
[0099] First, Global Average Pooling (GAP) is used to target the bottleneck hidden state characteristics. According to the following formula:
[0100] ;
[0101] Compress the two-dimensional feature map into a time token sequence. and stack them as This compression operation allows subsequent time series modeling to focus more on macroscopic evolution trends rather than local noise, while significantly reducing the computational complexity of time series aggregation; among which, This represents the global average pooling function. express The corresponding global pooling feature vector, Representing the bottleneck hidden state features The corresponding global pooling feature vector matrix;
[0102] Next, the first linear mapping module in sequence is used to target the global pooling feature vector matrix. Perform a linear mapping as follows:
[0103] ;
[0104] Obtain the corresponding query ,key ,value ,in, These represent the learnable parameters;
[0105] Then the causal multi-head attention module uses the upper triangular causal mask... , among which when season According to the following formula:
[0106] ;
[0107] Obtain the The aggregated result of attention heads, where, The feature dimension of a single attention head. , , They represent the first The query matrix, key matrix, and value matrix corresponding to each attention head. express transpose, Represents the normalization function. Indicates the first The aggregation result of attention heads under causal constraints.
[0108] The concatenation module then concatenates the aggregation results of each attention head to obtain a trend aggregation vector sequence. It is represented as:
[0109]
[0110] in, Indicates the number of heads of attention. This indicates that the feature dimensions are concatenated, and D2 represents the feature dimensions after concatenation.
[0111] Finally, select the trend aggregation vector sequence. Trend aggregation vector at the last moment As a long-term trend representation, a bottleneck enhancement representation is formed that combines local recursive information with global trend constraints. Then, sequentially through a linear mapping module and a broadcast alignment module, first through channel linear mapping and then through spatial broadcast alignment, to obtain... and further with The output of the causal time-series aggregation module CTAM constitutes the output.
[0112] The output of the element-wise addition module, after being sequentially connected to the decoder layer and the upsampling layer, is connected to one of the inputs of the splicing module. The output of the learnable discrete wavelet enhancement module (L-DWT) is connected to the other input of the splicing module. The output of the splicing module is used to output continuous... Frame cloud image backwards at preset target number Continuous after a time step Frame cloud image.
[0113] In practical applications, the above design further provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of a cloud map prediction method based on frequency domain enhancement and causal attention. Simultaneously, a computer-readable storage medium is designed, on which a computer program is stored. When the computer program is executed by the processor, it implements the steps of a cloud map prediction method based on frequency domain enhancement and causal attention.
[0114] Applying the above design scheme to practice, Central China is located in the transitional zone of the central monsoon climate in my country. Its terrain is characterized by a mix of plains, hills, and mountains. The Jianghan Plain and the Dongting-Poyang Lake system provide abundant water vapor, while the region is also influenced by the westerly winds, the East Asian monsoon, and the seasonal advance and retreat of the subtropical high. Therefore, it is one of the regions prone to severe weather events such as strong convection, torrential rain, and persistent precipitation. Thus, conducting cloud image sequence prediction experiments in Central China is highly representative and has significant practical application value.
[0115] Considering the consistency between day and night and the ability to characterize cloud top radiation, the brightness temperature (TBB) of the infrared channel CH13 (10.8 μm) was selected as the main prediction variable. The raw observation data was stored in DN format, and the brightness temperature field was obtained after radiometric calibration and geometric registration, while maintaining the full-disk Earth disk projection and regular row and column grid. To improve reading efficiency and facilitate training, the brightness temperature frames at each time point were sorted and organized according to timestamps and uniformly saved as .npy format arrays. The FY-4B full-disk scan cycle is 15 minutes, therefore the time interval between adjacent frames is fixed at 15 minutes.
[0116] In the network to be trained (WA-CLSTM) designed in this invention, the number of hidden channels in the shallow and bottleneck ConvLSTM layers are set to 64 and 128, respectively, and the temporal aggregation module adopts a 4-head causal multi-head self-attention mechanism. All models uniformly adopt a hybrid loss function composed of SSIM and L1, and are trained using the AdamW optimizer with an initial learning rate of 8×10⁻⁶. -4The learning rate is adaptively adjusted by combining ReduceLROnPlateau and CosineAnnealingWarmRestarts. The experiments were implemented on the PyTorch platform and training was completed on a single NVIDIA GPU.
[0117] To verify the effectiveness of the proposed satellite cloud image multi-step prediction algorithm (WA-CLSTM) based on frequency domain enhancement and causal attention, comparative and ablation experiments were conducted on the FY-4B satellite cloud image dataset. The algorithm is based on the dual-scale temporal backbone network SeqConv. A learnable discrete wavelet transform (L-DWT) is introduced at the shallow level to perform sub-band decomposition and residual back-injection of the latest time-series features. At the bottleneck scale, a causal temporal aggregation module (CTAM) is introduced to aggregate cross-temporal context and form trend anchors. At the decoding end, temporal aggregation and shallow enhancement features are fused to achieve same-scale reconstruction.
[0118] Based on this, a comparison was made with several representative cloud image prediction methods, including ConvLSTM, PredRNN++, SCSTque, SimVP, and F-CLSTM, and the results are as follows.
[0119] Table 1 shows a comparison of the performance metrics of different models. It can be seen that in the multi-step prediction task of FY-4B brightness temperature cloud maps in Central China, different models exhibit a relatively consistent performance ranking in terms of MSE, SSIM, and PSNR. The traditional ConvLSTM, as a classic spatiotemporal recursive baseline, does not perform ideally overall, indicating that relying solely on local recursive memory for modeling limits its ability to reconstruct the detailed structure of cloud systems and the overall radiation field, making the prediction results more prone to blurring and smoothing. PredRNN++ achieves significant improvements, demonstrating that its stronger spatiotemporal coupling modeling mechanism can more effectively utilize historical sequence information, thereby enhancing its ability to characterize cloud evolution. SCSTque and SimVP further improve their performance, indicating that after introducing a stronger spatiotemporal representation structure, the models are more stable in restoring the texture and overall morphology of cloud maps. Among them, F-CLSTM achieves superior overall results, reflecting its better expressive ability of cloud morphology in combining multi-scale features and temporal recursion. The proposed WA-CLSTM achieves optimal performance across all three metrics. Compared to the strongest baseline F-CLSTM, it further reduces MSE, and simultaneously improves SSIM and PSNR, indicating that this method not only has smaller pixel-level errors but also more closely approximates the real sequence in terms of structural similarity and overall reconstruction quality. Comparative experimental results show that, under a multi-step prediction setting with a 15-minute temporal resolution, WA-CLSTM can more effectively maintain the coherence of the main cloud structure and reduce detail degradation, thus demonstrating more stable and reliable prediction performance in quantitative evaluation.
[0120] Table 1
[0121]
[0122] To verify the contribution of each structural modification and newly added module of WA-CLSTM to the prediction performance, ablation experiments were conducted on each module. The results were evaluated from both quantitative indicators and visualization perspectives. The experimental results clearly reveal the contribution of each module and its synergistic effect. The quantitative results are shown in Table 2. The visualization results are shown in Table 3. Figure 6 As shown, using the original three-scale F-CLSTM as a baseline, six configurations were constructed for comparison while maintaining consistency between the ConvLSTM timing recursion and the output head. The first configuration is the original F-CLSTM; the second is a 2S-F-CLSTM with the backbone compressed from three scales to two scales, used to separately evaluate the impact of downscaling and backbone lightweighting on accuracy; the third configuration adds a learnable discrete wavelet transform (L-DWT) to the shallow layer of the two-scale backbone; the fourth configuration adds a bottleneck causal temporal aggregation module (CTAM) to the two-scale backbone; the fifth configuration enables both L-DWT and CTAM simultaneously on the original three scales; and the sixth configuration enables both L-DWT and CTAM simultaneously on the two-scale backbone, forming a complete WA-CLSTM.
[0123] Table 2
[0124]
[0125] The ablation results in Table 2 quantitatively show that simply compressing the three-scale backbone into a two-scale (2S-F-CLSTM) does not bring any accuracy gains: compared to the three-scale baseline F-CLSTM, the MSE of 2S-F-CLSTM increased from 0.0123 to 0.0131 (+0.0008, approximately +6.5%), the SSIM decreased from 0.687 to 0.681 (−0.006), and the PSNR decreased from 21.57 dB to 20.65 dB (−0.92 dB). This indicates that while reducing one scale layer lowers computational cost, without a compensation mechanism, it weakens the model's ability to represent multi-scale cloud morphology and temporal dynamics, thus leading to a decline in reconstruction quality. Adding L-DWT (2S-F-CLSTM+L-DWT) to the two-scale backbone resulted in a significant performance improvement: compared to 2S-F-CLSTM, MSE decreased from 0.0131 to 0.0120 (−0.0011, approximately −8.4%), SSIM increased from 0.681 to 0.688 (+0.007), and PSNR increased from 20.65 dB to 21.81 dB (+1.16 dB); simultaneously, compared to the three-scale baseline F-CLSTM, this configuration essentially matched and slightly surpassed it. These results demonstrate that shallow frequency domain enhancement can directly compensate for high-frequency structures such as cloud edges and narrow cloud bands, alleviating the detail passivation problem caused by two-scale compression. Furthermore, when only the bottleneck causal time series aggregation module CTAM (2S-F-CLSTM+CTAM) is introduced, the performance improvement is more stable and larger: compared to 2S-F-CLSTM, MSE decreases from 0.0131 to 0.0115 (−0.0016, approximately −12.2%), SSIM increases from 0.681 to 0.699 (+0.018), and PSNR increases from 20.65dB to 22.45dB (+1.80dB); it also achieves a significant improvement compared to F-CLSTM. This indicates that in multi-step extrapolation, the aggregation of long-range time context and causal constraints make a more significant contribution to suppressing error accumulation and reducing morphological drift. Finally, the complete model 2S-F-CLSTM+L-DWT+CTAM (WA-CLSTM) achieves the best overall performance: its MSE=0.0107, SSIM=0.702, and PSNR=22.59dB. Compared to the three-scale baseline F-CLSTM, MSE decreased by 0.0016 (approximately -13.0%), SSIM increased by 0.015, and PSNR increased by 1.02 dB; compared to the two-scale backbone 2S-F-CLSTM, the improvements were more significant (MSE -0.0024, approximately -18.3%; SSIM +0.021; PSNR +1.94 dB).Meanwhile, compared with adding only L-DWT or only CTAM, the complete model further reduced MSE to 0.0107, increased SSIM to 0.702, and increased PSNR to 22.59dB, respectively. This shows that the high-frequency detail compensation of L-DWT and the long-range temporal modeling of CTAM are significantly complementary, which can simultaneously enhance detail fidelity and global temporal consistency, thereby supporting the overall performance improvement of WA-CLSTM.
[0126] Further as Figure 5 The qualitative results show that all models can roughly track the main distribution of the cloud system in the short step size stage. However, as the prediction step size increases, the differences in structural degradation caused by error accumulation gradually become apparent. Taking the real sequence T7–T10 as a reference, the overall translational direction of the cloud cluster and local deformation coexist. There are obvious brightness temperature gradient changes near the cloud edge. Narrow cloud bands and thin cloud textures exhibit the characteristics of continuous stretching and local fracturing during the evolution process. These areas are also the parts where the models are most prone to smoothing or drift.
[0127] ConvLSTM tends to introduce granular noise at cloud edges, manifesting as irregular spots and pseudo-textures near the boundary. This noise is amplified over time, causing cloud edges to gradually become blunt at longer strides and accompanied by local structural drift. The connectivity of some cloud bands is inconsistent with the actual evolution. PredRNN++ can maintain a relatively complete cloud outline, but detail attenuation is more pronounced at longer strides. In the T9–T10 stage, cloud boundaries are excessively smoothed, local contrast decreases, and cloud edges and thin cloud layers are smoothed out, resulting in an overall blurring trend. F-CLSTM has relatively better overall stability, with smaller deviations in the position and scale of the main cloud body. However, detail bleaching still occurs in areas of rapid morphological changes such as cloud merging and splitting. Cloud edge gradients are weakened, and local gray levels are gradually compressed, resulting in a more homogenous texture and reduced boundary clarity. SCSTque's prediction results are generally smooth and can suppress some noise, but this sacrifices the ability to express high-frequency details. Narrow cloud bands and thin cloud textures are more likely to be compressed into blocky structures in the mid-to-late stages, the continuity of cloud bands deteriorates, and local stretching is difficult to maintain. SimVP performs better in global shape tracking, maintaining the translation trend of the cloud body well, and the boundary transition is relatively natural, alleviating the problems of structure drift and edge blunting to some extent. However, it still has a tendency to smooth out narrow cloud bands and texture hierarchy, and local details will gradually weaken at longer step lengths. In contrast, WA-CLSTM maintains a more stable structure across four prediction steps, with clearer gradient transitions and better continuity at cloud edges. Narrow cloud bands are not significantly smoothed out, and there are fewer spikes, breaks, or grainy artifacts at the boundaries.
[0128] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A cloud map prediction method based on frequency domain enhancement and causal attention, characterized in that, Perform steps A through C to obtain a cloud map prediction model, which can be used to perform cloud map prediction. Step A. For each of the at least one preset target direction sky regions, collect data at a preset time step. Cloud imagery of the sky region in the target direction, in frames sequentially from the sky regions corresponding to the same target direction, is collected in order of acquisition. Frame cloud imagery, re-continuous A single sample is constructed by combining frame cloud images, and the first consecutive frames in the single sample are... Frame cloud imagery and subsequent continuation The number of targets is preset between frame cloud images. The process involves several time steps to obtain individual samples, forming a sample set, and then proceeding to step B; where... ; Step B. Based on the dual-scale SeqConv spatiotemporal backbone network, a learnable discrete wavelet enhancement module L-DWT is introduced in the shallow layer, and a causal temporal aggregation module CTAM is introduced in the bottleneck layer to construct the network to be trained, and then proceed to step C; Step C. Based on the sample set, select the earliest continuous samples. Frame cloud imagery is the input, followed by continuous frames. The cloud image is used as the output to train the network and obtain the cloud image prediction model.
2. The cloud map prediction method based on frequency domain enhancement and causal attention according to claim 1, characterized in that: The network structure to be trained includes a shallow layer, a pooling layer, a bottleneck layer, a decoder layer, an upsampling layer, and a splicing module. The shallow layer includes the learnable discrete wavelet enhancement module L-DWT and Each sub-shallow branch consists of two first sequential convolutional feature extraction modules (SeqConv1) and a first convolutional long short-term memory module (ConvLSTM1) connected in series from its input to its output. The input of the first sequential convolutional feature extraction module (SeqConv1) forms the input of the sub-shallow branch, and the output of the first convolutional long short-term memory module (ConvLSTM1) forms the output of the sub-shallow branch. The input of each sub-shallow branch receives the continuous input one-to-one. Each frame of the cloud image in the frame cloud image is continuous. In the sequence of shallow sub-branches corresponding to the frame cloud image, the first The outputs of the shallow branches of each sub-branch are connected to the inputs of the first convolutional long short-term memory module (ConvLSTM1) in the next adjacent shallow branch. The output of the last shallow branch is connected to the input of the learnable discrete wavelet enhancement module (L-DWT), which outputs the enhanced shallow latent state features. Simultaneously, the outputs of each sequential shallow branch are processed by a pooling layer to generate sequential... Path output; The bottleneck layer includes the causal time-series aggregation module CTAM, the element-wise addition module, and... Each sub-bottleneck branch consists of two sequential second-order convolutional feature extraction modules (SeqConv2) and a second-order convolutional long short-term memory module (ConvLSTM2) connected in series from its input to its output. The input of the sub-bottleneck branch is formed by the input of the first and second-order convolutional feature extraction modules (SeqConv2), and the output of the second-order convolutional long short-term memory module (ConvLSTM2) is formed by the output of the sub-bottleneck branch. The input of each sub-bottleneck branch corresponds one-to-one with the sequence of pooling layers. Output path, pooling order In the order of the output paths corresponding to the bottleneck branches of each path, the first... The outputs of the bottleneck branches are connected to the inputs of the second convolutional long short-term memory (ConvLSTM2) module in the next adjacent bottleneck branch. The output of the last bottleneck branch is connected to one of the inputs of the element-wise addition module. Simultaneously, the outputs of each bottleneck branch are connected to the input of the causal temporal aggregation module (CTAM), and the output of CTAM is connected to the other input of the element-wise addition module, which outputs enhanced bottleneck hidden state features. The output of the element-wise addition module, after being sequentially connected to a decoder layer and an upsampling layer, is connected to one of the inputs of the splicing module. The output of the learnable discrete wavelet enhancement module (L-DWT) is connected to the other input of the splicing module, and the output of the splicing module is used to output continuous... Frame cloud image backwards at preset target number Continuous after a time step Frame cloud image.
3. The cloud map prediction method based on frequency domain enhancement and causal attention according to claim 2, characterized in that: The first sequential convolutional feature extraction module SeqConv1 in each shallow branch is connected in series from the input to the output. Convolutional module, normalization layer, ReLU activation layer, The input of the convolutional module constitutes the input of the first sequential convolutional feature extraction module SeqConv1, and the output of the ReLU activation layer constitutes the output of the first sequential convolutional feature extraction module SeqConv1; each sub-shallow branch receives continuous... The corresponding cloud image in the frame cloud image is processed by two sequential first-order convolutional feature extraction modules (SeqConv1) for the received cloud image as follows: ; Obtain cloud imagery The corresponding shallow features The output is fed to the first convolutional long short-term memory module, ConvLSTM1, where... , Indicates continuity The first frame cloud image Frame cloud imagery, express The corresponding shallow features, The following represent the first sequential convolutional feature extraction modules (SeqConv1) in each of the shallow sub-branches. The convolutional kernel weights of the convolutional module, The following represent the first sequential convolutional feature extraction modules (SeqConv1) in each of the shallow sub-branches. Bias terms of the convolution module, This represents a two-dimensional convolution operation. Represents the normalization function. This represents the ReLU function in the shallow sub-branch. This represents the computation and processing function of the first sequential convolutional feature extraction module, SeqConv1. Further, the first convolutional long short-term memory module ConvLSTM1 processes the received shallow features. And the shallow hidden state features output by the first convolutional long short-term memory module ConvLSTM1 in the adjacent shallow branch of the previous path. Features of shallow memory units Execute as follows: ; Obtaining shallow hidden state features Features of shallow memory units and output, where, , express The corresponding shallow hidden state features. , express The corresponding shallow memory unit features, , express The corresponding shallow hidden state features. , express The corresponding shallow memory unit features, Indicates the height of the shallow feature map. Indicates the width of the shallow feature map. This represents the number of channels in the shallow feature map. This represents the computation processing function of the first convolutional short-term memory module, ConvLSTM1; The structure of the second sequential convolutional feature extraction module SeqConv2 in each bottleneck branch is the same as that of the first sequential convolutional feature extraction module SeqConv1, and is further defined by the following formula: ; Each shallow branch of the sequential path outputs the hidden state features. To the pooling layer, after After spatial downsampling by pooling, the output is sent to the corresponding sub-bottleneck branch, where it is first processed by the second sequential convolution feature extraction module SeqConv2 to obtain bottleneck scale features. Then, the second convolutional long short-term memory module, ConvLSTM2, targets the received bottleneck scale features. And the hidden state features of the output of the second convolutional long short-term memory module ConvLSTM2 in the bottleneck branch of the adjacent path. Memory unit characteristics According to the following formula: ; Obtaining bottleneck hidden state features Bottleneck memory unit characteristics and output, where, , express The corresponding bottleneck scale features, This indicates the pooling window corresponding to the pooling layer POOLING. A pooling function with a step size of 2. , express The corresponding bottleneck hidden state features. , express The corresponding bottleneck memory unit characteristics, , express The corresponding bottleneck hidden state features. , express The corresponding bottleneck memory unit characteristics, This represents the number of channels in the bottleneck feature map. This represents the computation processing function of the second convolutional short-term memory module, ConvLSTM2.
4. The cloud map prediction method based on frequency domain enhancement and causal attention according to claim 3, characterized in that: The learnable discrete wavelet enhancement module (L-DWT) includes a depthwise separable convolution module, a concatenation module, an element-wise addition module, and two branches with identical structures. The input of the depthwise separable convolution module constitutes the input of the L-DWT. Each branch connects an upsampling module and a convolutional layer in series from its input to its output. The input of the upsampling module constitutes the input of a branch, and the output of the convolutional layer constitutes the output of a branch. The output of the depthwise separable convolution module connects to the inputs of the two branches. The outputs of the two branches connect to the two inputs of the concatenation module. The output of the concatenation module connects to one input of the element-wise addition module. The other input of the element-wise addition module connects to the input of the L-DWT. The output of the element-wise addition module constitutes the output of the L-DWT. The L-DWT outputs the shallow hidden state features of the last received sub-shallow branch. Execute as follows: First, the depthwise separable convolutional module is used according to the following formula: ; First target A downsampling transform with a stride of 2 is performed, followed by pointwise convolution to complete the linear combination of channels, thereby obtaining learnable low-frequency components. With high frequency components ,in, and Let represent the learnable low-frequency band decomposition operator and the high-frequency band decomposition operator, respectively. ; Next, the low-frequency components were analyzed separately. With high frequency components According to the following formula: ; First, perform bilinear interpolation upsampling, then complete channel alignment through 1×1 convolution to obtain the result. Low-frequency enhancement term of the same spatial size With high frequency enhancement term ;in, This represents the bilinear interpolation upsampling function; Represents a nonlinear activation function; This represents a 1×1 convolution function; Then press Introducing global scalar gating coefficients and And according to the following formula: ; Combining the shallow hidden state features of the last shallow branch output To obtain enhanced shallow hidden state features ;in, .
5. The cloud map prediction method based on frequency domain enhancement and causal attention according to claim 3, characterized in that: The causal temporal aggregation module CTAM is connected in series from input to output, including a global average pooling GAP, a linear mapping module, a causal multi-head attention module, a splicing module, a linear mapping module, and a broadcast alignment module. The input of the global average pooling GAP constitutes the input of the causal temporal aggregation module CTAM, and the output of the broadcast alignment module constitutes the output of the causal temporal aggregation module CTAM. The causal time-series aggregation module CTAM outputs bottleneck hidden state features for each bottleneck branch. Execute as follows: First, Global Average Pooling (GAP) is used to target the bottleneck hidden state characteristics. According to the following formula: ; Compress the two-dimensional feature map into a time token sequence. and stack them as ,in, This represents the global average pooling function. express The corresponding global pooling feature vector, Representing the bottleneck hidden state features The corresponding global pooling feature vector matrix; Next, the first linear mapping module in sequence is used to target the global pooling feature vector matrix. Perform a linear mapping as follows: ; Obtain the corresponding query ,key ,value ,in, These represent the learnable parameters; Then the causal multi-head attention module uses the upper triangular causal mask... , among which when season According to the following formula: ; Obtain the The aggregated result of attention heads, where, The feature dimension of a single attention head. , , They represent the first The query matrix, key matrix, and value matrix corresponding to each attention head. express transpose, Represents the normalization function. Indicates the first The aggregation result of individual attention heads under causal constraints; The concatenation module then concatenates the aggregation results of each attention point to obtain a trend aggregation vector sequence. It is represented as: ; in, Indicates the number of heads of attention. This indicates concatenation based on feature dimensions, where D2 represents the feature dimensions after concatenation. Finally, select the trend aggregation vector sequence. Trend aggregation vector at the last moment As a long-term trend representation, the sequence is obtained by the linear mapping module and the broadcast alignment module, first through channel linear mapping, and then through spatial dimension broadcast alignment. and further with The output of the causal time-series aggregation module CTAM constitutes the output.
6. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 5.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.