Real-world video super-resolution methods, systems, devices, and media
By combining a dual-axis spatiotemporal attention mechanism and rotational position coding, the problems of information loss and high computational cost in real-world video super-resolution are solved, achieving high-quality and low-cost video reconstruction results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
- Filing Date
- 2025-02-21
- Publication Date
- 2026-06-23
AI Technical Summary
Existing video super-resolution methods perform poorly in real-world applications, especially in low-level vision tasks where they are prone to information loss. Local attention limits the receptive field and scalability, and optical flow alignment leads to accumulated errors, impacting performance.
A dual-axis spatiotemporal attention mechanism is adopted, which processes video features through vertical-temporal and horizontal-temporal attention blocks. Combined with rotational position encoding and two-dimensional convolution, it realizes the synchronous utilization of spatial and temporal information. A pre-training-fine-tuning strategy is adopted to adapt to different degradation scenarios.
It improves video reconstruction quality and robustness, reduces computational costs, and generates high-quality video output in complex real-world scenarios.
Smart Images

Figure CN120339063B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video image processing, and in particular to a real-world video super-resolution method, system, device, and medium. Background Technology
[0002] Real-world video super-resolution (VSR) tasks aim to recover high-resolution results from low-quality, degraded video inputs that may contain noise, blur, and compression artifacts. However, directly applying the ViViT architecture to real-world VSR tasks yields poor results. Unlike high-level vision tasks, low-level vision tasks require pixel-level accuracy and consistency of detail, making them more susceptible to information loss. The compression patching, tokenization, and sequential spatiotemporal attention processes in ViViT can lead to significant detail loss, weakening the model's ability to generate high-quality reconstructions.
[0003] To address these issues, ViT-based image and video super-resolution models introduce specific adjustments, such as smaller patch sizes, CNN-based pixel refinement, and local attention mechanisms, combined with optical flow-based feature alignment and fusion modules. Through the use of recurrent structure design and auxiliary modules, CNN-based RealBasicVSR and transformer-based RealViformer achieve significantly better results than ViViT-VSR. However, these approaches still face limitations: local attention restricts the receptive field and scalability, while reliance on optical flow for alignment often leads to accumulated errors over long sequences or under realistic degradation conditions, significantly impacting performance. Summary of the Invention
[0004] The purpose of this invention is to overcome the shortcomings of the prior art and provide a real-world video super-resolution method, system, device, and medium that has good video reconstruction quality, high robustness, and low cost.
[0005] The objective of this invention can be achieved through the following technical solutions:
[0006] According to a first aspect of the present invention, a real-world video super-resolution method is provided, comprising:
[0007] Input the original video sequence;
[0008] Video embedding: Spatial features of each frame are extracted from the original video sequence to obtain the first feature. The first feature is then input into the dual-axis spatiotemporal attention mechanism module to obtain the second feature. The dual-axis spatiotemporal attention mechanism module includes a vertical-temporal attention block and a horizontal-temporal attention block. The feature blocks generated after the first feature is sliced by rotational position encoding are converted into token sequences. The token sequences are rearranged and then fed into the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics.
[0009] Spatiotemporal reconstruction: Based on the second feature and the original video sequence, temporal attention is used to integrate temporal information to reconstruct and generate video output with higher spatiotemporal quality.
[0010] Preferably, the step of extracting spatial features from each frame of the original video sequence specifically involves: using two-dimensional convolution to extract spatial features from each frame of the original video sequence, and using a temporal attention mechanism to enhance temporal continuity.
[0011] Preferably, the feature block generated after the first feature segmentation process is specifically: the first feature is segmented into 1×1 segments according to the time, height and width dimensions.
[0012] Preferably, the rotation position encoding is achieved by multiplying by an offset setting in the complex vector space, for the u-th query vector q u and the v-th key vector k v The rotation position encoding expression is:
[0013] f q (q u ,u)=e iuΘ q u ,
[0014] f k (k v ,v)=e ivΘ k v ,
[0015] In the formula: Θ is a diagonal matrix containing element θ d =b -2d / |D| Where b is the rotation cardinality and D is the embedding dimension of the attention block; where A is the score corresponding to the attention block. v Let f be two complex vectors q and f k The real part of the inner product.
[0016] Preferably, in the dual-axis spatiotemporal attention mechanism module, the token sequence is rearranged and then fed into the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics, specifically including:
[0017] First, token embedding is performed along the height-time dimension, rearranging the token dimensions from [BD n] N n H n W ] becomes [(BD n W )n H n N After that, B is the batch size, D is the embedding dimension of the attention block, and n N ,n H ,nW The number of blocks corresponding to the time, height, and width directions are fed into the Vertical-Temporal Attention Block (VTAB) to calculate self-attention on the vertical-temporal plane and perform spatial texture and vertical motion modeling.
[0018] Then, token embedding is performed along the width-time dimension, rearranging the token dimensions from [(BD n W )n H n N ] becomes [(BD n H )n W n N After that, it is fed into the horizontal-temporal attention block HTAB, which is used to calculate self-attention on the horizontal-temporal plane and perform spatial texture and horizontal motion modeling.
[0019] Preferably, the step of integrating temporal information using temporal attention based on the second feature and the original video sequence to reconstruct and generate a video output with higher spatiotemporal quality specifically includes:
[0020] Integrate temporal information using time attention;
[0021] Frame-by-frame reconstruction is performed using 2D convolution and pixel scrambling operations to generate video output with higher spatiotemporal quality.
[0022] Preferably, the dual-axis spatiotemporal attention mechanism module is trained using a pre-training-fine-tuning strategy.
[0023] According to a second aspect of the present invention, a system employing the aforementioned real-world video super-resolution method is provided, comprising:
[0024] The input module is used to input the raw video sequence;
[0025] The video embedding module is used to extract the spatial features of each frame from the original video sequence to obtain the first feature, and input the first feature to the dual-axis spatiotemporal attention mechanism module to obtain the second feature; wherein, the dual-axis spatiotemporal attention mechanism module includes a vertical-temporal attention block and a horizontal-temporal attention block. The feature block generated after the first feature is sliced by rotation position encoding is converted into a token sequence. The token sequence is rearranged and then sent to the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics.
[0026] The spatiotemporal reconstruction module allows users to integrate temporal information based on the second feature and the original video sequence, using temporal attention to reconstruct and generate video outputs with higher spatiotemporal quality.
[0027] According to a third aspect of the present invention, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the program to implement any of the methods described above.
[0028] According to a fourth aspect of the invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements any of the methods described herein.
[0029] Compared with the prior art, the present invention has the following beneficial effects:
[0030] (1) High video reconstruction quality: The dual-axis spatiotemporal attention mechanism module constructed in this invention embeds video simultaneously along the height-time and width-time dimensions, and applies vertical-time attention blocks and horizontal-time attention blocks to realize the synchronous utilization of spatial features and motion characteristics, enhance the ability to capture details and dynamic changes, and achieve more accurate and high-quality video reconstruction results.
[0031] (2) Higher robustness: This invention further enhances the ability to capture long-distance dependencies by rotating and encoding the features extracted from the original video sequence, ensuring spatial and temporal consistency, and enabling high-quality video super-resolution task output in complex real-world degradation scenarios; at the same time, the pre-training-fine-tuning strategy can stably output high-quality results under different types of low-quality video inputs.
[0032] (3) Low cost and high quality video super-resolution output: In the spatiotemporal reconstruction process, the present invention improves the traditional frame-by-frame processing method, so that spatial and temporal information are integrated at the same time during the reconstruction process. Compared with three-dimensional convolution, using two-dimensional convolution for reconstruction not only reduces the computational cost, but also has almost no performance loss. That is, while maintaining low computational overhead, it generates video output with higher spatiotemporal quality. Attached Figure Description
[0033] Figure 1 This is a schematic diagram of the real-world video super-resolution method of the present invention;
[0034] Figure 2 A schematic diagram of a two-axis spatiotemporal two-axis attention mechanism;
[0035] Figure 3 This diagram illustrates the structural differences between the model of the present invention and existing models. Detailed Implementation
[0036] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0037] Example
[0038] like Figure 1 As shown, this embodiment constructs a dual-axis spatiotemporal transformer model for real-world video super-resolution tasks, aiming to improve the resolution quality of videos. The input low-resolution video is converted into a high-resolution output through a series of processing steps. The specific implementation steps of the real-world video super-resolution method include:
[0039] S1. Input the original video sequence. The original low-resolution video sequence is represented as follows: Where i represents the frame index, B is the batch size, C is the number of channels, N is the total number of frames, and H and W are the height and width of each frame, respectively.
[0040] S2, Video Embedding: Extract the spatial features of each frame from the original video sequence to obtain the first feature, and input the first feature into the dual-axis spatiotemporal attention mechanism module to obtain the second feature.
[0041] In this embodiment, the dual-axis spatiotemporal attention mechanism module includes a vertical-temporal attention block and a horizontal-temporal attention block. The feature blocks generated after the first feature segmentation process are converted into token sequences through rotational position encoding. These token sequences are then rearranged and fed into the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics. The specific process includes:
[0042] S21. Apply two-dimensional convolution to each frame to capture its spatial features and obtain shallow features. d represents the dimension of shallow features.
[0043] S22. Perform a 1×1 partitioning operation on the shallow features, and enhance the temporal continuity through a temporal attention mechanism to divide the shallow features into blocks. Where, n N ,n H ,n W These correspond to the number of blocks in the time, height, and width directions, respectively, while n, h, and w represent the block size in each dimension.
[0044] S23. Embed the tokens along the height-time dimension, rearranging the token dimensions from [BD n] N n H nW ] becomes [(BD n W )n H n N After that, B is the batch size, D is the embedding dimension of the attention block, and n N ,n H ,n W The number of blocks corresponding to the time, height, and width directions are fed into the Vertical-Temporal Attention Block (VTAB) to calculate self-attention on the vertical-temporal plane and perform spatial texture and vertical motion modeling.
[0045] Specifically, Rotational Position Encoding (RoPE) is used to embed the segmented feature blocks into the token. Where D is the embedding dimension of the attention block. For the u-th query vector q u and the v-th key vector k v RoPE is multiplied by a bias in complex vector space, as shown by the formula f. q (q u ,u)=e iuΘ q u ,f k (k v ,v)=e ivΘ k v , where Θ is a diagonal matrix containing the element θ d =b -2d / |D| And the rotation basis b = 10000. The attention score is obtained by calculating the inner product of the real parts of two complex vectors: A v =Re<f q (q u ,u),f k (k v RoPE's advantage lies in its better handling of long-range dependencies and maintaining spatial and temporal consistency, which is crucial for video super-resolution tasks.
[0046] S24. Embed the tokens along the width-time dimension, rearranging the token dimensions from [(BD n)] W )n H n N ] becomes [(BD n H )n W n N After that, it is fed into the horizontal-temporal attention block HTAB, which is used to calculate self-attention on the horizontal-temporal plane and perform spatial texture and horizontal motion modeling.
[0047] In this embodiment, the specific settings for the Vertical-Time Attention Block (VTAB) and the Horizontal-Time Attention Block (HTAB) are as follows: the input token sequence A1 is sequentially passed through a convolutional layer and a multi-head click attention block to generate feature A2. Feature A1 and feature A2 are fused to obtain feature A3. Feature A3 is sequentially processed by layer normalization and a multilayer perceptron to obtain feature A4. Feature A3 and A4 are fused to obtain feature A5, which is used as the output of the Vertical-Time Attention Block (VTAB) / Horizontal-Time Attention Block (HTAB).
[0048] By setting the model as described above, spatial features and motion characteristics can be effectively utilized in both vertical and horizontal directions, enhancing the ability to capture details and dynamic changes.
[0049] S3. Spatiotemporal Reconstruction: Based on the second feature and the original video sequence, temporal attention is used to integrate temporal information, reconstructing and generating a video output with higher spatiotemporal quality. The output high-resolution video is represented as follows: Where 's' represents the magnification ratio, which is usually set to 4 times.
[0050] Specifically, in the decoding and reconstruction process after desegmentation, temporal attention is first applied to integrate temporal information, followed by frame-by-frame reconstruction using operations such as 2D convolution and pixel scrambling. Experiments show that 3D convolution does not provide a performance improvement over 2D convolution; instead, it significantly increases computational costs. Therefore, 2D convolution was chosen for reconstruction. This approach allows the model to combine spatial and temporal information, resulting in video output with higher spatiotemporal quality.
[0051] To adapt the model to real-world degradation, a pre-training-fine-tuning strategy was employed. The entire architecture is based on the ViViT design and includes input video embedding and spatiotemporal reconstruction modules. A dual-axis spatiotemporal attention mechanism was specifically introduced to achieve enhanced fusion and modeling of spatiotemporal information.
[0052] One of the core features of this invention is the first-ever proposed dual-axis spatiotemporal attention mechanism, such as... Figure 2 As shown:
[0053] In re-examining the attention mechanism in Video Transformers, it's noteworthy that different types of attention mechanisms each have their own advantages and disadvantages. Vision Transformer (ViT) captures static spatial information by segmenting the image into non-overlapping blocks and applying self-attention to them. However, in video tasks, spatial attention alone cannot handle inter-frame temporal dependencies. ViViT introduces temporal attention to capture dynamic relationships between frames, computing self-attention by connecting tokens at the same spatial location across time. This helps capture motion patterns but may lead to a loss of intra-frame spatial details. Spatiotemporal attention, combining spatial and temporal information, allows the model to focus on key regions within a frame while also understanding inter-frame dynamics.
[0054] Existing methods typically use spatial and temporal attention sequentially or alternately, but still process spatial and temporal information separately, limiting their ability to capture complex spatiotemporal patterns. Local window attention is widely used in video super-resolution models (such as RVRT, PSRT, and IART), employing a Swing Transformer architecture to compute attention within a local window, improving performance on low-level tasks. However, this mechanism is limited by a finite receptive field and poor scalability, making it difficult to capture long-range dependencies. To investigate the impact of different attention mechanisms on video super-resolution (VSR) performance, this embodiment extends the ViViT architecture to the VSR task and conducts experiments on the REDS dataset. The results show that using only spatial or temporal attention performs poorly, with temporal attention slightly outperforming because it can utilize inter-frame similarity to compensate for the lack of spatial details.
[0055] The dual-axis spatiotemporal attention mechanism of this invention integrates vertical-temporal and horizontal-temporal attention, effectively combining spatial and temporal information. The input video is segmented into a series of tokens, each representing a spatiotemporal region. These tokens are first embedded along the height-time dimension and then fed into a vertical-temporal attention block (VTAB) to model spatial texture and vertical motion. Subsequently, they are embedded along the width-time dimension, and a horizontal-temporal attention block (HTAB) is applied to capture spatial texture and horizontal motion. This method enables the model to utilize spatial features and motion characteristics simultaneously in both the vertical and horizontal directions, enhancing its ability to capture details and dynamic changes. Experiments show that this method not only improves parameter efficiency and running speed but also achieves significant performance improvements in video super-resolution tasks.
[0056] Table 1
[0057]
[0058] This embodiment also provides a real-world video super-resolution system, which includes:
[0059] The input module is used to input the raw video sequence;
[0060] The video embedding module is used to extract the spatial features of each frame from the original video sequence to obtain the first feature, and input the first feature to the dual-axis spatiotemporal attention mechanism module to obtain the second feature; wherein, the dual-axis spatiotemporal attention mechanism module includes a vertical-temporal attention block and a horizontal-temporal attention block. The feature block generated after the first feature is sliced by rotation position encoding is converted into a token sequence. The token sequence is rearranged and then sent to the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics.
[0061] The spatiotemporal reconstruction module allows users to integrate temporal information based on the second feature and the original video sequence, using temporal attention to reconstruct and generate video outputs with higher spatiotemporal quality.
[0062] In summary, this invention, through in-depth analysis of the core spatial and temporal attention mechanisms of the video Transformer, reveals that existing spatiotemporal attention mechanisms typically process spatial and temporal information independently and sequentially. This sequential approach fails to fully integrate spatiotemporal information, limiting coherent video representation. Therefore, this invention proposes a novel dual-axis spatial and temporal Transformer for Real-World Video Super-Resolution (DualX-VSR).
[0063] Unlike traditional spatiotemporal attention mechanisms, this invention simultaneously handles vertical-temporal and horizontal-temporal attention, projecting spatiotemporal information along orthogonal directions to achieve integrated modeling of spatial and temporal information. This avoids the sequential stacking of spatial and temporal modules, providing a more unified and coherent representation. The effectiveness of this method has been verified in real-world video super-resolution tasks. Figure 3 As shown, DualX-VSR retains the simplicity of the ViViT architecture without adding additional auxiliary modules, demonstrating superior performance compared to current video super-resolution methods. By eliminating explicit motion estimation and long-range feature propagation, DualX-VSR provides a robust solution to the unique challenges of real-world video super-resolution, setting a new standard for high-quality video restoration in complex real-world scenes.
[0064] The electronic device of this invention includes a central processing unit (CPU), which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) or loaded from a storage unit into random access memory (RAM). The RAM may also store various programs and data required for device operation. The CPU, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.
[0065] Multiple components in the device are connected to the I / O interface, including: input units such as keyboards and mice; output units such as various types of displays and speakers; storage units such as disks and optical discs; and communication units such as network interface cards (NICs), modems, and wireless transceivers. The communication unit allows the device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0066] The processing unit executes the various methods and processes described above, such as methods S1 to S3. For example, in some embodiments, methods S1 to S3 may be implemented as computer software programs tangibly contained in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and / or installed on the device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of methods S1 to S3 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to execute methods S1 to S3 by any other suitable means (e.g., by means of firmware).
[0067] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload programmable logic devices (CPLDs), and so on.
[0068] The program code used to implement the methods of the present invention can be written in any combination of one or more programming languages. This program code can be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code can be executed entirely on the machine, partially on the machine, as a standalone software package partially on the machine and partially on a remote machine, or entirely on a remote machine or server.
[0069] In the context of this invention, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media can include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0070] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A real-world video super-resolution method, characterized in that, include: Input the original video sequence; Video embedding: Spatial features of each frame are extracted from the original video sequence to obtain the first feature. The first feature is then input into the dual-axis spatiotemporal attention mechanism module to obtain the second feature. The dual-axis spatiotemporal attention mechanism module includes a vertical-temporal attention block and a horizontal-temporal attention block. The feature blocks generated after the first feature is sliced by rotational position encoding are converted into token sequences. The token sequences are rearranged and then fed into the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics. Spatiotemporal reconstruction: Based on the second feature and the original video sequence, temporal attention is used to integrate temporal information to reconstruct and generate video output with higher spatiotemporal quality; In the dual-axis spatiotemporal attention mechanism module, the token sequence is rearranged and then fed into the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics, specifically including: First, token embedding is performed along the height-time dimension, and the token dimensions are rearranged from... Become back, For batch size, For the embedding dimension of the attention block, The number of blocks corresponding to the time, height, and width directions are fed into the Vertical-Temporal Attention Block (VTAB) to calculate self-attention on the vertical-temporal plane and perform spatial texture and vertical motion modeling. Then, token embedding is performed along the width-time dimension, rearranging the token dimensions from... Become Then, it is fed into the horizontal-temporal attention block HTAB to calculate self-attention on the horizontal-temporal plane and perform spatial texture and horizontal motion modeling; The specific settings for Vertical-Temporal Attention Block (VTAB) and Horizontal-Temporal Attention Block (HTAB) are as follows: The input token sequence A1 is sequentially passed through a convolutional layer and a multi-head dot product attention block to generate feature A2. Feature A1 and feature A2 are fused to obtain feature A3. Feature A3 is sequentially processed by layer normalization and a multilayer perceptron to obtain feature A4. Feature A3 and A4 are fused to obtain feature A5, which is used as the output of Vertical-Temporal Attention Block (VTAB) or Horizontal-Temporal Attention Block (HTAB).
2. The real-world video super-resolution method according to claim 1, characterized in that, The extraction of spatial features from each frame of the original video sequence specifically involves: using two-dimensional convolution to extract spatial features from each frame of the original video sequence, and using a temporal attention mechanism to enhance temporal continuity.
3. The real-world video super-resolution method according to claim 1, characterized in that, The feature blocks generated after the first feature segmentation process are specifically generated by: dividing the first feature according to the time, height, and width dimensions. Slice processing.
4. The real-world video super-resolution method according to claim 1, characterized in that, The rotation position encoding is achieved by multiplying the offset setting in the complex vector space, for the first... query vectors and the Key vectors The rotation position encoding expression is: , , In the formula: It is a diagonal matrix containing elements , For the rotation base, Let be the embedding dimension of the attention block; where is the score corresponding to the attention block. Two complex vectors and The real part of the inner product.
5. The real-world video super-resolution method according to claim 1, characterized in that, The process of reconstructing and generating higher spatiotemporal quality video output based on the second feature and the original video sequence, using temporal attention to integrate temporal information, specifically includes: Integrate temporal information using time attention; Frame-by-frame reconstruction is performed using 2D convolution and pixel scrambling operations to generate video output with higher spatiotemporal quality.
6. The real-world video super-resolution method according to claim 1, characterized in that, The dual-axis spatiotemporal attention mechanism module is trained using a pre-training-fine-tuning strategy.
7. A system employing the real-world video super-resolution method of claim 1, characterized in that, include: The input module is used to input the raw video sequence; The video embedding module is used to extract the spatial features of each frame from the original video sequence to obtain the first feature, and input the first feature to the dual-axis spatiotemporal attention mechanism module to obtain the second feature; wherein, the dual-axis spatiotemporal attention mechanism module includes a vertical-temporal attention block and a horizontal-temporal attention block. The feature block generated after the first feature is sliced by rotation position encoding is converted into a token sequence. The token sequence is rearranged and then sent to the vertical-temporal attention block and the horizontal-temporal attention block respectively to simulate spatial texture and motion characteristics. The spatiotemporal reconstruction module allows users to integrate temporal information based on the second feature and the original video sequence, using temporal attention to reconstruct and generate video outputs with higher spatiotemporal quality.
8. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 6.