Video super-resolution method based on long-range-short-range combination
By combining short-range and long-range feature extraction methods in the video super-resolution algorithm, and using sliding window and recurrent neural network to generate high-resolution video, the problem of insufficient information combination in existing algorithms is solved, and better video super-resolution effect is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2023-09-26
- Publication Date
- 2026-06-12
AI Technical Summary
Existing video super-resolution algorithms struggle to effectively combine short-range and long-range information, resulting in poor video super-resolution performance. Furthermore, the transformer has a large number of parameters and requires a large number of floating-point operations, leading to an imbalance between resources and performance.
A video super-resolution method based on long-range and short-range combination is adopted. Short-range features are extracted by sliding window and long-range features are generated by recurrent neural network. The features are then combined with shallow features for feature fusion and reconstruction to generate high-resolution images.
By combining short-range and long-range information, the generated video super-resolution effect is better, with richer details, stronger temporal correlation and consistency.
Smart Images

Figure CN117333363B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of video super-resolution technology, and particularly relates to a video super-resolution method based on a combination of long-range and short-range methods. Background Technology
[0002] Super-resolution is defined as reconstructing a high-resolution output from a low-resolution input by learning a mapping relationship from low resolution to high resolution. With the significant improvement in computing power of hardware devices in recent years, users' demands for image quality have also increased. Because directly transmitting high-quality images or videos requires excessive bandwidth, low-quality media is often transmitted during the actual transmission process, and then super-resolution is performed on the media before it reaches the display terminal to improve image quality. Super-resolution currently has numerous practical applications, such as medical image reconstruction, satellite image remote sensing, digital high-definition, microscopic imaging, game streaming, and movie videos.
[0003] Super-resolution can be divided into image super-resolution and video super-resolution: image super-resolution reconstructs a high-resolution image from a single low-resolution input image; video super-resolution reconstructs a high-resolution image from multiple consecutive input frames by utilizing inter-frame relationships. Traditional video super-resolution methods often use affine transformations or probabilistic statistics for estimation, but these methods are difficult to effectively estimate the complex motion processes of objects in videos and scene transitions. With the tremendous success of deep learning in various fields, deep learning-based video super-resolution algorithms have been widely studied. Currently, deep learning-based super-resolution algorithms mainly include three structures: convolutional neural network (CNN) based structures, which use a sliding window approach to extract short-range features between adjacent frames and participate in the calculation; recurrent neural network (RNN) based structures, which use hidden states to store long-range features of historical or future frames and participate in the calculation, with features continuously updated during the calculation process; and Transformer based structures, which use attention mechanisms to extract features and participate in the calculation.
[0004] Since video is a continuous sequence, adjacent frames closer to the target frame often provide more useful information. Convolutional neural networks utilize local sliding window feature extraction to extract and fuse features, making good use of short-range information between frames. Simultaneously, there is a certain correlation between consecutive frames in a video; a relatively distant frame may significantly aid in the reconstruction of the target frame. Therefore, if the network can focus on longer video sequences or has a larger temporal receptive field, it can achieve better video super-resolution results. Currently, existing networks do not effectively combine short-range and long-range information. Some transformers can achieve this combination using attention matching mechanisms between queries and keys, but transformers have a large number of parameters and high floating-point computation, failing to achieve a good balance between performance and resources. Summary of the Invention
[0005] To address the aforementioned technical problems, this invention discloses a video super-resolution method based on a long-range-short-range combination. This method obtains short-range features from a low-resolution image through a sliding window, then inputs these short-range features into a recurrent network to obtain long-range features. Feature fusion and reconstruction of these two methods yield a high-resolution image. This invention fully utilizes both short-range and long-range information in the video sequence, resulting in a super-resolution image with rich detail.
[0006] The technical solution adopted by this invention to solve its technical problem is as follows:
[0007] A video super-resolution method based on long-range and short-range combination includes the following steps:
[0008] Step (1). Acquire low-resolution video and perform image enhancement frame by frame;
[0009] Step (2). Extract shallow features from each frame of the image-enhanced low-resolution video;
[0010] Step (3). Calculate inter-frame motion compensation information based on the image enhancement of each frame and the shallow features of adjacent frames;
[0011] Step (4). Each frame is treated as a target frame in turn. The adjacent frames are bent and deformed using deformable convolution and the inter-frame motion compensation information calculated in step (3). The shallow features of the adjacent frames are aligned with the shallow features of the target frame.
[0012] Step (5). Fuse the shallow features of the target frame with the shallow features of adjacent frames after alignment to obtain the short-range features of the target frame;
[0013] Step (6). Based on the short-range features of the target frame and adjacent frames, a long-range feature extraction module based on a recurrent neural network is used to generate long-range features of the target frame;
[0014] Step (7). Traverse all target frames, fuse the shallow features, short-range features and long-range features of each frame of the low-resolution video obtained in steps (2), (5) and (6) to obtain reconstructed features, and perform upsampling interpolation and channel dimension transformation on the reconstructed features to generate high-resolution video.
[0015] Furthermore, in step (1), image enhancement includes mirror symmetry, horizontal 90° flip, and vertical 90° flip.
[0016] Furthermore, in step (2), each frame of the low-resolution video after image enhancement obtained in step (1) is expanded in channel dimension through a convolutional layer while maintaining the resolution, to obtain shallow features of each frame.
[0017] In steps (1)-(2), image enhancement is first performed on the input low-resolution video sequence segment, and then spatial features of each frame of the video sequence are extracted. Since these features are shallow features, they can preserve the information of the original image relatively completely. These shallow features serve as guiding information in the subsequent calculation of reconstructed features, helping the model to converge quickly and achieve good performance. Further, in step (3), the shallow features of each frame obtained in step (2) are processed sequentially with the shallow features of its adjacent frames to obtain the inter-frame motion compensation information of the adjacent frames, as follows:
[0018]
[0019]
[0020] in, This represents the shallow features of the t-th frame image. Let R(.) represent the shallow features of the adjacent frames before and after the t-th frame, R(.) represent the stacked residual block structure, and ReLU(.) represent the activation function. This represents the backward inter-frame motion compensation for aligning the shallow features of the (t-1)th frame to the shallow features of the tth frame. This represents forward inter-frame motion compensation that aligns the shallow features of the (t+1)th frame image with the shallow features of the tth frame image.
[0021] In step (3), motion compensation is calculated for each frame of the low-resolution video sequence to determine the forward and backward optical flow of each frame, i.e., backward inter-frame motion compensation and forward inter-frame motion compensation.
[0022] Furthermore, in step (4), deformable convolution is used to compensate for inter-frame motion. As the positional deviation between the target frame and its adjacent frames, the shallow features of the historical frame and the future frame are aligned with the shallow features of the target frame to obtain the shallow features after alignment of the historical frame and the shallow features after alignment of the future frame.
[0023] In step (4), the previously obtained motion compensation information is used to warp the image and align the reference frame to the target frame. Specifically, the motion compensation information obtained in step (3) is used as the offset in the variable convolution dconv, while the shallow features of adjacent frames are used as input to warp and deform the shallow features.
[0024] Furthermore, in step (5), the shallow features of the historical frame after alignment, the shallow features of the target frame, and the shallow features of the future frame are concatenated in the channel dimension, and then the features are fused sequentially through a convolutional layer, an activation function layer, and stacked residual blocks to obtain the short-range features of the target frame.
[0025] Furthermore, in step (6), the calculation formula for the long-range feature extraction module based on the recurrent neural network is as follows:
[0026]
[0027]
[0028]
[0029] in, h represents the short-range features of the target frame and adjacent frames, respectively. t-1 The hidden state of the historical frame is represented by c(.,.), which concatenates the data along the channel dimension. conv(.) represents a convolutional layer, ReLU(.) represents an activation function, R(.) represents a stacked residual block structure, and RNN(.) represents a recurrent neural network. h t Indicates the hidden state of the current frame. Represents the long-range features of the target frame. This represents the features obtained by fusing the short-range features of the target frame, the short-range features of historical frames, the short-range features of future frames, and the hidden state of historical frames.
[0030] Furthermore, in step (7), the shallow features, short-range features, and long-range features of each frame image are concatenated in the channel dimension, and then the features are fused sequentially through a convolutional layer, an activation function layer, and stacked residual blocks to obtain the reconstructed features.
[0031] Furthermore, in step (7), the reconstructed features are upsampled and interpolated through a subpixel convolutional layer, and then the upsampled and interpolated features are restored to their original channel dimensions to generate a high-resolution video.
[0032] The beneficial effects of this invention are:
[0033] This invention designs a complete network structure for video super-resolution based on a long-range-short-range combination, including a data preprocessing module, a shallow feature extraction module, a short-range feature extraction module, a long-range feature extraction module, and a super-resolution reconstruction module, belonging to a multi-layered network architecture. The network structure first uses the shallow feature extraction module to calculate shallow features for each low-resolution image frame. Next, it uses a local feature extraction sliding window to extract short-range features. Finally, it uses a recurrent neural network to calculate the long-range features for each low-resolution image frame and fuses the three types of features obtained to obtain the final high-resolution image. By separately calculating the features of neighboring frames (short-range features) and distant frames (long-range features) of the target frame, and then performing feature fusion, it can effectively utilize the context-related features of the video sequence, resulting in better temporal correlation and consistency in the super-resolution video. Attached Figure Description
[0034] Figure 1 This is a structural block diagram of the video super-resolution method based on long-range-short-range combination used in the embodiments of the present invention;
[0035] Figure 2 This is an overall flowchart of an embodiment of the present invention. Detailed Implementation
[0036] The method of the present invention will be further described below with reference to the accompanying drawings. The accompanying drawings are merely illustrative diagrams of the present invention. Some block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities can be implemented in software, or in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0037] The structural block diagram of the video super-resolution method based on long-range-short-range combination of the present invention is as follows: Figure 1 As shown, it mainly includes five modules: data preprocessing module, shallow feature extraction module, short-range feature extraction module, long-range feature extraction module, and super-resolution reconstruction module. The implementation steps are as follows: Figure 2 As shown.
[0038] The data preprocessing module is used to process the input raw video data stream and execute the method in step (1) below.
[0039] Step (1). Obtain the low-resolution video sequence and perform mirror symmetry, horizontal 90° flip, and vertical 90° flip on each frame sequentially to achieve image enhancement. Denote the image-enhanced low-resolution video sequence as... Where T represents the number of video frames, This represents the image at frame t; subsequently, each frame is input into the shallow feature extraction module.
[0040] The shallow feature extraction module is used to extract shallow features of each frame in a low-resolution video by performing the method in step (2) below.
[0041] Step (2). Take the t-th frame (t∈[1,T]) of the low-resolution video obtained in step (1). The input frame's channel dimension is increased from 3 to 64 using convolutional layers, followed by shallow feature extraction.
[0042] The shallow feature extraction process is represented as follows:
[0043]
[0044] in, Represents the image of frame t. The shallow features are represented by R(.), which denotes the stacked residual block structure used for feature extraction; Relu(.) represents the activation function; and conv(.) represents the convolutional layer. The spatial resolution remains constant during the shallow feature calculation process.
[0045] A core design feature of this invention is that the network extracts short-range and long-range features from each frame of the low-resolution video and then fuses and reconstructs them. The short-range feature extraction module is used to extract short-range features from the shallow features of each frame of the low-resolution video and executes the methods in steps (3)-(4) below.
[0046] Step (3). Take the t-th frame of the low-resolution video obtained in step (2). shallow features and adjacent frames shallow features Convolution calculation is performed as follows:
[0047] Step (3.1). Take the t-th frame of the low-resolution video obtained in step (2). shallow features and its adjacent historical frames shallow features Feature calculations are performed to obtain the backward inter-frame motion compensation:
[0048]
[0049] in, Represents the shallow features of the (t-1)th frame image Shallow features of the t-th frame Aligned backward inter-frame motion compensation; t(.) represents the stacked residual block structure, used here to calculate compensation; Relu(.) represents the activation function.
[0050] Step (3.2). Take the t-th frame of the low-resolution video obtained in step (2). shallow features and its adjacent future frames shallow features Feature calculations are performed to obtain the backward inter-frame motion compensation:
[0051]
[0052] in, Represents the shallow features of the (t+1)th frame image Shallow features of the t-th frame Aligned forward inter-frame motion compensation; R(.) represents the stacked residual block structure, used here to calculate compensation; Relu(.) represents the activation function.
[0053] In this embodiment, during the calculation of motion compensation for the first and last frames of the low-resolution video, at t=1, let When t = T, let
[0054] In this scheme, implicit alignment between frames is selected, so deformable convolution is used to bend the frames. The historical frames and future frames are aligned to the target frame using the backward inter-frame motion compensation information and the forward inter-frame motion compensation information of each frame obtained in step (4), and the method in step (4) is performed below.
[0055] Step (4). Enhance the low-resolution video image obtained in step (2). shallow features of adjacent frames According to the inter-frame motion compensation obtained in step (3) Implicit motion compensation is performed, specifically as follows:
[0056] Step (4.1). Extract the (t-1)th frame image from the low-resolution video. shallow features To the t-th frame image shallow features Alignment:
[0057]
[0058] in, This represents the image of frame t-1. Shallow features to the t-th frame image The features after shallow feature alignment are referred to as shallow features after historical frame alignment; donv(.,.) represents deformable convolution.
[0059] Step (4.2). Take the (t+1)th frame of the low-resolution video. shallow features To the t-th frame image shallow features Alignment:
[0060]
[0061] in, This represents the image at frame t+1. Shallow features to the t-th frame image Features aligned to shallow features, or simply shallow features aligned to future frames; donv(.,.) represents deformable convolution.
[0062] Step (4.3). Take the t-th frame image obtained in step (4.1). Shallow features after historical frame alignment and the shallow features after future frame alignment obtained in step (4.2) With the t-th frame image shallow features Perform feature fusion:
[0063]
[0064] in, Represents the image of frame t. The short-range features are represented by R(.), which represents the stacked residual block structure used to calculate the short-range features; c(.,.) represents concatenating the features along the channel dimension.
[0065] After obtaining the short-range features of each frame of the low-resolution video, the long-range features of each frame are calculated in the long-range feature extraction module by taking the short-range features of the target frame and the reference frame and the hidden state output from the previous state. Then, the methods in steps (5) to (6) are executed.
[0066] Step (5). The t-th frame of the low-resolution video sequence obtained in step (4). Short-range characteristics Short-range features of its adjacent frames and the hidden state feature h obtained from the previous state t-1 The input long-range feature extraction module calculates the long-range features of the current frame, specifically:
[0067] Step (5.1). Take the t-th frame of the low-resolution video obtained in step (4.3). Short-range characteristics and the short-range features of its adjacent frames. And the hidden state feature h from the previous state output t-1 Perform feature fusion:
[0068]
[0069] in, Represents the image of frame t. The short-range features of the first frame are fused with the short-range features of its neighboring frames and the hidden features; R(.) represents the stacked residual block structure used for fusion reconstruction. In this embodiment, eight stacked residual blocks are used. The purpose of using residual blocks is to reduce the learning difficulty of the network and ensure that the network does not encounter gradient problems when performing backpropagation. Relu(.) represents the activation function; conv(.) represents the convolutional layer; c(.,.) represents concatenating the features along the channel dimension.
[0070] Step (5.2). Use the t-th frame of the low-resolution video obtained in step (5.1). Post-reconstruction features Input into the recurrent neural network:
[0071]
[0072] in, Represents the image of frame t. The features extracted by the recurrent neural network are represented by RNN(.).
[0073] Step (5.3). Take the t-th frame image obtained in step (5.2). Features After passing through a convolutional layer and then applying an activation function, we can obtain:
[0074]
[0075] Among them, h t This indicates the calculation of the t-th frame image. The hidden state features obtained at that time, Relu(.) represents the activation function, and conv(.) represents the convolutional layer.
[0076] Step (5.4). Take the t-th frame image obtained in step (5.3). Hidden state features h t Further feature extraction is performed:
[0077]
[0078] in, Represents the image of frame t. The long-range features are represented by Relu(.), which represents the activation function, and conv(.) represents the convolutional layer.
[0079] Step (6). Take the t-th frame image obtained in step (2). shallow features The short-range features obtained in step (4) and the long-range features obtained in step (6) Feature fusion is performed, specifically as follows:
[0080] Step (6.1). Use the t-th frame of the low-resolution video obtained in step (2). shallow features The short-range features obtained in step (4) and the long-range features obtained in step (6) Perform feature fusion to generate reconstructed features:
[0081]
[0082] in, Represents the image of frame t. The reconstructed features are represented by R(.), which represents the stacked residual block structure used for fusion reconstruction; Relu(.) represents the activation function; and c(.,.) represents the concatenation of features along the channel dimension.
[0083] Step (6.2). Use the t-th frame of the low-resolution video obtained in step (6.1). Reconstruction features Perform upsampling and channel dimension transformation:
[0084]
[0085] in, Represents the image of frame t. The high-resolution image after super-resolution; PixelShuffle(.) represents subpixel convolution; conv(.) represents the convolution operation, used to convert the image of frame t. High-resolution reconstruction features The channel dimension changed from 64 to 3.
[0086] Final super-resolution result This refers to the high-resolution video output from a low-resolution video using the bidirectional recurrent neural network super-resolution method based on structure-detail separation proposed in this invention.
[0087] The above description is merely a specific embodiment of this application and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the concept of this application. For example, technical solutions formed by substituting the above features with (but not limited to) technical features with similar functions disclosed in this application.
Claims
1. A video super-resolution method based on a long-range-short-range combination, characterized in that, Includes the following steps: Step (1). Acquire low-resolution video and perform image enhancement frame by frame; Step (2). Extract shallow features from each frame of the image-enhanced low-resolution video; Step (3). Calculate inter-frame motion compensation information based on the image enhancement of each frame and the shallow features of adjacent frames; Step (4). Each frame is treated as a target frame in turn. The adjacent frames are bent and deformed using deformable convolution and the inter-frame motion compensation information calculated in step (3). The shallow features of the adjacent frames are aligned with the shallow features of the target frame. Step (5). Fuse the shallow features of the target frame with the shallow features of adjacent frames after alignment to obtain the short-range features of the target frame; Step (6). Based on the short-range features of the target frame and adjacent frames, a long-range feature extraction module based on a recurrent neural network is used to generate long-range features of the target frame; Step (7). Traverse all target frames, fuse the shallow features, short-range features and long-range features of each frame of the low-resolution video obtained in steps (2), (5) and (6) to obtain reconstructed features, and perform upsampling interpolation and channel dimension transformation on the reconstructed features to generate high-resolution video.
2. The video super-resolution method based on long-range-short-range combination according to claim 1, characterized in that, In step (1), image enhancement includes mirror symmetry, horizontal 90° flip, and vertical 90° flip.
3. The video super-resolution method based on long-range-short-range combination according to claim 1, characterized in that, In step (2), each frame of the low-resolution video after image enhancement obtained in step (1) is expanded in channel dimension through a convolutional layer while maintaining the resolution, to obtain shallow features of each frame.
4. The video super-resolution method based on long-range-short-range combination according to claim 1, characterized in that, In step (3), the shallow features of each frame obtained in step (2) are processed sequentially with the shallow features of its adjacent frames to obtain the inter-frame motion compensation information of the preceding and following frames, as shown below: in, This represents the shallow features of the t-th frame image. Let R(.) represent the shallow features of the adjacent frames before and after the t-th frame, R(.) represent the stacked residual block structure, and ReLU(.) represent the activation function. This represents the backward inter-frame motion compensation for aligning the shallow features of the (t-1)th frame to the shallow features of the tth frame. This represents forward inter-frame motion compensation that aligns the shallow features of the (t+1)th frame image with the shallow features of the tth frame image.
5. The video super-resolution method based on long-range-short-range combination according to claim 4, characterized in that, In step (4), deformable convolution is used to compensate for inter-frame motion. As the positional deviation between the target frame and its adjacent frames, the shallow features of the historical frame and the future frame are aligned with the shallow features of the target frame to obtain the shallow features after alignment of the historical frame and the shallow features after alignment of the future frame.
6. The video super-resolution method based on long-range-short-range combination according to claim 5, characterized in that, In step (5), the shallow features of the historical frame after alignment, the shallow features of the target frame, and the shallow features of the future frame are concatenated in the channel dimension, and then the features are fused through the convolutional layer, the activation function layer and the stacked residual block in sequence to obtain the short-range features of the target frame.
7. The video super-resolution method based on long-range-short-range combination according to claim 1, characterized in that, In step (6), the calculation formula for the long-range feature extraction module based on the recurrent neural network is as follows: in, h represents the short-range features of the target frame and adjacent frames, respectively. t-1 The hidden state of the historical frame is represented by c(.,.), which concatenates the data along the channel dimension. conv(.) represents a convolutional layer, ReLU(.) represents an activation function, R(.) represents a stacked residual block structure, and RNN(.) represents a recurrent neural network. h t Indicates the hidden state of the current frame. Represents the long-range features of the target frame. This represents the features obtained by fusing the short-range features of the target frame, the short-range features of historical frames, the short-range features of future frames, and the hidden state of historical frames.
8. The video super-resolution method based on long-range-short-range combination according to claim 1, characterized in that, In step (7), the shallow features, short-range features, and long-range features of each frame image are concatenated in the channel dimension, and then the features are fused sequentially through a convolutional layer, an activation function layer, and stacked residual blocks to obtain the reconstructed features.
9. The video super-resolution method based on long-range-short-range combination according to claim 1, characterized in that, In step (7), the reconstructed features are upsampled and interpolated through subpixel convolutional layers, and then the upsampled and interpolated features are restored to their original channel dimensions to generate a high-resolution video.