An infrared video deblurring method based on cross-modal style transfer and optical flow guided sparse attention
By employing cross-modal style transfer and optical flow-guided sparse attention mechanisms, the deblurring problem of infrared video under occlusion and complex motion was solved, achieving high-precision infrared video deblurring and improving image quality and temporal consistency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING TECH & BUSINESS UNIV
- Filing Date
- 2026-03-29
- Publication Date
- 2026-06-23
AI Technical Summary
Existing infrared video deblurring methods have significant limitations when dealing with occlusion and complex motion, making it difficult to effectively recover dynamic details, and the optical flow estimation accuracy is insufficient, affecting the performance of subsequent detection, recognition and tracking tasks.
Cross-modal style transfer technology is used to map visible light optical flow data to the infrared domain. Combined with optical flow-guided sparse attention mechanism and global motion aggregation optical flow estimation, the spatiotemporal dependence between adjacent frames is captured through sparse sampling and multi-head self-attention computation. The pixel-level and perception-level joint optimization strategy is used to balance image accuracy and structural details and eliminate inter-frame jitter.
It significantly improves the peak signal-to-noise ratio and structural similarity of infrared video deblurring, effectively restores the thermal radiation structural details of infrared images, eliminates inter-frame jitter in video, and improves the visual quality of images and the performance of subsequent tasks.
Smart Images

Figure CN122265093A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image processing and computer vision technology, specifically relating to an infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention, which is an interdisciplinary field of computer vision and infrared imaging technology. Background Technology
[0002] Infrared imaging technology plays an irreplaceable role in security monitoring, industrial inspection, and military reconnaissance due to its all-weather operation and sensitivity to thermal radiation. However, infrared video is highly susceptible to equipment vibration and target motion during acquisition, resulting in severe intra-frame and inter-frame blurring, which seriously impairs the performance of subsequent visual tasks. Although optical flow estimation has been widely used to model inter-frame motion information and facilitate de-blurring reconstruction, existing methods still have significant limitations in handling occlusion and adapting to infrared imaging modes. These limitations manifest as infrared thermal radiation structural blurring, artifacts or ghosting at dynamic target edges, and severe temporal jitter between video sequences. This blurring not only reduces the visual quality of the image but also seriously affects subsequent detection, recognition, and tracking tasks, limiting the application of infrared imaging technology in critical fields.
[0003] With the development of deep learning technology, researchers have proposed various image deblurring and related processing methods, which have improved image quality to some extent. Wu Jun'an et al. disclosed an infrared image deblurring algorithm based on an attention mechanism residual network model in patent CN202210955369.2. This method constructs an end-to-end network containing multi-scale downsampling layers and dense attention residual blocks, and extracts infrared image features using channel and spatial attention mechanisms, achieving deblurring of single-frame infrared images. However, it cannot handle blurring caused by motion in infrared video. Lü Hengyi et al. disclosed an event-driven multimodal fusion image motion deblurring method in patent CN202411372115.3. This method introduces a high temporal resolution event stream generated by an event camera, constructs a deep spatiotemporal event voxel grid, and combines it with degraded images for multimodal fusion deblurring. This method highly depends on specific hardware (event camera) to achieve multimodal fusion and is difficult to directly apply to conventional standard infrared camera systems. Pan Gang et al. disclosed a visible light and infrared image fusion method for enhancing degraded images in patent CN202510649244.0. This method utilizes a cross-modal style transfer model and a visible light-infrared fusion model, generating the fusion result through an attention matrix to enhance image quality in degraded scenes. Zhou Zhihu et al. disclosed an optical flow estimation method and system based on a hybrid expert network in patent CN202511733028.0. This method utilizes a hybrid expert feature extractor (MoEE) and a hybrid expert updater (MoEU), extracting features and iteratively optimizing optical flow through a dynamic routing strategy and sparse activation mechanism, improving the accuracy and efficiency of optical flow estimation. However, it lacks cross-modal training data support for infrared weak texture features.
[0004] In summary, the field of infrared video deblurring still faces the following challenges: improving the ability to recover dynamic details by utilizing multi-frame temporal information; enhancing the accuracy of optical flow estimation for image features such as weak texture and thermal radiation distribution in infrared imaging; and developing a joint optimization strategy that balances pixel-level accuracy and visual perception quality. Therefore, there is an urgent need for an infrared video deblurring method that can adapt to the characteristics of infrared imaging and effectively handle occlusion and complex motion. Summary of the Invention
[0005] The purpose of this invention is to disclose an infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention. This method utilizes cross-modal style transfer technology based on generative adversarial networks to map a visible light dataset with precise optical flow labels to the infrared domain. An optical flow estimation model guides the attention window to perform sparse sampling and multi-head self-attention computation, capturing long-distance spatiotemporal dependencies between adjacent frames. A joint optimization strategy at the pixel and perceptual levels is used to balance the numerical accuracy and structural details of the image, constrain temporal consistency, eliminate inter-frame jitter, and generate a high-precision deblurred video that meets temporal consistency constraints. The specific process includes: (1) constructing an infrared video deblurring dataset; (2) generating infrared optical flow data; (3) constructing a global motion aggregation optical flow estimation network, training the network using infrared optical flow data, and obtaining an optical flow estimation model; (4) constructing an infrared video deblurring network, using the optical flow estimation model to guide the fusion of spatiotemporal features, and constructing a joint loss function to calculate the total error; (5) training the deblurring network to obtain an infrared video deblurring model; (6) using the infrared video deblurring model to process the input blurred infrared video and output a clear infrared video with high precision and meeting the temporal consistency constraint.
[0006] Specifically, the method of the present invention includes the following steps:
[0007] A. Construct an infrared video deblurring dataset, the specific steps are as follows:
[0008] A1 Acquiring basic data for real-world scenarios: Collecting unpaired visible light images and clear infrared video frames of specific industrial scenarios to construct a basic image dataset for real-world scenarios;
[0009] A2. Extract sharp infrared video sequences from the base image dataset as source images, and generate paired sharp-blurred infrared video datasets using a trajectory-modeling-based motion blur enhancement method. The specific process includes:
[0010] A2.1 Decompose the motion parameters of the equipment into motion length and main motion direction, establish probability distribution models for each, and introduce constrained random disturbance terms to construct a motion vector model;
[0011] A2.2 Utilizing an improved Bresenham line generation algorithm, combined with sub-pixel trajectory calculation and a dynamic energy decay function, a normalized motion blur kernel matrix is constructed. ;
[0012] A2.3 Using Fourier transform, the source image... (The sharp infrared image after boundary expansion) and the motion blur kernel matrix After performing frequency domain multiplication and then inverse transform, a simulated blurred infrared image is obtained. :
[0013]
[0014] in, Indicates Fourier transform, This represents the inverse Fourier transform. This represents element-wise multiplication;
[0015] B. Generate infrared optical flow data. The specific steps are as follows:
[0016] B1. Constructing a cross-modal style transfer model, the specific steps are as follows:
[0017] B1.1 Perform physical prior-guided preprocessing on the visible light image to obtain a preprocessed image containing only thermal radiation structure information;
[0018] B1.2 Construct a cycle-consistent generative adversarial network containing two generators and two discriminators;
[0019] B1.3 Construct a hybrid loss function that includes adversarial loss, cycle consistency loss, identity loss, and perception loss. The specific calculation formula is as follows:
[0020]
[0021] in, Indicates perceived loss. This represents the generated infrared image. Indicates the reference target image. The feature extraction network is represented by the first... Feature map of the layer Indicates the first Weights of layer feature maps Represents the L1 norm;
[0022] B2 uses unpaired visible light and infrared images from the basic image dataset described in step A to train a recurrent consistency generative adversarial network to learn cross-modal style mapping relationships;
[0023] B3 inputs a publicly available visible light video dataset with precise optical flow labels into a trained recurrent consistency generative adversarial network and transfers its image style to the infrared domain.
[0024] B4 generates a cross-modal infrared optical flow dataset with absolutely accurate optical flow labels, which is used for supervised training of subsequent optical flow estimation models.
[0025] C. Construct a global motion aggregation optical flow estimation network, train the network using infrared optical flow data, and obtain the optical flow estimation model. The specific steps are as follows:
[0026] C1 constructs an optical flow estimation network based on the global motion aggregation mechanism, consisting of a feature encoder, a global motion aggregation module, and a gated recurrent unit.
[0027] C2. Input the infrared optical flow data generated in step B into the optical flow estimation network, and use the feature extraction network to extract multi-scale feature maps of two adjacent infrared images, denoted as follows: and ;
[0028] C3 executes the global motion aggregation module, which constructs motion propagation equations based on the feature similarity of global pixel pairs. The specific steps are as follows:
[0029] C3.1 will feature map Mapped to query vector , feature map Mapped to key vector Sum value vector ;
[0030] C3.2 Calculate the query vector With key vector The normalized dot product similarity between them is used to construct the global motion propagation weight matrix. The specific calculation formula is as follows:
[0031]
[0032] in, For global motion propagation weights, For feature dimension, Indicates the transpose operation;
[0033] C4 executes a self-similarity matching mechanism and uses a quadtree block strategy for dynamic motion residual optimization. The specific steps are as follows:
[0034] C4.1 divides image features into several non-overlapping image blocks. And calculate attention independently within each block;
[0035] C4.2 utilizes intra-block submatrices for feature aggregation to reduce computational complexity and balance computational efficiency and accuracy. The specific calculation formula is as follows:
[0036]
[0037] in, The first Query, key, and value submatrices within a quadtree image block;
[0038] C5 performs iterative optimization and sub-pixel-level correction, using gated cyclic units to perform multi-step iterative correction of the optical flow field. The specific steps are as follows:
[0039] C5.1 inputs the current optical flow estimate, global aggregation features, and context information into the gated loop unit to calculate the first... Optical flow correction in the next iteration ;
[0040] C5.2 Update the current optical flow field based on the optical flow correction, and its dynamic equation is as follows:
[0041]
[0042] C5.3 Constructing the Endpoint Error Loss Function The Euclidean distance between the predicted optical flow field and the actual infrared optical flow label is calculated to drive network parameter updates. The specific calculation formula is as follows:
[0043]
[0044] in, For true optical flow labels, To predict the optical flow field, Represents the L2 norm;
[0045] D. Construct an infrared video deblurring network, use an optical flow estimation model to guide spatiotemporal feature fusion, and construct a joint loss function to calculate the total error. The specific steps are as follows:
[0046] D1 constructs an optical flow-guided attention module, which sequentially includes a layer normalization layer, an optical flow-guided sparse window multi-head self-attention unit, and a feedforward neural network.
[0047] D2 constructs an infrared video deblurring network based on a U-shaped architecture, which consists of an encoder, a bottleneck layer, and a decoder; the encoder and decoder are both composed of several optical flow-guided attention modules stacked together.
[0048] D3 performs optical flow-guided sparse window multi-head self-attention computation, using the global motion aggregation optical flow estimation model obtained in step B to guide sparse sampling of spatiotemporal key elements. The specific steps are as follows:
[0049] D3.1 divides the input video frame feature map into non-overlapping local windows and sets the current time... The window features are used as a reference and denoted as query features;
[0050] D3.2 Using the optical flow estimation model trained in step B, calculate the reference frame. With adjacent frames Optical flow offset between ;
[0051] D3.3 Based on the optical flow offset, in adjacent frames The system locates the spatial region corresponding to the current query window and performs sparse sampling on the pixels within that region to obtain a set of key elements highly correlated with the query features. ;
[0052] D3.4 Map the features of the current query window to a query vector Set of key elements Mapped to key vector Sum value vector ;
[0053] D3.5 Calculate the query vector With key vector The multi-head self-attention weights between frames are used to calculate the window feature output that incorporates cross-frame spatiotemporal information;
[0054] D4. Construct a joint loss function that includes pixel loss, structural similarity loss, and optical flow constraints. The specific steps are as follows:
[0055] D4.1 Constructing the pixel consistency loss function :
[0056]
[0057] in, Let L1 norm be denoted as , where For timing length, For spatial resolution, To deblur the output, For clear GT images;
[0058] D4.2 Constructing the structural similarity loss function The specific calculation formula is as follows:
[0059]
[0060]
[0061] in, Represents a deblurred image patch. Represents a real image patch. This represents the mean. Represents variance. Describing covariance, , and To maintain constants for computational stability;
[0062] D4.3 Constructing the Optical Flow Temporal Consistency Constraint Loss Function The specific calculation formula is as follows:
[0063]
[0064] in, This represents the deblurred frame at the current moment. This represents the deblurred frame at the next moment. This represents the predicted optical flow field. This represents a spatial transformation operation based on optical flow. Represents the L1 norm;
[0065] D4.4 The pixel consistency loss function, structural similarity loss function, and optical flow temporal consistency constraint loss function are weighted and summed to construct the overall joint loss function. :
[0066]
[0067] in, This is the perception-pixel balance factor, used to adjust the contribution ratio of SSIM and L1 loss; The intensity of time regularization;
[0068] D4.5 Calculate the total error using the joint loss function;
[0069] The specific steps for training the deblurring network to obtain the infrared video deblurring model are as follows:
[0070] E1 Dataset Partitioning: The pairwise sharp-blurred infrared video deblurring dataset constructed in step A is divided into a training set and a test set;
[0071] E2 Initialize network parameters: Initialize the deblurred network parameters constructed in step D, and load the optical flow estimation model parameters trained in step C into the network and freeze or fine-tune them;
[0072] E3 performs end-to-end training and output: the training set is input into the network for forward propagation, the total joint loss function is calculated, and the network parameters are updated by backpropagation using the stochastic gradient descent algorithm until the model converges, and the final infrared video deblurring model is output.
[0073] F uses an infrared video deblurring model to process the input blurred infrared video and output a high-precision, clear infrared video that meets the temporal consistency constraint. The specific steps are as follows:
[0074] F1 acquires the real blurred infrared video sequence to be processed and performs continuous multi-frame serialization preprocessing according to the format required by the network, which is then used as input data for the inference stage.
[0075] F2 Input the preprocessed blurred infrared video sequence into the trained infrared video deblurring model obtained in step E, and use the optical flow estimation network module embedded in the model to calculate the optical flow offset between adjacent frames in the sequence.
[0076] F3 performs spatiotemporal feature reconstruction, using the deblurring network module inside the model to guide the sparse window to sample key elements in the spatial region corresponding to adjacent frames according to the optical flow offset, and to fuse cross-frame spatiotemporal features through multi-head self-attention calculation to restore the thermal radiation structure and texture details of the current frame.
[0077] F4 outputs continuous video frames that have undergone spatiotemporal feature reconstruction and deblurring, and combines them according to the original video frame rate to generate a high-precision, clear infrared video that meets the temporal consistency constraint, thus completing the end-to-end deblurring task of infrared video.
[0078] Compared with existing technologies, this invention has the following advantages: It implements a complete infrared video deblurring method. Considering the scarcity of infrared optical flow labels and the significant differences between visible and infrared modes, it utilizes cross-modal style transfer technology to transfer visible light datasets with accurate optical flow labels to the infrared domain, generating infrared optical flow data. This effectively solves the problem of poor generalization performance of infrared motion models due to a lack of supervised data. It fully exploits the spatiotemporal dependencies between adjacent frames using an optical flow-guided sparse window multi-head self-attention mechanism, combined with an improved global motion aggregation optical flow estimator, effectively solving the problems of occlusion processing and dynamic blur recovery under weak infrared texture conditions. It employs a pixel-level and perception-level joint optimization strategy and introduces optical flow temporal consistency constraints, effectively restoring the thermal radiation structural details of infrared images and eliminating inter-frame jitter while ensuring image pixel accuracy. This invention integrates the advantages of cross-modal transfer learning, optical flow-guided attention, and joint loss optimization, significantly improving the peak signal-to-noise ratio and structural similarity of infrared video deblurring, and has broad application and promotion value in complex industrial scenarios. Attached Figure Description
[0079] Figure 1 This is a flowchart illustrating the method described in this invention. Detailed Implementation
[0080] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0081] The purpose of this invention is to disclose an infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention. It utilizes cross-modal style transfer technology to map visible light optical flow data to the infrared domain, addressing the problems of scarce infrared optical flow labels and reduced generalization performance due to modal differences. An improved global motion aggregation optical flow estimator enhances the ability to capture motion in weakly textured and occluded infrared regions. An optical flow-guided sparse window multi-head self-attention mechanism accurately captures long-distance spatiotemporal dependencies between adjacent frames during the feature extraction stage, solving the problems of missing contextual information and poor dynamic region recovery in single-frame deblurring. A pixel-level and perception-level joint optimization strategy balances the numerical accuracy and structural details of the deblurred image while constraining temporal consistency. The method consists of six stages: constructing an infrared video deblurring dataset; generating infrared optical flow data; training a global motion aggregation optical flow estimation network based on the infrared optical flow data; constructing a deblurring network including an optical flow-guided attention module and designing a joint loss function; training the deblurring network; inputting the blurred infrared video into the infrared video deblurring model for deblurring processing, and outputting a clear infrared video. The specific process includes: (1) constructing an infrared video deblurring dataset; (2) generating infrared optical flow data; (3) constructing a global motion aggregation optical flow estimation network, training the network using infrared optical flow data, and obtaining an optical flow estimation model; (4) constructing an infrared video deblurring network, using the optical flow estimation model to guide the fusion of spatiotemporal features, and constructing a joint loss function to calculate the total error; (5) training the deblurring network to obtain an infrared video deblurring model; (6) using the infrared video deblurring model to process the input blurred infrared video and output a clear infrared video with high precision and meeting the temporal consistency constraint.
[0082] 1. Construct an infrared video deblurring dataset. The specific steps are as follows:
[0083] 1.1 Constructing a real-world infrared dataset and a synthetic deblurred dataset, the specific steps are as follows:
[0084] 1.1.1 Obtaining Unpaired Infrared-Visible Light Image Sets in Real-World Scenarios. This embodiment constructs the "Coal Mine Rock Tunneling Scene High Coal Dust and Fog Dual-Light Contrast Dataset" (CMRRD-HCFDLC). The data acquisition environment is an underground coal mine tunneling operation site, and the hardware equipment uses an intrinsically safe 300,000-pixel long-wave infrared (LWIR) camera and a visible light camera. The acquisition scenes cover drilling operations under strong interference, operation transitions under weak interference, and interference-free idle states, acquiring a total of 8,447 images. This dataset contains unpaired infrared and visible light images, serving as the training data source for the subsequent style transfer network;
[0085] 1.1.2 A clear infrared image is selected as the source image, and a pairwise clear-blurred infrared video dataset (CMRRD-HCFBI) is generated using a trajectory-modeling-based motion blur enhancement method. The specific process includes:
[0086] (1) The motion parameters of the decomposed equipment are motion length and main motion direction. In this embodiment, the motion length is set. Following a uniform distribution, the specific range is: Pixels; Set the dominant motion direction exist to Random sampling is performed between the intervals, and a random perturbation term following a Gaussian distribution is introduced. and constraints To simulate the nonlinear vibration characteristics of mechanical equipment on uneven surfaces or during operation;
[0087] (2) Construct a normalized motion fuzzy kernel matrix using the improved Bresenham line generation algorithm. The size of the fuzzy kernel is set to... ,in This ensures that the core size is odd. A third-order energy decay function is also introduced. The energy dissipation during the simulation of motion is set to a normalized threshold. When the nuclear element value is below this threshold, the unit impulse function is used instead.
[0088] (3) Using Fourier transform, the source image is convolved with the motion blur kernel matrix. To eliminate boundary artifacts, the original infrared image is padded before convolution, and then cropped to restore it to its original size after convolution, thus obtaining the simulated blurred infrared image. In this embodiment, the resolution of the final generated training images is uniformly adjusted to [resolution value]. This allows for the construction of paired clear-blurred infrared video training sets;
[0089] 2. Generate infrared optical flow data. The specific steps are as follows:
[0090] 2.1 Physically Prior-Guided Preprocessing of Visible Light Images. The visible light image is input into a Gaussian low-pass filter, with the Gaussian kernel size set to... Standard deviation Set the preset value to filter out high-frequency color texture noise; then convert the filtered image to the YUV color space and extract only the Y channel (luminance component) to obtain a preprocessed image containing only thermal radiation structure information;
[0091] 2.2 Construct a cycle-consistent generative adversarial network containing two generators and two discriminators. Train the network using the CMRRD-HCFDLC dataset described in step 1.1.1;
[0092] 1.2.3 Construct a hybrid loss function comprising adversarial loss, cycle consistency loss, identity loss, and perceptual loss. The perceptual loss... Using a pre-trained VGG16 network as a feature extractor, the Euclidean distance between the generated image and the target image in the second and third layer feature maps is calculated. During training, the weights of each component in the total loss function are set as follows: cycle consistency loss weights. Perceived loss weights This ensures that the generated infrared images retain their geometric structure while possessing a realistic infrared thermal radiation style.
[0093] 1.2.4 Utilizing the first generator after training The visible light images from the FlyingChairs visible light dataset are input into the generator, which outputs corresponding infrared style images while retaining the optical flow labels (.flo files) from the original dataset. This generates an infrared optical flow training dataset (FlyingChairs-IR) containing approximately 22,000 pairs of samples, which is used for supervised training of the subsequent optical flow estimation network.
[0094] 3. Construct a global motion aggregation optical flow estimation network, train the network using infrared optical flow data, and obtain the optical flow estimation model. The specific steps are as follows:
[0095] 3.1 Constructing a Global Motion Convergence Optical Flow Estimator (SM-GMA):
[0096] In this embodiment, an SM-GMA network is constructed to replace the traditional SPyNet. This network mainly consists of a feature encoder, a global motion aggregation module, a self-similarity matching mechanism, and a gated recurrent unit (GRU). The feature encoder uses a residual network structure to extract features, and the channel dimension of the output feature map is specifically set to... To balance computational efficiency with feature representation capability;
[0097] 3.2 Data Input and Feature Extraction:
[0098] The infrared-style optical flow dataset (FlyingChairs-IR) generated in step 1 is input into the network. The feature encoder is used to extract infrared images from two adjacent frames. The multi-scale feature maps of ) are denoted as follows: and In this embodiment, the resolution of the feature map is [missing information - likely a percentage] of the input image. Feature Dimension for ;
[0099] 3.3 Perform global motion aggregation module calculations, the specific steps of which are as follows:
[0100] 3.3.1 Feature Map Mapped to query vector , feature map Mapped to key vector Sum value vector ;
[0101] 3.3.2 Calculate the query vector With key vector The normalized dot product similarity between them is used to construct the global motion propagation weight matrix. To prevent the gradient from vanishing due to an excessively large dot product, divide by the scaling factor. (In this embodiment, i.e.) The specific calculation formula is as follows:
[0102]
[0103] The weight matrix This is used to weightedly propagate high-confidence texture features from unoccluded areas to occluded areas, thus solving the problem of optical flow breakage under weak infrared textures.
[0104] 3.4 Perform self-similarity matching and quadtree block optimization. The specific steps are as follows:
[0105] 3.4.1 A quadtree block attention mechanism is introduced to divide the feature map into non-overlapping image blocks. For example, for The feature map is recursively divided into sub-blocks of different scales based on a quadtree strategy;
[0106] 3.4.2 Attention is calculated independently within each block and then aggregated. The specific calculation formula is as follows:
[0107]
[0108] in, The first The query, key, and value submatrices correspond to each quadtree image patch. By utilizing large-scale patches to capture the overall large displacement of the object and small-scale patches to analyze the minute movements of the object's edges, a balance between computational complexity and accuracy is achieved.
[0109] 3.5 Perform iterative optimization and sub-pixel level correction. The specific steps are as follows:
[0110] The optical flow field is updated in multiple steps using GRU units. In this embodiment, the total number of iterations is set. ;
[0111] 3.5.1 In the... In the next iteration ( ), the optical flow estimate at the current moment Global aggregated features and contextual features are input into the GRU, and the output is the optical flow correction. ;
[0112] 3.5.2 Update the current optical flow field based on the optical flow correction, and its dynamic equation is as follows:
[0113]
[0114] After 12 iterations, the final high-precision predicted optical flow field is output. ;
[0115] 3.5.3 Calculate the endpoint error loss function To drive network parameter updates:
[0116]
[0117] in, For infrared optical flow real labels;
[0118] 3.6 Network training parameter settings:
[0119] During the training process in this step, the AdamW optimizer is used, and the weight decay factor is set to... A single-cycle learning rate scheduling strategy is adopted, with the initial learning rate set to... The iterations were dynamically adjusted during training. A total of 100,000 iterations were performed on the FlyingChairs-IR dataset, with a batch size of 8.
[0120] 4. Construct an infrared video deblurring network, use an optical flow estimation model to guide spatiotemporal feature fusion, and construct a joint loss function to calculate the total error. The specific steps are as follows:
[0121] 4.1 Constructing the Optical Flow Guided Attention Module (FGAB):
[0122] 4.1.1 Construct the FGAB module, which includes LayerNorm, Optical Flow Guided Sparse Window Multi-Head Self-Attention Unit (FGSW-MSA), and Feedforward Neural Network (FFN).
[0123] 4.1.2 Setting up the FFN: The feedforward neural network consists of two fully connected layers and a GELU activation function, with an inflation ratio set to 2;
[0124] 4.2 Construct an Infrared Video Deblurring Network (CFST) based on a U-shaped architecture. The specific steps are as follows:
[0125] 4.2.1 The network input is continuous. A sequence of infrared blurred videos, in which Set to 7 (meaning that 7 frames are input each time, and the 4th frame in the middle is reconstructed).
[0126] 4.2.2 The input sequence first passes through a... Convolutional layers and 5 residual blocks are used to extract shallow features. ;
[0127] 4.2.3 The downsampling stage is performed. Each stage consists of two optical flow guided attention modules (FGAB) and a convolutional layer with a stride of 2. The number of feature channels doubles with each layer (e.g., 64 -> 128 -> 256).
[0128] 4.2.4 The bottleneck layer contains 5 residual blocks to further refine deep spatiotemporal features;
[0129] Version 4.2.5 includes an upsampling layer and an FGAB module. Upsampling performs pixel rearrangement (PixelShuffle), fusing features of the same scale;
[0130] 4.3 Perform optical flow-guided multi-head self-attention calculation for sparse windows. The specific steps are as follows:
[0131] 4.3.1 Window Partitioning: Divide the input feature map (size: ...) into windows. The area is divided into non-overlapping local windows, with window sizes of... Set as The current moment. The window features are used as queries.
[0132] 4.3.2 Optical Flow Guidance: Using the SM-GMA optical flow estimation model trained in step 3, the reference frame is calculated. With adjacent frames Optical flow offset between The parameters of the optical flow estimation model are frozen during the initial training phase to provide stable motion guidance;
[0133] 4.3.3 Sparse Sampling: Based on the calculated optical flow offset, sparse sampling is performed in adjacent frames. The system locates the spatial region corresponding to the current query window. Unlike traditional global attention scanning, this embodiment samples only the most relevant data within this corresponding region. One key element, construct a set of key elements. This approach leverages long-distance spatiotemporal dependencies while avoiding redundancy in full-map computation.
[0134] 4.3.4 Multi-head attention calculation:
[0135] Map the current window features to a query vector. , the set of sparse samples Mapped to key vector Sum value vector ;
[0136] Perform multi-head self-attention calculation, with the number of attention heads set to [value]. ;
[0137] 4.4 Design a joint loss function that includes pixel loss, structural similarity loss, and optical flow constraints. The specific steps are as follows:
[0138] 4.4.1 Constructing the pixel consistency loss function The deblurred infrared image output by the computational network With true clear infrared images The mean absolute error (L1 loss) between the two values is used to constrain the fidelity of the deblurred image at the pixel level. The specific calculation formula is as follows:
[0139]
[0140] in, Let L1 norm be denoted as , where For timing length, For spatial resolution, To deblur the output, For a clear ground truth (GT) image, this loss function directly optimizes the pixel-level differences in the image;
[0141] 4.4.2 Constructing the structural similarity loss function:
[0142] To enhance the recovery of thermal radiation structures and edge details in infrared images, a sensing loss based on SSIM is constructed.
[0143] Calculation method: using a size of A sliding Gaussian window is used to calculate the local structural similarity between the deblurred image and the real image.
[0144] Parameter settings: Constant settings in the formula , ;
[0145] Loss definition: In this embodiment, the perception-pixel balance factor Set as To improve visual perception quality while ensuring pixel accuracy;
[0146] 4.4.3 Constructing the optical flow temporal consistency constraint loss function:
[0147] To suppress non-physical jitter between video frames, a temporal constraint is introduced using the optical flow estimation model trained in step 3 (parameter freezing).
[0148] Specific steps: Predict the current deblurred frame using an optical flow model. To the next frame Optical flow field Using a spatial transformation network (STN) based on the optical flow field... Perform a warping transformation to obtain the aligned image. ;
[0149] Loss Calculation: Calculation and L1 distance between them:
[0150]
[0151] Weight settings: The weight coefficient of this loss term. Set as To constrain the smoothing of motion trajectories between adjacent frames;
[0152] 4.4.4 Constructing the overall joint loss function:
[0153] The weighted sum of the three loss components yields the final total loss function used for network backpropagation. :
[0154]
[0155] By minimizing this total loss function, the network is driven to maintain temporal stability while recovering image details;
[0156] 4.4.5 Calculate the total error using the joint loss function;
[0157] 5. Train the deblurring network to obtain the infrared video deblurring model. The specific steps are as follows:
[0158] 5.1 Dataset Partitioning and Preprocessing:
[0159] 5.1.1 Divide the constructed infrared video deblurring dataset (such as UIRD or self-built CMRRD-HCFBI) into training and testing sets according to a preset ratio;
[0160] 5.1.2 During training, data augmentation operations are performed on the input video frames, including random horizontal flipping and random rotation. );
[0161] 5.1.3 Randomly crop the image to Pixel patches are applied to adapt to GPU memory limitations and increase sample diversity.
[0162] 5.2 Network Initialization Strategy:
[0163] 5.2.1 Initialize the parameters of the CFST deblurring network constructed in step 4;
[0164] 5.2.2 Load the pre-trained SM-GMA optical flow estimation model parameters from step 3 into the optical flow branch of the network. During the first 10,000 iterations of training, freeze the parameters of the optical flow branch and only update the parameters of the deblurred backbone network to prevent gradient fluctuations in the early stages of training from destroying the optical flow prior. After 10,000 iterations, unfreeze the optical flow branch and perform end-to-end joint fine-tuning of the entire network.
[0165] 5.3 Model training parameter settings:
[0166] This embodiment uses the SGD optimizer to optimize the network, and the specific training parameters are set as follows:
[0167] Batch size: Set to 8 (adapted for 90GB video memory environments);
[0168] Learning rate: The learning rate of the optical flow branch is set to 0.25 times the global learning rate; the initial learning rate of the deblurred backbone network is set to... ;
[0169] Learning rate scheduling strategy: A cosine annealing restart strategy is used to dynamically adjust the learning rate, with a period of 200,000 iterations, and the minimum learning rate decays to [value missing]. ;
[0170] Total number of iterations: The entire training process involves 200,000 iterations until the total joint loss function is reached. convergence;
[0171] 5.4 Model Derivation and Inference:
[0172] After training, the model weights with the highest PSNR on the validation set are saved as the final infrared video deblurring model. During the inference phase, an infrared blurred video sequence of any length is input into the model, and a clear and continuous infrared video stream is output.
[0173] 6. Use an infrared video deblurring model to process the input blurred infrared video and output a high-precision, clear infrared video that meets the temporal consistency constraint. The specific implementation steps are as follows:
[0174] 6.1 Acquire the real blurred infrared video to be processed and perform continuous multi-frame serialization preprocessing: Use the FFmpeg decoding library to extract the original infrared video file into continuous static frames at a frame rate of 30 fps; set the time-series sliding window length to... (That is, taking the current frame as the center, including the three frames before and after); align the resolution of the single-channel infrared image to the specified value using boundary padding. and normalize the pixel grayscale values to The intervals are used to construct a five-dimensional input tensor with dimensions (Batch=1, Time=7, Channel=1, Height=256, Width=256).
[0175] 6.2 Loading Deblurred Model Weights and Calculating Optical Flow Offsets: Load the model weight file trained and converged in step 4 into GPU memory, and enable evaluation mode to freeze all network parameters; feed the five-dimensional input tensor into the globally motion-aggregated optical flow estimation network fixed within the model, and calculate the optical flow offset of adjacent frames in the sequence (such as the first frame). Frame and the The relative motion between frames is output as a dual-channel optical flow offset feature map with dimensions (1, 2, 256, 256).
[0176] 6.3 Perform optical flow-guided spatiotemporal feature reconstruction: In the deblurring network module, set the local attention window size to [value missing]. Guided by the optical flow offset feature map output in step 5.2, extract the Top-K (K=4) key pixel elements with the highest relevance in the corresponding regions of adjacent frames; set the number of heads for multi-head self-attention to 8, calculate the attention weights between the query vector and the key elements and perform feature aggregation, and output the deblurred single-frame infrared feature map.
[0177] 6.4 Output deblurred infrared video and complete sequence reconstruction: Perform inverse normalization on the single-frame infrared feature map output in step 5.3, multiply the value by 255 and truncate it to... Integer range; cropping to remove extra edge pixels filled for size alignment in step 5.1; temporally stitching and H.264 encoding compression of all reconstructed clear infrared image frames at the original 30 fps frame rate, finally outputting a clear infrared video file (.mp4 format) without artifacts and satisfying temporal consistency.
Claims
1. An infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention, comprising the following steps: A. Construct an infrared video deblurring dataset, the specific steps are as follows: A1. Obtain basic data for real-world scenarios: Collect unpaired visible light images and clear infrared video frames of specific industrial scenarios to construct a basic image dataset for real-world scenarios; A2. Extract sharp infrared video sequences from the base image dataset as source images, and generate paired sharp-blurred infrared video datasets using a trajectory-modeling-based motion blur enhancement method. The specific process includes: A2.1 Decompose the motion parameters of the equipment into motion length and main motion direction, establish probability distribution models for each, and introduce constrained random disturbance terms to construct a motion vector model; A2.2 Utilizing an improved Bresenham line generation algorithm, combined with sub-pixel trajectory calculation and a dynamic energy decay function, a normalized motion blur kernel matrix is constructed. ; A2.3 Using Fourier transform, the source image... (The sharp infrared image after boundary expansion) and the motion blur kernel matrix After performing frequency domain multiplication and then inverse transform, a simulated blurred infrared image is obtained. : in, Indicates Fourier transform, This represents the inverse Fourier transform. This represents element-wise multiplication; B. Generate infrared optical flow data, the specific steps are as follows: B1. Construct a cross-modal style transfer model; B2. Using unpaired visible light and infrared images from the basic image dataset described in step A, train a recurrent consistency generative adversarial network to learn cross-modal style mapping relationships; B3. Input the publicly available visible light video dataset with accurate optical flow labels into the trained recurrent consistency generative adversarial network and transfer its image style to the infrared domain; B4. Generate a cross-modal infrared optical flow dataset with absolutely accurate optical flow labels for supervised training of subsequent optical flow estimation models; C. Construct and train a global motion aggregation optical flow estimation network to obtain an optical flow estimation model. The specific steps are as follows: C1. Construct an optical flow estimation network based on the global motion aggregation mechanism, consisting of a feature encoder, a global motion aggregation module, and a gated recurrent unit; C2. Input the infrared optical flow data generated in step B into the optical flow estimation network, and use the feature extraction network to extract multi-scale feature maps of two adjacent infrared images, denoted as follows: and ; C3. Execute the global motion aggregation module to construct motion propagation equations based on the feature similarity of global pixel pairs; C3.1 will feature map Mapped to query vector , feature map Mapped to key vector Sum value vector ; C3.2 Calculate the query vector With key vector The normalized dot product similarity between them is used to construct the global motion propagation weight matrix. The specific calculation formula is as follows: in, For global motion propagation weights, For feature dimension, Indicates the transpose operation; C4. Implement a self-similarity matching mechanism and use a quadtree block strategy to optimize dynamic motion residuals; C5. Perform iterative optimization and sub-pixel level correction, using gated loop units to perform multi-step iterative correction of the optical flow field; D. Construct an infrared video deblurring network, use an optical flow estimation model to guide the fusion of spatiotemporal features and construct a joint loss function. The specific steps are as follows: D1. Construct an optical flow-guided attention module, which consists of a layer normalization layer, an optical flow-guided sparse window multi-head self-attention unit, and a feedforward neural network. D2. Construct an infrared video deblurring network based on a U-shaped architecture. The network consists of an encoder, a bottleneck layer, and a decoder. The encoder and decoder are both composed of several optical flow-guided attention modules stacked together. D3. Perform optical flow-guided sparse window multi-head self-attention calculation, and use the global motion aggregation optical flow estimation model obtained in step B to guide the sparse sampling of spatiotemporal key elements; D4. Construct a joint loss function that includes pixel loss, structural similarity loss, and optical flow constraints; E. Train the deblurring network end-to-end to obtain the infrared video deblurring model. The specific steps are as follows: E1. Split the dataset: Divide the pairwise sharp-blurred infrared video deblurring dataset constructed in step A into a training set and a test set; E2. Initialize network parameters: Initialize the deblurred network parameters constructed in step D, and load the optical flow estimation model parameters trained in step C into the network and freeze or fine-tune them; E3. Perform end-to-end training and output: Input the training set into the network for forward propagation, calculate the total joint loss function, and use the stochastic gradient descent algorithm to backpropagate and update the network parameters until the model converges, and output the final infrared video deblurring model; F. Input the blurred infrared video into the model for processing, and output a clear infrared video. The specific steps are as follows: F1. Obtain the real blurred infrared video sequence to be processed, and perform continuous multi-frame serialization preprocessing according to the format required by the network, which will be used as input data for the inference stage; F2. Input the preprocessed blurred infrared video sequence into the trained infrared video deblurring model obtained in step E, and use the optical flow estimation network module embedded in the model to calculate the optical flow offset between adjacent frames in the sequence; F3. Perform spatiotemporal feature reconstruction, utilize the deblurring network module inside the model, guide the sparse window to sample key elements in the spatial region corresponding to adjacent frames according to the optical flow offset, and fuse cross-frame spatiotemporal features through multi-head self-attention calculation to restore the thermal radiation structure and texture details of the current frame. F4. Output continuous video frames that have undergone spatiotemporal feature reconstruction and deblurring, and combine them according to the original video frame rate to generate a high-precision, clear infrared video that meets the temporal consistency constraint, thus completing the end-to-end deblurring task of infrared video.
2. The infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention as described in claim 1, wherein the cross-modal style transfer model is constructed, and the specific steps are as follows: B1.1 Perform physical prior-guided preprocessing on the visible light image to obtain a preprocessed image containing only thermal radiation structure information; B1.2 Construct a cycle-consistent generative adversarial network containing two generators and two discriminators; B1.3 Construct a hybrid loss function that includes adversarial loss, cycle consistency loss, identity loss, and perception loss. The specific calculation formula is as follows: in, Indicates perceived loss. This represents the generated infrared image. Indicates the reference target image. The feature extraction network is represented by the first... Feature map of the layer Indicates the first Weights of layer feature maps, This represents the L1 norm.
3. The infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention as described in claim 1, employs a self-similarity matching mechanism and utilizes a quadtree block strategy for dynamic motion residual optimization. The specific steps are as follows: C4.1 Divide the image features into several non-overlapping image blocks. And calculate attention independently within each block; C4.2 utilizes intra-block submatrices for feature aggregation to reduce computational complexity and balance computational efficiency and accuracy. The specific calculation formula is as follows: in, The first The query, key, and value submatrix within a quadtree image block.
4. The infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention as described in claim 1, performs iterative optimization and sub-pixel-level correction, and uses a gated recurrent unit to perform multi-step iterative correction of the optical flow field. The specific steps are as follows: C5.1 inputs the current optical flow estimate, global aggregation features, and context information into the gated loop unit to calculate the first... Optical flow correction in the next iteration ; C5.2 Update the current optical flow field based on the optical flow correction, and its dynamic equation is as follows: C5.3 Constructing the Endpoint Error Loss Function The Euclidean distance between the predicted optical flow field and the actual infrared optical flow label is calculated to drive network parameter updates. The specific calculation formula is as follows: in, For true optical flow tags, To predict the optical flow field, This represents the L2 norm.
5. The infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention as described in claim 1, performs optical flow-guided sparse window multi-head self-attention calculation, and uses the global motion aggregation optical flow estimation model obtained in step B to guide sparse sampling of spatiotemporal key elements. The specific steps are as follows: D3.1 divides the input video frame feature map into non-overlapping local windows and sets the current time... The window features are used as a reference and denoted as query features; D3.2 Using the optical flow estimation model trained in step B, calculate the reference frame. With adjacent frames Optical flow offset between ; D3.3 Based on the optical flow offset, in adjacent frames The system locates the spatial region corresponding to the current query window and performs sparse sampling on the pixels within that region to obtain a set of key elements highly correlated with the query features. ; D3.4 Map the features of the current query window to a query vector Set of key elements Mapped to key vector Sum value vector ; D3.5 Calculate the query vector With key vector The multi-head self-attention weights between frames are used to calculate the window feature output that incorporates cross-frame spatiotemporal information.
6. The infrared video deblurring method based on cross-modal style transfer and optical flow-guided sparse attention as described in claim 1, constructs a joint loss function including pixel loss, structural similarity loss, and optical flow constraints, with the following specific steps: D4.1 Constructing the pixel consistency loss function : in, Let L1 norm be denoted as , where For timing length, For spatial resolution, To deblur the output, For clear GT images; D4.2 Constructing the structural similarity loss function The specific calculation formula is as follows: in, Represents a deblurred image patch. Represents a real image patch. This represents the mean. Represents variance. Describing covariance, , and To maintain constants for computational stability; D4.3 Constructing the Optical Flow Temporal Consistency Constraint Loss Function The specific calculation formula is as follows: in, This represents the deblurred frame at the current moment. This represents the deblurred frame at the next moment. This represents the predicted optical flow field. This represents a spatial transformation operation based on optical flow. Represents the L1 norm; D4.4 The pixel consistency loss function, structural similarity loss function, and optical flow temporal consistency constraint loss function are weighted and summed to construct the overall joint loss function. : in, This is the perception-pixel balance factor, used to adjust the contribution ratio of SSIM and L1 loss; The intensity of time regularization; D4.5 Calculate the total error using the joint loss function.