A real scene video deblurring system and method based on a single-step video diffusion model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing frame-by-frame latent space coding and a temporal window masking mechanism based on a single-step video diffusion model, the problems of insufficient generalization ability and high inference cost in real-world video deblurring are solved, achieving efficient and stable deblurring results.

CN122265096APending Publication Date: 2026-06-23SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date: 2026-05-27
Publication Date: 2026-06-23

Application Information

Patent Timeline

27 May 2026

Application

23 Jun 2026

Publication

CN122265096A

IPC: G06T5/73; G06T5/70; G06T5/60; G06V10/80; G06V10/75; G06V10/766; G06N3/045; G06N3/0455; G06N3/096; G06N5/025; G06N5/04

AI Tagging

Application Domain

Image enhancement Biological models

Technology Topics

Pattern recognition Computer graphics (images)

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing video deblurring technologies lack generalization ability in real-world scenarios, resulting in overly smooth deblurring results, high inference costs, and poor stability in long video processing. They also struggle to balance deblurring quality, temporal consistency, and engineering deployment efficiency.

Method used

A real-scene video deblurring system based on a single-step video diffusion model is adopted. Through frame-by-frame latent space coding, single-step diffusion distillation and temporal window masking mechanisms, the system enhances the frame-by-frame difference representation of real blurred videos, reduces inference overhead, and limits the temporal domain interaction range of long videos.

Benefits of technology

It can recover more high-frequency texture and structural information in real-world scenarios, reduce inference latency, improve the stability and engineering efficiency of long video processing, and improve deblurring quality and temporal consistency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122265096A_ABST

Patent Text Reader

Abstract

The application discloses a real scene video deblurring system and method based on a single-step video diffusion model, comprising an encoding module, a denoising module and a decoding module, wherein the encoding module is used for respectively performing latent space encoding on each frame in a blurred video sequence to be recovered, generating a frame-by-frame latent space representation corresponding to the input video frame by frame; the denoising module is used for performing single-step denoising on the frame-by-frame latent space representation to obtain a latent space representation corresponding to a clear video; the decoding module is used for decoding the latent space representation corresponding to the clear video into image frames frame by frame and outputting according to the original time sequence of the input video to obtain a deblurred video result. Through frame-by-frame latent space encoding, frame-by-frame blur differences can be preserved, and through single-step diffusion distillation, reasoning delay can be reduced, and deblurring quality and reasoning efficiency are considered.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a real-scene video deblurring system and method based on a single-step video diffusion model. Background Technology

[0002] Real-world videos are prone to blurring during filming due to camera shake, movement of objects in the scene, and optical defocus. This blurring destroys texture details, structural boundaries, and temporal continuity in the video, further affecting the performance of downstream tasks such as mobile video enhancement and 3D reconstruction. Video deblurring is a fundamental task in low-level visual processing, aiming to recover sharp image frames from degraded video sequences.

[0003] Currently, learning-based video deblurring techniques mainly include explicit motion alignment, implicit correspondence learning, and feature modeling based on spatiotemporal transformers. These methods generally rely on supplementary information from neighboring frames to recover the current frame, and mostly belong to deterministic regression frameworks. Therefore, they are usually constrained by the mean squared error objective function. When dealing with highly underdetermined inverse problems, they tend to output the statistical mean of all possible solutions, lacking the ability to model the distribution of real images. They struggle to recover reasonable details and high-frequency textures in blurred areas with damaged information, resulting in a smooth transition in the deblurred video. Due to the limited size of the training dataset for deblurring, which is limited to a few specific scenes, these methods lack the ability to generatively model the distribution of clear videos. Although they can achieve good distortion metrics on synthetic test sets, they generally suffer from insufficient generalization ability, overly smooth deblurring results, high inference costs, and poor stability in long video processing in real-world scenarios. They are prone to insufficient high-frequency texture recovery, blurred structural boundaries, and decreased cross-scene generalization ability, making it difficult to simultaneously achieve deblurring quality, temporal consistency, and engineering deployment efficiency.

[0004] In recent years, diffusion-generative models have been increasingly used for image and video deblurring tasks. Compared with traditional regression methods, diffusion-generative models can incorporate stronger natural image and video priors, thus showing greater potential in perceptual quality. Existing video diffusion-based deblurring schemes typically build upon pre-trained video generation models, fine-tuning existing video generation diffusion models. While preserving the original video generation backbone, temporally compressed latent space representation, and multi-step iterative sampling process, these models are adapted for video deblurring through task adaptation. While these approaches improve some perceptual quality issues, they still have shortcomings in real-world video deblurring tasks. Firstly, existing video diffusion-based deblurring schemes typically use the temporal compression representation in the original video generation model. The original model employs a causal variational autoencoder with temporal compression, which assumes smooth transitions between adjacent frames and maps multiple frames to a shared latent space. However, in real-world blurred sequences, the size and orientation of the blur kernels between adjacent frames differ significantly. Temporal compression leads to entanglement of frame-by-frame degradation information and loss of high-frequency details, making it difficult to fully preserve the information on the strength and orientation changes of blur in each frame of the real video. This approach is more suitable for video generation scenarios with smooth transitions between adjacent frames, but not for real-world blurred videos with significant differences in blur levels between adjacent frames. Secondly, existing video diffusion-based deblurring processes mostly require multiple iterations. The inference latency of multi-step diffusion sampling is high, and the computational cost is exorbitant, making it difficult to meet practical engineering needs. Furthermore, existing models rely on global self-attention and rotational position encoding to model the temporal sequence. When inputting long videos, global temporal correlation and rotational position encoding can lead to significant computational burden and extrapolation instability. When the input length exceeds the training length during inference, the relative position distance exceeds the encoding range, which can cause attention weight distortion, temporal inconsistency, and quadratic growth in memory usage, resulting in unstable position encoding extrapolation and temporal artifacts. Summary of the Invention

[0005] To address some or all of the problems in existing technologies, and in order to simultaneously achieve deblurring quality, long sequence stability, and engineering implementation efficiency in real-world video deblurring, the first aspect of this invention provides a real-world video deblurring system based on a single-step video diffusion model, comprising: The encoding module is used to perform latent space encoding on each frame in the fuzzy video sequence to be recovered, generating a frame-by-frame latent space representation that corresponds one-to-one with the input video frame; A denoising module is used to perform single-step denoising on the frame-by-frame latent space representation to obtain the latent space representation corresponding to the clear video. The decoding module is used to decode the latent space representation corresponding to the clear video frame by frame into image frames, and output them according to the original time sequence of the input video to obtain the deblurred video result.

[0006] Furthermore, the encoding module includes a two-dimensional image encoder.

[0007] Furthermore, the noise reduction module includes: A preprocessing module is used to fuse the frame-by-frame latent space representation with the noisy latent space embedding; A video diffusion transformer is used to denoise the fused data.

[0008] Furthermore, the preprocessing module includes: The feature transformation module includes a conditional mapping network and is used to perform feature transformation on the frame-by-frame latent space representation to obtain conditional features; A fusion module is used to fuse the conditional features with the noisy latent space embedding.

[0009] Furthermore, the video diffusion transformer includes at least one transformer module, and each transformer module includes a spatiotemporal self-attention unit, a cross-attention unit, and a feedforward network unit, wherein the cross-attention unit receives an empty text embedding.

[0010] Furthermore, the video diffusion transformer is obtained through pre-training: Construct a teacher model, a student model, and a discriminant model, wherein the teacher model, student model, and discriminant model have the same structure but have independent low-rank adaptation parameters; The student model is used to denoise the blurred video to obtain the student output, wherein the student model starts from random noise and directly predicts the latent space representation of the denoised video in one forward computation. Noise is added to the denoised video latent space representation; The teacher model is used to denoise the latent space representation of the denoised video with added noise to obtain the teacher output, wherein the teacher model performs denoising through multi-step denoising capability; The discriminant model is used to estimate the distribution difference between the student output and the teacher output, and the distribution difference is fed back to the student model. The latent space results output by the student model are decoded, and the image domain reconstruction loss and perceptual similarity loss are determined in the image domain. Standard flow matching training is performed by the discriminative model based on the separation results output by the student model to determine the flow matching loss. The trained student model is used as a video diffusion transformer.

[0011] Based on the aforementioned real-scene video deblurring method, a second aspect of the present invention provides a real-scene video deblurring method based on a single-step video diffusion model, comprising: The fuzzy video sequence to be recovered is obtained, and each frame in the video is latent space encoded to generate a frame-by-frame latent space representation that corresponds to each input video frame. The frame-by-frame latent space representation is fused with the noisy latent space embedding to obtain the fused feature; The fused features are denoised using a pre-trained video diffusion transformer to obtain a latent space representation of the clear video. The latent space representation corresponding to the clear video is decoded frame by frame into image frames, and output according to the original time sequence of the input video to obtain the deblurred video result.

[0012] Furthermore, the real-scene video deblurring method also includes: When the length of the blurred video sequence to be recovered is greater than the training length, a time window mask is introduced.

[0013] Furthermore, the introduction of a time window mask includes: For each current frame, temporal attention interaction is only allowed with neighboring frames within a predetermined number of frames before and after it.

[0014] Furthermore, the length of the time window is no greater than the time sequence length used in the training phase.

[0015] This invention provides a real-scene video deblurring system and method based on a single-step video diffusion model. Based on a pre-trained video generation model, it enhances the ability to express frame-by-frame differences in real-scene blurred videos by retaining the existing spatiotemporal priors of the pre-trained video generation model, significantly reduces inference overhead through frame-by-frame latent space coding, and limits the temporal interaction range during long video deblurring through a time window mask-based inference mechanism. Thus, it simultaneously considers deblurring quality, long sequence stability, and engineering implementation efficiency in real-scene video deblurring. Attached Figure Description

[0016] To further illustrate the above and other advantages and features of the various embodiments of the present invention, a more specific description of the various embodiments of the present invention will be presented with reference to the accompanying drawings. It is to be understood that these drawings depict only typical embodiments of the invention and are therefore not intended to limit its scope. In the drawings, identical or corresponding parts will be indicated by identical or similar reference numerals for clarity.

[0017] Figure 1 This diagram illustrates the structure of a real-scene video deblurring system based on a single-step video diffusion model, according to an embodiment of the present invention. Figure 2 This diagram illustrates the pre-training process of a video diffusion model according to an embodiment of the present invention. Figure 3 The diagram illustrates a flowchart of a real-scene video deblurring method based on a single-step video diffusion model according to an embodiment of the present invention. Figure 4 This diagram illustrates a time window mask according to an embodiment of the present invention. Detailed Implementation

[0018] In the following description, the invention is described with reference to various embodiments. However, those skilled in the art will recognize that the embodiments may be practiced without one or more specific details or in conjunction with other alternatives and / or additional methods or components. In other instances, well-known structures, materials, or operations are not shown or described in detail so as not to obscure the inventive points of the invention. Similarly, for illustrative purposes, specific numbers and configurations are set forth to provide a comprehensive understanding of embodiments of the invention. However, the invention is not limited to these specific details. Furthermore, it should be understood that the embodiments shown in the drawings are illustrative representations and are not necessarily drawn to scale.

[0019] In this specification, references to "an embodiment" or "this embodiment" mean that a particular feature, structure, or characteristic described in connection with that embodiment is included in at least one embodiment of the invention. The phrase "in one embodiment" appearing throughout this specification does not necessarily refer to the same embodiment in all instances.

[0020] It should be noted that the embodiments of the present invention describe the method steps in a specific order; however, this is only for illustrating the specific embodiment and not for limiting the order of the steps. On the contrary, in different embodiments of the present invention, the order of the steps can be adjusted according to actual needs.

[0021] To address the problems of existing diffusion models requiring multiple iterations, high inference costs, and difficulty in handling long video sequences, this invention provides a real-scene video deblurring system and method based on a single-step video diffusion model. It systematically transforms the input encoding / decoding representation, diffusion denoising method, and long video inference method around the real-scene video deblurring task. While retaining the existing spatiotemporal priors of the pre-trained video generation model, it preserves frame-by-frame fuzzy differences through frame-by-frame latent space encoding, reduces inference latency through single-step diffusion distillation, and improves the stability of long video processing through time window masking inference, thereby achieving real-scene video deblurring that balances deblurring quality, inference efficiency, and long sequence stability. The system and method first acquire the blurred video sequence to be recovered and perform latent space encoding on each frame of the video to form a frame-by-frame latent space representation that corresponds one-to-one with the input video frame, thus avoiding the destruction of inter-frame fuzzy difference information by traditional time compression representation. Secondly, during model training or inference, the frame-by-frame latent space representation is introduced into the video generation backbone, making the video generation backbone constrained by fuzzy observation information during the denoising process. Furthermore, by using distillation training, the original multi-step diffusion denoising process is compressed into a single-step denoising process, enabling the model to directly output the deblurred result in a single forward computation. Finally, in the long video inference stage, an inference mechanism based on temporal window masks is adopted to restrict the original global temporal correlation to a local temporal range, thereby reducing inference costs and minimizing temporal artifacts when processing ultra-long sequences.

[0022] The technical solution of the present invention will be further described below with reference to the accompanying drawings of the embodiments.

[0023] Figure 1 This diagram illustrates the structure of a real-scene video deblurring system based on a single-step video diffusion model, according to an embodiment of the present invention. Figure 1 As shown, a real-scene video deblurring system based on a single-step video diffusion model includes an encoding module 101, a denoising module 102, and a decoding module 103. The encoding module 101 is used to perform latent space encoding on each frame of the blurred video sequence to be recovered, generating a frame-by-frame latent space representation that corresponds to each input video frame. The denoising module 102 is used to perform single-step denoising on the frame-by-frame latent space representation to obtain the latent space representation corresponding to the clear video. The decoding module 103 is used to decode the latent space representation corresponding to the clear video frame by frame into image frames and output them according to the original time sequence of the input video to obtain the deblurred video result.

[0024] In one embodiment of the present invention, the real-scene video deblurring system uses a video generation model pre-trained on large-scale image and video data as its basic backbone, including a video variational autoencoder (VAE) and a video diffusion transformer (DiT). The VAE is used to map the input video into a latent space representation, and the DiT is used to denoise the noisy representation in the latent space. Existing video generation models typically use a causal three-dimensional variational autoencoder to simultaneously perform spatial and temporal compression on the input video to reduce the computational load during subsequent diffusion denoising. However, the real-scene video deblurring task differs from general video generation tasks. Blur is formed within the exposure time, resulting in inconsistent blur levels, blur directions, and degradation strengths between adjacent frames. If temporal compression is used directly, degradation information from multiple frames will enter the same compressed representation during the encoding stage, causing frame-by-frame degradation information to become entangled. Based on this, in one embodiment of the present invention, the encoding module 101 includes two-dimensional image encoding instead of using the original time compression encoding method. In this way, each frame in the input blurred video can be regarded as an independent image and input into the two-dimensional image encoder for latent space encoding. Then, the latent space representations of each frame are combined into a video latent space sequence according to the original time order, thereby maintaining the one-to-one correspondence between the input frame and the latent space frame.

[0025] After obtaining the frame-by-frame latent space representation of the blurred video, the blurred observation information needs to be introduced into the pre-trained video diffusion transformer to transform it from an unconditional generation model into a conditional denoising model. Based on this, in one embodiment of the present invention, the denoising module 102 includes a preprocessing module 121 and a video diffusion transformer 122, wherein the preprocessing module 121 is used to fuse the frame-by-frame latent space representation with the noisy latent space embedding, and the video diffusion transformer 122 is used to denoise the fused data.

[0026] In one embodiment of the present invention, the preprocessing module 121 includes a feature transformation module and a fusion module. The feature transformation module includes a conditional mapping network, through which the frame-by-frame latent space representation can be transformed to obtain conditional features. The fusion module is used to fuse the transformed conditional features with the noisy latent space embedding at the input of the video diffusion transformer 122, and then send it to the video diffusion transformer 122 for denoising.

[0027] As mentioned above, the real-scene video deblurring system uses a video generation model pre-trained on large-scale image and video data as its backbone. In one embodiment of the present invention, the video diffusion transformer 122 is composed of stacked multi-layer transformer modules, each of which includes at least a spatiotemporal self-attention unit, a cross-attention unit, and a feedforward network unit. Typically, the cross-attention unit receives text conditions; however, to avoid text information interfering with the video deblurring task, in one embodiment of the present invention, the cross-attention unit receives empty text embeddings, thus keeping the model in a text-free state and retaining only its original network structure. In one embodiment of the present invention, to accommodate long video input requirements, the spatiotemporal self-attention unit uses a self-attention unit based on a time window mask, which only allows temporal attention interaction with neighboring frames within a predetermined number of frames before and after it.

[0028] As mentioned earlier, the original video diffusion model typically requires multiple rounds of iterative denoising to gradually remove random noise into a clear video latent space representation. To reduce inference latency, in the embodiments of this invention, the original multi-step diffusion denoising process is compressed into a single-step denoising process through distillation training. Figure 2 This diagram illustrates the pre-training process of a video diffusion model according to an embodiment of the present invention. Figure 2 As shown, in one embodiment of the present invention, during distillation training, a teacher model, a student model, and a discriminant model are first constructed. These models share the same frozen diffusion transformer backbone, differing only in that they each have their own independent low-rank adaptation parameters. Then, the student model denoises the blurred video, specifically, the blurred video latent space representation incorporating noise, to obtain the student output. The student model starts from random noise and directly predicts the denoised video latent space representation in a single forward computation. Noise is then added to the denoised video latent space representation, and the teacher model denoises the noise-added denoised video latent space representation to obtain the teacher output. The teacher model retains its original multi-step denoising capability to provide a high-quality target distribution. Then, the discriminant model estimates the distribution difference between the student output and the teacher output, and feeds this distribution difference back to the student model. Simultaneously, the latent space result output by the student model is decoded, and image domain reconstruction loss and perceptual similarity loss are determined in the image domain. Standard flow matching training is then performed based on the separation result output by the student model using the discriminant model to determine the flow matching loss. Finally, the trained student model is used as a video diffusion transformer. It can be seen that the student model is simultaneously subject to distribution matching constraints, image domain reconstruction constraints, and perceptual similarity constraints during training, which compresses the original multi-step diffusion denoising process into a single-step denoising process, enabling the model to directly output the deblurred result in a single forward computation.

[0029] To reduce training overhead while preserving the prior knowledge of pre-trained video generation, one embodiment of the present invention employs a parameter-efficient fine-tuning method to adapt the diffusion backbone to the task. Specifically, the parameters of the video diffusion transformer backbone are kept frozen, and low-rank adaptation parameters are loaded only in each transformer module. Simultaneously, the aforementioned conditional mapping network is trained, and the video variational autoencoder remains frozen, not participating in deblurring training updates. This avoids the need for overall retraining of the large-scale pre-trained model and also facilitates the inheritance of the clear video distribution prior learned by the original video generation model.

[0030] Based on the real-scene video deblurring method described above. Figure 3 The diagram illustrates a flowchart of a real-scene video deblurring method based on a single-step video diffusion model according to an embodiment of the present invention. Figure 3 As shown, a real-scene video deblurring method based on a single-step video diffusion model includes: First, in step 301, video encoding. The blurred video sequence to be recovered is obtained, and each frame in the video is latent space encoded to generate a frame-by-frame latent space representation that corresponds one-to-one with the input video frame; Next, in step 302, feature fusion is performed. The frame-by-frame latent space representation is fused with the noisy latent space embedding to obtain fused features; Next, in step 303, denoising is performed. The fused features are denoised using a pre-trained video diffusion transformer to obtain the latent space representation corresponding to the clear video. In one embodiment of the present invention, for long video inputs, especially those longer than the training length, if a global temporal attention mechanism is still used, each frame in the deblurring process needs to interact with all frames in the entire video, leading to a rapid increase in memory and computational load with the sequence length. It also causes instability in the rotation position encoding extrapolation when the inference sequence length exceeds the training length. Therefore, in one embodiment of the present invention, a temporal window mask is introduced during the inference stage, i.e., the denoising process, such as... Figure 4 As shown, for any given frame, temporal attention interaction is only allowed with neighboring frames within a predetermined range before and after it. Frames exceeding this window range W are masked during the current frame denoising process and do not participate in attention calculation. In one embodiment of the invention, the time window range W, i.e., the length of the time window, is no greater than the temporal length used in the training phase, and preferably equal to the training length, ensuring that the relative temporal distance between any query frame and the key frame always falls within the range learned during training. By introducing a time window mask, global temporal attention can be transformed into local time window attention without retraining the model, thereby reducing memory consumption during long video inference and suppressing artifacts and temporal instability caused by long sequence extrapolation. Finally, in step 304, data decoding is performed. The latent space representation corresponding to the clear video is decoded frame by frame into image frames, and output according to the original temporal order of the input video to obtain the deblurred video result. Since frame-by-frame latent space encoding is used at the input end, frame-by-frame latent space decoding is also preferably used at the output end to maintain the correspondence between the input and output frames, which facilitates subsequent direct application to video enhancement, 3D reconstruction preprocessing, and depth estimation preprocessing.

[0031] This invention provides a real-scene video deblurring system and method based on a single-step video diffusion model. By replacing the causal temporal compression coding method in the video generation model with frame-by-frame latent space coding, it enhances the ability to express frame-by-frame blur differences in real videos. Through single-step diffusion distillation, the deblurring process, which originally required multiple iterations, is compressed into a single deblurring process, thereby significantly reducing inference latency and improving deployment efficiency. Furthermore, through the inference mechanism of temporal window masking, the temporal information interaction during long video deblurring is restricted to a local temporal range, improving the stability of ultra-long video processing and reducing computational resource consumption. Compared with traditional regression-based video deblurring methods, this system and method utilize the clear video prior provided by the pre-trained video generation model, enabling the recovery of more high-frequency texture and structural information in realistic and complex blurred scenes, mitigating the oversmoothing problem that easily occurs in traditional methods. The method and system have been experimentally verified. Specifically, they have been quantitatively evaluated on multiple real-world video deblurring benchmarks, including multiple real-world blurred video datasets. The deblurring effect is comprehensively analyzed using distortion index, perceptual quality index, no-reference quality index, and temporal consistency index. Experimental results show that the method and system have achieved strong comprehensive performance on multiple real-world benchmarks.

[0032] Although various embodiments of the invention have been described above, it should be understood that they are presented by way of example only and not as limitations. It will be apparent to those skilled in the art that various combinations, modifications, and alterations can be made without departing from the spirit and scope of the invention. Therefore, the breadth and scope of the invention disclosed herein should not be limited by the exemplary embodiments disclosed above, but should be defined solely by the appended claims and their equivalents.

Claims

1. A real-scene video deblurring system based on a single-step video diffusion model, characterized in that, include: The encoding module is configured to perform latent space encoding on each frame in the fuzzy video sequence to be recovered, generating a frame-by-frame latent space representation that corresponds one-to-one with the input video frame; A denoising module is configured to perform single-step denoising on the frame-by-frame latent space representation to obtain the latent space representation corresponding to the clear video. The decoding module is configured to decode the latent space representation corresponding to the clear video frame by frame into image frames, and output them in the original time order of the input video to obtain the deblurred video result.

2. The real-scene video deblurring system as described in claim 1, characterized in that, The encoding module includes a two-dimensional image encoder.

3. The real-scene video deblurring system as described in claim 1, characterized in that, The noise reduction module includes: A preprocessing module is configured to fuse the frame-by-frame latent space representation with the noisy latent space embedding to obtain fused data; A video diffusion converter is configured to denoise the fused data.

4. The real-scene video deblurring system as described in claim 3, characterized in that, The preprocessing module includes: A feature transformation module, comprising a conditional mapping network, is configured to perform feature transformation on the frame-by-frame latent space representation to obtain conditional features; A fusion module is configured to fuse the conditional features with a noisy latent space embedding to obtain fused data.

5. The real-scene video deblurring system as described in claim 3, characterized in that, The video diffusion converter includes at least one converter module, and each converter module includes: The spatiotemporal self-attention unit, based on a temporal window mask, only allows temporal attention interactions with neighboring frames within a predetermined number of preceding and following frames; Cross-attention units, configured to input empty text embeddings; Forward network unit.

6. The real-scene video deblurring system as described in claim 3, characterized in that, The video diffusion transformer is obtained through pre-training: Construct a teacher model, a student model, and a discriminant model, wherein the teacher model, student model, and discriminant model have the same structure but have independent low-rank adaptation parameters; The student model is used to denoise the blurred video to obtain the student output, wherein the student model starts from random noise and directly predicts the latent space representation of the denoised video in one forward computation. Noise is added to the denoised video latent space representation; The teacher model is used to denoise the latent space representation of the denoised video with added noise to obtain the teacher output, wherein the teacher model performs denoising through multi-step denoising capability; The discriminant model is used to estimate the distribution difference between the student output and the teacher output, and the distribution difference is fed back to the student model. The latent space results output by the student model are decoded, and the image domain reconstruction loss and perceptual similarity loss are determined in the image domain. Standard flow matching training is performed by the discriminative model based on the separation results output by the student model to determine the flow matching loss. The trained student model is used as a video diffusion transformer.

7. A method for deblurring real-scene videos based on a single-step video diffusion model, characterized in that, include: The fuzzy video sequence to be recovered is obtained, and each frame in the video is latent space encoded to generate a frame-by-frame latent space representation that corresponds to each input video frame. The frame-by-frame latent space representation is fused with the noisy latent space embedding to obtain fused data; The fused data is denoised to obtain the latent space representation corresponding to the clear video; The latent space representation corresponding to the clear video is decoded frame by frame into image frames, and output according to the original time sequence of the input video to obtain the deblurred video result.

8. The method for deblurring real-scene videos as described in claim 7, characterized in that, Also includes: When the length of the blurred video sequence to be recovered is greater than the training length, a time window mask is introduced.

9. The method for deblurring real-scene videos as described in claim 8, characterized in that, Introducing a time window mask includes: For each current frame, temporal attention interaction is only allowed with neighboring frames within a predetermined number of frames before and after it.

10. The method for deblurring real-scene videos as described in claim 8, characterized in that, The length of the time window is no greater than the time sequence length used during the training phase.