A video coloring method and apparatus
By combining a global motion aggregation module and an optical flow-guided self-attention module, and utilizing a Transformer network and color reference frames, the problems of color overflow and artifacts in motion scene video colorization are solved, achieving more stable and accurate color video generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- COMMUNICATION UNIVERSITY OF CHINA
- Filing Date
- 2023-06-26
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies suffer from undesirable issues such as color overflow and artifacts in motion-related video coloring, especially the diversity of coloring results in unreferenced videos, which leads to inconsistent object coloring.
Optical flow is estimated using a Global Motion Aggregation (GMA) module, combined with an optical flow-guided self-attention module and a Transformer network. The first and last two color reference frames are used to guide the bidirectional propagation of optical flow in the video frame sequence, thereby improving the coloring effect.
It effectively alleviates the problems of color overflow and color loss caused by excessive motion, improves the consistency and stability of video coloring, and enhances the accuracy of colorization results for videos with large motion.
Smart Images

Figure CN116828195B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to a video coloring method and apparatus. Background Technology
[0002] Video colorization is the process of converting black-and-white video into color video using technical means to improve the visual quality of the video. It is widely used in film restoration. In recent years, with the development of deep learning, video colorization has also made significant progress. Deep learning-based video colorization is divided into two categories: reference image-based video colorization and referenceless video colorization. Referenceless video colorization produces diverse results, providing users with multiple choices. However, this diversity may lead to the same object being colored in different colors, failing to guarantee the consistency of the entire video segment. For example, in a landscape video, a tree might be colored yellow or green, and referenceless colorization cannot determine the accurate color of the object. Reference image-based colorization algorithms take a reference image as input, extract color information from the reference image, and propagate it to the black-and-white video frames. This greatly improves the stability of video colorization.
[0003] However, in existing technologies, there are undesirable issues such as color overflow and artifacts in video coloring related to motion scenes. Summary of the Invention
[0004] To address the aforementioned technical problems, embodiments of this application provide a video coloring method and apparatus to improve the coloring effect of black and white videos related to motion scenes.
[0005] In a first aspect, an embodiment of this application provides a video coloring apparatus, comprising:
[0006] Global Motion Aggregation (GMA) module, first residual module, encoder, optical flow guided self-attention module, decoder and second residual module;
[0007] The inputs to the GMA module and the first residual module are the video frame sequences to be colored;
[0008] The GMA module is configured to determine a bidirectional optical flow sequence based on the sequence of video frames to be colored, the bidirectional optical flow sequence serving as the encoder, with the optical flow guided by the input of the attention module and the decoder;
[0009] The first residual module is configured to determine a first input vector based on the sequence of video frames to be colored;
[0010] The encoder is configured to determine a first output sequence and a second output sequence based on the bidirectional optical flow sequence and the first input vector;
[0011] The optical flow-guided self-attention module is configured to determine a third output sequence based on the first output sequence and the bidirectional optical flow sequence;
[0012] The decoder is configured to determine a sixth output sequence based on the second output sequence, the third output sequence, and the bidirectional optical flow sequence;
[0013] The second residual module is configured to map the sixth output sequence and add it to the video frame sequence to be colored to obtain a color video frame sequence;
[0014] The video frame sequence to be colored is a black and white video frame sequence containing two color reference frames at the beginning and end.
[0015] Preferably, the optical flow-guided self-attention module includes:
[0016] Layer normalization module LN, sparse window multi-head self-attention FGSW-MSA and feedforward network FFN;
[0017] The output of the LN module is the input of the FGSW-MSA;
[0018] The output of the FGSW-MSA is added point-by-point to the input of the LN module, and then used as the input of the FNN.
[0019] The output of the FNN is then added point-by-point to the input of the FNN, and the result is used as the output of the optical flow-guided self-attention module.
[0020] Preferably, the encoder includes: an optical flow-guided self-attention module and a patch merging downsampling module.
[0021] Preferably, the encoder is configured to determine a first output sequence and a second output sequence based on the bidirectional optical flow sequence and the first input vector, including:
[0022] The encoder's optical flow-guided self-attention module determines a first output sequence based on the bidirectional optical flow sequence and the first input vector;
[0023] The encoder's patch merging downsampling module determines the second output sequence based on the first output sequence.
[0024] Preferably, the decoder includes: an optical flow-guided self-attention module and a patch-extended upsampling module.
[0025] Preferably, the decoder is configured to determine a sixth output sequence based on the second output sequence, the third output sequence, and the bidirectional optical flow sequence, including:
[0026] The decoder's patch extension upsampling module determines a fourth output sequence based on the third output sequence, and performs channel-level concatenation processing on the fourth output sequence and the second output sequence to obtain a fifth output sequence;
[0027] After performing a convolution operation on the fifth output sequence, the seventh output sequence is obtained;
[0028] The optical flow-guided self-attention module of the decoder determines the sixth output sequence based on the fourth output sequence, the seventh output sequence, and the bidirectional optical flow sequence.
[0029] Preferably, in this application, the GMA module is based on a transformer.
[0030] Secondly, embodiments of this application also provide a video coloring method, including:
[0031] The apparatus of the present invention is used to implement a method for colorizing black and white video into color video.
[0032] The video coloring method and apparatus of the present invention introduces an optical flow estimation module, which uses optical flow to estimate the motion of consecutive frames and assists in the temporal propagation of color. It also uses the first and last two color reference frames to guide the coloring of the entire video frame sequence. The first and last two frames provide more color information and perform bidirectional propagation guided by optical flow, thereby improving the video coloring effect and effectively alleviating the problems of color overflow and loss caused by excessive motion amplitude. Attached Figure Description
[0033] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0034] Figure 1 This is a schematic diagram of the video coloring device structure provided in an embodiment of this application;
[0035] Figure 2 This is a schematic diagram of the signal processing position of the video coloring device provided in the embodiments of this application;
[0036] Figure 3 This is a schematic diagram of the optical flow-guided self-attention module of the video coloring apparatus provided in the embodiments of this application. Detailed Implementation
[0037] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this invention, and not all embodiments. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0038] The following are explanations of some of the words that appear in the text:
[0039] 1. In the embodiments of this invention, the term "and / or" describes the relationship between associated objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. The character " / " generally indicates that the preceding and following associated objects have an "or" relationship.
[0040] 2. In the embodiments of this application, the term "multiple" refers to two or more, and other quantifiers are similar.
[0041] In the past, a large number of black and white films were produced. To better disseminate traditional culture, video colorization technology was needed for colorizing these older films. Advances in video colorization technology rely on the development of neural network frameworks. This invention proposes a colorization neural network architecture that meets the needs of the film colorization industry. To accurately estimate object displacement information in a video scene, this invention introduces an optical flow estimation module into the video colorization algorithm. Optical flow is used to estimate motion between consecutive frames and to assist in the temporal propagation of color. To stably propagate determined color information into the black and white video, this invention employs video colorization based on reference frames. To overcome the propagation problem of long sequence frames, this invention uses the first and last two color reference frames to guide the entire video frame sequence for colorization. These first and last frames provide more color information and perform bidirectional propagation guided by optical flow, improving the video colorization effect and effectively mitigating color overflow and loss caused by excessive motion.
[0042] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0043] It should be noted that the order in which the embodiments of this application are presented only represents the chronological order of the embodiments and does not represent the superiority or inferiority of the technical solutions provided by the embodiments.
[0044] See Figure 1 The present application provides a schematic diagram of a video coloring device, as shown in the embodiment. Figure 1As shown, the device includes the following modules:
[0045] Global Motion Aggregation (GMA) module, first residual module, encoder, optical flow guided self-attention module, decoder and second residual module;
[0046] The inputs to the GMA module and the first residual module are the video frame sequences to be colored;
[0047] The GMA module is configured to determine a bidirectional optical flow sequence based on the sequence of video frames to be colored, the bidirectional optical flow sequence serving as the encoder, with the optical flow guided by the input of the attention module and the decoder;
[0048] The first residual module is configured to determine a first input vector based on the sequence of video frames to be colored;
[0049] The encoder is configured to determine a first output sequence and a second output sequence based on the bidirectional optical flow sequence and the first input vector;
[0050] The optical flow-guided self-attention module is configured to determine a third output sequence based on the first output sequence and the bidirectional optical flow sequence;
[0051] The decoder is configured to determine a sixth output sequence based on the second output sequence, the third output sequence, and the bidirectional optical flow sequence;
[0052] The second residual module is configured to map the sixth output sequence and add it to the video frame sequence to be colored to obtain a color video frame sequence;
[0053] The video frame sequence to be colored is a black and white video frame sequence containing two color reference frames at the beginning and end.
[0054] The coloring framework in this invention is based on the Transformer network.
[0055] like Figure 1 As shown, in this invention, the Global Motion Aggregation (GMA) module is used to estimate optical flow. The input is a black and white sequence of frames and the output is a bidirectional optical flow sequence. The input black and white sequence is the sequence of video frames to be colored.
[0056] The first residual module is used to map the input sequence to an input vector.
[0057] The encoder (Endoder) includes the Flow-Guided Attention Block (FGAB) and the Patch Merging downsampling module.
[0058] The decoder includes an optical flow guided self-attention module (FGAB) and a patch expanding upsampling module.
[0059] The second residual module is used to map the features output by the decoder. The mapping result is added point by point to the input sequence matrix to obtain the final color video frame sequence.
[0060] In this invention, the positions of the input and output signals of each module are as follows: Figure 2 As shown. The sequence of video frames to be colored serves as the input to the GMA module and the first residual module, and also as the input to the second residual module for matrix point-by-point addition processing;
[0061] The output of the GMA module is a bidirectional optical flow sequence, which serves as the input to the encoder, FGAB, and decoder.
[0062] The output of the first residual module is the first input vector, which is then used as the input to the encoder.
[0063] The output of the encoder's FGAB module is the second output sequence, which is used as the input for channel-level splicing processing of the fourth output sequence of the decoder's patch extended upsampling module.
[0064] The output of the encoder's patch-merging downsampling module is used as the first output sequence and is then used as the input to the FGAB module.
[0065] The output of the FGAB module is the third output sequence, which is then used as the input to the decoder.
[0066] The decoder outputs the sixth output sequence, which is then used as the input to the second residual module.
[0067] The output of the second residual module is added point by point to the video frame sequence to be colored to obtain the color video frame sequence.
[0068] like Figure 3 As shown, the optical flow-guided self-attention module includes:
[0069] Layer normalization module LN, sparse window multi-head self-attention FGSW-MSA and feedforward network FFN;
[0070] The output of the LN module is the input of the FGSW-MSA;
[0071] The output of the FGSW-MSA is added point-by-point to the input of the LN module, and then used as the input of the FNN.
[0072] The output of the FNN is then added point-by-point to the input of the FNN, and the result is used as the output of the optical flow-guided self-attention module.
[0073] As a preferred example, the encoder includes: an optical flow-guided self-attention module and a patch merging downsampling module.
[0074] As an optional example, the encoder is configured to determine a first output sequence and a second output sequence based on the bidirectional optical flow sequence and the first input vector, including:
[0075] The encoder's optical flow-guided self-attention module determines a first output sequence based on the bidirectional optical flow sequence and the first input vector;
[0076] The encoder's patch merging downsampling module determines the second output sequence based on the first output sequence.
[0077] As an optional example, the decoder includes: an optical flow-guided self-attention module and a patch-extended upsampling module.
[0078] As an optional example, the decoder is configured to determine a sixth output sequence based on the second output sequence, the third output sequence, and the bidirectional optical flow sequence, including:
[0079] The decoder's patch extension upsampling module determines a fourth output sequence based on the third output sequence, and performs channel-level concatenation processing on the fourth output sequence and the second output sequence to obtain a fifth output sequence;
[0080] After performing a convolution operation on the fifth output sequence, the seventh output sequence is obtained;
[0081] The optical flow-guided self-attention module of the decoder determines the sixth output sequence based on the fourth output sequence, the seventh output sequence, and the bidirectional optical flow sequence.
[0082] The video colorization apparatus of this invention operates as follows: The overall input is a sequence of black and white video frames containing two color reference frames (first and last). This sequence is then processed by an optical flow-guided video colorization Transformer network, ultimately outputting a sequence of color video frames. The Global Motion Aggregation (GMA) module estimates the optical flow, taking the black and white sequence as input and outputting a bidirectional optical flow sequence. This module, based on the Transformer method, can discover long-term dependencies between frames and perform global information aggregation on the corresponding motion features. The first residual module maps the input sequence to an input vector. This input vector and the bidirectional optical flow sequence are input into the encoder, and its output is processed by a separate FGAB before entering the decoder. The encoder consists of an FGAB and a Patch Merging downsampling block, while the decoder consists of a Patch Expanding upsampling block and an FGAB. The second residual module maps the features output by the decoder, and the mapped result is added point-by-point to the input sequence to obtain the final color video frame sequence.
[0083] like Figure 3 The workflow of the optical flow-guided self-attention module shown is as follows: Due to the limited receptive field of previous window-based Transformers, the attention module could only aggregate local information. In this framework, the attention module FGAB, which incorporates motion estimation information, is crucial for colorizing large-motion videos. Figure 1 As shown, FGAB consists of a layer normalization (LN), an optical flow-guided sparse window multi-head self-attention (FGSW-MSA), and a feedforward network (FFN). In FGSW-MSA, the attention query and value are calculated from the input vector. Additionally, optical flow cues provide self-similar and highly correlated frame prior information, and the key is sampled from the corresponding patch for attention computation. This self-attention mechanism preserves the original image prior information, and the features captured by this framework are spatially global but temporally local.
[0084] In this invention, a Transformer structure, which has excellent performance in capturing long-range spatial dependencies, is used instead of a convolutional neural network. First, the Transformer can learn the non-local self-similarity of frames. Second, the framework of this invention tightly couples the optical flow estimation module with the subsequent colorization network. A Global Motion Aggregation (GMA) module is used to estimate optical flow. GMA, a Transformer-based method, can discover long-term dependencies between frames and aggregate corresponding motion features globally. The GMA module improves the accuracy of large motion estimation, thereby further improving the colorization results. Then, to overcome the limitations of existing optical flow estimation modules and alignment strategies, an optical flow-guided Transformer is introduced to tightly couple optical flow information. Utilizing a self-attention mechanism, prior information of the original image can be preserved, and the features captured by this framework are spatially global but temporally local.
[0085] In this invention, from the perspective of optimizing the training of the optical flow-guided video coloring neural network, the performance of the model can be improved through an edge loss function to suppress color overflow at edges. To better enhance the blurred edges and texture information in the coloring results, this invention utilizes the Sobel operator to detect image edges and designs a loss function to further enhance blurred edges and textures. The optical flow-guided video coloring network effectively solves the temporal inconsistency problem in video coloring networks and alleviates the difficulty of coloring large-motion videos.
[0086] In this invention, firstly, a Transformer is used to capture the performance of long-range spatial dependencies and learn the nonlocal self-similarity of frames. Then, a global motion aggregation optical flow module is employed to establish relationships between frames and aggregate global information of corresponding motion features, improving the accuracy of large motion estimation and colorization results. Finally, to overcome the limitations of previous optical flow estimation modules and alignment strategies, an optical flow-guided self-attention mechanism is adopted, improving the accuracy of optical flow results in fast-moving regions.
[0087] This invention incorporates an optical flow-guided self-attention module. Previous work simply concatenated extracted optical flow information with features for convolution or other operations. Because the receiving field of traditional window-based Transformers is limited, attention blocks can only aggregate local information. Combining the attention module with motion estimation information is crucial for achieving colorization of high-motion videos. This invention proposes using optical flow as guidance to sample sparse features from spatiotemporal pixels to improve video colorization performance.
[0088] Based on the same inventive concept, this invention also provides a video coloring method, which uses the video coloring apparatus of this invention to colorize a black-and-white video into a color video:
[0089] It should be noted that the device and method of the present invention belong to the same inventive concept, solve the same technical problem, and achieve the same technical effect. The device can implement all methods, and the similarities will not be repeated.
[0090] This application also proposes a processor-readable storage medium. This processor-readable storage medium stores a computer program, which, when executed by a processor, implements any of the video colorization methods in Embodiment 1.
[0091] It should be noted that the division of units in the embodiments of this application is illustrative and only represents one logical functional division. In actual implementation, other division methods may be used. Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated units described above can be implemented in hardware or as software functional units.
[0092] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage and optical storage) containing computer-usable program code.
[0093] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0094] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1The function specified in one or more boxes.
[0095] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A video coloring device, characterized in that, include: Global Motion Aggregation (GMA) module, first residual module, encoder, optical flow guided self-attention module, decoder and second residual module; The inputs to the GMA module and the first residual module are the video frame sequences to be colored; The GMA module is configured to determine a bidirectional optical flow sequence based on the sequence of video frames to be colored, the bidirectional optical flow sequence serving as the encoder, with the optical flow guided by the input of the attention module and the decoder; The first residual module is configured to determine a first input vector based on the sequence of video frames to be colored; The encoder is configured to determine a first output sequence and a second output sequence based on the bidirectional optical flow sequence and the first input vector; The optical flow-guided self-attention module is configured to determine a third output sequence based on the first output sequence and the bidirectional optical flow sequence; The decoder is configured to determine a sixth output sequence based on the second output sequence, the third output sequence, and the bidirectional optical flow sequence; The second residual module is configured to map the sixth output sequence and perform matrix point-by-point addition with the video frame sequence to be colored to obtain a color video frame sequence; The video frame sequence to be colored is a black and white video frame sequence containing two color reference frames at the beginning and end. The encoder includes: Optical flow-guided self-attention module and patch merging downsampling module; The encoder is configured to determine a first output sequence and a second output sequence based on the bidirectional optical flow sequence and the first input vector, including: The encoder's optical flow-guided self-attention module determines a first output sequence based on the bidirectional optical flow sequence and the first input vector; The encoder's patch merging downsampling module determines the second output sequence based on the first output sequence; The decoder includes: Optical flow-guided self-attention module and patch-extended upsampling module.
2. The apparatus according to claim 1, characterized in that, The optical flow-guided self-attention module includes: Layer normalization module LN, sparse window multi-head self-attention FGSW-MSA and feedforward network FFN; The output of the LN module is the input of the FGSW-MSA; The output of the FGSW-MSA is added point-by-point to the input of the LN module, and then used as the input of the FNN. The output of the FNN is then added point-by-point to the input of the FNN, and the result is used as the output of the optical flow-guided self-attention module.
3. The apparatus according to claim 1, characterized in that, The decoder is configured to determine a sixth output sequence based on the second output sequence, the third output sequence, and the bidirectional optical flow sequence, including: The decoder's patch extension upsampling module determines a fourth output sequence based on the third output sequence, and performs channel-level concatenation processing on the fourth output sequence and the second output sequence to obtain a fifth output sequence; After performing a convolution operation on the fifth output sequence, the seventh output sequence is obtained; The optical flow-guided self-attention module of the decoder determines the sixth output sequence based on the fourth output sequence, the seventh output sequence, and the bidirectional optical flow sequence.
4. The apparatus according to claim 1, characterized in that, The GMA module is based on the Transformer.
5. A video colorization method, characterized in that, A method for colorizing a black-and-white video into a color video is implemented using the apparatus as described in any one of claims 1 to 4.