A variable frame rate video generation method based on optical flow estimation

By introducing the OpFode-Net model for optical flow estimation, combining neural ODE and ConvGRU, and using hybrid masking and adversarial learning to optimize the loss function, the problem of video generation under flexible frame rates in existing technologies is solved, achieving high-quality video frame interpolation and prediction effects.

CN116708869BActive Publication Date: 2026-06-26TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
Filing Date
2023-04-12
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing video generation models are inadequate in handling flexible frame rates and struggle to effectively handle complex motion and lighting changes, resulting in poor video interpolation and prediction performance.

Method used

We employ the OpFode-Net model based on optical flow estimation, combined with neural ODE and ConvGRU for dynamic modeling, introduce optical flow supervision information, improve generation quality through hybrid masking and adversarial learning, and optimize model training using multiple loss functions.

Benefits of technology

It achieves high-quality video frame generation at any frame rate, solves problems such as occlusion, lighting changes and motion blur, and improves the effect of video frame interpolation and prediction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116708869B_ABST
    Figure CN116708869B_ABST
Patent Text Reader

Abstract

A variable frame rate video generation method based on optical flow estimation, which introduces optical flow supervision information into an OpFode-Net model, the OpFode-Net model comprising an encoder-decoder structure; the encoder uses an ODE-ConvGRU to embed input video sequence X T into a hidden state h T ; wherein the ODE-ConvGRU uses a ConvGRU as a node of a neural ODE and embeds it into the neural ODE to realize dynamic modeling of the video sequence; the decoder starts from h T , and uses an ODE solver to generate a new video frame at any time step S, which can realize more accurate prediction results and achieve optimal performance in video interpolation and video prediction tasks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision, and in particular to a variable frame rate video generation method based on optical flow estimation. Background Technology

[0002] Video frame prediction (VFP) and video frame interpolation (VFI) are important research directions in video generation and are currently among the hot topics in computer vision. The goal of video prediction is to predict future video frames given past video frames; the goal of video interpolation is to synthesize several new video frames between two adjacent frames of the original video.

[0003] Video prediction has applications in autonomous driving, robot navigation, and human-computer interaction, providing more efficient and convenient processing methods for these fields. Video frame interpolation refers to inserting new video frames between given video frames. Applications of video frame interpolation include video playback, video streaming, and video compression. For example, in video playback, frame interpolation can increase the frame rate, making playback smoother; in video streaming, it can transmit higher-quality video with limited bandwidth.

[0004] Current research primarily employs autoregressive or spatiotemporal decoupling methods for video prediction, while optical flow mapping is the main approach for video frame interpolation. We note that optical flow can be considered a reasonable way to describe motion information between video frames. However, research in the field of video prediction rarely uses optical flow as a motion description method. We believe this is because optical flow methods face inherent problems, such as occlusion, lighting variations, object deformation, and motion blur, making it difficult to handle complex motions in complex environments using only optical flow mapping methods. Furthermore, optical flow describes the instantaneous motion between video frames, which is insufficient to effectively represent long-term, multimodal motion.

[0005] Video frame interpolation and video prediction are cutting-edge technologies applying deep learning to the field of computer vision. With the increasing demands for video quality, high-definition video applications are becoming more and more common. Therefore, the research and application of video frame interpolation and prediction are particularly important and urgent.

[0006] Video frame interpolation has wide applications in computer vision, such as slow-motion video, novel view composition, frame rate conversion, and frame restoration in video streams. High frame rate video can avoid common artifacts such as time jitter and motion blur, thus making it more visually appealing to viewers.

[0007] Video frame interpolation aims to synthesize an intermediate frame between two consecutive video frames, which can be used to improve frame rate and enhance visual quality. Video frame interpolation is challenging due to the complex and large nonlinear motion and lighting variations in the real world. Currently, commonly used methods for video frame interpolation in existing technologies include:

[0008] (1) Distort the input frame according to the approximate optical flow;

[0009] (2) Use a convolutional neural network (CNN) to fuse and refine the distorted frames.

[0010] These methods for video frame interpolation suffer from problems such as complex preprocessing and complex models.

[0011] Video prediction is the process of reasonably inferring the next video frame based on known consecutive video frames. The difference between video interpolation and video prediction lies only in the temporal distribution of the results; the former is interpolation, while video prediction is extrapolation (extending).

[0012] Video generation models typically operate under the assumption of a fixed frame rate, which often results in unsatisfactory performance when dealing with flexible frame rates. Existing video generation models have limitations in their ability to handle arbitrary frame rates.

[0013] It should be noted that the information disclosed in the background section above is only for understanding the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0014] This invention provides a variable frame rate video generation method based on optical flow estimation to achieve more accurate prediction results and achieve optimal performance in video interpolation and video prediction tasks.

[0015] To achieve the above objectives, the present invention adopts the following technical solution:

[0016] A variable frame rate video generation method based on optical flow estimation incorporates optical flow supervision information into the OpFode-Net model, which includes an encoder-decoder structure. The encoder uses ODE-ConvGRU to process the input video sequence X. T Embedded into hidden state h T In the middle; wherein, the ODE-ConvGRU uses ConvGRU as a node of the neural ODE and embeds it into the neural ODE to achieve dynamic modeling of video sequences; the decoder from h T Initially, new video frames are generated at any time step S using the ODE solver.

[0017] Furthermore, two discriminators are used, one at the image level and the other at the video sequence level, to improve the output quality of spatial appearance and temporal dynamics.

[0018] Furthermore, OpFode-Net combines neural ODE and ConvGRU, utilizing the ODE solver to achieve video generation at arbitrary time steps, while introducing adversarial learning to improve generation quality.

[0019] Furthermore, the OpFode-Net model is formalized as follows:

[0020]

[0021]

[0022] Let h0 be the initial value of the hidden state of the model. After solving using the ODESolve module, we obtain t. i Hidden state at time video frames After encoding by encoder E, the hidden state is updated by the ConvGRUCell module, and the updated t i The hidden state at any given time is recorded as And so on, until all input times t1, t2, ..., t are obtained. T Corresponding hidden state Then select the initial state for generating the frame. When used for video frame interpolation tasks, select When used for video prediction tasks, select After that, with Starting from the initial state, the output times s1, s2, ..., s are obtained by solving ODESolve again. S Corresponding hidden state

[0023]

[0024] Decoder G determines the hidden states of the two preceding and following frames. Decoding yields optical flow Difference graph Blending Mask Finally, the final target frame is synthesized by combining optical flow mapping with a hybrid template.

[0025]

[0026]

[0027] Furthermore, the method includes the following specific workflow:

[0028] Step 1: First, select a frame of the image at time t1. The data is given to encoder E, and after one iteration of the encoder, the hidden state at time t1 is obtained. video frames After encoding by encoder E, the hidden state is updated by the ConvGRUCell module. The updated hidden state at time t1 is denoted as follows. As the next step One of the inputs;

[0029] Next, select a frame of the image at time t2. The data is given to encoder E, and after iterative processing by the encoder, the hidden state at time t2 is obtained. Then, the result obtained at time t1 Join The hidden state updated at time t2 was obtained.

[0030] Repeat the above operation n times, at t T At that moment, we received and

[0031] Step 2: First, put t T The frame obtained by encoder E at time E The input is directly given to the generator G, and the result is... The frame is then converted through ODE-Solve parsing and processing. Also, we submit this to G, obtaining three results: optical flow, difference map, and blending mask. and

[0032] Next, in the frame While inputting into G, another path is sent to the next G, and the result is... The frame is then converted through ODE-Solve parsing and processing. We also gave it to G, and got 3 more results: and

[0033] Repeat the above operation n times until s S The result was obtained at that time: and

[0034] Furthermore, a hybrid mask based on optical flow divergence is used to blend the optically flow-mapped image with the difference image provided by the reconstruction network, thereby obtaining a higher quality frame generation result; the hybrid mask is calculated based on the optical flow divergence map.

[0035]

[0036] Where netG is the hybrid mask generation module in the decoding network; Mask - It is a hybrid mask directly generated by the decoding network without optical flow refinement. Flow refers to the optical flow map generated by the optical flow estimation network.

[0037] Furthermore, the model transforms the variable frame rate video generation problem into a video prediction or video frame interpolation problem for processing:

[0038] Let X t ∈R w×h×c Let w represent the t-th frame of the video sequence, where w, c are the width, height, and number of channels of the video frame, respectively; from time t1 to t... T A video sequence with a time-end length of T is represented as follows: Where t1 < t2 < ... T For a given input sequence With the time intervals to be solved (s1, s2, ..., s Estimate the S-frame video sequence at the corresponding time. Where s1 < s2 < ... S ;

[0039] According to (s1, s2, ..., S The value of ) when t T When <s1, the video sequence to be estimated Following the input video sequence X, the variable frame rate video generation problem is transformed into a video prediction problem; when t1 < s i <t T And s i ≠t j When (i = 1, 2, ..., S; j = 1, 2, ..., T), the video sequence to be estimated Between the input video sequences X, the variable frame rate video generation problem is transformed into a video frame interpolation problem.

[0040] The objective function L used to train the model is a linear weighted combination of the following loss functions:

[0041]

[0042] Where L flow It is the optical flow loss function, L recon It is the reconstruction loss function, L diff It is the difference loss function. and It is a modeling image discriminator D img and sequence discriminator D seqThe loss function, λ diff , λ img and λ seq It is a hyperparameter that controls the relative importance of different losses, λ. flow It is a hyperparameter that controls optical flow loss.

[0043] L flow The optical flow loss function includes optical flow reconstruction loss and optical flow smoothing loss;

[0044] The optical flow reconstruction loss and the smoothing loss are calculated by the following two formulas:

[0045]

[0046]

[0047] in, It is the generated optical flow field. It is the real optical flow field, k∈{x,y} represents the operation on the gradient of the optical flow field along the x-axis and y-axis, s i The time intervals to be solved (s1, S2, ..., S Any one of the terms in ); the final optical flow loss function is the weighted sum of the optical flow reconstruction loss and the smoothing loss, that is:

[0048] L flow =λ FlowRecon L FlowRecon

[0049] +λ FlowSmooth L Flowsmooth

[0050] Where, λ FlowRecon With λ FlowSmooth It is a hyperparameter that controls optical flow loss.

[0051] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the variable frame rate video generation method.

[0052] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0053] This invention provides a variable frame rate video generation method based on optical flow estimation. Optical flow supervision information is introduced into the OpFode-Net model. This invention proposes the OpFode-Net model by combining the Neural-ODE method with an optical flow estimation network, achieving high-quality video frame generation at any time and any frame rate. It can be used for both video frame interpolation and video prediction tasks. The advantages of this invention are specifically reflected in the following aspects:

[0054] (1) To address the difficulties in optical flow estimation, such as occlusion, illumination changes, and motion blur, the optical flow guidance network netF is introduced into the model, thereby obtaining better optical flow estimation results.

[0055] (2) To address the holes and overlaps caused by optical flow mapping, a hybrid masking method based on optical flow divergence, Mask+, is proposed. Based on this hybrid mask, the image after optical flow mapping is mixed with the difference image given by the reconstruction network, thereby obtaining higher quality frame generation results.

[0056] (3) A FlowLoss term has been added to the loss function to train the aforementioned modules.

[0057] The variable frame rate video generation technology of this invention is expected to be widely used in practical applications. Attached Figure Description

[0058] Figure 1 The variable frame rate video generation technology---OpFode-Net model structure is an embodiment of the present invention.

[0059] Figure 2 This is a hybrid mask based on optical flow divergence in an embodiment of the present invention.

[0060] Figure 3 This is a graph showing the test results of an embodiment of the present invention on the KTH Action dataset. Detailed Implementation

[0061] The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary and not intended to limit the scope and application of the present invention.

[0062] To address the issue of poor quality performance in existing variable frame rate video generation model frameworks when using self-supervised methods to evaluate optical flow, this invention constructs a continuous-time video generation model that combines the neural ordinary differential equation Vid-ODE with pixel-level video processing techniques. This model is named the OpFode-Net model.

[0063] In some embodiments, the present invention incorporates the optical flow guidance network netF into the model.

[0064] In some embodiments, the present invention proposes an occlusion relationship evaluation method, which uses a hybrid mask based on optical flow divergence to solve the problems encountered by optical flow evaluation in scenarios such as occlusion, illumination changes, and motion blur.

[0065] In some embodiments, the present invention proposes a novel loss function that takes into account the contextual information of the video to obtain higher quality optical flow estimation results.

[0066] The mathematical theoretical foundation of the constructed model

[0067] A video sequence consists of a series of frames, each composed of pixel values. These frames can be represented as a three-dimensional tensor, where the first and second dimensions represent the width and height of the frame, and the third dimension represents the number of color channels. Thus, the variable frame rate video generation problem can be formally described as follows:

[0068] Let X t ∈R w×h×c Let represent the t-th frame of the video sequence, where w, c, and y represent the width, height, and number of channels of the video frame, respectively. From time t1 to t... T A video sequence with a time-end length of T can be represented as follows: Where t1 < t2 < ... T The variable frame rate video generation problem is that, given an input sequence... With the time intervals to be solved (s1, S2, ..., S Estimate the S-frame video sequence at the corresponding time. Where s1 < s2 < ... S .

[0069] According to (s1, s2, ..., S With respect to the value of ), the variable frame rate video generation problem can be transformed into a video prediction or video frame interpolation problem.

[0070] When t T When <s1, the video sequence to be estimated This problem is transformed into a video prediction problem, following the input video sequence X.

[0071] When t1 < s i <t T And s i ≠t j When (i = 1, 2, ..., S; j = 1, 2, ..., T), the video sequence to be estimated Located between the input video sequence X, this problem is transformed into a video frame interpolation problem. OpFode-Net model structure.

[0072] To explore variable frame rate video generation methods based on optical flow estimation and achieve more accurate prediction results, reaching optimal performance in both video frame interpolation and video prediction tasks, we propose a novel technical solution and design an architecture that incorporates optical flow supervision information into the OpFode-Net model. The specific solution and architecture are as follows: Figure 1 As shown.

[0073] The OpFode-Net model consists of an encoder-decoder structure. First, the encoder processes the input video sequence X using ODE-ConvGRU.T Embedded into hidden state h T The ODE-ConvGRU proposed in this invention is a method combining Neural Ordinary Differential Equations (ODEs) and ConvGRU. Specifically, ODE-ConvGRU uses ConvGRUs as nodes in the ODE and embeds them into the neural ODE to achieve dynamic modeling of video sequences. Then, the decoder starts from h... T Initially, new video frames are generated at any time step S using the ODE solver. To further improve the quality of the output, the embodiment includes two discriminators in the framework, which improve the quality of the output through adversarial learning.

[0074] In summary, OpFode-Net combines neural ODE and ConvGRU, utilizing the ODE solver to achieve video generation at arbitrary time steps, while introducing adversarial learning to improve generation quality. The model's formula is as follows:

[0075]

[0076]

[0077] Let h0 be the initial value of the hidden state of the model. After solving using the ODESolve module, we obtain t. i Hidden state at time video frames After encoding by encoder E, the hidden state is updated by the ConvGRUCell module, and the updated t i The hidden state at any given time is recorded as And so on, until all input times t1, t2, ..., t are obtained. T Corresponding hidden state Then select the initial state for generating the frame. When used for video frame interpolation tasks, select When used for video prediction tasks, select After that, with Starting from the initial state, the output times s1, s2, ..., s are obtained by solving ODESolve again. S Corresponding hidden state

[0078]

[0079] Decoder G determines the hidden states of the two preceding and following frames. Decoding yields optical flow Difference graph Blending Mask Finally, the final target frame is synthesized by combining optical flow mapping with a hybrid template.

[0080]

[0081]

[0082] Workflow

[0083] Step 1: First, select a frame of the image at time t1. The data is given to encoder E, and after one iteration of the encoder, the hidden state at time t1 is obtained. video frames After encoding by encoder E, the hidden state is updated by the ConvGRUCell module. The updated hidden state at time t1 is denoted as follows. this This will be the next step. One of the inputs.

[0084] Next, select a frame of the image at time t2. The data is given to encoder E, and after iterative processing by the encoder, the hidden state at time t2 is obtained. Then, take the result obtained at time t1 Join Then, we obtain the hidden state updated at time t2.

[0085] Repeat the above operation n times, at t T At that moment, we received and

[0086] Step 2, first put t T The frame obtained by encoder E at time E The input is directly given to the generator G, and the result is... The frame is then converted through ODE-Solve parsing and processing. Also give it to G. This gives us three results: and

[0087] Next, regarding the frame from before... While it is being input into G, another path is being sent to the next G, and the result is... The frame is then converted through ODE-Solve parsing and processing. Also, give it to G.

[0088] This gives us three more results: and

[0089] Repeat the above operation n times until s S The result was obtained at that time: and This is the overall workflow of the proposed model structure.

[0090] Blending Mask+

[0091] Starting with optical flow itself, we can identify regions that optical flow cannot handle. Optical flow can be viewed as a 2D vector field. For any video frame, pixels move in the direction of the optical flow, and the divergence of the optical flow describes the pixel aggregation. Where the divergence is less than zero, pixels will overlap; where the divergence is greater than zero, pixels will separate, creating holes. Applying optical flow to generate new images in such locations is unreliable, and divergence indicates these locations. Using this, we can calculate a blending mask based on the optical flow divergence map to blend the optical flow mapping results with the reconstruction results from the generative network.

[0092]

[0093] Where netG is the hybrid mask generation module in the decoding network; Mask - It is a hybrid mask directly generated by the decoding network without optical flow refinement. Flow refers to the optical flow map generated by the optical flow estimation network.

[0094] Figure 2 An embodiment of the present invention is shown, which is a hybrid mask based on optical flow divergence.

[0095] loss function

[0096] The following describes the loss function used to train the OpFode-Net model:

[0097] Adversarial loss

[0098] The preferred embodiment uses two discriminators to improve output quality, one at the image level and the other at the video sequence level, to improve the output quality of spatial appearance and temporal dynamics.

[0099] Image discriminator D img Distinguish the time step s for each target i Real images With the generated image Sequence Discriminator D seq Distinguishing the true sequence X S The generated sequence Y at all time steps S .

[0100] Using a method similar to LS-GAN to model D img and D seqTheir loss functions are as follows:

[0101]

[0102]

[0103] Where T:S is the union of the time series T and S, and X T:S From t1 to t T A series of video frames from s1 to s S A video sequence composed of frames connected together.

[0104] Reconstruction loss

[0105] Reconstruction loss calculation predicts video frames and real frames The pixel-level L1 distance between them has the following loss function:

[0106]

[0107] Difference loss

[0108] Difference loss helps the model learn image differences. This refers to the pixel difference between adjacent video frames. Its loss function is:

[0109]

[0110] in, This represents the image difference between two adjacent frames, i.e. Optical flow loss FlowLoss

[0111] The preferred embodiment employs an additional loss function, FlowLoss, to optimize the optical flow guidance network. This loss function utilizes optical flow information to constrain motion between adjacent frames in the video, helping to generate smoother and more continuous video. FlowLoss consists of two parts: optical flow reconstruction loss and optical flow smoothing loss.

[0112] Optical flow field generated by optical flow reconstruction loss calculation and the real optical flow field The L1 distance between them. Optical flow smoothing loss is calculated by considering the optical flow field. Optical flow field of adjacent frames The L1 distance between them is used to encourage the smoothness of the optical flow field in space and time. The weight of the optical flow loss is determined by the hyperparameter λ. flow control.

[0113] The optical flow reconstruction loss and smoothing loss are calculated by the following two equations:

[0114]

[0115]

[0116] in, It is the generated optical flow field. It is the real optical flow field, k∈{x,y} represents the operation on the gradient of the optical flow field along the x-axis and y-axis, s i The time intervals to be solved (s1, s2, ..., s) S Any one of the terms in ). The final FlowLoss is the weighted sum of the optical flow reconstruction loss and the smoothing loss, that is:

[0117] L flow =λ FlowRecon L FlowRecon +λ FlowSmooth L Flowsmooth

[0118] Where, λ FlowRecon With λ FlowSmooth This is a hyperparameter that controls optical flow loss. Optimizing this objective function will enable the decoder to generate more realistic, smooth, and continuous video.

[0119] Overall losses

[0120] The final objective function L is a linearly weighted combination of the above loss functions, and the final objective function formula is:

[0121]

[0122] Where, λ diff , λ img and λ seq It is a hyperparameter that controls the relative importance of different losses, λ. flow This is a hyperparameter that controls optical flow loss. Optimizing this objective function will enable the decoder to generate more realistic, smooth, and continuous video. By optimizing this objective function through end-to-end training, the decoder of the preferred embodiment of the present invention can generate high-quality video frames while maintaining temporal and spatial consistency.

[0123] Example

[0124] We used the proposed OpFode-Net model to perform video frame interpolation and video prediction tasks on the KTH Action and Penn Action datasets, and compared the experimental data. We evaluated the model using three metrics: SSIM, PSNR, and LPIPS. These metrics are widely used in image processing, computer vision, and image quality assessment.

[0125] SSIM: Structural Similarity Index, is a metric designed to evaluate the structural similarity and preservation of detail between two images. It represents the degree of similarity with values ​​ranging from -1 to 1, where 1 indicates perfect similarity and -1 indicates complete dissimilarity. In our experiments, a higher SSIM indicates better model performance.

[0126] LPIPS: Learned Perceptual Image Patch Similarity, is a perceptual image patch similarity metric used to evaluate the perceptual similarity of images. It is calculated based on image features using a deep neural network, and its value is a real number greater than or equal to zero, representing the degree of similarity, where a value of 0 indicates complete similarity, and a larger value indicates lower similarity. In our experiments, a lower metric indicates better model performance.

[0127] PSNR: Peak Signal-to-Noise Ratio, is a metric used to evaluate image sharpness. It represents image quality in dB, with higher values ​​indicating higher image quality and vice versa. In our experiments, a higher PSNR indicates better model performance.

[0128] Tables 1 and 2 present the technical metrics obtained from the video interpolation and video prediction experiments, respectively. The data shows that we outperform existing benchmark methods in multiple metrics, including SSIM, PSNR, and LPIPS.

[0129] In the ablation experiments, we removed the optical flow-guided network netF, the hybrid masking method Mask+ based on optical flow divergence, and the FlowLoss term of the loss function, respectively, and conducted quantitative experimental comparisons with the complete network, thereby demonstrating the effectiveness of our proposed innovations.

[0130] Test results on the KTH Action dataset are as follows: Figure 3 As shown in the figure, the first row is the GroundTruth dataset, used as the baseline for training; the second row is the model's prediction results; the third row is the frame-by-frame difference map of GroundTruth; the fourth row is the difference map of the model's output; the fifth row is a visualization of the model's optical flow prediction results, where the hue of the image represents the direction of the optical flow, and the saturation of the image represents the magnitude of the optical flow; the last row is the model's output mask.

[0131] Table 1. Results of the interpolation experiment

[0132]

[0133] Table 2. Experimental results of extrapolation

[0134]

[0135]

[0136] ablation experiment

[0137] To verify the validity of the two experimental results, we conducted ablation experiments on them. The experimental results are listed in Tables 3 and 4.

[0138] Table 3 Results of interpolation ablation experiments

[0139]

[0140] Table 4 Results of extrapolation ablation experiments

[0141]

[0142] The analysis above shows that introducing new modules significantly improves video generation tasks, with netF, Mask+, and Loss modules each making unique contributions. Therefore, introducing new modules can be considered an effective means to improve the performance of video generation tasks. The ablation experiments demonstrate that the introduced new modules significantly improve both video frame interpolation and video prediction tasks.

[0143] In summary, the embodiments of the present invention have the following features and advantages:

[0144] (1) Introduce optical flow supervision information to obtain more refined optical flow evaluation results;

[0145] (2) A new method for evaluating occlusion relationships is proposed to combine optical flow mapping results with generative network reconstruction results;

[0146] (3) Propose loss functions for optical flow and for hybrid templates to improve the accuracy of optical flow estimation and frame generation quality.

[0147] The embodiments of the present invention can accept input of any frame rate and synthesize videos of any frame rate; it can be used for both video prediction and video frame interpolation tasks; it solves the difficulties faced by optical flow estimation such as occlusion, illumination changes, and motion blur, as well as the hole and overlap problems that will occur when synthesizing new frames only through optical flow mapping.

[0148] The optical flow evaluation network in this invention has explicit supervision information, resulting in more accurate estimation of optical flow. The proposed occlusion relationship evaluation method based on optical flow divergence mixes the optical flow mapping results with the reconstruction results of the generator network, resulting in higher quality video frames. The proposed loss functions for optical flow and for the mixed template have faster training convergence speed.

[0149] The video generation method of the present invention can be used in the following fields:

[0150] 1. Video frame interpolation is possible. Video frame interpolation is a key technology for video stream restoration and post-processing. When network bandwidth is limited, low frame rate video can be transmitted and then interpolated at the receiving end to reduce the amount of data required for network transmission. The quality of the video frame interpolation technology determines the quality of the reconstructed frames, and this technology is of great significance in the field of video compression and transmission. Video frame interpolation technology can post-process past video data, increasing its frame rate and making it smoother and more fluid to watch. It is also important in applications such as slow-motion playback.

[0151] 2. It can be used for video prediction. Application scenarios include autonomous driving, robot navigation, and human-computer interaction, providing these fields with more efficient and convenient processing methods. Video frame interpolation refers to inserting new video frames between given video frames. Some tasks where video prediction has been successfully applied include: predicting activities and events, long-term planning, predicting the future location of objects, predicting instances or semantic segmentation maps, predicting pedestrian trajectories in traffic, anomaly detection, precipitation nowcasting, and autonomous driving.

[0152] In the film and television production field, this method can be used in video production and editing to improve the production efficiency of video content. In the online media field, this method can be applied to online media transmission, video encoding and compression, reducing redundancy and bandwidth consumption in video data, lowering the bandwidth required for video transmission, and improving the smoothness of online video, live streaming, and video conferencing. In the field of autonomous driving, this method can predict future video frames, helping autonomous driving systems make judgments by predicting future frames of driving footage, enabling them to better perceive their surroundings and improve the safety and stability of autonomous driving systems.

[0153] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0154] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0155] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0156] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0157] The background section of this invention may include background information about the problems or environment in which the invention is being developed, and is not necessarily a description of prior art. Therefore, the content included in the background section does not constitute an admission of prior art by the applicant.

[0158] The above description provides a further detailed explanation of the present invention in conjunction with specific / preferred embodiments, and it should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various substitutions or modifications can be made to these described embodiments without departing from the concept of the present invention, and all such substitutions or modifications should be considered within the scope of protection of the present invention. In the description of this specification, the reference to terms such as "an embodiment," "some embodiments," "preferred embodiment," "example," "specific example," or "some examples," etc., indicates that the specific features, structures, materials, or characteristics described in connection with that embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials, or characteristics described can be combined in any suitable manner in one or more embodiments or examples. Without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification and the features of different embodiments or examples. Although the embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions, and modifications can be made herein without departing from the scope of protection of the patent application.

Claims

1. A variable frame rate video generation method based on optical flow estimation, characterized in that, Optical flow supervision information is introduced into the OpFode-Net model, which includes an encoder-decoder structure; the encoder uses ODE-ConvGRU to process the input video sequence. Embedded into hidden state In the middle; wherein, the ODE-ConvGRU uses ConvGRU as a node of the neural ODE and embeds it into the neural ODE to achieve dynamic modeling of video sequences; the decoder from h T Initially, new video frames are generated at any time step S using the ODE solver. The method employs a hybrid mask+ based on optical flow divergence to blend the optically mapped image with the difference image provided by the reconstruction network, thereby obtaining a higher quality frame generation result. The hybrid mask+ is calculated based on the optical flow divergence map. in, This is a module for generating hybrid masking in the decoding network; It is a hybrid mask directly generated by the decoding network and without optical flow refinement. The optical flow map generated by the optical flow estimation network.

2. The variable frame rate video generation method as described in claim 1, characterized in that, Two discriminators are used, one at the image level and the other at the video sequence level, to improve the output quality of spatial appearance and temporal dynamics.

3. The variable frame rate video generation method as described in claim 1, characterized in that, The OpFode-Net combines neural ODE and ConvGRU, using an ODE solver to generate videos at arbitrary time steps, while introducing adversarial learning to improve the quality of the generated videos.

4. The variable frame rate video generation method as described in claim 1, characterized in that, The OpFode-Net model is formalized as follows: set up These are the initial values ​​of the hidden states of the model, which are obtained after solving the ODESolve module. Hidden state at time Video frames After encoding by encoder E, the hidden state is updated by the ConvGRUCell module, and the updated state is... The hidden state at any given time is recorded as And so on, until all input times are obtained. Corresponding hidden state ; Then select the initial state for generating the frame. ; When used for video frame interpolation tasks, select When used for video prediction tasks, select After that, with Starting from the initial state, the output time is obtained by solving the ODESolve module again. Corresponding hidden state : Decoder G determines the hidden states of the two preceding and following frames. Decoding yields optical flow Difference plot Blending Mask Finally, the final target frame is synthesized by combining optical flow mapping with a hybrid template. : 。 5. The variable frame rate video generation method as described in claim 4, characterized in that, The method includes the following specific workflow: Step 1: First select A frame of a time The data is given to encoder E, and after one iteration of the encoder's processing, the desired result is obtained. Hidden state at time Video frames After encoding by encoder E, the hidden state is updated by the ConvGRUCell module, and the updated state is... The hidden state at any given time is recorded as , As the next step One of the inputs; Next, select A frame of a time The data is then given to encoder E, and after iterative processing by the encoder, the final result is obtained. Hidden state at time Then, Timely Join Up, got Hidden state after constant updates ; Repeat the above operation n times, in At that moment, we received and ; Step 2: First, The frame obtained by encoder E at time E The input is directly given to the generator G, and the result is... The frames are then parsed and transformed by the ODE-Solve module. Also give it to G to obtain optical flow. Three results from difference maps and blending masks: , and ; Next, in the frame While inputting into G, another path is sent to the next G, and the result is... The frames are then parsed and transformed by the ODE-Solve module. We also gave it to G, and got 3 more results: , and ; Repeat the above steps multiple times until... The result was obtained at that time: , and .

6. The variable frame rate video generation method according to any one of claims 1 to 5, characterized in that, The model transforms the variable frame rate video generation problem into a video prediction or video frame interpolation problem: set up Represents the first video sequence Frame, in which These are the width, height, and number of channels of the video frame, respectively; From the moment to The time-end length is The video sequence is represented as ,in For a given input sequence and the time to be solved Estimate the corresponding time Frame video sequence ,in ; according to The value of , when At that time, the video sequence to be estimated Located in the input video sequence Subsequently, the variable frame rate video generation problem is transformed into a video prediction problem; when and At that time, the video sequence to be estimated Located in the input video sequence In this process, the variable frame rate video generation problem is transformed into a video frame interpolation problem.

7. The variable frame rate video generation method according to any one of claims 1 to 5, characterized in that, The objective function L used to train the model is a linear weighted combination of the following loss functions: ; in It is the optical flow loss function. It is a reconstruction loss function. It is the difference loss function. and It is a modeling image discriminator and sequence discriminator loss function, and These are hyperparameters that control the relative importance of different losses. It is a hyperparameter that controls optical flow loss.

8. The variable frame rate video generation method as described in claim 7, characterized in that, The optical flow loss function includes optical flow reconstruction loss and optical flow smoothing loss; The optical flow reconstruction loss and the smoothing loss are calculated by the following two formulas: ; in, It is the generated optical flow field. It is a real optical flow field. This indicates that the gradient of the optical flow field on the x-axis and y-axis is manipulated. The moment to be solved Any one of the terms in the equation; the final optical flow loss function is the weighted sum of the optical flow reconstruction loss and the smoothing loss, that is: ; in, and It is a hyperparameter that controls optical flow loss.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the variable frame rate video generation method as described in any one of claims 1 to 8.