Method and apparatus for training video frame interpolation model, and video frame interpolation method using same model
The method effectively addresses the challenge of accurately representing non-uniform motion in video frame interpolation by employing a motion information field and neural networks to generate sharper intermediate frames.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KOREA ADVANCED INST OF SCI & TECH
- Filing Date
- 2025-12-04
- Publication Date
- 2026-06-18
AI Technical Summary
Existing video frame interpolation methods struggle to accurately represent non-uniform motion between video frames, leading to motion ambiguity and blurriness in generated frames.
A training method for a video frame interpolation model that utilizes a motion information field including pixel-wise motion vectors and angle difference information, processed through a first and second neural network to generate clear intermediate frames, addressing non-uniform motion representation.
The method effectively reduces motion ambiguity, enabling the generation of clearer and clearer intermediate frames by accurately representing and addressing the motion between video frames without blurriness in generated frames.
Smart Images

Figure KR2025020768_18062026_PF_FP_ABST
Abstract
Description
Training method and apparatus for a video frame interpolation model, and a video frame interpolation method using the model
[0001] Training method and apparatus for a video frame interpolation model, and a video frame interpolation method using the model
[0002] Video frame interpolation (VFI) is widely used to improve the frame rate of a video. Video frame interpolation is a technique that generates a new video frame between two consecutive video frames of the original video. Through video frame interpolation, a video with a low frame rate can be converted into a video with a high frame rate. Videos with a high frame rate can be visually smoother and more natural than videos with a low frame rate.
[0003] The present invention was developed with support from the Ministry of Science and ICT (Project No.: RS-2022-00144444, Project Name: Information and Communication / Broadcasting Technology Development Project, Research Project Name: Research on Learning and Rendering Spatial Image Representation of Static and Dynamic Scenes Based on Deep Learning, Lead Institution: Korea Advanced Institute of Science and Technology, Research Management Agency: Korea Institute of Information and Communication Technology Planning and Evaluation).
[0004] A training method for a video frame interpolation model according to one embodiment comprises: generating image feature maps corresponding to each image frame based on a first image sequence including a first image frame corresponding to a first time point, a second image frame corresponding to a second time point after the first time point, and an intermediate image frame corresponding to an intermediate time point between the first time point and the second time point; generating a motion information field including pixel-wise motion vectors between the first image frame and the intermediate image frame and pixel-wise angle difference information between the second image frame and the intermediate image frame; inputting the image feature maps and motion information field corresponding to the first image frame and the second image frame into a first neural network that estimates a bidirectional motion vector field to generate a first inverse motion vector field and a first forward motion vector field; inputting the first inverse motion vector field and the first forward motion vector field into a second neural network that estimates an image frame at an intermediate time point to generate a first output image frame; and training the first neural network and the second neural network based on the difference between the intermediate image frame and the first output image frame.
[0005] According to one embodiment, a training device comprises one or more processors and a memory comprising instructions executable by one or more processors, and when instructions are executed by one or more processors, the instructions cause the training device to generate image feature maps corresponding to each image frame based on a first image sequence comprising a first image frame corresponding to a first time point, a second image frame corresponding to a second time point after the first time point, and an intermediate image frame corresponding to an intermediate time point between the first time point and the second time point, generate a motion information field including pixel-wise motion vectors between the first image frame and the intermediate image frame and pixel-wise angle difference information between the pixel-wise motion vectors between the second image frame and the intermediate image frame, input the image feature maps corresponding to the first image frame and the second image frame and the motion information field to a first neural network that estimates a bidirectional motion vector field to generate a first inverse motion vector field and a first forward motion vector field, input the first inverse motion vector field and the first forward motion vector field to a second neural network that estimates an image frame at an intermediate time point to generate a first output image frame, and based on the difference between the intermediate image frame and the first output image frame Train the first neural network and the second neural network.
[0006] A video frame interpolation method according to one embodiment comprises the steps of: initializing a pre-trained video frame interpolation model; receiving an input video sequence including a first input video frame at a first input time point and a second input video frame at a second input time point; determining a target motion information field including pixel-wise motion vectors between the first input video frame and a target video frame corresponding to a target time point between the first input time point and the second input time point, and pixel-wise angle difference information between the second input video frame and the target video frame; and inputting the input video sequence and the target motion information field into the trained video frame interpolation model to generate a target video frame corresponding to a target time point.
[0007] FIG. 1 is a block diagram illustrating the operation of generating a target image frame of an electronic device according to one embodiment.
[0008] FIG. 2 is a diagram illustrating the operation of generating a target image frame of a video frame interpolation model according to one embodiment.
[0009] FIG. 3 is a diagram illustrating the performance difference of a video frame interpolation model according to a training index, according to one embodiment.
[0010] FIG. 4 is a block diagram schematically illustrating a process of generating a target image frame using a video frame interpolation model according to one embodiment.
[0011] FIG. 5 is a diagram illustrating an exemplary inference process of a first neural network according to one embodiment.
[0012] FIG. 6 is a block diagram schematically illustrating the process of training a video frame interpolation model according to one embodiment.
[0013] FIG. 7 is a flowchart exemplarily illustrating a method for a training device to train a video frame interpolation model according to one embodiment.
[0014] FIG. 8 is a block diagram showing the configuration of a video frame interpolation device according to one embodiment.
[0015] FIG. 9 is a block diagram showing the configuration of a training device according to one embodiment.
[0016] FIG. 10 is a block diagram exemplarily illustrating the configuration of an electronic device for training a video frame interpolation model according to one embodiment.
[0017] Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be modified and implemented in various forms. Accordingly, actual implementations are not limited to the specific embodiments disclosed, and the scope of this specification includes modifications, equivalents, or substitutions included in the technical concept described by the embodiments.
[0018] Terms such as "first" or "second" may be used to describe various components, but these terms should be interpreted solely for the purpose of distinguishing one component from another. For example, the first component may be named the second component, and similarly, the second component may be named the first component.
[0019] When it is stated that a component is "connected" to another component, it should be understood that it may be directly connected to or coupled with that other component, or that there may be other components in between.
[0020] The singular expression includes the plural expression unless the context clearly indicates otherwise. In this specification, terms such as "comprising" or "having" are intended to specify the existence of the described features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.
[0021] Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by those skilled in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this specification.
[0022] Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the attached drawings, identical components are given the same reference numeral regardless of the drawing number, and redundant descriptions thereof will be omitted.
[0023] FIG. 1 is a block diagram illustrating the operation of generating a target image frame of an electronic device according to one embodiment. Referring to FIG. 1, the electronic device (100) can receive an input image sequence (110). The input image sequence (110) may include a plurality of image frames. The input image sequence (110) may correspond to a video with a low frame rate.
[0024] The electronic device (100) may be referred to as a video frame interpolation (VFI) device. The electronic device (100) may include a video frame interpolation model (120). The video frame interpolation model (120) may correspond to a software module. The electronic device (100) may perform video frame interpolation on an input video sequence (110) using the video frame interpolation model (120).
[0025] The electronic device (100) can input an input video sequence (110) into a video frame interpolation model (120). The video frame interpolation model (120) can generate a new video frame between two consecutive video frames of the input video so as to improve the frame rate of the video. For example, the video frame interpolation model (120) can generate a new target video frame (130) between two consecutive video frames of the input video sequence (110).
[0026] The video frame interpolation model (120) can generate a video with an improved frame rate based on the input video and the generated video frames. For example, the video frame interpolation model (120) can generate and / or output an output video sequence including an input video sequence (110) and a target video frame (130). The output video sequence can correspond to a video with a high frame rate.
[0027] The video frame interpolation model (120) may correspond to a model trained to generate a video with a higher frame rate in response to an input video. The video frame interpolation model (120) may include one or more neural networks. In one embodiment, the electronic device (100) may include a training device for training the video frame interpolation model (120). The training device may train the video frame interpolation model (120). For example, the training device may train one or more neural networks of the video frame interpolation model (120). One or more neural networks of the video frame interpolation model (120) may perform inference suitable for the purpose by mapping input data and output data that are in a non-linear relationship to each other after being trained based on deep learning.
[0028] The training device can train a video frame interpolation model (120) based on a training video sequence containing three video frames. The training device can be trained to estimate a video frame between two video frames from two video frames of the training video sequence. The training device can utilize an index of the training video sequence to train the video frame interpolation model (120). The index may include, for example, information about the time (point in time) corresponding to each video frame, information about the magnitude (distance) of the motion vector between video frames, and / or information about the direction (angle) of the motion vector between video frames.
[0029] If the index of the training video sequence is not properly utilized during the training of the video frame interpolation model (120), the motion between the video frames of the training video sequence may not be accurately represented. In this case, motion ambiguity may occur during the video frame generation process of the video frame interpolation model (120) due to the motion not accurately represented in the training video sequence. Motion ambiguity may be referred to as time-to-location ambiguity. One or more neural networks of the video frame interpolation model (120) may not be trained to determine an appropriate motion among the many motions that may exist between two consecutive video frames of the input video sequence (110). In this case, the target video frame (130) generated by the video frame interpolation model (120) may appear blurred.
[0030] The training device of the electronic device (100) can train a video frame interpolation model (120) by utilizing an index containing information about the magnitude (distance) of motion vectors between video frames and / or information about the direction (angle) of motion vectors between video frames. The index can represent non-uniform motion (e.g., non-linear and non-uniform motion) between video frames of a training video sequence. A video frame interpolation model (120) trained with a video sequence representing non-uniform motion can generate a clear target video frame (130) without blurriness.
[0031] FIG. 2 is a diagram illustrating the operation of generating a target image frame of a video frame interpolation model according to one embodiment. Referring to FIG. 2, an input image sequence (210) may be input to a video frame interpolation model (220). The input image sequence (210) and the video frame interpolation model (220) may correspond to the input image sequence (110) and the video frame interpolation model (120) of FIG. 1. The input image sequence (210) may include a first image frame (211) and a second image frame (212). The first image frame (211) and the second image frame (212) may correspond to two consecutive image frames of the input image sequence (210). The first image frame (211) may be an image frame corresponding to a first time point. The second image frame (212) may be an image frame corresponding to a second time point.
[0032] The video frame interpolation model (220) can generate and / or output a target image frame (231) based on an input image sequence (210). The target image frame (231) may correspond to the target image frame (130) of FIG. 1. The video frame interpolation model (220) can generate and / or output an output image sequence (230) based on the input image sequence (210) and the target image frame (231). The output image sequence (230) may include a first image frame (211), a second image frame (212), and a target image frame (231).
[0033] The target image frame (231) may correspond to an intermediate time point (e.g., t=0.5) between the first time point (e.g., t=0) corresponding to the first image frame (211) and the second time point (e.g., t=1) corresponding to the second image frame (212) within the output image sequence (230). Although FIG. 2 is illustrated as a video frame interpolation model (220) generating a ‘one’ image frame (target image frame (231)) corresponding to a ‘one’ time point between two consecutive image frames (first image frame (211) and second image frame (212)), it is also possible for the video frame interpolation model (220) to generate ‘multiple’ image frames corresponding to ‘multiple’ time points (e.g., t=0.25, t=0.5, t=0.75) between two consecutive image frames (first image frame (211) and second image frame (212)).
[0034] FIG. 3 is a diagram illustrating the performance difference of a video frame interpolation model according to a training index according to one embodiment. A training device can train a video frame interpolation model by utilizing the training index of video frames of a training video sequence. An example of a training index illustrated in FIG. 3 is explained through three video frames of a training video sequence. The three video frames of the training video sequence may include a first video frame (t=0), a second video frame (t=1), and an intermediate video frame (t=0.5).
[0035] Referring to FIG. 3, the first training index (310) includes information about the viewpoint, but may not include information about the magnitude (distance) of the motion vector between video frames and information about the direction (angle) of the motion vector between video frames. The second training index (320) includes information about the viewpoint and information about the magnitude (distance) of the motion vector between video frames, but may not include information about the direction (angle) of the motion vector between video frames. According to the second training index (320), the magnitude of the motion between the first video frame and the intermediate video frame and the magnitude of the motion between the intermediate video frame and the second video frame may be expressed differently, but motion in a different direction between the first video frame and the second video frame cannot be expressed. The third training index (330) may include information about the viewpoint, information about the magnitude (distance) of the motion vector between video frames, and information about the direction (angle) of the motion vector between video frames. According to the third training index, the magnitude and direction of motion between the first video frame and the intermediate video frame, and the magnitude and direction of motion between the intermediate video frame and the second video frame, may be expressed differently.
[0036] Information regarding the magnitude (distance) of motion vectors between video frames may include information regarding the magnitude of motion vectors per pixel. In one embodiment, information regarding the magnitude of motion vectors per pixel between video frames may include information regarding the 'difference' in the magnitude of motion vectors per pixel between video frames. For example, information regarding the magnitude of motion vectors per pixel between video frames may include information regarding the difference in the magnitude of motion vectors between a first video frame and an intermediate video frame and the magnitude of motion vectors between an intermediate video frame and a second video frame. In one embodiment, information regarding the difference in the magnitude of motion vectors may be expressed as a ratio between the magnitudes of two motion vectors.
[0037] Compared to using the first training index (310) or the second training index, when training a video frame interpolation model using an index for distance and an index for angle, such as the third training index (330), changes in motion within the video frames used for training can be accurately represented, and the position of an object within an intermediate video frame can be accurately represented, as illustrated in FIG. 3. When training a video frame interpolation model using the third training index (330), the problem of time-position ambiguity can be significantly reduced, and the video frame interpolation model can generate a clearer image.
[0038] Information regarding the magnitude (distance) of the motion vector of the third training index (330) and information regarding the direction (angle) of the motion vector between video frames can be expressed based on the pixel-wise motion vector between video frames. The motion vector can represent a change in the position of a pixel between video frames. The pixel-wise motion vector between video frames can be referred to as a motion vector field. For example, the motion vector field between the first video frame (t=0) and the intermediate video frame (t=0.5) can be referred to as the reverse motion vector field, and the motion vector field between the intermediate video frame (t=0.5) and the second video frame (t=1) can be referred to as the forward motion vector field. The reverse motion vector field can include the pixel-wise motion vector from the intermediate video frame to the previous video frame. The forward motion vector field can include the pixel-wise motion vector from the intermediate video frame to the subsequent video frame.
[0039] The motion vector field between video frames of a training video sequence can be determined and / or estimated in various ways. For example, the motion vector field can be determined and / or estimated based on the brightness of pixels between video frames. The motion vector field corresponding to the brightness of pixels can be referred to as optical flow. Additionally, for example, the motion vector field can be determined and / or estimated by a neural network.
[0040] [Mathematical Formula 1]
[0041]
[0042] Information regarding the magnitude (distance) of the motion vector of the third training index (330) and information regarding the direction (angle) of the motion vector between video frames can be expressed as a vector field, for example, as in Equation 1 above. t can represent a time point corresponding to an intermediate video frame between the video frame of time point 0 and the video frame of time point 1.
[0043] It can be referred to as a motion information field. It can represent the ratio of the magnitude (distance) of the motion vector per pixel. and Each can represent the magnitude of the inverse motion vector per pixel and the magnitude of the forward motion vector per pixel. Is and It can represent the pixel-by-pixel normalized ratio between them. and can represent the "difference between the angle of the pixel-wise reverse motion vector and the pixel-wise angle of the pixel-wise forward motion vector" for each reverse pixel. The angle of the pixel-wise motion vector can be determined, for example, as in Equation 2 below. For example, if the motion vector corresponding to a single pixel is (2, 4), the angle is It could be.
[0044] [Mathematical Formula 2]
[0045]
[0046] FIG. 4 is a block diagram schematically illustrating a process for generating a target image frame using a video frame interpolation model according to one embodiment. Referring to FIG. 4, an electronic device may input an input image sequence (410) into a video frame interpolation model (400). Through the video frame interpolation model (400), the electronic device may generate a target image frame (442) at a target time point based on the input image sequence (410). The target time point may correspond to a time point between a first time point and a second time point, which correspond respectively to a consecutive first image frame and a second image frame of the input image sequence. The target image frame (442) may correspond to the target image frame (130) of FIG. 1.
[0047] The video frame interpolation model (400) may be a pre-trained model that performs video frame interpolation based on an input video sequence (e.g., input video sequence (410)) to generate a new video frame (e.g., target video frame (442)). The training method of the video frame interpolation model (400) is described in more detail through FIG. 6 below. The electronic device may initialize the video frame interpolation model (400) before inputting the input video sequence (410) into the video frame interpolation model (400). Initialization of the video frame interpolation model (400) may mean loading the pre-trained video frame interpolation model (400) into memory. For example, all parameters of the pre-trained first neural network (430) and second neural network (440) may be loaded according to the initialization of the video frame interpolation model (400). When the video frame interpolation model (400) is initialized, the value input to the video frame interpolation model (400) when using the previous video frame interpolation model (400) may not affect the initialized video frame interpolation model (400).
[0048] An electronic device can generate a pyramid image sequence (412) based on an input image sequence (410). An electronic device can generate a pyramid image sequence (412) based on a plurality of encoding levels. The plurality of encoding levels may include L predetermined levels. An electronic device can generate an image sequence of each encoding level by performing downsampling on the input image sequence (410). k may represent an encoding level. The size of the image frames of the image sequence of each encoding level of the pyramid image sequence (412) may differ from one another. As the encoding level increases, the size of the image frames of the image sequence may decrease. For example, an image sequence of encoding level k is 2 of the input image sequence (410). kIt may have a scale that is reduced by a factor of 1. The pyramid image sequence (412) may include an input image sequence (410). The input image sequence (410) may be an image sequence corresponding to encoding level 0.
[0049] The electronic device can generate image feature maps by performing pyramid encoding (414) on a pyramid image sequence (412). The image feature maps may include motion feature maps (422) and context feature maps (424). The video frame interpolation model (400) may include a motion feature extractor for generating motion feature maps (422) and a context feature extractor for generating context feature maps (424). The motion feature maps (422) may be feature maps used to estimate a bidirectional motion field. The context feature maps (424) may be feature maps used to estimate a target image frame between the viewpoints of two image frames.
[0050] The pyramid encoding (414) may include multiple encoding levels. The multiple encoding levels of the pyramid encoding (414) may correspond to multiple encoding levels of the pyramid image sequence (412). The electronic device may generate image feature maps corresponding to each encoding level through the pyramid encoding (414). For example, the motion feature maps (422) may include motion feature maps corresponding to the input image sequence (410) of encoding level 0 and motion feature maps corresponding to the image sequence of encoding level (L-1).
[0051] Motion feature maps (422) may include motion feature maps corresponding to each encoding level. Motion feature maps corresponding to each encoding level may include motion feature maps corresponding to each video frame of the video sequence corresponding to each encoding level. Context feature maps (424) may include context feature maps corresponding to each encoding level. Context feature maps corresponding to each encoding level may include context feature maps corresponding to each video frame of the video sequence corresponding to each encoding level.
[0052] The video frame interpolation model (400) may include a first neural network (430) and a second neural network (440). An electronic device may perform pyramid decoding using the first neural network (430) and the second neural network (440). An electronic device may perform pyramid decoding based on motion feature maps (422), context feature maps (424), and a motion information field (426). An electronic device may generate a target video frame (442) by performing pyramid decoding. The motion information field (426) may correspond to the motion information field described through FIG. 3 and / or Equation 1.
[0053] Pyramid decoding may include multiple decoding levels. Multiple decoding levels may correspond to multiple encoding levels. For example, decoding level k of pyramid decoding may utilize information from encoding level k. Pyramid decoding may start from decoding level (L-1). At decoding level k, the electronic device may input motion feature maps (422) and a motion information field (426) into the first neural network (430). At decoding level k, the motion feature maps (422) may represent motion feature maps corresponding to a first time point and a second time point generated in correspondence with encoding level k. The first neural network (430) may be a neural network trained to estimate a bidirectional motion vector field from a target time point to the time points of two video frames of the input video sequence (410) (e.g., a first time point and a second time point). At decoding level k, the electronic device can use the first neural network (430) to generate a bidirectional motion vector field (432) corresponding to decoding level k.
[0054] At decoding level k, the motion information field (426) may have a size corresponding to the video sequence of encoding level k. The electronic device may determine the motion information field (426) to include pixel-wise angle difference information and / or pixel-wise motion vector size difference information between the pixel-wise forward motion vector between the first video frame and the target frame of the input video sequence (410) and between the pixel-wise reverse motion vector between the second input video frame and the target video frame of the input video sequence (410).
[0055] In one embodiment, the electronic device may estimate information of the motion information field (426) and then determine the motion information field (426) based on the estimated information. For example, the electronic device may estimate information on the angle difference between a pixel-wise forward motion vector and a pixel-wise backward motion vector and / or information on the magnitude difference between the pixel-wise motion vectors, and the electronic device may determine the motion information field (426) to include the estimated information. The estimation of information on the motion vectors regarding the target image frame may be estimated, for example, through a separately trained neural network (not shown).
[0056] In one embodiment, instead of estimating the exact motion of the target viewpoint between the viewpoints of consecutive image frames of the input image sequence (410), the electronic device may determine the motion information field (426) by assuming that the motion between the viewpoints of consecutive image frames of the input image sequence (410) is uniform. In this case, the electronic device may determine the motion information field (426) as shown in Equation 3 below. t may correspond to the target viewpoint, and H and W may correspond to the height and width of the image frame of the image sequence of encoding level k, respectively. According to Equation 3, the electronic device may determine the pixel-wise angle difference between the forward motion vector and the reverse motion vector to be 180° (π).
[0057] [Mathematical Formula 3]
[0058]
[0059] At decoding level k, the electronic device may input a bidirectional motion vector field (432), context feature maps (424), and an image sequence of a pyramid image sequence (412) into the second neural network (440). At decoding level k, the electronic device may input an image sequence of a pyramid image sequence (412) corresponding to encoding level k into the second neural network (440). At decoding level k, the context feature maps (424) may represent context feature maps corresponding to a first time point and a second time point generated corresponding to encoding level k. The second neural network (440) may be a neural network trained to estimate an image frame at a target time point. An image frame output by the second neural network (440) at decoding level k may have the same size as an image frame of the image sequence at encoding level k. That is, an image frame at a target time point estimated at decoding level k is 2 times larger than the target image frame (442). k It can have a scale reduced by a factor of 1. The second neural network can be trained to better estimate the occlusion mask.
[0060] In one embodiment, the second neural network (440) may include an upsampling neural network and an image frame synthesis network. At decoding level k, the electronic device may input the bidirectional motion vector field (432) and context feature maps (424) into the upsampling neural network to generate a bidirectional motion vector field of a larger size than the bidirectional motion vector field (432). This may correspond to an adaptive upsampling model. At decoding level k, the electronic device may input the bidirectional motion vector field, context feature maps (424), and the image sequence corresponding to encoding level k into the image frame synthesis network to estimate and / or generate an image frame and / or occlusion mask at a target time point. The image frame synthesis network may correspond to a U-net architecture.
[0061] The electronic device can use data generated at decoding level k in pyramid decoding at decoding level (k-1). At decoding level (k-1), the electronic device can input a bidirectional motion vector field upsampled at decoding level k and an occlusion mask into the first neural network (430).
[0062] Pyramid decoding can be terminated at decoding level 0. At decoding level 0, the electronic device can estimate and / or generate a target image frame (442) having the same size as the image frame of the input image sequence (410) through the second neural network (440).
[0063] FIG. 5 is a diagram illustrating an exemplary inference process of a first neural network according to one embodiment. Referring to FIG. 5, the decoding level The inference process of the first neural network (500) is schematically illustrated.
[0064] Decoding level In this, the electronic device decoding level Bidirectional motion vector field generated in ( and ) can be input to the first neural network (500). The bidirectional motion vector field is the decoding level of FIG. 4 It can correspond to the bidirectional motion vector field output by the upsampling network. and can represent a reverse motion vector field and a forward motion vector field, respectively. t can represent a target viewpoint, and 0 and 1 can be viewpoints corresponding to consecutive video frames of an input video sequence.
[0065] Decoding level In this, the electronic device is at the encoding level Motion feature maps ( and ) can be input into the first neural network (500). The motion feature maps can correspond to the motion feature maps (422) of FIG. 4. and are each encoding levels Among the motion feature maps, they may be the motion feature maps for viewpoints 0 and 1. Decoding level In this, the electronic device decoding level Occlusion mask corresponding to ( ) can be input into the first neural network (500).
[0066] The first neural network (500) is and downsampling and It can generate. For example, and is 2 times the video frame of the input video sequence (l+1) It can have a scale reduced by half, and is 2 times the video frame of the input video sequence (l+2)It can have a scale reduced by a factor of 1. The first neural network (500) is Using Warped and warped motion feature map ( Can generate ), Using Warped to create a warped motion feature map ( ) can be generated. The first neural network (500) is and A cost volume can be generated to find the correspondence relationship (e.g., similarity) between them.
[0067] The first neural network (500) is downsampling Generate and convolve it, then and and can be convolved by combining with the generated cost volume. The first neural network (500) can be convolved with a feature map ( Can generate ).
[0068] The electronic device has a motion information field (1 neural network (500)) You can enter ). and Each can represent the ratio of the magnitude (distance) of the motion vector per pixel and the "difference between the angle of the reverse motion vector and the angle of the forward motion vector." The motion information field may correspond to the motion information field described through FIG. 3 and / or Equation 1. The first neural network (500) may include a distance embedding module (DEM) and an angle embedding module (AEM). The first neural network (500) Input into DEM to create a feature map ( ) can be generated. The first neural network (500) is By entering into AEM ( Can generate ).
[0069] The first neural network (500) is , and The result can be input into the Residual Block (ResBlock) and pixelwise multiplication can be performed on the output result. Pixelwise multiplication may be referred to as elementwise multiplication. The first neural network (500) and the result of the pixelwise multiplication and After adding and convolving, the bidirectional residual motion vector( and ) can be generated. The first neural network (500) decodes the bidirectional residual motion vector at a level Adding the bidirectional motion vector of bidirectional motion vector( and Can generate ). and It can correspond to the bidirectional motion vector field (432) of Fig. 4. and is 2 times the video frame of the input video sequence (l+2) It can have a scale reduced by a factor of 1, and is 2 times smaller than the video frames of the input video sequence l 2 by the upsampling neural network of the second neural network to have a scale reduced by a factor of 2 2 It can be upsampled to the scale of the ship.
[0070] FIG. 6 is a block diagram schematically illustrating a process for training a video frame interpolation model according to one embodiment. Referring to FIG. 6, the training device can generate a pyramid image sequence comprising multiple scale image sequences corresponding to multiple encoding levels in an original image sequence. The training image sequence (612) may correspond to one of the image sequences among the pyramid image sequences. The training image sequence (612) is an encoding level among the pyramid image sequences It may be a video sequence corresponding to. The training video sequence (612) may include a first video frame corresponding to a first time point, a second video frame corresponding to a second time point, and an intermediate video frame corresponding to an intermediate time point. The second time point may be a time point after the first time point. The intermediate time point may be a time point between the first time point and the second time point.
[0071] The training device can generate motion feature maps and context feature maps by performing pyramid encoding (602) on a pyramid video sequence. The training device can generate motion feature maps (614) and context feature maps (616) corresponding to the training video sequence (612). The motion feature maps (614) may include motion feature maps corresponding to a first video frame, a second video frame, and an intermediate video frame. The context feature maps (616) may include context feature maps corresponding to a first video frame and a second video frame.
[0072] The training device can train the first neural network (620) and the second neural network (630) so that a target image frame (632) similar to an intermediate image frame is estimated according to motion feature maps (6142) corresponding to the first image frame and the second image frame. The first neural network (620) and the second neural network (630) may correspond to the first neural network (430) and the second neural network (440) of FIG. 4. The target image frame (632) may be an image frame having the same size as the image frame of the training image sequence (612). If the training image sequence (612) is the original image sequence among the pyramid image sequences used for training, the target image frame (632) may correspond to the target image frame (130) of FIG. 1.
[0073] The training device can perform first training (604) based on motion feature maps (614) and context feature maps (616). In first training (604), the training device can input motion feature maps (6144) and motion information fields (654) corresponding to the first image frame and the intermediate image frame into the first neural network (620). In this case, the information of the intermediate image frame is input instead of the information of the second image frame, so the first neural network (620) can estimate the reverse motion vector field (6242) from the intermediate time point to the first time point and the motion vector field from the intermediate time point to the intermediate time point. Additionally, in first training (604), the training device can input motion feature maps (6146) and motion information fields (656) corresponding to the intermediate image frame and the second image frame into the first neural network (620). In this case, information of an intermediate image frame is input instead of information of a first image frame, so that the first neural network (620) can estimate a motion vector field from an intermediate point in time to an intermediate point in time and a forward motion vector field (6244) from an intermediate point in time to a second point in time. Preferably, the motion vector field from an intermediate point in time to an intermediate point in time may be a field composed of zeros.
[0074] In one embodiment, a loss function can be determined based on the difference between a vector field consisting of zeros and a ‘motion vector field from intermediate point to intermediate point’ that is incidentally generated during the process of generating a bidirectional motion vector field (624), and the first neural network (620) can be trained so that the loss function is reduced.
[0075] In the first training (604), the motion vector field between intermediate video frames should preferably be composed of 0, so the angle of the motion vector per pixel cannot be properly defined. Accordingly, the motion information field (654) and the motion information field (656) can be determined as shown in Equations 4 and 5 below, respectively. can represent the first training (604), and can represent the encoding level of the training video sequence (612). and is [0, 360°( It may have random values between )]. Accordingly, the motion information field (654) and the motion information field (656) may contain random pixel-wise angle difference information.
[0076] [Mathematical Formula 4]
[0077]
[0078] [Mathematical Formula 5]
[0079]
[0080] In the first training (604), the training device may input a bidirectional motion vector field (624) and context feature maps (616) corresponding to the first image frame and the second image frame into the second neural network (630). Although not shown in FIG. 6, the training device may input the first image frame and the second image frame into the second neural network (630). Accordingly, the second neural network (630) may estimate and / or generate a target image frame (634) corresponding to the size of the image frame of the training image sequence (612). The target image frame (634) may be referred to as the output image frame. Additionally, the second neural network (630) may estimate and / or generate an upsampled bidirectional motion vector field and an occlusion mask, and these at a decoding level ( -1) can be input into the first neural network (620).
[0081] In the first training (604), the training device can train the first neural network (620) and the second neural network (630) based on the difference between the intermediate image frame and the target image frame (634) of the training image sequence (612). The training device can determine a Charbonnier loss function based on the difference between the intermediate image frame and the target image frame (634) of the training image sequence (612). The training device can train the first neural network (620) and the second neural network (630) so that the value of the loss function is reduced based on one or more loss functions. In one embodiment, the training device can determine a census loss based on the intermediate image frame and the target image frame (634) of the training image sequence (612), and train the first neural network (620) and the second neural network (630) based on the Charbonnier loss function and the census loss.
[0082] The training device can perform secondary training (604) based on motion feature maps (614), context feature maps (616), and a motion information field (652). The training device can calculate the motion information field (652) based on the reverse motion vector field (6242) and the forward motion vector field (6244) through the above mathematical formula 1. The training device can generate a motion information field (652) containing pixel-by-pixel angle information based on the angle information of the pixel-by-pixel motion vectors of the reverse motion vector field (6242) and the forward motion vector field (6244). The training device can generate a bidirectional motion vector field (622) by inputting the motion feature maps (6142) and the motion information field (654) corresponding to the first image frame and the second image frame into the first neural network (620).
[0083] In the second training (606), the bidirectional motion vector field (622) can be estimated based on the bidirectional motion vector field (624). Accordingly, in one embodiment, the training device can determine a loss function based on the difference between the bidirectional motion vector field (622) and the bidirectional motion vector field (624). For example, the training device can determine a loss function based on the difference between the back-direction motion vector field of the bidirectional motion vector field (622) and the back-direction motion vector field (6242), and the difference between the forward-direction motion vector field of the bidirectional motion vector field (622) and the forward-direction motion vector field (6244). The training device can train the first neural network (620) so that the loss function is reduced.
[0084] In the second training (606), the training device may input a bidirectional motion vector field (622) and context feature maps (616) corresponding to the first image frame and the second image frame into the second neural network (630). Although not shown in FIG. 6, the training device may input the first image frame and the second image frame into the second neural network (630). Accordingly, the second neural network (630) may estimate and / or generate a target image frame (632) corresponding to the size of the image frame of the training image sequence (612). The target image frame (632) may be referred to as the output image frame. Additionally, the second neural network (630) may estimate and / or generate an upsampled bidirectional motion vector field and an occlusion mask, and these at a decoding level ( -1) can be input into the first neural network (620).
[0085] In the second training (606), the training device can train the first neural network (620) and the second neural network (630) based on the difference between the intermediate image frame and the target image frame (632) of the training image sequence (612). The training device can determine a Charbonnier loss function based on the difference between the intermediate image frame and the target image frame (634) of the training image sequence (612). The training device can train the first neural network (620) and the second neural network (630) so that the value of the loss function is reduced based on one or more loss functions. In one embodiment, the training device can determine a census loss based on the intermediate image frame and the target image frame (632) of the training image sequence (612), and train the first neural network (620) and the second neural network (630) based on the Charbonnier loss function and the census loss.
[0086] Decoding level through Fig. 6 Training of the first neural network (620) and the second neural network (630) through [this method] has been described. However, the first neural network (620) and the second neural network (630) can be trained at all decoding levels (k= 0, 1, ..., L-1) using other image sequences of the pyramid image sequence other than the training image sequence (612).
[0087] FIG. 7 is a flowchart illustrating, in an exemplary manner, a method for a training device to train a video frame interpolation model according to one embodiment. Referring to FIG. 7, in step (710), the training device may generate image feature maps corresponding to each image frame based on a first image sequence. The first image sequence may include a first image frame corresponding to a first time point, a second image frame corresponding to a second time point after the first time point, and an intermediate image frame corresponding to an intermediate time point between the first time point and the second time point.
[0088] In step (720), the training device may generate a motion information field. The motion information field may include pixel-wise motion vectors between the first image frame and the intermediate image frame and pixel-wise angle difference information between the pixel-wise motion vectors between the second image frame and the intermediate image frame.
[0089] The training device can generate a second inverse motion vector field by inputting image feature maps corresponding to the first image frame and the intermediate image frame into the first neural network. The training device can generate the second inverse motion vector field by inputting a motion information field containing image feature maps corresponding to the first image frame and the intermediate image frame and random pixel-by-pixel angle difference information into the first neural network. The training device can generate a second forward motion vector field by inputting image feature maps corresponding to the second image frame and the intermediate image frame into the first neural network. The training device can generate the second forward motion vector field by inputting a motion information field containing image feature maps corresponding to the second image frame and the intermediate image frame and random pixel-by-pixel angle difference information into the first neural network. The training device can generate the motion information field containing pixel-by-pixel angle difference information based on the angle information of the pixel-by-pixel motion vectors of the second inverse motion vector field and the second forward motion vector field.
[0090] The training device can generate a second output image frame by inputting the second inverse motion vector field and the second forward motion vector field into the second neural network. The training device can train the first neural network and the second neural network based on the difference between the intermediate image frame and the second output image frame.
[0091] In step (730), the training device may generate a first inverse motion vector field and a first forward motion vector field by inputting image feature maps and a motion information field into a first neural network. The first neural network may be a neural network trained to estimate the bidirectional motion vector field. The training device may train the first neural network based on the difference between the first inverse motion vector field and the second inverse motion vector field, and the difference between the first forward motion vector field and the second forward motion vector field.
[0092] In step (740), the training device may generate a first output image frame by inputting a first reverse motion vector field and a first forward motion vector field into a second neural network. The second neural network may be a neural network trained to estimate an image frame at an intermediate time point.
[0093] In step (750), the training device can train the first neural network and the second neural network based on the difference between the intermediate image frame and the first output image frame.
[0094] FIG. 8 is a block diagram illustrating the configuration of a video frame interpolation device according to one embodiment. Referring to FIG. 8, the video frame interpolation device (800) includes a processor (810) and a memory (820). The memory (820) is connected to the processor (810) and can store instructions executable by the processor (810), data to be computed by the processor (810), or data processed by the processor (810). The memory (820) may include a non-transient computer-readable medium, such as high-speed random access memory and / or a non-volatile computer-readable storage medium (e.g., one or more disk storage devices, flash memory devices, or other non-volatile solid-state memory devices).
[0095] The processor (810) may execute instructions for performing the operations of FIGS. 1 to 7, FIG. 9, and FIG. 10. For example, the processor (810) may receive an input image sequence including a first input image frame at a first input time point and a second input image frame at a second input time point, determine a target motion information field including pixel-wise motion vectors between the first input image frame and a target image frame corresponding to a target time point between the first input time point and the second input time point, and pixel-wise angle difference information between the second input image frame and the target image frame, and input the input image sequence and the target motion information field into a trained video frame interpolation model to generate a target image frame corresponding to a target time point. In addition, the description of FIGS. 1 to 7, FIG. 9, and FIG. 10 may be applied to the video frame interpolation device (800).
[0096] FIG. 9 is a block diagram showing the configuration of a training device according to one embodiment. Referring to FIG. 9, the training device (900) includes a processor (910) and a memory (920). The memory (920) is connected to the processor (910) and can store instructions executable by the processor (910), data to be computed by the processor (910), or data processed by the processor (910). The memory (920) may include a non-transient computer-readable medium, such as high-speed random access memory and / or a non-volatile computer-readable storage medium (e.g., one or more disk storage devices, flash memory devices, or other non-volatile solid-state memory devices).
[0097] The processor (910) can execute instructions to perform the operations of FIGS. 1 through 8 and FIG. 10. For example, the processor (910) can generate image feature maps corresponding to each image frame based on a first image sequence including a first image frame corresponding to a first time point, a second image frame corresponding to a second time point after the first time point, and an intermediate image frame corresponding to an intermediate time point between the first time point and the second time point, generate a motion information field including pixel-wise motion vectors between the first image frame and the intermediate image frame and pixel-wise angle difference information between the second image frame and the intermediate image frame, input the image feature maps and motion information field corresponding to the first image frame and the second image frame into a first neural network that estimates a bidirectional motion vector field to generate a first reverse motion vector field and a first forward motion vector field, input the first reverse motion vector field and the first forward motion vector field into a second neural network that estimates an image frame at an intermediate time point to generate a first output image frame, and train the first neural network and the second neural network based on the difference between the intermediate image frame and the first output image frame. In addition, the descriptions of FIGS. 1 to 8 and FIG. 10 may be applied to the training device (900).
[0098] FIG. 10 is a block diagram illustrating, in an exemplary manner, the configuration of an electronic device for training a video frame interpolation model according to one embodiment. Referring to FIG. 10, the electronic device (1000) may include one or more processors (1010), memory (1020), storage (1030), I / O (input / output) devices (1040), and network interfaces (1050), which may communicate with each other via a communication bus (1060). For example, the electronic device (1000) may be implemented as at least part of a mobile device such as a mobile phone, smartphone, PDA, netbook, tablet computer, laptop computer, etc., a wearable device such as a smart watch, smart band, smart glasses, etc., a computing device such as a desktop, server, etc., a home appliance such as a television, smart television, refrigerator, etc., a security device such as a door lock, etc., an autonomous vehicle, a smart vehicle, etc. The electronic device (1000) may structurally and / or functionally include the video frame interpolation device (800) of FIG. 8 and / or the training device (900) of FIG. 9.
[0099] One or more processors (1010) may execute instructions stored in memory (1020) or storage (1030). When executed by one or more processors (1010), the instructions may cause the electronic device (1000) to perform the operation described through FIGS. 1 to 9. The memory (1020) may include a computer-readable storage medium or a computer-readable storage device. The memory (1020) may store instructions to be executed by one or more processors (1010) and may store relevant information while software and / or applications are executed by the electronic device (1000).
[0100] Storage (1030) may include a computer-readable storage medium or a computer-readable storage device. Storage (1030) may store a larger amount of information than memory (1020) and may store information for a longer period. For example, storage (1030) may include a magnetic hard disk, an optical disk, a flash memory, a floppy disk, or other forms of non-volatile memory known in the art.
[0101] The I / O device (1040) can receive input from a user through traditional input methods such as a keyboard and mouse, and new input methods such as touch input, voice input, and image input. For example, the I / O device (1040) may include a keyboard, mouse, touch screen, microphone, or any other device capable of detecting input from a user and transmitting the detected input to an electronic device (1000). The I / O device (1040) may provide output of the electronic device (1000) to the user through visual, auditory, or tactile channels. The I / O device (1040) may include, for example, a display, touch screen, speaker, vibration generator, or any other device capable of providing output to the user. The network interface (1050) may communicate with an external device through a wired or wireless network.
[0102] The embodiments described above may be implemented as hardware components, software components, and / or combinations of hardware and software components. For example, the devices, methods, and components described in the embodiments may be implemented using a general-purpose computer or a special-purpose computer, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing unit may execute an operating system (OS) and software applications executed on said operating system. Additionally, the processing unit may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing unit may be described as being used as a single unit, but those skilled in the art will understand that the processing unit may include multiple processing elements and / or multiple types of processing elements. For example, the processing unit may include multiple processors or one processor and one controller. In addition, other processing configurations, such as parallel processors, are also possible.
[0103] Software may include computer programs, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired or instruct the processing unit independently or collectively. Software and / or data may be stored on any type of machine, component, physical device, virtual equipment, computer storage medium, or device so as to be interpreted by the processing unit or to provide instructions or data to the processing unit. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on computer-readable recording media.
[0104] The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may store program instructions, data files, data structures, etc., either alone or in combination, and the program instructions recorded on the medium may be those specifically designed and configured for the embodiment or may be those known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, and flash memory. Examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.
[0105] The hardware device described above may be configured to operate as one or more software modules to perform the operation of the embodiment, and vice versa.
[0106] Although the embodiments have been described above with reference to the limited drawings, those skilled in the art can apply various technical modifications and variations based thereon. For example, suitable results may be achieved even if the described techniques are performed in a different order than described, and / or if the components of the described system, structure, device, circuit, etc. are combined or assembled in a form different from described, or replaced or substituted by other components or equivalents.
[0107] Therefore, other implementations, other embodiments, and equivalents to the claims also fall within the scope of the claims set forth below.
Claims
1. Regarding the training method of a video frame interpolation model, A step of generating image feature maps corresponding to each image frame based on a first image sequence comprising a first image frame corresponding to a first time point, a second image frame corresponding to a second time point after the first time point, and an intermediate image frame corresponding to an intermediate time point between the first time point and the second time point; A step of generating a motion information field including pixel-wise angle difference information between pixel-wise motion vectors between the first image frame and the intermediate image frame and pixel-wise motion vectors between the second image frame and the intermediate image frame; A step of generating a first inverse motion vector field and a first forward motion vector field by inputting image feature maps corresponding to the first image frame and the second image frame and the motion information field into a first neural network that estimates a bidirectional motion vector field; A step of generating a first output image frame by inputting the first reverse motion vector field and the first forward motion vector field into a second neural network that estimates the image frame of the intermediate time point; and A step of training the first neural network and the second neural network based on the difference between the intermediate image frame and the first output image frame. A training method including 2. In Paragraph 1, The step of generating the above motion information field is A step of generating a second inverse motion vector field by inputting image feature maps corresponding to the first image frame and the intermediate image frame into the first neural network; A step of generating a second forward motion vector field by inputting image feature maps corresponding to the second image frame and the intermediate image frame into the first neural network; and A step of generating the motion information field including pixel-wise angle difference information based on the pixel-wise motion vector angle information of the second reverse motion vector field and the second forward motion vector field. A training method including 3. In Paragraph 2, A step of training the first neural network based on the difference between the first reverse motion vector field and the second reverse motion vector field, and the difference between the first forward motion vector field and the second forward motion vector field. A training method that further includes 4. In Paragraph 2, The step of generating the second reverse motion vector field above The method includes the step of generating the second inverse motion vector field by inputting a motion information field, which includes image feature maps corresponding to the first image frame and the intermediate image frame and random pixel-wise angle difference information, into the first neural network. The step of generating the second forward motion vector field above The method comprises the step of generating the second forward motion vector field by inputting a motion information field, which includes image feature maps corresponding to the second image frame and the intermediate image frame and random pixel-by-pixel angle difference information, into the first neural network. Training methods.
5. In Paragraph 2, A step of generating a second output image frame by inputting the second reverse motion vector field and the second forward motion vector field into the second neural network; and A step of training the first neural network and the second neural network based on the difference between the intermediate image frame and the second output image frame. A training method that further includes 6. In Paragraph 1, The above motion information field is Further including the pixel-by-pixel normalized ratio between the size of the pixel-by-pixel motion vector between the first image frame and the intermediate image frame and the size of the pixel-by-pixel motion vector between the second image frame and the intermediate image frame. Training methods.
7. In a training device, One or more processors; and Memory comprising instructions executable by the above one or more processors Includes, When the instructions are executed by one or more of the above processors, the instructions cause the training device, Image feature maps corresponding to each image frame are generated based on a first image sequence comprising a first image frame corresponding to a first time point, a second image frame corresponding to a second time point after the first time point, and an intermediate image frame corresponding to an intermediate time point between the first time point and the second time point. A motion information field is generated that includes pixel-wise angle difference information between pixel-wise motion vectors between the first image frame and the intermediate image frame and pixel-wise motion vectors between the second image frame and the intermediate image frame, and A first inverse motion vector field and a first forward motion vector field are generated by inputting image feature maps corresponding to the first image frame and the second image frame and the motion information field into a first neural network that estimates a bidirectional motion vector field, and A first output image frame is generated by inputting the first inverse motion vector field and the first forward motion vector field into a second neural network that estimates the image frame of the intermediate time point, and Training the first neural network and the second neural network based on the difference between the intermediate image frame and the first output image frame. Training device.
8. In Paragraph 7, When the instructions are executed by one or more of the above processors, the instructions cause the training device to generate the motion information field, A second inverse motion vector field is generated by inputting image feature maps corresponding to the first image frame and the intermediate image frame into the first neural network, and Image feature maps corresponding to the second image frame and the intermediate image frame are input into the first neural network to generate a second forward motion vector field, and A motion information field including pixel-wise angle difference information based on the angle information of the pixel-wise motion vectors of the second reverse motion vector field and the second forward motion vector field, Training device.
9. In Paragraph 8, When the instructions are executed by one or more of the above processors, the instructions cause the training device, Training the first neural network based on the difference between the first reverse motion vector field and the second reverse motion vector field, and the difference between the first forward motion vector field and the second forward motion vector field. Training device.
10. In Paragraph 8, When the instructions are executed by one or more of the above processors, the instructions cause the training device, A motion information field including image feature maps corresponding to the first image frame and the intermediate image frame, and random pixel-by-pixel angle difference information is input to the first neural network to generate the second inverse motion vector field, and A motion information field including image feature maps corresponding to the second image frame and the intermediate image frame, and random pixel-by-pixel angle difference information, is input into the first neural network to generate the second forward motion vector field. Training device.
11. In Paragraph 8, When the instructions are executed by one or more of the above processors, the instructions cause the training device, A second output image frame is generated by inputting the second inverse motion vector field and the second forward motion vector field into the second neural network, and Training the first neural network and the second neural network based on the difference between the intermediate image frame and the second output image frame. Training device.
12. In Paragraph 7, The above motion information field is Further including the pixel-by-pixel normalized ratio between the size of the pixel-by-pixel motion vector between the first image frame and the intermediate image frame and the size of the pixel-by-pixel motion vector between the second image frame and the intermediate image frame. Training device.
13. In a video frame interpolation method, Step of initializing a pre-trained video frame interpolation model; A step of receiving an input image sequence including a first input image frame at a first input time and a second input image frame at a second input time; A step of determining a target motion information field including pixel-wise motion vectors between a target image frame corresponding to a target time point between the first input image frame and the first input time point and the second input time point, and pixel-wise angle difference information between pixel-wise motion vectors between the second input image frame and the target image frame; and inputting the above input video sequence and the above target motion information field into the above trained video frame interpolation model to generate the above target video frame corresponding to the above target time point A method including 14. In Paragraph 13, The step of generating the target image frame corresponding to the target time point is A step of generating image feature maps corresponding to the first input image frame and the second input image frame; A step of generating a reverse motion vector field and a forward motion vector field by inputting the image feature maps and the target motion information field into a first neural network that estimates a bidirectional motion vector field; and The step of generating the target image frame by inputting the inverse motion vector field and the forward motion vector field into a second neural network that estimates the image frame at the target time point. Method including.
15. In Paragraph 13, The step of determining the above target motion information field is A method comprising the step of determining the target motion information field to include a pixel-by-pixel angle difference of 180°.
16. In Paragraph 13, The step of determining the above target motion information field is A method comprising the step of determining the target motion information field to include pixel-wise angle difference information and pixel-wise motion vector magnitude difference information between the pixel-wise motion vector between the first input image frame and the target image frame and the pixel-wise motion vector between the second input image frame and the target image frame.