Artificial intelligence-based animation video frame rate adjustment method and system

By constructing motion saliency features and a binarized gated mask, static and moving regions in animated videos are distinguished. A deep learning frame interpolation network is used to calculate only the effective moving regions, which solves the pseudo-motion problem in static regions in existing technologies and improves the stability and smoothness of animated videos.

CN122265482APending Publication Date: 2026-06-23BEIJING ZHONGCHUANG ZHONGYUE NETWORK TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING ZHONGCHUANG ZHONGYUE NETWORK TECH CO LTD
Filing Date
2026-05-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing AI-based animation video frame interpolation methods cannot effectively distinguish between disordered pixel fluctuations caused by video compression noise and real local motion. This results in ripple-like creeping or edge distortion in static areas when generating intermediate frames, which damages the stability and visual appeal of 2D animation.

Method used

By acquiring pixel-level optical flow calculations from adjacent frames in the video, motion saliency features are constructed and a binary motion-gated mask is generated to distinguish between static background regions and effective motion regions. A deep learning frame interpolation network is used to calculate only the effective motion regions, while static background regions are directly mapped to the original image pixels, thus avoiding invalid calculations.

Benefits of technology

It effectively identifies and blocks meaningless interpolation calculations in static areas, improving the frame rate adjustment efficiency and viewing quality of animated videos, ensuring the frame rate improvement effect in key local motion areas, while maintaining the stability of static areas, and eliminating the image distortion and creep problems caused by traditional frame interpolation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265482A_ABST
    Figure CN122265482A_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of image processing, and provides an animation video frame rate adjustment method and system based on artificial intelligence, which comprises the following steps: obtaining an initial optical flow field through pixel-level optical flow calculation, combining optical flow amplitude and local spatial variance to calculate motion saliency features, generating a binary motion gating mask according to the motion saliency features, dividing a picture into a static background area and an effective motion area, generating corresponding intermediate frame pixels for the effective motion area by using a deep learning frame insertion network, not performing network reasoning for the static background area, directly mapping original image corresponding position pixels as intermediate frame pixels, finally splicing the pixels of the two types of areas, outputting a complete interpolation frame and inserting the original video sequence, and realizing frame rate improvement.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing technology, specifically an AI-based method and system for adjusting the frame rate of animated videos. Background Technology

[0002] Deep learning-based video frame rate enhancement (video interpolation) techniques typically synthesize intermediate frames by calculating the global optical flow field or pixel-level motion vectors between adjacent frames. However, when existing interpolation models are applied to 2D finite animation, they often cause severe local creep artifacts in still images. The drawing process of finite animation determines that there are many specific scenes with large areas of absolute stillness and local minor movements. For example, in a typical dialogue scene, the character's torso, background and most of the screen remain completely still, with only local areas such as the lips or eyes moving. At the same time, during the compression, encoding or network transmission of animation videos, weak compression noise or high-frequency flickering at the edges will inevitably be introduced into the screen. Most existing AI frame interpolation systems are trained on real-world video footage (including global camera micro-motion and environmental white noise), and they use indiscriminate global optical flow interpolation logic. When processing the above-mentioned animation scenes, existing optical flow estimation networks are extremely sensitive to pixel changes and cannot effectively distinguish between disordered pixel fluctuations caused by video compression noise and real local effective motion. Existing technology misjudges tiny noises in static areas (such as flatly painted clothing color blocks or backgrounds) as physical displacements, and forces optical flow smoothing and pixel fusion calculations to be performed on these areas that have not actually moved when generating intermediate frames. This processing method not only consumes ineffective computing power, but also amplifies the background noise that is originally difficult to detect with the naked eye, causing the image area that should be absolutely static to show continuous ripple-like creep or edge distortion during playback, which seriously damages the unique image stability and visual experience of 2D animation. Therefore, existing animation video frame interpolation methods lack a motion saliency assessment and gating blocking mechanism for background noise, and cannot lock and protect large areas of static regions from meaningless interpolation calculation interference while increasing the frame rate of local effective motion regions. To address this, the present invention provides an animation video frame rate adjustment method and system based on artificial intelligence. Summary of the Invention

[0003] In order to overcome the shortcomings of the prior art, at least one technical problem raised in the background art is solved.

[0004] The technical solution adopted by this invention to solve its technical problem is: In one aspect, the present invention provides an AI-based method for adjusting the frame rate of animated videos, comprising: Step S1: Obtain two adjacent original frames from the video, perform pixel-level optical flow calculation on the two original frames, and obtain the initial optical flow field; Step S2: Calculate the motion saliency features of the corresponding pixel coordinates based on the amplitude of the optical flow of each pixel in the initial optical flow field and the local spatial variance of the optical flow of each pixel. Step S3: Compare the motion saliency features with the noise threshold to generate a binarized motion gating mask. The binarized motion gating mask divides the image area corresponding to any frame of the two original images into a static background area and an effective motion area. Step S4: Construct a deep learning frame interpolation network. Based on the binarized motion gating mask, the deep learning frame interpolation network calculates and generates the intermediate frame pixels corresponding to the pixel coordinates of the image pixels in the effective motion area. Step S5: For the pixel coordinates in the static background area, without deep learning frame interpolation network inference, directly map the corresponding pixel coordinates in the two adjacent original images as intermediate frame pixels, and stitch the intermediate frame pixels of the static background area and the effective motion area to output the complete interpolated frame.

[0005] Preferably, the specific process for obtaining the two original images is as follows: The input animation video stream to be processed is read by a video decoder, and consecutive frame t and frame t+1 are extracted according to the time sequence as two original frames.

[0006] Preferably, the specific process of obtaining the initial optical flow field is as follows: The color space of the two original images is uniformly converted to the RGB color space, and the pixel values ​​of the two original images are normalized to a preset range to obtain an image tensor of size H×W×3, where H is the image height, W is the image width, and 3 is the number of color channels. Two original frames of images are input into the optical flow estimation network to predict the motion displacement of each pixel coordinate in the t-th original image to the corresponding pixel coordinate in the (t+1)-th original image, thus obtaining the initial optical flow field. The initial optical flow field is a two-dimensional floating-point tensor matrix of size H×W×2, where H and W are consistent with the height and width of the t-th and (t+1)-th images, respectively, and the third dimension has 2 channels, which are the horizontal displacement vector and the vertical displacement vector of the pixel.

[0007] Preferably, the specific process for calculating the motion saliency features of the corresponding pixel coordinates is as follows: Traverse all pixel coordinates (x, y) in the initial optical flow field, extract the horizontal displacement vector u(x, y) and the vertical displacement vector v(x, y) of the current pixel coordinate (x, y), and calculate the magnitude A(x, y) of the pixel optical flow corresponding to the current pixel coordinate (x, y): ; Centered on the current pixel coordinates (x, y), construct a local sliding window of size K×K. Calculate the local spatial variance V(x, y) of the optical flow vectors of all pixels within the local sliding window. First, calculate the local mean μ of the u-channel and v-channel within the local sliding window respectively. u and μ v Then calculate the variance. For each pixel coordinate (x, y), calculate the motion saliency feature S(x, y): ; is a preset minimal positive number, (i,j) is the pixel coordinates within a local sliding window with the center pixel coordinates (x,y) as the origin, used to traverse all pixels within the window, Window represents a local sliding window with a size of K×K centered at pixel coordinates (x,y), K is a preset window size parameter, and K² is the total number of pixels within the local sliding window.

[0008] Preferably, the specific process for generating the binarized motion-gated mask is as follows: The motion saliency feature corresponding to each pixel coordinate is compared with a preset noise threshold, and a value is assigned to the pixel coordinate based on the comparison result. If the motion saliency feature corresponding to the current pixel coordinate is less than the noise threshold, then the pixel coordinate is assigned a value of 0; If the motion saliency feature corresponding to the current pixel coordinate is greater than or equal to the noise threshold, then the pixel coordinate is assigned a value of 1; The results of assigning all coordinate values ​​are organized into a two-dimensional matrix of size H×W. This two-dimensional matrix is ​​the binary motion gating mask.

[0009] Preferably, the specific process by which the binarized motion gating mask divides the image region corresponding to any one of the two original images into a static background region and an effective motion region is as follows: When the motion saliency feature corresponding to the pixel coordinate is less than the noise threshold, the mask value of the corresponding pixel coordinate in the binarized motion gating mask is set to 0, and the set of all pixel coordinates with a mask value of 0 constitutes the static background region. When the motion saliency feature corresponding to the pixel coordinate is greater than or equal to the noise threshold, the mask value of the corresponding pixel coordinate in the binarized motion gating mask is set to 1, and the set of all pixel coordinates with a mask value of 1 constitutes the effective motion region.

[0010] Preferably, the specific process of generating the intermediate frame pixel corresponding to the pixel coordinates in the intermediate frame is as follows: Two original images, an initial optical flow field, and a binarized motion-gated mask are combined to form an inference input tensor, which is then input into a deep learning frame interpolation network. When the current pixel coordinates are detected to have a mask value of 1 in the binarized motion-gated mask, the corresponding pixel features in the two original images are subjected to forward or backward deformation based on the initial optical flow field. The pixel features are then projected onto the target interpolation time to obtain a full-size candidate intermediate frame image tensor. The binarized motion-gated mask is used as a spatial extraction index matrix, and the pixel RGB values ​​corresponding to the pixel coordinates with a mask value of 1 are extracted only from the full-size image tensor as the intermediate frame pixels.

[0011] Preferably, the specific process of directly mapping the image pixels with corresponding pixel coordinates in two adjacent original images as intermediate frame pixels is as follows: Extract the original pixel RGB value at the current pixel coordinates from the t-th or t+1-th original image of the two original images, and directly assign the extracted original pixel RGB value to the current pixel coordinates as the corresponding intermediate frame pixel in the intermediate frame.

[0012] Preferably, the specific process of outputting the complete interpolation frame is as follows: Initialize a blank image tensor with size H×W×3. Use a binarized motion-gated mask as the spatial coordinate index matrix. Fill the pixel coordinates of the effective motion region into the intermediate frame pixels, and fill the pixel coordinates of the static background region into the intermediate frame pixels. After pixel stitching and fusion, obtain and output the complete interpolated frame.

[0013] On the other hand, the present invention provides an AI-based animation video frame rate adjustment system, comprising: Image and optical flow calculation module: acquire two adjacent original images from the video, perform pixel-level optical flow calculation on the two original images, and obtain the initial optical flow field; Motion saliency calculation module: Based on the amplitude of the optical flow of each pixel in the initial optical flow field and the local spatial variance of the optical flow of each pixel, calculate the motion saliency features of the corresponding pixel coordinates; Motion-gated mask generation module: compares motion saliency features with noise thresholds to generate a binarized motion-gated mask. The binarized motion-gated mask divides the image region corresponding to any one of the two original images into a static background region and an effective motion region. Motion region frame interpolation module: Construct a deep learning frame interpolation network. Based on the binarized motion gating mask, the deep learning frame interpolation network calculates and generates the corresponding intermediate frame pixels for the pixel coordinates of the image pixels in the effective motion region. Static region mapping and fusion output module: For the pixel coordinates in the static background region, without deep learning frame interpolation network inference, directly map the corresponding pixel coordinates in the original images of two adjacent frames as intermediate frame pixels, and stitch the intermediate frame pixels of the static background region and the effective motion region to output a complete interpolated frame.

[0014] The beneficial effects of this invention are as follows: 1. By combining optical flow amplitude and local spatial variance to construct motion saliency features, it can effectively identify disordered pseudo-motion caused by compression noise and edge flicker in animated videos, avoiding misjudging noise in static areas as valid motion. It uses a binarized motion gating mask to calculate and block static background areas, without participating in deep learning frame interpolation inference, and directly maps the original image pixels to achieve absolute locking of static area pixels on the time axis. This completely solves the problems of image distortion and water ripple creep caused by traditional frame interpolation. It only performs deep learning frame interpolation and deformation calculation on valid motion areas, ensuring the frame rate improvement effect of key local motion areas such as the lips and eyes of the character, and making the transition of intermediate frames more in line with the laws of physical motion.

[0015] 2. By using a gating mechanism to skip redundant calculations such as feature extraction, deformation, and color reconstruction in static areas, it significantly reduces unnecessary computing power and improves the overall processing efficiency of frame rate adjustment. It is specifically designed for scenarios with large areas of stillness and small local movements in limited animations, effectively adapting to noise scenarios generated during animation compression and transmission. It has greater versatility and practicality. Using masks as spatial coordinate indexes, it achieves pixel-by-pixel precise stitching, with smooth transitions between moving and static areas. The final output video has both smooth motion and stable images, greatly improving the viewing quality of animated videos. Attached Figure Description

[0016] The invention will now be further described with reference to the accompanying drawings.

[0017] Figure 1 This is a flowchart of the steps of the AI-based animation video frame rate adjustment method of the present invention; Figure 2 This is a system module diagram of the AI-based animation video frame rate adjustment system of the present invention. Detailed Implementation

[0018] To make the technical means, creative features, objectives and effects of this invention easier to understand, the invention will be further described below in conjunction with specific embodiments.

[0019] Example 1 like Figure 1 As shown in the embodiment of the present invention, the animation video frame rate adjustment method and system based on artificial intelligence includes: Step S1: Obtain two adjacent original frames from the video, perform pixel-level optical flow calculation on the two original frames to obtain the initial optical flow field; First, two adjacent original frames are obtained from the video: Specifically, the input animation video stream to be processed is read by a video decoder, and the consecutive t-th frame and t+1-th frame are extracted according to the time sequence as the two adjacent original frames. To meet the input requirements of the subsequent neural network, the two original images are preprocessed, including converting the color space of the two original images to the RGB color space and normalizing the pixel values ​​of the two original images to a preset range (e.g., [0,1] or [-1,1]), to obtain an image tensor of size H×W×3, where H is the image height, W is the image width, and 3 is the number of color channels; Secondly, pixel-level optical flow calculation is performed on the two original images: In specific implementation, a pre-trained optical flow estimation network (such as RAFT network, PWC-Net network or similar convolutional neural network architecture; the specific network structure and training process of the optical flow estimation network are well-known technologies and will not be described in detail here) is used as the carrier for feature extraction and motion estimation. The two preprocessed original images are input into the optical flow estimation network. The optical flow estimation network first extracts multi-scale feature maps of the two original images through a feature encoder. Then, by calculating the correlation between the multi-scale feature maps of the two original images, a multi-scale cost volume is constructed. Finally, based on the cost volume, iterative regression or upsampling decoding is performed to predict the motion displacement of each pixel coordinate in the t-th original image to the corresponding pixel coordinate in the (t+1)-th original image. Finally, the initial optical flow field is obtained: After the above pixel-level optical flow calculation, the result output by the network is the initial optical flow field; in this embodiment, the data structure of the initial optical flow field is a two-dimensional floating-point tensor matrix of size H×W×2, where H and W are consistent with the height and width of the t-th frame image and the (t+1)-th frame image, respectively, and the third dimension has 2 channels, which represent the displacement vector of the pixel in the horizontal direction (usually denoted as the u channel) and the displacement vector in the vertical direction (usually denoted as the v channel), respectively. The initial optical flow field completely records the absolute spatial motion trend of each pixel in the animation video stream to be processed, providing basic data support for subsequent evaluation of local motion state; Step S2: Calculate the motion saliency features of the corresponding pixel coordinates based on the amplitude of the optical flow of each pixel in the initial optical flow field and the local spatial variance of the optical flow of each pixel. First, calculate the amplitude of the optical flow of each pixel in the initial optical flow field: traverse all pixel coordinates (x, y) in the initial optical flow field (i.e., a two-dimensional floating-point tensor matrix of size H×W×2), and extract the horizontal displacement vector u(x, y) and the vertical displacement vector v(x, y) of the current pixel coordinate (x, y); according to the Euclidean distance formula, calculate the amplitude A(x, y) of the pixel optical flow corresponding to the current pixel coordinate (x, y), the calculation formula is: ; The magnitude A(x,y) of the pixel optical flow represents the absolute motion speed or displacement of the pixel between two adjacent frames; Secondly, the local spatial variance of the optical flow of each pixel in the initial optical flow field is calculated: Considering that the pseudo-motion caused by compression noise or edge flicker in the static area of ​​the animation video is usually disordered and random (i.e., the motion direction and magnitude of adjacent pixels are inconsistent), while the real local motion (such as the opening and closing of a character's lips) has high spatial coherence in the local area, a local sliding window of size K×K (e.g., a square window of K=5 or K=7) is constructed with the current pixel coordinates (x,y) as the center, and the local spatial variance V(x,y) of the optical flow vector of all pixels in the local sliding window is calculated; specifically, the local mean μ of the u channel and v channel in the local sliding window is calculated first. u and μ v Then calculate the variance: (i,j) represents the pixel coordinates within a local sliding window centered at pixel coordinates (x,y), used to traverse all pixels within the window; Window represents a local sliding window centered at pixel coordinates (x,y) with a size of K×K, where K is a preset window size parameter (e.g., K=5 or K=7), used to limit the local spatial range for variance calculation; K² represents the total number of pixels within the local sliding window; The local spatial variance V(x,y) of the pixel optical flow characterizes the consistency of motion in the local area around the pixel. The larger the variance, the more chaotic the motion in the area, which is very likely to be pseudo motion caused by noise. The smaller the variance, the smoother and more consistent the motion in the area, which is likely to be real object motion. Finally, combining the amplitude and local spatial variance of the pixel optical flow, the motion saliency feature of the corresponding pixel is calculated: In order to effectively amplify real motion and suppress background noise, a joint evaluation function of amplitude and local spatial variance is constructed. The formula for calculating the motion saliency feature S(x,y) of each pixel coordinate (x,y) is as follows: ; in, For a preset very small positive number (e.g., a value of...) (), used to prevent the denominator from being zero and to maintain numerical stability; Based on the above calculation logic, when the pixel coordinates (x, y) are in the actual effective motion region, its optical flow amplitude A(x, y) is large and its local spatial variance V(x, y) is small, resulting in a very high motion saliency feature value. Conversely, when the pixel coordinates (x, y) are in the static background region containing small noise, its optical flow amplitude A(x, y) is small and its local spatial variance V(x, y) is relatively large due to the randomness of the noise, resulting in a very low motion saliency feature value. After pixel-by-pixel calculation, a motion saliency feature map of size H×W is finally obtained. Step S3: Compare the motion saliency features with a preset noise threshold to generate a binarized motion gating mask. The binarized motion gating mask divides the image area corresponding to any one of the two original images into a static background area and an effective motion area. First, the preset noise threshold used for comparison is determined: the preset noise threshold physically represents the numerical critical point that distinguishes pseudo-motion caused by background compressed noise from real object motion. In specific implementations, the preset noise threshold can be an empirical constant obtained through offline calibration. For example, a large number of absolutely still animation video frames containing only compressed noise are collected in advance, and the maximum value of the motion saliency features of these still animation video frames is calculated and used as the preset noise threshold. In order to adapt to video streams of different image quality, the preset noise threshold can also be a dynamically calculated adaptive threshold. For example, the global mean and standard deviation of the motion saliency features S(x,y) of all pixels in the motion saliency feature map of size H×W output in the current step S2 are calculated, and the value of the global mean plus one standard deviation is set as the preset noise threshold. For ease of description, the preset noise threshold is uniformly referred to as Th in this embodiment. Next, the motion saliency features are compared pixel by pixel with a preset noise threshold to generate a binary motion gating mask: traverse each pixel coordinate (x,y) in the motion saliency feature map of size H×W, and compare the motion saliency feature S(x,y) corresponding to each pixel coordinate (x,y) with the preset noise threshold Th. The pixel coordinates (x, y) are assigned based on the comparison results: If the motion saliency feature S(x,y) corresponding to the current pixel coordinate (x,y) is less than the preset noise threshold Th, then the pixel coordinate (x,y) is assigned the value 0; If the motion saliency feature S(x,y) corresponding to the current pixel coordinate (x,y) is greater than or equal to the preset noise threshold Th, then the pixel coordinate (x,y) is assigned the value 1; The results of assigning all coordinates are organized into a two-dimensional matrix of size H×W. This two-dimensional matrix is ​​the binary motion gating mask (denoted as M). Finally, the binarized motion-gated mask is used to divide the image region corresponding to any one of the two original images into a static background region and an effective motion region. The specific mask assignment and region division rules are as follows: When the motion saliency feature S(x,y) corresponding to pixel coordinate (x,y) is less than the preset noise threshold Th, the motion at pixel coordinate (x,y) is determined to be pseudo motion caused by background noise. At this time, the mask value of the corresponding pixel coordinate (x,y) in the binarized motion gating mask is set to 0 (i.e., M(x,y)=0). In the binarized motion-gated mask, the set of all pixel coordinates with a mask value of 0 together constitutes the static background region; When the motion saliency feature S(x,y) corresponding to pixel coordinate (x,y) is greater than or equal to the preset noise threshold Th, the motion at pixel coordinate (x,y) is determined to be a real and effective motion (e.g., the lip movement of an animated character). At this time, the mask value of the corresponding pixel coordinate (x,y) in the binarized motion gating mask is set to 1 (i.e., M(x,y)=1). In the binarized motion gating mask, the set of all pixel coordinates with a mask value of 1 together constitutes the effective motion region. Through the above comparison and assignment operations, the binarized motion gating mask with size H×W is output, and the precise numerical decoupling of the noise creeping region and the real motion region in the video image is completed. Step S4: Construct a deep learning frame interpolation network. Based on the binarized motion gating mask, the deep learning frame interpolation network calculates and generates intermediate frame pixels corresponding to the pixel coordinates of the image pixels in the effective motion area. First, an existing deep learning frame interpolation network is selected and loaded to construct inference input data adapted to the gating mechanism: the deep learning frame interpolation network adopts an existing well-known architecture (such as DAIN, RIFE, SuperSloMo, etc.), which belongs to the existing technology in the field of video frame interpolation. It inherently includes a warping module for feature alignment and a synthesis module for color reconstruction, without the need to improve the basic structure of the network itself; This step innovatively introduces a binarized motion-gated mask as an additional input. The two original images obtained in step S1, the calculated initial optical flow field, and the binarized motion-gated mask generated in step S3 are combined to form an inference input tensor, which is input into the deep learning frame interpolation network. This ensures that the deep learning frame interpolation network can combine the original image information, pixel motion trends, and region segmentation results to achieve directional frame interpolation only for effective motion regions, avoiding static regions from participating in invalid calculations. Secondly, based on the initial optical flow field and the binarized motion-gated mask, pixel-level gated deformation inference is performed for the effective motion region: Based on the deformation module of the deep learning frame interpolation network, this step adds spatial gating control logic, using the binarized motion gating mask as the enable switch for spatial dimension calculation; iterates through all pixel coordinates (x,y) in the image of size H×W, and when it is detected that the mask value corresponding to the current pixel coordinate (x,y) in the binarized motion gating mask is 1 (i.e., the pixel coordinate (x,y) belongs to the effective motion region), the deformation module activates the interpolation calculation for the pixel coordinate (x,y); Using the displacement vector corresponding to the pixel coordinates (x, y) in the initial optical flow field, the corresponding pixel features in the two original images are subjected to forward warping or backward warping. The pixel features in the two original images are projected along the motion trajectory to the target interpolation time (e.g., timestamp t+0.5), providing an aligned feature basis for the color reconstruction of the pixels in the subsequent intermediate frames. For static background regions with a mask value of 0 in the binarized motion-gated mask, the deformation and interpolation calculation of the pixel coordinates (x,y) are skipped directly, keeping the original pixel information unchanged, and avoiding noise from participating in frame interpolation and causing creep artifacts. Finally, based on the gated mask inference, intermediate frame pixels corresponding to the effective motion region are generated: the synthesis module of the deep learning frame interpolation network receives the pixel features after deformation projection, and performs feature fusion and color reconstruction through a multi-layer convolutional neural network to output a full-size candidate intermediate frame image tensor; in order to ensure that only the real motion region is updated and to avoid static regions being updated incorrectly, the binarized motion gated mask is used as a spatial extraction index matrix, and only the pixel RGB values ​​corresponding to the pixel coordinates (x,y) of the pixel with a mask value of 1 in the binarized motion gated mask are extracted from the full-size image tensor reconstructed by the synthesis module. After the above extraction operations, the output set of pixels with new RGB values ​​are the intermediate frame pixels generated by the deep learning frame interpolation network and belonging to the effective motion region. Each intermediate frame pixel corresponds one-to-one with the pixel coordinates (x, y) in the effective motion region, corresponding to the same spatial position of the pixel coordinates (x, y) on the intermediate frame. The intermediate frame pixels of the static background region directly use the original pixel values ​​of the corresponding pixel coordinates (x, y) in the two original images without the need for network reconstruction, further ensuring the stability of the static region. Step S5: For the pixel coordinates in the static background area, without inference through the deep learning interpolation network, directly map the corresponding pixel coordinates in the two adjacent original images as intermediate frame pixels, and stitch the intermediate frame pixels of the static background area and the effective motion area to output a complete interpolated frame.

[0020] First, blocking and direct mapping operations are performed on the static background region to generate intermediate frame pixels specific to the static background region: All pixel coordinates (x, y) in the H×W image are traversed. When it is detected that the mask value corresponding to the current pixel coordinate (x, y) in the binarized motion gating mask is 0 (i.e., the current pixel coordinate (x, y) belongs to the static background region), at the system control flow level, any feature extraction, deformation, and color reconstruction calculation process of the deep learning frame interpolation network for the current pixel coordinate (x, y) is explicitly blocked to prevent noise in the static region from being amplified as pseudo-motion during network inference; subsequently, direct mapping logic is executed: The original pixel RGB value at the current pixel coordinate (x,y) is directly extracted from the first original image (i.e., the t-th frame image) or the second original image (i.e., the t+1-th frame image) of the two original images, and the extracted original pixel RGB value is directly assigned to the current pixel coordinate (x,y) as the intermediate frame pixel corresponding to the current pixel coordinate (x,y) in the intermediate frame. Through this zero-interpolation pass-through mapping method, the tiny compressed noise in the original image is absolutely physically locked on the time axis, cutting off the path of noise being amplified into pseudo-motion by the AI ​​network from the root. Secondly, based on spatial coordinates, the intermediate frame pixels of the static background area and the effective motion area are stitched together: a blank image tensor of size H×W×3 is initialized in memory to carry the final complete image, and the binarized motion gating mask is used as the spatial coordinate index matrix to ensure the accuracy of coordinate stitching. For all pixel coordinates (x, y) with a mask value of 1 in the binarized motion gating mask, the intermediate frame pixels that belong to the effective motion region and are generated by the deep learning frame interpolation network in step S4 are accurately filled into the corresponding pixel coordinates (x, y) in the blank image tensor according to their coordinate positions. For all pixel coordinates (x, y) with a mask value of 0 in the binarized motion gating mask, the intermediate frame pixels belonging to the static background region obtained through the direct mapping operation are accurately filled into the corresponding pixel coordinates (x, y) in the blank image tensor according to their coordinate positions. After the above pixel-by-pixel filling and stitching operations based on absolute coordinate positions, the physical fusion of pixel data from two different sources is completed.

[0021] Finally, the complete interpolated frame is output: After the above splicing operation, the blank image tensor is completely filled to form a full-size RGB image tensor with complete image information. This full-size RGB image tensor is the output complete interpolated frame. The complete interpolated frame exhibits a smooth transition effect that conforms to the laws of physical motion in the effective motion area, while perfectly preserving the original image's static state in the static background area, completely eliminating local creep artifacts caused by background noise. Finally, the complete interpolated frame is inserted between the two original image frames (frame t and frame t+1) according to the time sequence, and the animation video stream with adjusted frame rate is output.

[0022] Example 2 like Figure 2 As shown, based on a specific implementation of Embodiment 1, the present invention provides an AI-based animation video frame rate adjustment system, comprising: Image and optical flow calculation module: acquire two adjacent original images from the video, perform pixel-level optical flow calculation on the two original images, and obtain the initial optical flow field; Motion saliency calculation module: Based on the amplitude of the optical flow of each pixel in the initial optical flow field and the local spatial variance of the optical flow of each pixel, calculate the motion saliency features of the corresponding pixel coordinates; Motion-gated mask generation module: compares motion saliency features with noise thresholds to generate a binarized motion-gated mask. The binarized motion-gated mask divides the image region corresponding to any one of the two original images into a static background region and an effective motion region. Motion region frame interpolation module: Construct a deep learning frame interpolation network. Based on the binarized motion gating mask, the deep learning frame interpolation network calculates and generates the corresponding intermediate frame pixels for the pixel coordinates of the image pixels in the effective motion region. Static region mapping and fusion output module: For the pixel coordinates in the static background region, without deep learning frame interpolation network inference, directly map the corresponding pixel coordinates in the original images of two adjacent frames as intermediate frame pixels, and stitch the intermediate frame pixels of the static background region and the effective motion region to output a complete interpolated frame.

[0023] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of the present invention is defined by the appended claims and their equivalents.

Claims

1. An AI-based method for adjusting the frame rate of animated videos, characterized by: include: Step S1: Obtain two adjacent original frames from the video, perform pixel-level optical flow calculation on the two original frames, and obtain the initial optical flow field; Step S2: Calculate the motion saliency features of the corresponding pixel coordinates based on the amplitude of the optical flow of each pixel in the initial optical flow field and the local spatial variance of the optical flow of each pixel. Step S3: Compare the motion saliency features with the noise threshold to generate a binarized motion gating mask. The binarized motion gating mask divides the image area corresponding to any frame of the two original images into a static background area and an effective motion area. Step S4: Construct a deep learning frame interpolation network. Based on the binarized motion gating mask, the deep learning frame interpolation network calculates and generates the intermediate frame pixels corresponding to the pixel coordinates of the image pixels in the effective motion area. Step S5: For the pixel coordinates in the static background area, without deep learning frame interpolation network inference, directly map the corresponding pixel coordinates in the two adjacent original images as intermediate frame pixels, and stitch the intermediate frame pixels of the static background area and the effective motion area to output the complete interpolated frame.

2. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process of obtaining the two original images is as follows: The input animation video stream to be processed is read by a video decoder, and consecutive frame t and frame t+1 are extracted according to the time sequence as two original frames.

3. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process for obtaining the initial optical flow field is as follows: The color space of the two original images is uniformly converted to the RGB color space, and the pixel values ​​of the two original images are normalized to a preset range to obtain an image tensor of size H×W×3, where H is the image height, W is the image width, and 3 is the number of color channels. Two original images are input into the optical flow estimation network to predict the motion displacement of each pixel coordinate in the t-th frame image to the corresponding pixel coordinate in the (t+1)-th original image, thus obtaining the initial optical flow field. The initial optical flow field is a two-dimensional floating-point tensor matrix of size H×W×2, where H and W are consistent with the height and width of the t-th and (t+1)-th frames, respectively. The third dimension has 2 channels, which are the horizontal displacement vector and the vertical displacement vector of the pixel.

4. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process for calculating the motion saliency features of the corresponding pixel coordinates is as follows: Traverse all pixel coordinates (x, y) in the initial optical flow field, extract the horizontal displacement vector u(x, y) and the vertical displacement vector v(x, y) of the current pixel coordinate (x, y), and calculate the magnitude A(x, y) of the pixel optical flow corresponding to the current pixel coordinate (x, y): ; Centered on the current pixel coordinates (x, y), construct a local sliding window of size K×K. Calculate the local spatial variance V(x, y) of the optical flow vectors of all pixels within the local sliding window. First, calculate the local mean μ of the u-channel and v-channel within the local sliding window respectively. u and μ v Then calculate the variance. For each pixel coordinate (x, y), calculate the motion saliency feature S(x, y): ; is a preset minimal positive number, (i,j) is the pixel coordinates within a local sliding window with the center pixel coordinates (x,y) as the origin, used to traverse all pixels within the window, Window represents a local sliding window with a size of K×K centered at pixel coordinates (x,y), K is a preset window size parameter, and K² is the total number of pixels within the local sliding window.

5. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process for generating the binarized motion-gated mask is as follows: The motion saliency feature corresponding to each pixel coordinate is compared with a preset noise threshold, and a value is assigned to the pixel coordinate based on the comparison result. If the motion saliency feature corresponding to the current pixel coordinate is less than the noise threshold, then the pixel coordinate is assigned a value of 0; If the motion saliency feature corresponding to the current pixel coordinate is greater than or equal to the noise threshold, then the pixel coordinate is assigned a value of 1; The results of assigning all coordinate values ​​are organized into a two-dimensional matrix of size H×W. This two-dimensional matrix is ​​the binary motion gating mask.

6. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process by which the binarized motion gating mask divides the image region corresponding to any one of the two original images into a static background region and an effective motion region is as follows: When the motion saliency feature corresponding to the pixel coordinate is less than the noise threshold, the mask value of the corresponding pixel coordinate in the binarized motion gating mask is set to 0, and the set of all pixel coordinates with a mask value of 0 constitutes the static background region. When the motion saliency feature corresponding to the pixel coordinate is greater than or equal to the noise threshold, the mask value of the corresponding pixel coordinate in the binarized motion gating mask is set to 1, and the set of all pixel coordinates with a mask value of 1 constitutes the effective motion region.

7. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process of generating the intermediate frame pixel corresponding to the pixel coordinates in the intermediate frame is as follows: Two original images, an initial optical flow field, and a binarized motion-gated mask are combined to form an inference input tensor, which is then input into a deep learning frame interpolation network. When the current pixel coordinates are detected to have a mask value of 1 in the binarized motion-gated mask, the corresponding pixel features in the two original images are subjected to forward or backward deformation based on the initial optical flow field. The pixel features are then projected onto the target interpolation time to obtain a full-size candidate intermediate frame image tensor. The binarized motion-gated mask is used as a spatial extraction index matrix, and the pixel RGB values ​​corresponding to the pixel coordinates with a mask value of 1 are extracted only from the full-size image tensor as the intermediate frame pixels.

8. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process of directly mapping the image pixels with corresponding pixel coordinates in two adjacent original images as intermediate frame pixels is as follows: Extract the original pixel RGB value at the current pixel coordinates from the t-th or t+1-th original image of the two original images, and directly assign the extracted original pixel RGB value to the current pixel coordinates as the corresponding intermediate frame pixel in the intermediate frame.

9. The method for adjusting the frame rate of animated videos based on artificial intelligence according to claim 1, characterized in that, The specific process of outputting the complete interpolated frame is as follows: Initialize a blank image tensor with size H×W×3. Use a binarized motion-gated mask as the spatial coordinate index matrix. Fill the pixel coordinates of the effective motion region into the intermediate frame pixels, and fill the pixel coordinates of the static background region into the intermediate frame pixels. After pixel stitching and fusion, obtain and output the complete interpolated frame.

10. An AI-based animation video frame rate adjustment system, used to perform the method described in any one of claims 1-9, characterized in that: include: Image and optical flow calculation module: acquire two adjacent original images from the video, perform pixel-level optical flow calculation on the two original images, and obtain the initial optical flow field; Motion saliency calculation module: Based on the amplitude of the optical flow of each pixel in the initial optical flow field and the local spatial variance of the optical flow of each pixel, calculate the motion saliency features of the corresponding pixel coordinates; Motion-gated mask generation module: compares motion saliency features with noise thresholds to generate a binarized motion-gated mask. The binarized motion-gated mask divides the image region corresponding to any one of the two original images into a static background region and an effective motion region. Motion region frame interpolation module: Construct a deep learning frame interpolation network. Based on the binarized motion gating mask, the deep learning frame interpolation network calculates and generates the corresponding intermediate frame pixels for the pixel coordinates of the image pixels in the effective motion region. Static region mapping and fusion output module: For the pixel coordinates in the static background region, without deep learning frame interpolation network inference, directly map the corresponding pixel coordinates in the original images of two adjacent frames as intermediate frame pixels, and stitch the intermediate frame pixels of the static background region and the effective motion region to output a complete interpolated frame.