Adaptive video frame filtering method based on scene complexity and early exit mechanism and related device

By calculating the absolute difference between adjacent frames and scene complexity, and combining it with a multi-layer lightweight neural network for frame filtering, the problem of redundant calculation and resource consumption in video frame filtering is solved, realizing an adaptive video frame filtering method that is suitable for scenarios such as security monitoring, intelligent transportation, and industrial quality inspection.

CN122244752APending Publication Date: 2026-06-19SHENZHEN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN UNIV
Filing Date
2026-03-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing video frame filtering methods lead to redundant computation when video content changes slowly or the scene is simple, ignoring the semantic complexity of the scene and the importance of the task, resulting in resource consumption and deployment difficulties.

Method used

By calculating the absolute difference between adjacent frames, frame difference information is obtained and scene complexity is calculated. A multi-layer lightweight neural network is used for frame admission and early exit, and the threshold is dynamically adjusted to adaptively determine whether to retain a frame.

Benefits of technology

It achieves a significant reduction in computational load while ensuring task accuracy, adapts to adaptive video frame filtering in different scenarios, and is suitable for devices with limited computing resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244752A_ABST
    Figure CN122244752A_ABST
Patent Text Reader

Abstract

This invention relates to the field of computer vision analysis technology, and discloses an adaptive video frame filtering method and related apparatus based on scene complexity and an early exit mechanism. The method includes: calculating the absolute difference of pixels between two adjacent video frames in a real-time video stream to obtain frame difference information; calculating the scene complexity value of the real-time video stream based on the frame difference information; inputting each video frame in the real-time video stream into a frame admission and early exit module based on inter-frame differences to obtain a decision result; wherein the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network including multiple progressively deeper feature extraction layers, each feature extraction layer having an early exit branch. This invention achieves adaptive determination of whether to retain the current frame, ensuring task accuracy while significantly reducing computational load.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision analysis technology, specifically to an adaptive video frame filtering method, apparatus, computer device, and storage medium based on scene complexity and early exit mechanism. Background Technology

[0002] In scenarios such as security monitoring, intelligent transportation, and industrial quality inspection, real-time video analysis often relies on DNN / Transformer networks to achieve high accuracy. However, when performing frame-by-frame inference on all frames, the computation and memory usage are extremely high; when multiple video streams are processed concurrently, resource contention leads to increased latency and decreased accuracy.

[0003] To alleviate the load, existing frame filtering / frame sampling methods reduce downstream computation by removing temporal redundancy. However, most existing methods only consider inter-frame pixel or shallow feature redundancy, ignoring the differences in semantic importance in different scenarios; even in scenarios with slow changes or simple content, they still use filtering networks and rules with fixed complexity, resulting in extra computational waste and making them difficult to deploy on devices with limited computing power.

[0004] Specifically, the inventors of this application have discovered the following problems with the prior art: 1. When the video content changes slowly or the scene is simple, the differences between frames are small, and frame-by-frame inference leads to a lot of redundant calculations.

[0005] 2. Existing frame filtering methods only determine whether to discard a frame based on a fixed threshold frame difference or motion estimation, ignoring the semantic complexity of the scene and the importance of the task.

[0006] 3. Fixed-structure filtering networks still perform all computational layers in simple scenarios, leading to unnecessary resource consumption. Summary of the Invention

[0007] In view of the above problems, embodiments of the present invention provide an adaptive video frame filtering method, apparatus, computer device and storage medium based on scene complexity and early exit mechanism, which is used to solve the problems of low frame filtering efficiency, poor semantic adaptability and difficult deployment in the prior art.

[0008] According to one aspect of the present invention, an adaptive video frame filtering method based on scene complexity and an early exit mechanism is provided, the method comprising: Calculate the absolute difference of pixels between two adjacent video frames in a real-time video stream to obtain the frame difference information of the real-time video. The scene complexity value of the real-time video stream is calculated based on the frame difference information; the scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream. Each video frame in the real-time video stream is input into the frame admission and early exit module based on inter-frame differences to obtain the decision result; wherein, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer is provided with an early exit branch. Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

[0009] In one optional approach, calculating the absolute difference of pixels between two adjacent video frames in the real-time video stream to obtain the frame difference information of the real-time video includes: Read the raw video frames of the real-time video stream frame by frame; The original video frames are grayscaled and downsampled to obtain preprocessed video frames in the real-time video stream. The absolute difference between pixels between two adjacent preprocessed video frames is calculated to obtain the frame difference information.

[0010] In one optional approach, the frame difference information comprises multiple frame difference images; the step of calculating the scene complexity value of the real-time video stream based on the frame difference information includes: Calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images; The scene complexity value of the real-time video stream is obtained by weighted fusion of the mean, variance, and proportion of non-zero pixels of all the frame difference images.

[0011] In an alternative approach, before inputting each video frame into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: Obtain training video stream samples; the training video stream samples include sample video streams with retained confidence value labels; The training video stream samples are input into a multi-layer lightweight neural network for iterative training to obtain a trained multi-layer lightweight neural network.

[0012] In one alternative approach, the dynamic threshold is determined by: Obtain the current device computing resource information and the scene complexity value; When the current device computing resources are lower than the first resource threshold or the scene complexity value is lower than the first complexity threshold, the dynamic threshold is obtained by adjusting according to the first weight and the preset basic threshold. When the current device computing resources are greater than or equal to the first resource threshold or the scene complexity value is greater than or equal to the first complexity threshold, the dynamic threshold is obtained by adjusting according to the second weight and the preset basic threshold; wherein, the first weight is less than or equal to 1 and the second weight is greater than 1.

[0013] In one alternative approach, the decision result may delete or retain the current video frame; after inputting each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: When the decision result is to delete the current video frame, the current video frame is deleted from the real-time video stream; When the decision result is to retain the current video frame, the current video frame is retained in the real-time video stream, and the judgment of the next video frame continues.

[0014] According to another aspect of the present invention, an adaptive video frame filtering device based on scene complexity and an early exit mechanism is provided, comprising: The first calculation module is used to calculate the absolute difference between pixels between two adjacent video frames in the real-time video stream, and obtain the frame difference information of the real-time video. The second calculation module is used to calculate the scene complexity value of the real-time video stream based on the frame difference information; the scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream. The decision module is used to input each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result; wherein, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer is provided with an early exit branch. Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

[0015] In one optional approach, the frame difference information comprises multiple frame difference images; the step of calculating the scene complexity value of the real-time video stream based on the frame difference information includes: Calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images; The scene complexity value of the real-time video stream is obtained by weighted fusion of the mean, variance, and proportion of non-zero pixels of all the frame difference images.

[0016] According to another aspect of the present invention, a computer device is provided, including: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface communicate with each other through the communication bus; The memory is used to store at least one executable instruction that causes the processor to perform the operation of the adaptive video frame filtering method based on scene complexity and early exit mechanism.

[0017] According to another aspect of the present invention, a computer-readable storage medium is provided, the storage medium storing at least one executable instruction, which, when executed on a computer device, causes the computer device to perform the operation of the adaptive video frame filtering method based on scene complexity and early exit mechanism.

[0018] This invention provides a method to obtain frame difference information for a real-time video stream by calculating the absolute difference between pixels in adjacent frames. The scene complexity of the real-time video stream is then calculated based on this frame difference information. Each video frame in the real-time video stream is input into a frame admission and early exit module based on inter-frame differences to obtain a decision result. This module is a multi-layered lightweight neural network, comprising multiple progressively deeper feature extraction layers, each with an early exit branch. This invention achieves adaptive determination of whether to retain the current frame, ensuring task accuracy while significantly reducing computational load.

[0019] The above description is merely an overview of the technical solutions of the embodiments of the present invention. In order to better understand the technical means of the embodiments of the present invention and to implement them in accordance with the contents of the specification, and to make the above and other objects, features and advantages of the embodiments of the present invention more apparent and understandable, specific embodiments of the present invention are described below. Attached Figure Description

[0020] The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings: Figure 1A flowchart illustrating the adaptive video frame filtering method based on scene complexity and early exit mechanism provided in an embodiment of the present invention is shown. Figure 2 A flowchart illustrating an adaptive video frame filtering method based on scene complexity and early exit mechanism provided in an embodiment of the present invention is shown. Figure 3 This diagram illustrates the structure of the frame admission and early exit module based on inter-frame differences according to an embodiment of the present invention. Figure 4 A schematic diagram of the adaptive video frame filtering device based on scene complexity and early exit mechanism provided in an embodiment of the present invention is shown. Figure 5 A schematic diagram of the structure of a computer device provided in an embodiment of the present invention is shown. Detailed Implementation

[0021] Exemplary embodiments of the invention will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be implemented in various forms and should not be limited to the embodiments set forth herein.

[0022] Figure 1 The flowchart illustrates an adaptive video frame filtering method based on scene complexity and an early exit mechanism provided in an embodiment of the present invention. This method is executed by a computer device. The computer device can be a desktop computer, laptop computer, smart terminal, distributed device, wearable device, etc., and the embodiments of the present invention do not impose specific limitations. Figure 1 As shown, the method includes the following steps: Step 110: Calculate the absolute difference of pixels between two adjacent video frames in the real-time video stream to obtain the frame difference information of the real-time video.

[0023] First, the original video frames of the real-time video stream are read frame by frame. These original video frames are then converted to grayscale and downsampled to obtain preprocessed video frames within the real-time video stream. To reduce computational complexity, the frame difference calculation first converts the original input image to grayscale and then downsamples it. Since the original video frames have high resolution and a large number of pixels, directly calculating the frame difference is time-consuming. Downsampling significantly reduces computational complexity by decreasing the number of pixels, without affecting the detection of the overall motion trend.

[0024] Next, the absolute difference between pixels between two adjacent preprocessed video frames is calculated to obtain the frame difference information. This frame difference information consists of multiple frame difference images. Specifically, the pixel change at a given location is quantified by calculating the absolute difference between corresponding pixels in adjacent frames. A larger difference indicates a more significant pixel change at that location, likely indicating a moving area; a difference of 0 indicates no change at that location, such as a background area. After calculating the absolute difference between pixels at each location in each pair of adjacent preprocessed frames, a frame difference image with the same size as the preprocessed frame is obtained.

[0025] Step 120: Calculate the scene complexity value of the real-time video stream based on the frame difference information.

[0026] The scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream.

[0027] In this embodiment of the invention, the scene complexity is calculated in the following way: First, calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images. The mean reflects the average level of pixel differences across all frame difference images; a larger mean indicates more drastic overall pixel changes between video frames and higher overall scene motion intensity. Specifically, determine the statistical window: select N consecutive frame difference images, denoted as the frame difference image set F = {F1, F2, ..., F...}. n}; Calculate the mean of a single frame difference image: For each frame difference image Fᵢ, calculate the average of all pixel values: ; Where W×H is the resolution of the frame difference image. For the first i The pixel value at position (x,y) of the frame difference image.

[0028] Calculate the global mean of all frame difference images: Take the average of the individual frame means of N frame difference images, using the following formula: ; Variance reflects the dispersion of pixel values ​​in frame differences. A large variance indicates that some areas experience drastic pixel changes while others remain unchanged, resulting in uneven distribution of scene motion; a small variance indicates that scene changes are uniform.

[0029] Calculate the total set of pixel values ​​for all frame difference images: Combine the pixel values ​​of all N frame difference images into a one-dimensional array. Its variance is: ; Where N×W×H represents the total number of pixels, and p represents the value of a single pixel. This is the global mean.

[0030] Calculate the proportion of non-zero pixels across all frame difference images. The proportion of non-zero pixels reflects the percentage of the scene where changes have occurred. A higher proportion indicates a wider area of ​​motion; a lower proportion indicates only localized changes.

[0031] In this embodiment of the invention, for each frame difference image First, perform threshold filtering: set pixel values ​​less than threshold T to 0, and retain the original values ​​of pixels greater than or equal to T, to obtain a binarized frame difference image. ; Calculate the proportion of non-zero pixels in a single frame difference image: statistics Number of non-zero pixels Divide by the total number of pixels W×H, and calculate using the following formula: ; Calculate the global non-zero pixel ratio of all frame difference images: Take the average of the single-frame ratios of N frame difference images, using the following formula: .

[0032] Secondly, the mean, variance, and proportion of non-zero pixels of all the frame difference images are weighted and fused to obtain the scene complexity value of the real-time video stream. Since the numerical ranges of the mean, variance, and non-zero pixels of the frame difference images vary significantly, direct weighting would cause the variance value to dominate the result, rendering the weights of the mean and proportion ineffective. Therefore, this embodiment first normalizes them to the same dimension before performing weighted fusion to obtain the final scene complexity value. Different weights are assigned to the mean, variance, and non-zero pixels of the frame difference images. These different weights can be set according to specific business scenarios.

[0033] Step 130: Input each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result.

[0034] like Figure 2 and 3 As shown, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network. The multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer has an early exit branch.

[0035] Specifically, before inputting the frame admission and early exit module based on inter-frame differences, this module is also trained. Specifically, training video stream samples are obtained; the training video stream samples include sample video streams with retained confidence value labels; the training video stream samples are input into a multi-layer lightweight neural network for iterative training to obtain a trained multi-layer lightweight neural network.

[0036] Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

[0037] Specifically, each video frame in the real-time video stream is input into each feature extraction layer of this multi-layered lightweight neural network. After passing through its corresponding early exit branch, it outputs a retention confidence value, which represents the probability that the current layer determines whether the video frame should be retained. Subsequently, the system compares this value with a dynamic threshold. If the retention confidence exceeds the dynamic threshold, the network immediately stops calculating subsequent layers and immediately outputs the decision result of that layer. If the confidence is insufficient, it continues to propagate to deeper layers. Under this mechanism, simple scenarios can exit directly at shallow layers, while complex scenarios automatically deepen the inference path, achieving a match between computational load and scene complexity. This layer-by-layer dynamic exit mechanism allows the system to quickly exit at shallow layers in simple scenarios and automatically delve deeper into inference in complex scenarios, thereby achieving a match between computational depth and scene complexity and significantly reducing the average frame processing time.

[0038] This involves real-time monitoring of device computing resource utilization (current device computing resource information), while also considering scene complexity values ​​to analyze device load and scene stability. When device load increases or the scene stabilizes, the threshold is lowered to allow for early termination; when drastic scene changes are detected, the threshold is raised to increase analysis depth, thereby achieving adaptive matching between computing power and tasks. Specifically, the dynamic threshold is calculated as follows: Obtain the current device computing resource information and the scene complexity value; When the current device computing resources are lower than the first resource threshold or the scene complexity value is lower than the first complexity threshold, the dynamic threshold is obtained by adjusting according to the first weight and the preset basic threshold. When the current device computing resources are greater than or equal to the first resource threshold or the scene complexity value is greater than or equal to the first complexity threshold, the dynamic threshold is obtained by adjusting according to the second weight and the preset basic threshold; wherein, the first weight is less than or equal to 1 and the second weight is greater than 1.

[0039] After obtaining the decision result, the method further includes: When the decision result is to delete the current video frame, the current video frame is deleted from the real-time video stream; When the decision result is to retain the current video frame, the current video frame is retained in the real-time video stream, and the judgment of the next video frame continues.

[0040] The video frames retained in the real-time video stream are input to the corresponding image analysis module for analysis to obtain the analysis results. This image analysis module can be an image recognition module, a target recognition module, etc. This embodiment of the invention does not impose specific limitations.

[0041] This invention provides a method to obtain frame difference information for a real-time video stream by calculating the absolute difference between pixels in adjacent frames. The scene complexity of the real-time video stream is then calculated based on this frame difference information. Each video frame in the real-time video stream is input into a frame admission and early exit module based on inter-frame differences to obtain a decision result. This module is a multi-layered lightweight neural network, comprising multiple progressively deeper feature extraction layers, each with an early exit branch. This invention achieves adaptive determination of whether to retain the current frame, ensuring task accuracy while significantly reducing computational load.

[0042] Figure 4 A schematic diagram of the adaptive video frame filtering device based on scene complexity and early exit mechanism provided in an embodiment of the present invention is shown. Figure 4 As shown, the device 200 includes: The first calculation module 210 is used to calculate the absolute difference between pixels between two adjacent video frames in a real-time video stream, and obtain the frame difference information of the real-time video. The second calculation module 220 is used to calculate the scene complexity value of the real-time video stream based on the frame difference information; the scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream. The decision module 230 is used to input each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain a decision result; wherein, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer is provided with an early exit branch. Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

[0043] In one optional approach, calculating the absolute difference of pixels between two adjacent video frames in the real-time video stream to obtain the frame difference information of the real-time video includes: Read the raw video frames of the real-time video stream frame by frame; The original video frames are grayscaled and downsampled to obtain preprocessed video frames in the real-time video stream. The absolute difference between pixels between two adjacent preprocessed video frames is calculated to obtain the frame difference information.

[0044] In one optional approach, the frame difference information comprises multiple frame difference images; the step of calculating the scene complexity value of the real-time video stream based on the frame difference information includes: Calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images; The scene complexity value of the real-time video stream is obtained by weighted fusion of the mean, variance, and proportion of non-zero pixels of all the frame difference images.

[0045] In an alternative embodiment, the device further includes a training module for: Obtain training video stream samples; the training video stream samples include sample video streams with retained confidence value labels; The training video stream samples are input into a multi-layer lightweight neural network for iterative training to obtain a trained multi-layer lightweight neural network.

[0046] In one alternative approach, the dynamic threshold is determined by: Obtain the current device computing resource information and the scene complexity value; When the current device computing resources are lower than the first resource threshold or the scene complexity value is lower than the first complexity threshold, the dynamic threshold is obtained by adjusting according to the first weight and the preset basic threshold. When the current device computing resources are greater than or equal to the first resource threshold or the scene complexity value is greater than or equal to the first complexity threshold, the dynamic threshold is obtained by adjusting according to the second weight and the preset basic threshold; wherein, the first weight is less than or equal to 1 and the second weight is greater than 1.

[0047] In one alternative approach, the decision result may delete or retain the current video frame; the apparatus further includes: Execution module, used for: When the decision result is to delete the current video frame, the current video frame is deleted from the real-time video stream; When the decision result is to retain the current video frame, the current video frame is retained in the real-time video stream, and the judgment of the next video frame continues.

[0048] The specific working process of the device in this embodiment is largely the same as the method steps in the foregoing embodiments, and will not be repeated here.

[0049] This invention provides a method to obtain frame difference information for a real-time video stream by calculating the absolute difference between pixels in adjacent frames. The scene complexity of the real-time video stream is then calculated based on this frame difference information. Each video frame in the real-time video stream is input into a frame admission and early exit module based on inter-frame differences to obtain a decision result. This module is a multi-layered lightweight neural network, comprising multiple progressively deeper feature extraction layers, each with an early exit branch. This invention achieves adaptive determination of whether to retain the current frame, ensuring task accuracy while significantly reducing computational load.

[0050] Figure 5 The diagram shows a structural schematic of a computer device provided in an embodiment of the present invention. The specific embodiments of the present invention do not limit the specific implementation of the computer device.

[0051] like Figure 5 As shown, the computer device may include: a processor 402, a communications interface 404, a memory 406, and a communications bus 408.

[0052] The processor 402, communication interface 404, and memory 406 communicate with each other via communication bus 408. Communication interface 404 is used to communicate with other network elements such as clients or other servers. The processor 402 executes program 410, specifically performing the relevant steps described above in the embodiment of the adaptive video frame filtering method based on scene complexity and early exit mechanism.

[0053] Specifically, program 410 may include program code, which includes computer-executable instructions.

[0054] Processor 402 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The computer device includes one or more processors, which may be processors of the same type, such as one or more CPUs; or processors of different types, such as one or more CPUs and one or more ASICs.

[0055] Memory 406 is used to store program 410. Memory 406 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.

[0056] Specifically, program 410 can be called by processor 402 to cause the computer device to perform the following operations: Calculate the absolute difference of pixels between two adjacent video frames in a real-time video stream to obtain the frame difference information of the real-time video. The scene complexity value of the real-time video stream is calculated based on the frame difference information; the scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream. Each video frame in the real-time video stream is input into the frame admission and early exit module based on inter-frame differences to obtain the decision result; wherein, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer is provided with an early exit branch. Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

[0057] In one optional approach, calculating the absolute difference of pixels between two adjacent video frames in the real-time video stream to obtain the frame difference information of the real-time video includes: Read the raw video frames of the real-time video stream frame by frame; The original video frames are grayscaled and downsampled to obtain preprocessed video frames in the real-time video stream. The absolute difference between pixels between two adjacent preprocessed video frames is calculated to obtain the frame difference information.

[0058] In one optional approach, the frame difference information comprises multiple frame difference images; the step of calculating the scene complexity value of the real-time video stream based on the frame difference information includes: Calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images; The scene complexity value of the real-time video stream is obtained by weighted fusion of the mean, variance, and proportion of non-zero pixels of all the frame difference images.

[0059] In an alternative approach, before inputting each video frame into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: Obtain training video stream samples; the training video stream samples include sample video streams with retained confidence value labels; The training video stream samples are input into a multi-layer lightweight neural network for iterative training to obtain a trained multi-layer lightweight neural network.

[0060] In one alternative approach, the dynamic threshold is determined by: Obtain the current device computing resource information and the scene complexity value; When the current device computing resources are lower than the first resource threshold or the scene complexity value is lower than the first complexity threshold, the dynamic threshold is obtained by adjusting according to the first weight and the preset basic threshold. When the current device computing resources are greater than or equal to the first resource threshold or the scene complexity value is greater than or equal to the first complexity threshold, the dynamic threshold is obtained by adjusting according to the second weight and the preset basic threshold; wherein, the first weight is less than or equal to 1 and the second weight is greater than 1.

[0061] In one alternative approach, the decision result may delete or retain the current video frame; after inputting each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: When the decision result is to delete the current video frame, the current video frame is deleted from the real-time video stream; When the decision result is to retain the current video frame, the current video frame is retained in the real-time video stream, and the judgment of the next video frame continues.

[0062] This invention provides a method to obtain frame difference information for a real-time video stream by calculating the absolute difference between pixels in adjacent frames. The scene complexity of the real-time video stream is then calculated based on this frame difference information. Each video frame in the real-time video stream is input into a frame admission and early exit module based on inter-frame differences to obtain a decision result. This module is a multi-layered lightweight neural network, comprising multiple progressively deeper feature extraction layers, each with an early exit branch. This invention achieves adaptive determination of whether to retain the current frame, ensuring task accuracy while significantly reducing computational load.

[0063] This invention provides a computer-readable storage medium storing at least one executable instruction. When the executable instruction is executed on a computer device, the computer device performs the adaptive video frame filtering method based on scene complexity and early exit mechanism in any of the above method embodiments.

[0064] Executable instructions can be used to cause computer devices to perform the following operations: Calculate the absolute difference of pixels between two adjacent video frames in a real-time video stream to obtain the frame difference information of the real-time video. The scene complexity value of the real-time video stream is calculated based on the frame difference information; the scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream. Each video frame in the real-time video stream is input into the frame admission and early exit module based on inter-frame differences to obtain the decision result; wherein, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer is provided with an early exit branch. Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

[0065] In one optional approach, calculating the absolute difference of pixels between two adjacent video frames in the real-time video stream to obtain the frame difference information of the real-time video includes: Read the raw video frames of the real-time video stream frame by frame; The original video frames are grayscaled and downsampled to obtain preprocessed video frames in the real-time video stream. The absolute difference between pixels between two adjacent preprocessed video frames is calculated to obtain the frame difference information.

[0066] In one optional approach, the frame difference information comprises multiple frame difference images; the step of calculating the scene complexity value of the real-time video stream based on the frame difference information includes: Calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images; The scene complexity value of the real-time video stream is obtained by weighted fusion of the mean, variance, and proportion of non-zero pixels of all the frame difference images.

[0067] In an alternative approach, before inputting each video frame into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: Obtain training video stream samples; the training video stream samples include sample video streams with retained confidence value labels; The training video stream samples are input into a multi-layer lightweight neural network for iterative training to obtain a trained multi-layer lightweight neural network.

[0068] In one alternative approach, the dynamic threshold is determined by: Obtain the current device computing resource information and the scene complexity value; When the current device computing resources are lower than the first resource threshold or the scene complexity value is lower than the first complexity threshold, the dynamic threshold is obtained by adjusting according to the first weight and the preset basic threshold. When the current device computing resources are greater than or equal to the first resource threshold or the scene complexity value is greater than or equal to the first complexity threshold, the dynamic threshold is obtained by adjusting according to the second weight and the preset basic threshold; wherein, the first weight is less than or equal to 1 and the second weight is greater than 1.

[0069] In one alternative approach, the decision result may delete or retain the current video frame; after inputting each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: When the decision result is to delete the current video frame, the current video frame is deleted from the real-time video stream; When the decision result is to retain the current video frame, the current video frame is retained in the real-time video stream, and the judgment of the next video frame continues.

[0070] This invention provides a method to obtain frame difference information for a real-time video stream by calculating the absolute difference between pixels in adjacent frames. The scene complexity of the real-time video stream is then calculated based on this frame difference information. Each video frame in the real-time video stream is input into a frame admission and early exit module based on inter-frame differences to obtain a decision result. This module is a multi-layered lightweight neural network, comprising multiple progressively deeper feature extraction layers, each with an early exit branch. This invention achieves adaptive determination of whether to retain the current frame, ensuring task accuracy while significantly reducing computational load.

[0071] This invention provides an adaptive video frame filtering device based on scene complexity and an early exit mechanism, used to execute the aforementioned adaptive video frame filtering method based on scene complexity and an early exit mechanism.

[0072] This invention provides a computer program that can be called by a processor to enable a computer device to execute the adaptive video frame filtering method based on scene complexity and early exit mechanism in any of the above method embodiments.

[0073] This invention provides a computer program product, which includes a computer program stored on a computer-readable storage medium. The computer program includes program instructions, which, when executed on a computer, cause the computer to perform the adaptive video frame filtering method based on scene complexity and early exit mechanism in any of the above method embodiments.

[0074] The algorithms or displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used in conjunction with the teachings herein. The required structure for constructing such systems is apparent from the above description. Furthermore, the embodiments of the present invention are not directed to any particular programming language. It should be understood that the content of the invention described herein can be implemented using various programming languages, and the above description of specific languages ​​is for the purpose of disclosing the best mode of implementation of the invention.

[0075] Numerous specific details are set forth in the specification provided herein. However, it will be understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this specification.

[0076] Similarly, it should be understood that, in order to streamline the invention and aid in understanding one or more of the various aspects of the invention, features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the above description of exemplary embodiments of the invention. However, this disclosure should not be construed as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim.

[0077] Those skilled in the art will understand that modules in the device of the embodiments can be adaptively changed and placed in one or more devices different from that embodiment. Modules, units, or components in the embodiments can be combined into a single module, unit, or component, and can be divided into multiple sub-modules, sub-units, or sub-components. Except where at least some of such features and / or processes or units are mutually exclusive, any combination can be used to combine all features disclosed in this specification (including the accompanying claims, abstract, and drawings) and all processes or units of any method or device so disclosed. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract, and drawings) may be replaced by an alternative feature that serves the same, equivalent, or similar purpose.

[0078] It should be noted that the above embodiments are illustrative of the invention and not restrictive, and that those skilled in the art can devise alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses should not be construed as limiting the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by the same item of hardware. The use of the words first, second, and third, etc., does not indicate any order. These words can be interpreted as names. The steps in the above embodiments, unless otherwise specified, should not be construed as limiting the order of execution.

Claims

1. An adaptive video frame filtering method based on scene complexity and an early exit mechanism, characterized in that, The method includes: Calculate the absolute difference of pixels between two adjacent video frames in a real-time video stream to obtain the frame difference information of the real-time video. The scene complexity value of the real-time video stream is calculated based on the frame difference information; the scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream. Each video frame in the real-time video stream is input into the frame admission and early exit module based on inter-frame differences to obtain the decision result; wherein, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer is provided with an early exit branch. Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

2. The method according to claim 1, characterized in that, The calculation of the absolute difference of pixels between two adjacent video frames in the real-time video stream to obtain the frame difference information of the real-time video includes: Read the raw video frames of the real-time video stream frame by frame; The original video frames are grayscaled and downsampled to obtain preprocessed video frames in the real-time video stream. The absolute difference between pixels between two adjacent preprocessed video frames is calculated to obtain the frame difference information.

3. The method according to claim 1, characterized in that, The frame difference information comprises multiple frame difference images; the calculation of the scene complexity value of the real-time video stream based on the frame difference information includes: Calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images; The scene complexity value of the real-time video stream is obtained by weighted fusion of the mean, variance, and proportion of non-zero pixels of all the frame difference images.

4. The method according to claim 1, characterized in that, Before inputting each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: Obtain training video stream samples; the training video stream samples include sample video streams with retained confidence value labels; The training video stream samples are input into a multi-layer lightweight neural network for iterative training to obtain a trained multi-layer lightweight neural network.

5. The method according to any one of claims 1-4, characterized in that, The dynamic threshold is determined in the following way: Obtain the current device computing resource information and the scene complexity value; When the current device computing resources are lower than the first resource threshold or the scene complexity value is lower than the first complexity threshold, the dynamic threshold is obtained by adjusting according to the first weight and the preset basic threshold. When the current device computing resources are greater than or equal to the first resource threshold or the scene complexity value is greater than or equal to the first complexity threshold, the dynamic threshold is obtained by adjusting according to the second weight and the preset basic threshold; wherein, the first weight is less than or equal to 1 and the second weight is greater than 1.

6. The method according to claim 5, characterized in that, The decision result is to delete or retain the current video frame; after inputting each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result, the method further includes: When the decision result is to delete the current video frame, the current video frame is deleted from the real-time video stream; When the decision result is to retain the current video frame, the current video frame is retained in the real-time video stream, and the judgment of the next video frame continues.

7. An adaptive video frame filtering device based on scene complexity and an early exit mechanism, characterized in that, The device includes: The first calculation module is used to calculate the absolute difference between pixels between two adjacent video frames in the real-time video stream, and obtain the frame difference information of the real-time video. The second calculation module is used to calculate the scene complexity value of the real-time video stream based on the frame difference information; the scene complexity value is used to characterize the intensity of changes in the video scene in the real-time video stream. The decision module is used to input each video frame in the real-time video stream into the frame admission and early exit module based on inter-frame differences to obtain the decision result; wherein, the frame admission and early exit module based on inter-frame differences is a multi-layer lightweight neural network, the multi-layer lightweight neural network includes multiple progressively deeper feature extraction layers, and each feature extraction layer is provided with an early exit branch. Specifically, when each video frame is input to the current feature extraction layer, the current feature extraction layer outputs a retention confidence score. When the retention confidence score is greater than a dynamic threshold, the calculation of subsequent feature extraction layers is immediately stopped, and the decision result of the current feature extraction layer is output. When the retention confidence score is less than or equal to the dynamic threshold, the calculation continues to propagate to deeper feature extraction layers. The retention confidence score represents the probability that the current feature extraction layer determines whether the current video frame needs to be retained. The dynamic threshold is obtained by adjusting a preset base threshold based on the current device computing resource information and the scene complexity value.

8. The apparatus according to claim 7, characterized in that, The frame difference information consists of multiple frame difference images; The step of calculating the scene complexity value of the real-time video stream based on the frame difference information includes: Calculate the mean, variance, and proportion of non-zero pixels for all the frame difference images; The scene complexity value of the real-time video stream is obtained by weighted fusion of the mean, variance, and proportion of non-zero pixels of all the frame difference images.

9. A computer device, characterized in that, include: The processor, memory, communication interface, and communication bus are provided, wherein the processor, memory, and communication interface communicate with each other via the communication bus. The memory is used to store at least one executable instruction that causes the processor to perform the operation of the adaptive video frame filtering method based on scene complexity and early exit mechanism as described in any one of claims 1-6.

10. A computer-readable storage medium, characterized in that, The storage medium stores at least one executable instruction, which, when executed on a computer device, causes the computer device to perform the operation of the adaptive video frame filtering method based on scene complexity and early exit mechanism as described in any one of claims 1-6.