A motion estimation based video recognition acceleration method
By employing a motion estimation-based approach, utilizing a Bayer domain-based model and a perceptual residual correction network to directly process Bayer data, and combining this with a GPU parallel architecture for video recognition, the video recognition process is accelerated. This solves the problems of computational redundancy and inaccurate artifact correction in existing technologies, achieving faster and more efficient video recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
- Filing Date
- 2026-01-29
- Publication Date
- 2026-06-26
AI Technical Summary
Existing video recognition acceleration methods suffer from computational redundancy and low efficiency when processing raw Bayer format signals. In particular, during the process of converting sensor-captured Bayer data into RGB data, the computation speed is slow and the data cannot be directly applied to the Bayer data. Furthermore, traditional artifact correction methods suffer from redundancy in the correction area and low accuracy.
A motion estimation-based approach is adopted. By identifying keyframes and non-keyframes, perceptual features are extracted using a Bayer domain basic model. Video recognition is performed by combining a fast motion estimation module and a perceptual residual correction network. Multi-level matching search is performed using a pyramid block structure and a GPU parallel architecture. Numerical correction is performed through the perceptual residual correction network. Bayer data is directly processed to reduce computational complexity.
It achieves faster and more efficient video recognition, reduces computational redundancy, improves the accuracy and efficiency of motion estimation, overcomes the computational bottleneck and inaccurate artifact correction problems in traditional methods, and achieves accurate and efficient artifact correction results.
Smart Images

Figure CN122290004A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and in particular to a method for accelerating video recognition based on motion estimation. Background Technology
[0002] Currently, video recognition acceleration technologies primarily reduce computation by exploiting temporal redundancy between video frames, mainly falling into two categories: feature reuse-based methods and sparse computation-based methods. For feature reuse-based methods, in deep convolutional neural networks, shallow features typically encode low-level visual information such as edges and textures, while deep features encode high-level information such as object categories and semantic attributes. Research shows that compared to the drastic fluctuations caused by lighting changes or minor jitter in the original RGB pixel space, deep semantic features exhibit stronger stability over time, the so-called "slow features." Based on this insight, researchers have proposed using low-cost motion estimation to propagate high-cost deep features, thereby avoiding full-network forward computation for each frame. This mainly includes deep feature optical flow techniques, the TapLab method, and edge and residual map-guided methods. For sparse computation methods, in the human visual system, most neurons in the retina only respond when objects change, rather than continuously outputting static signals. Based on this principle, researchers have proposed a series of sparse computation methods, mainly including split-up update methods and motion-compensated split-up update methods.
[0003] While existing video recognition acceleration methods have made progress in acceleration and model performance, they still suffer from the following problems: In traditional methods, sensors capture raw signals in Bayer format, and image signal processing units convert the Bayer data into human-readable RGB data. For video recognition tasks, this results in computational redundancy at the front end of the computation pipeline. Current motion estimation schemes are mainly based on optical flow models and video coding tools. Optical flow models, based on deep learning methods, require a large amount of computation, limiting computational efficiency. Video coding tools are primarily designed for video compression, involving a large amount of complex computation, and their CPU-based architecture, which processes data serially, results in slow computation speed and cannot be directly applied to Bayer data. Traditional artifact correction methods use residual maps to locate the regions that need correction and employ a lightweight method for correction. However, these methods suffer from redundancy in the correction regions and low accuracy. First, residual maps cannot accurately locate semantically incorrect regions; second, model performance degrades in low-resolution local areas.
[0004] Therefore, existing technologies have significant limitations and cannot adequately reduce computational redundancy or improve the speed of video recognition models. Summary of the Invention
[0005] In view of this, the present invention provides a video recognition acceleration method based on motion estimation to solve the above problems.
[0006] This invention provides a motion estimation-based method for accelerating video recognition, comprising: determining keyframes and non-keyframes in a video sequence; extracting features from the keyframes using a Bayer domain basic model to obtain perceptual features; calculating motion vectors between the non-keyframes and a reference frame using a fast motion estimation module, wherein the reference frame is the frame preceding the non-keyframe, wherein the fast motion estimation module adopts a pyramid block structure and performs multi-level matching search from coarse to fine under a GPU parallel architecture; performing deformation operations on the features of the reference frame using the motion vectors to obtain propagation features; predicting the perceptual residual of the current frame using a perceptual residual correction network, using the residual to numerically correct the propagation features, and outputting the corrected propagation features, wherein the perceptual residual correction network is a lightweight network structure; and performing a video recognition task based on the corrected propagation features.
[0007] In another implementation of the present invention, the fast motion estimation module adopts a pyramid block structure containing P layers; a coarse search is performed at the top layer using large blocks, and the matching is constrained by larger contextual information to ensure global consistency; the search is refined layer by layer, dividing the large blocks into smaller blocks, and a fine search is performed based on the search results of the previous layer to capture subtle motions.
[0008] In another implementation of the present invention, the propagation feature is represented as:
[0009] Among them, Z R Features of the reference frame; MV This is the motion vector.
[0010] In another implementation of the present invention, the formula for the numerical correction is:
[0011] in, Res This is the perceptual residual of the current frame.
[0012] In another implementation of the present invention, the depth of the sensing residual correction network is... and number of feature channels refine It is jointly controlled by the hyperparameters base depth and scaling factor, where:
[0013] in, 0 represents the base depth hyperparameter. This is the scaling factor.
[0014] In another implementation of the present invention, the loss function of the perceptual residual correction network is expressed as:
[0015] Combining the characteristics of L1 and L2 norms, the L1 norm is used when the perceived residual is less than a preset threshold, and the L2 norm is used when the perceived residual is greater than the preset threshold.
[0016] In another implementation of the present invention, the method further includes: introducing a weighting strategy based on region size into the perceptual residual correction network.
[0017] Among them, | | represents the size of the region.
[0018] In another aspect, the present invention provides a video recognition acceleration system based on motion estimation, comprising: a video sequence processing module for determining keyframes and non-keyframes in a video sequence; a feature extraction module for extracting features from the keyframes using a Bayer domain basic model to obtain perceptual features; a fast motion estimation module for calculating motion vectors between the non-keyframes and a reference frame, wherein the reference frame is the frame preceding the non-keyframe, wherein the fast motion estimation module adopts a pyramid block structure and performs multi-level matching search from coarse to fine under a GPU parallel architecture; a feature correction module for deforming the features of the reference frame using the motion vectors to obtain propagation features; predicting the perceptual residual of the current frame using a perceptual residual correction network, numerically correcting the propagation features using the residual, and outputting the corrected propagation features, wherein the perceptual residual correction network is a lightweight network structure; and a video recognition module for performing video recognition tasks based on the corrected propagation features.
[0019] In another aspect, the present invention provides an electronic device comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of a motion estimation-based video recognition acceleration method as described in any of the preceding claims. In another aspect, the present invention provides a computer storage medium storing a computer program that, when executed by a processor, implements the steps of a motion estimation-based video recognition acceleration method as described in any of the preceding claims.
[0020] This invention presents a motion estimation-based video recognition acceleration method that utilizes a video recognition pipeline based on raw Bayer data, abandoning the traditional image signal processing unit. It leverages a model trained on Bayer data to directly process the Bayer data captured by the sensor, achieving faster and more efficient video recognition. Not only does it consider the simplicity of the motion estimation algorithm, but it also incorporates modern GPUs and employs a parallel computing design. By reducing computational complexity and utilizing a parallel architecture, it provides faster and more accurate motion estimation, overcoming the computational bottleneck of motion estimation. Furthermore, it uses a perceptual residual-based approach for artifact correction. Perceptual residuals extend the traditional video coding residual map scheme to the perceptual level. Through model learning of sparse residual correction, artifacts are numerically corrected at the perceptual level, avoiding the inaccurate guidance of traditional residual maps and local resolution issues, achieving accurate and efficient artifact correction. Through fast and efficient motion estimation and accurate and concise artifact correction, it overcomes the shortcomings of existing technologies. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. By reading the detailed description of the embodiments below, the advantages and benefits of the solutions will become clear to those skilled in the art. The accompanying drawings are only for illustrating preferred embodiments and are not intended to limit the present invention. In the accompanying drawings: Figure 1 This is a schematic diagram of a video recognition acceleration method based on motion estimation according to an embodiment of the present invention.
[0022] Figure 2 This is an architecture diagram of a video recognition acceleration system based on motion estimation, according to an embodiment of the present invention.
[0023] Figure 3 This is a schematic diagram comparing the prediction performance of the algorithm in this invention and a conventional algorithm, which is an embodiment of the present invention. Detailed Implementation
[0024] To enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and thoroughly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art should fall within the protection scope of the present invention.
[0025] Figure 1 This is a schematic diagram of a video recognition acceleration method based on motion estimation provided in an embodiment of the present invention, as shown below. Figure 1As shown, this embodiment mainly includes: S101. Determine the key frames and non-key frames in the video sequence.
[0026] S102. Use the Bayer domain basic model to extract features from the keyframes to obtain perceptual features.
[0027] S103. The motion vector between the non-key frame and the reference frame is calculated by the fast motion estimation module. The reference frame is the frame preceding the non-key frame. The fast motion estimation module adopts a pyramid block structure and performs multi-level matching search from coarse to fine under the GPU parallel architecture.
[0028] S104. The features of the reference frame are deformed using the motion vector to obtain the propagation features.
[0029] S105. Predict the perceptual residual of the current frame through the perceptual residual correction network, use the residual to numerically correct the propagation feature, and output the corrected propagation feature, wherein the perceptual residual correction network is a lightweight network structure.
[0030] S106. Perform video recognition task based on the corrected propagation features.
[0031] For example, for video sequences The reasoning process is as follows: Keyframe processing: For keyframes key Using the Bayer domain base model θ Extracting perceptual features key And decode to obtain high-quality prediction results.
[0032] Motion estimation: for non-keyframes i The fast motion estimation module is used to calculate the motion vector between itself and the reference frame. .
[0033] Feature Transformation: Warping the features of the reference frame using MV yields a coarse propagation feature. T .
[0034] Residual Correction: Predicting feature-level residuals using a perceptual residual correction network. Res ,right T Numerical corrections are performed to obtain the final features. T .
[0035] First, process the keyframes, then process the first frame after the keyframe, the second frame, and so on. Each processing step references the result of the previous frame. Only the keyframes are processed using the basemodel.
[0036] To support learning in the Bayer domain, an invertible ISP model is used to inversely transform a large-scale RGB dataset, synthesizing a Bayer format dataset. Simultaneously, the input layer of standard computer vision models (such as PSPNet, SegFormer, DETR, etc.) is modified to accept single-channel Bayer pattern input, and the models are retrained on the transformed data.
[0037] This invention presents a motion estimation-based video recognition acceleration method that utilizes a video recognition pipeline based on raw Bayer data, abandoning the traditional image signal processing unit. It leverages a model trained on Bayer data to directly process the Bayer data captured by the sensor, achieving faster and more efficient video recognition. Not only does it consider the simplicity of the motion estimation algorithm, but it also incorporates modern GPUs and employs a parallel computing design. By reducing computational complexity and utilizing a parallel architecture, it provides faster and more accurate motion estimation, overcoming the computational bottleneck of motion estimation. Furthermore, it uses a perceptual residual-based approach for artifact correction. Perceptual residuals extend the traditional video coding residual map scheme to the perceptual level. Through model learning of sparse residual correction, artifacts are numerically corrected at the perceptual level, avoiding the inaccurate guidance of traditional residual maps and local resolution issues, achieving accurate and efficient artifact correction. Through fast and efficient motion estimation and accurate and concise artifact correction, it overcomes the shortcomings of existing technologies.
[0038] In another implementation of the present invention, the fast motion estimation module adopts a pyramid block structure containing P layers; a coarse search is performed at the top layer using large blocks, and the matching is constrained by larger contextual information to ensure global consistency; the search is refined layer by layer, dividing the large blocks into smaller blocks, and a fine search is performed based on the search results of the previous layer to capture subtle motions.
[0039] For example, to address the inconsistency between pixel-level matching and perceptual semantics caused by local texture blurring in the Bayer domain, a pyramid block structure is introduced. This structure contains P layers, and the block size is defined as:
[0040] At the top level, a coarse search is performed using large blocks (e.g., 64x64), leveraging significant contextual information to constrain the matching and ensure global consistency. This is then refined layer by layer, dividing the large blocks into smaller blocks (e.g., 32x32, 16x16) and performing a finer search based on the results from the previous layer to capture subtle movements. This hierarchical design utilizes contextual constraints while maintaining local matching accuracy.
[0041] The parallel coarse-to-fine matching algorithm employs a three-level hierarchical search strategy, including coarse, medium, and fine levels. The search range is... (k) The search step size is Δ (k) and satisfy (1) (2) (k) Δ (1) >Δ (2) > >Δ (k) .
[0042] To maximize throughput, a customized CUDA parallel architecture was designed. Each image patch in the target frame is assigned to an independent CUDA Block. The CUDA Block loads the target block into shared memory to reduce global memory access and maximize data reuse. Multiple threads within the CUDA Block compute the similarity of all candidate locations within the search range in parallel, using absolute error and SAD as evaluation metrics. The candidate block with the smallest SAD is selected, and its offset is recorded as a motion vector. .
[0043] The algorithm complexity of this method is... By employing a hierarchical design and GPU parallelism, the latency of each block is significantly reduced. Experiments show that this method processes 1080P video in milliseconds on an A100 GPU, which is 120 times faster than traditional diamond search.
[0044] This invention addresses the problems of existing video codecs relying on complex decision trees and CPU serial processing for motion estimation, and the excessive computational cost and applicability of deep optical flow models to the RGB domain, through a pyramid block structure and GPU parallel architecture.
[0045] In another implementation of the present invention, the propagation feature is represented as:
[0046] Among them, Z R Features of the reference frame; MV This is the motion vector.
[0047] In another implementation of the present invention, the formula for the numerical correction is:
[0048] in, Res This is the perceptual residual of the current frame.
[0049] For example, perceptual residuals are inherently sparsity, meaning that values are close to zero in most regions, with significant values only appearing in areas where motion estimation fails or where there is occlusion. This allows the correction module to be very lightweight, requiring only the learning of sparse feature correction terms.
[0050] In another implementation of the present invention, the depth of the sensing residual correction network is... and number of feature channels refine It is jointly controlled by the hyperparameters base depth and scaling factor, where:
[0051] in, 0 represents the base depth hyperparameter. This is the scaling factor.
[0052] For example, in order to learn the perceptual residual, a perceptual residual correction network is proposed, which utilizes the sparsity of the perceptual residual to achieve lightweight perceptual residual generation.
[0053] The perceptual residual correction network comprises a lightweight encoder, a fusion layer, a correction volume, and a projection head. The encoder consists of a series of compact, depthwise separable convolutions for extracting structural features from the current Bayer frame and downsampling the feature map to the perceptual features using convolutions with a stride of 2. T The same resolution. Because depthwise separable convolution reduces the computational complexity of traditional convolution by separating features, the encoder has low computational cost and high speed while maintaining performance. The fusion layer combines the features extracted by the encoder with the predicted features. T The features are concatenated and fused using 1×1 convolutions to help the model fuse features. The correction body learns the fused features and focuses on the regions that need correction; it is the core of the perceptual residual correction network. To balance accuracy and efficiency, a dynamic scaling design is employed. Its depth... and number of feature channels refine Based on the hyperparameter base depth 0 and scaling factor Joint control. This scaling mechanism helps the model dynamically adjust its size according to the needs of the residual, achieving a more streamlined computational scale.
[0054] For example, perceptual residuals are inherently sparsity, meaning that values are close to zero in most regions, with significant values only appearing in areas where motion estimation fails or where there is occlusion. This allows the correction module to be very lightweight, requiring only the learning of sparse feature correction terms. In another implementation of the present invention, the depth of the sensing residual correction network is... and number of feature channels refine It is jointly controlled by the hyperparameters base depth and scaling factor, where: in, 0 represents the base depth hyperparameter. This is the scaling factor. For example, in order to learn the perceptual residual, a perceptual residual correction network is proposed, which utilizes the sparsity of the perceptual residual to achieve lightweight perceptual residual generation. The perceptual residual correction network comprises a lightweight encoder, a fusion layer, a correction volume, and a projection head. The encoder consists of a series of compact, depthwise separable convolutions for extracting structural features from the current Bayer frame and downsampling the feature map to the perceptual features using convolutions with a stride of 2. T The same resolution. Because depthwise separable convolution reduces the computational complexity of traditional convolution by separating features, the encoder has low computational cost and high speed while maintaining performance. The fusion layer combines the features extracted by the encoder with the predicted features. T The features are concatenated and fused using 1×1 convolutions to help the model fuse features. The correction body learns the fused features and focuses on the regions that need correction; it is the core of the perceptual residual correction network. To balance accuracy and efficiency, a dynamic scaling design is employed. Its depth... and number of feature channels refine Based on the hyperparameter base depth 0 and scaling factor Joint control. This scaling mechanism helps the model dynamically adjust its size according to the needs of the residual, achieving a more streamlined computational scale. The specific hyperparameters of the model can be determined using the two formulas above. The projector uses a 1 × 1 convolution initialized to 0. This ensures that in the initial training phase, the perceptual residual correction behaves equivalent to the identity mapping, and then progressively learns to predict non-zero residual terms. Res .
[0055] Traditional methods typically use pixel-level residual maps to guide correction, but this often leads to overcorrection or undercorrection because pixel errors and perceptual errors are not perfectly aligned. This invention proposes extending the concept of residuals from the pixel domain to the perceptual feature domain, enabling precise numerical correction of predicted features and avoiding the problems associated with pixel-level residual maps.
[0056] In another implementation of the present invention, the loss function of the perceptual residual correction network is expressed as:
[0057] in, c It is usually 0.3.
[0058] Combining the characteristics of L1 and L2 norms, the L1 norm is used when the perceived residual is less than a preset threshold, and the L2 norm is used when the perceived residual is greater than the preset threshold.
[0059] For example, during training, there is minute noise in the perceptual residuals. This noise has a negligible impact on perceptual quality but disrupts the sparsity of the residuals. To focus on correcting perceptually significant errors, a threshold filtering, inverse Huber loss, and region-size-based weighting strategies are employed.
[0060] Specifically, in the perception residuals, noise introduced by the model calculation manifests as differences with excessively small amplitudes. These differences have a negligible impact on the final perception result, but they significantly reduce the sparsity of the perception residuals, thus requiring a larger-scale model for learning. Therefore, this invention uses a threshold for filtering to help the perception residual correction model learn the truly important perception residuals.
[0061] The loss function of this invention combines the characteristics of L1 and L2 norms. It uses the L1 norm when the perceptual residual is small and the L2 norm when the perceptual residual is large. This design allows the model to focus on large perceptual differences that have a greater impact on downstream tasks.
[0062] In another implementation of the present invention, the method further includes: introducing a weighting strategy based on region size into the perceptual residual correction network.
[0063] Among them, | | represents the size of the region.
[0064] For example, since the number of correctly predicted regions far exceeds the number of incorrectly predicted regions, weights are introduced to mitigate this imbalance. This weighting strategy prevents the model from converging to trivial zero-residual solutions and encourages it to learn meaningful perceptual residuals.
[0065] It should be understood that, for the perceptual residual correction network used in perceptual residual learning, since perceptual residual learning theory is a general framework, models with other architectures can be used instead of the model proposed in this invention.
[0066] Example 1 A comparative experiment on computational efficiency was conducted, and the results are shown in Tables 1, 2, 3, and 4.
[0067] Table 1. FLOPs of the model on the CamVid dataset
[0068] Table 2. FLOPs of the model on the Cityscapes dataset
[0069] Table 3. Model speed on the CamVid dataset
[0070] Table 4. Model speed on the Cityscapes dataset
[0071] As can be seen, compared to frame-by-frame inference, this invention reduces GFLOPs by approximately 85-88% in video semantic segmentation. On NVIDIA A100 GPUs, the processing speed is improved by 4 to 8 times.
[0072] An experiment was conducted to estimate the motion performance, and the results are shown in Table 5.
[0073] Table 5. Speed comparison between the proposed algorithm and traditional algorithms
[0074] It can be seen that the proposed FME algorithm is 120 times faster than traditional codec algorithms, and maintains a high degree of matching consistency in semantically critical regions.
[0075] A Bayer domain validity comparison experiment was conducted, and the results are shown in Table 6. Figure 3 As shown.
[0076] Table 6. Comparison of model training performance on RGB and Bayer.
[0077] It can be seen that the mIoU difference between the model trained in the Bayer domain and the model trained in the RGB domain is usually within 1%, proving the feasibility of directly using raw data.
[0078] The experimental results above demonstrate that the method of the present invention achieves significant acceleration in both video semantic segmentation (CamVid, Cityscapes) and video object detection (ImageNet VID) tasks, while with minimal loss of accuracy.
[0079] Technical advantages of the present invention: A High-Efficiency Video Recognition Pipeline: Completely Eliminating the ISP Bottleneck and Building a Low-Latency Bayer Domain Perception Pipeline. Traditional video vision systems rely on image signal processing units to convert raw sensor data into RGB images, a computationally expensive process that introduces significant latency. This innovative approach eliminates this step, performing feature extraction and model inference directly in the Bayer domain, fundamentally eliminating the computational cost and latency associated with the front-end ISP. Simultaneously, it utilizes reversible ISP technology to synthesize a large-scale Bayer dataset, addressing the scarcity of training data in the Bayer domain and enabling model training. This pipeline achieves highly efficient end-to-end processing.
[0080] Fast Motion Estimation: Addressing the shortcomings of existing motion estimation methods, which either rely on CPU serial logic or are based on deep learning, a fast motion estimation method is designed. This technique employs a pyramid block structure and a coarse-to-fine search strategy, and achieves full GPU parallelization based on the CUDA architecture. This significantly improves backend processing throughput while effectively capturing motion.
[0081] Perceptual Residual Learning: To address the accumulated errors caused by motion estimation, this paper proposes a perceptual residual learning mechanism, abandoning traditional pixel-level residual repair schemes. This mechanism leverages the natural sparsity of differences in the feature space of videos, using a lightweight network to predict and correct numerical biases in propagated features. Combined with a specially designed smoothed L1 loss and dynamic weight strategy, the perceptual residual correction model can accurately repair significant perceptual errors with minimal computational cost. This achieves a significant reduction in computational load through temporal redundancy while maintaining recognition accuracy comparable to frame-by-frame inference.
[0082] Another aspect of the present invention, such as Figure 2 As shown, a video recognition acceleration system based on motion estimation is provided, comprising: Video sequence processing module: Identifies keyframes and non-keyframes in a video sequence.
[0083] Feature extraction module: Uses the Bayer domain basic model to extract features from the keyframes to obtain perceptual features.
[0084] Fast motion estimation module: The fast motion estimation module calculates the motion vector between the non-key frame and the reference frame, where the reference frame is the frame preceding the non-key frame. The fast motion estimation module adopts a pyramid block structure and performs multi-level matching search from coarse to fine under the GPU parallel architecture.
[0085] Feature correction module: The feature of the reference frame is deformed using the motion vector to obtain the propagation feature; the perceptual residual of the current frame is predicted by the perceptual residual correction network; the propagation feature is numerically corrected using the residual; and the corrected propagation feature is output. The perceptual residual correction network is a lightweight network structure.
[0086] Video recognition module: Performs video recognition tasks based on the corrected propagation features.
[0087] This invention presents a motion estimation-based video recognition acceleration system that utilizes a video recognition pipeline based on raw Bayer data, abandoning the traditional image signal processing unit. It leverages a model trained on Bayer data to directly process the Bayer data captured by the sensor, achieving faster and more efficient video recognition. Not only does it consider the simplicity of the motion estimation algorithm, but it also incorporates modern GPUs and employs a parallel computing design. By reducing computational complexity and utilizing a parallel architecture, it provides faster and more accurate motion estimation, overcoming the computational bottleneck of motion estimation. Furthermore, it uses a perceptual residual-based approach for artifact correction. Perceptual residuals extend the traditional video coding residual map scheme to the perceptual level. Through model learning of sparse residual correction, artifacts are numerically corrected at the perceptual level, avoiding the inaccurate guidance of traditional residual maps and local resolution problems, achieving accurate and efficient artifact correction. Through fast and efficient motion estimation and accurate and concise artifact correction, it overcomes the shortcomings of existing technologies.
[0088] In another aspect of the present invention, the electronic device includes: a processor, a memory, and a communication bus and a communication interface.
[0089] in: The processor, memory, and communication interface communicate with each other via a communication bus.
[0090] A communication interface is used to communicate with other electronic devices or servers.
[0091] The processor is used to execute programs, specifically the steps of any of the motion estimation-based video recognition acceleration methods described in the above embodiments.
[0092] Specifically, the program may include program code, which includes computer operation instructions.
[0093] The processor may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application. The one or more processors included in the smart device may be processors of the same type, such as one or more CPUs; or they may be processors of different types, such as one or more CPUs and one or more ASICs.
[0094] Memory is used to store programs. Memory may include high-speed RAM, and may also include non-volatile memory, such as at least one disk drive.
[0095] Specifically, the program can be used to cause the processor to execute steps to implement any of the motion estimation-based video recognition acceleration methods described in the embodiments. The specific implementation of each step in the program can be found in the corresponding descriptions of the steps and units executed in any of the motion estimation-based video recognition acceleration methods described above, and will not be repeated here. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the devices and modules described above can be referred to the corresponding process descriptions in the foregoing method embodiments.
[0096] An exemplary embodiment of this application also provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform the methods of various embodiments of this application.
[0097] The methods described above according to embodiments of the present invention can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as a CD-ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or as computer code originally stored on a remote recording medium or a non-transitory machine-readable medium and subsequently stored on a local recording medium, downloaded via a network. Thus, the methods described herein can be processed by software stored on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware (such as an ASIC or FPGA). It is understood that the computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) capable of storing or receiving software or computer code, which, when accessed and executed by the computer, processor, or hardware, implements the methods described herein. Furthermore, when a general-purpose computer accesses code used to implement the methods shown herein, the execution of the code transforms the general-purpose computer into a dedicated computer for executing the methods shown herein.
[0098] Specific embodiments of the present invention have now been described. Other embodiments are within the scope of the appended claims. In some cases, the actions described in the claims can be performed in a different order and still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result.
[0099] It should be noted that all directional indications (such as up, down, left, right, back, etc.) in the embodiments of the present invention are only used to explain the relative positional relationship between the components in a certain order (as shown in the figure). If the specific order changes, the directional indication will also change accordingly.
[0100] In the description of this invention, the terms "first" and "second" are used only for convenience in describing different components or names, and should not be construed as indicating or implying a sequential relationship, relative importance, or implicitly specifying the number of technical features indicated. Thus, a feature defined with "first" and "second" may explicitly or implicitly include at least one of that feature.
[0101] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
[0102] It should be noted that although specific embodiments of the present invention have been described in detail with reference to the accompanying drawings, this should not be construed as limiting the scope of protection of the present invention. Various modifications and variations that can be made by those skilled in the art without inventive effort within the scope described in the claims still fall within the scope of protection of the present invention.
[0103] The examples of the embodiments of the present invention are intended to concisely illustrate the technical features of the embodiments of the present invention, so that those skilled in the art can intuitively understand the technical features of the embodiments of the present invention, and are not intended to be an improper limitation of the embodiments of the present invention.
[0104] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A video recognition acceleration method based on motion estimation, characterized in that, include: Identify keyframes and non-keyframes in a video sequence; The keyframes are used to extract features using the Bayer domain basic model to obtain perceptual features; The motion vector between the non-key frame and the reference frame is calculated by the fast motion estimation module. The reference frame is the frame preceding the non-key frame. The fast motion estimation module adopts a pyramid block structure and performs multi-level matching search from coarse to fine under the GPU parallel architecture. The propagation features are obtained by deforming the features of the reference frame using the motion vectors. The perceptual residual of the current frame is predicted by a perceptual residual correction network, and the propagation feature is numerically corrected by the residual, and the corrected propagation feature is output. The perceptual residual correction network is a lightweight network structure. Video recognition task based on corrected propagation features.
2. The method according to claim 1, characterized in that, The fast motion estimation module uses a pyramid block structure containing P layers; A coarse search is performed at the top level using large blocks, and the matching is constrained by greater contextual information to ensure global consistency. The search is refined layer by layer, dividing large blocks into smaller ones, and then performing a more detailed search based on the search results of the previous layer to capture subtle movements.
3. The method according to claim 1, characterized in that, The propagation characteristics are represented as follows: Among them, Z R Features of the reference frame; MV This is the motion vector.
4. The method according to claim 3, characterized in that, The formula for the numerical correction is: in, Res This is the perceptual residual of the current frame.
5. The method according to claim 1, characterized in that, The depth of the perception residual correction network and number of feature channels refine It is jointly controlled by the hyperparameters base depth and scaling factor, where: in, 0 represents the base depth hyperparameter. This is the scaling factor.
6. The method according to claim 5, characterized in that, The loss function of the sensing residual correction network is expressed as: Combining the characteristics of L1 and L2 norms, the L1 norm is used when the perceived residual is less than a preset threshold, and the L2 norm is used when the perceived residual is greater than the preset threshold.
7. The method according to claim 6, characterized in that, Also includes: A region-size-based weighting strategy is introduced into the perceptual residual correction network: Among them, | | represents the size of the region.
8. A video recognition acceleration system based on motion estimation, characterized in that, include: Video sequence processing module: Identifies keyframes and non-keyframes in a video sequence; Feature extraction module: Uses the Bayer domain basic model to extract features from the keyframes to obtain perceptual features; Fast motion estimation module: The fast motion estimation module calculates the motion vector between the non-key frame and the reference frame, where the reference frame is the frame preceding the non-key frame. The fast motion estimation module adopts a pyramid block structure and performs multi-level matching search from coarse to fine under the GPU parallel architecture. Feature correction module: The feature of the reference frame is deformed using the motion vector to obtain the propagation feature; the perceptual residual of the current frame is predicted through the perceptual residual correction network; the propagation feature is numerically corrected using the residual; and the corrected propagation feature is output. The perceptual residual correction network is a lightweight network structure. Video recognition module: Performs video recognition tasks based on the corrected propagation features.
9. An electronic device, characterized in that, include: The memory, the processor, and the computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the motion estimation-based video recognition acceleration method as described in any one of claims 1 to 7.
10. A computer storage medium, characterized in that, The computer storage medium stores a computer program, which, when executed by a processor, implements the steps of the motion estimation-based video recognition acceleration method as described in any one of claims 1 to 7.