Video stabilization method and device based on video understanding large model, equipment and medium

By combining depth estimation and point cloud data processing with a video stabilization method based on a large video understanding model, the problem of poor adaptability and insufficient stabilization accuracy in existing technologies is solved, achieving high-quality video stabilization in complex scenes and adapting to different scenarios and user needs.

CN122243770APending Publication Date: 2026-06-19NINGBO SIMSHINE INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NINGBO SIMSHINE INTELLIGENT TECH CO LTD
Filing Date
2026-01-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing video stabilization technologies do not fully integrate large model technology, have poor adaptability and insufficient stabilization accuracy, cannot effectively identify the main subject of the video and prioritize the stability of the core area, ignore depth information leading to inaccurate motion estimation, cannot adaptively adjust the stabilization style according to user instructions, and lack a deformation detection mechanism for the stabilized image.

Method used

A video understanding-based large model approach is adopted. It acquires continuous frame images of monocular video data for depth estimation and point cloud data processing, and performs video stabilization processing by combining user commands and image motion parameters. It utilizes the semantic analysis capabilities and depth estimation of the large model to obtain image motion parameters, performs video stabilization processing, and performs content-aware inpainting.

Benefits of technology

It achieves high accuracy and adaptability in video stabilization in complex scenarios, avoiding image distortion and loss, improving video stabilization quality and user experience, and adapting to different scenarios and user needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243770A_ABST
    Figure CN122243770A_ABST
Patent Text Reader

Abstract

This invention relates to the field of large-scale model technology, and provides a video stabilization method, apparatus, device, and medium based on a large-scale model for video understanding. The method includes: acquiring several consecutive preceding frames and the current frame from monocular video data; inputting the preceding frames and the current frame into a large-scale model to determine whether the current frame requires stabilization; if stabilization is required, performing depth estimation on the preceding frames and the current frame using a preset depth estimation algorithm to obtain corresponding point cloud data; obtaining image motion parameters based on the point cloud data; and inputting user instructions, the image motion parameters, and the current frame into the large-scale model for video stabilization to obtain the target frame image. This invention improves the accuracy of motion estimation and the stabilization effect, and is applicable to real-time video processing, UAV photography, mobile device photography, and other fields.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of large model technology, and in particular to a video stabilization method, apparatus, device and medium based on video understanding of large models. Background Technology

[0002] Shaking is a common problem during video shooting, especially with handheld devices or in dynamic shooting scenarios. This shaking can cause blurry, jittery images in the video, affecting the viewing experience and the accuracy of image processing. Existing video stabilization technologies mainly include optical and electronic stabilization methods that compensate for camera shake by adjusting the position of the lens or sensor. However, these methods are limited by equipment structure, are costly, and have limited ability to handle severe shaking. Alternatively, they mainly rely on image processing algorithms to estimate and compensate for motion in video frames. Common algorithms include optical flow, feature point matching, and global motion estimation. However, most of these methods process two-dimensional image data, ignoring image depth information. This can lead to inaccurate motion estimation in complex scenes, affecting the stabilization effect.

[0003] Existing patent CN109862253A (Digital Video Stabilization Method Based on Deep Learning) discloses a stabilization scheme by constructing a deep convolutional neural network, but it lacks semantic intent judgment for video content, cannot identify video subjects (people, vehicles) and guide the stabilization strategy, and may damage the core content of the image. Existing electronic image stabilization algorithms have the following problems: they do not fully integrate the semantic understanding and intelligent decision-making capabilities of large models, resulting in insufficient accuracy in recognizing subjects such as people and vehicles in videos, failing to prioritize the stability of core areas, and easily leading to subject shifting or cropping during the stabilization process, affecting the integrity and coherence of the video; they ignore depth information in the scene, resulting in inaccurate representation of motion states in complex scenes, leading to weak targeted stabilization correction; they cannot adaptively adjust the stabilization style according to user instructions, and their deformation and missing image detection mechanisms after stabilization are insufficient to achieve high-quality content perception and repair, resulting in unsatisfactory stabilization effects.

[0004] In summary, existing image stabilization technologies do not fully integrate large model technology, resulting in poor adaptability and insufficient image stabilization accuracy, which urgently need to be addressed. Summary of the Invention

[0005] In view of this, embodiments of the present invention provide a video stabilization method, apparatus, device and storage medium based on a large model of video understanding, in order to solve the technical problems that the stabilization technology does not fully integrate the large model technology, has poor adaptive ability and insufficient stabilization accuracy.

[0006] In a first aspect, embodiments of the present invention provide a video stabilization method based on a large video understanding model, the method comprising: Acquire consecutive preceding frames and the current frame from monocular video data; Input the previous few frames and the current frame image into the large model to determine whether the current frame image needs image stabilization processing. If the current frame image needs image stabilization, then depth estimation is performed on the previous few frames and the current frame image according to a preset depth estimation algorithm to obtain the corresponding depth images respectively; Based on each depth image, obtain the corresponding point cloud data respectively; Based on the point cloud data, obtain the image motion parameters; The user command, the image motion parameters, and the current frame image are input into a large model for video stabilization processing to obtain the target frame image.

[0007] Preferably, the step of inputting the previous several frames and the current frame image into a large model to determine whether the current frame image needs image stabilization includes: Input the previous few frames and the current frame image into the large model to obtain the semantic analysis results; Based on the semantic analysis results and the preset motion type recognition rule library, the motion type in the current frame image is obtained; Based on the motion type, determine whether the previous frame image requires image stabilization.

[0008] Preferably, the step of acquiring corresponding point cloud data based on each of the depth images includes: Obtain the pixel coordinate information of each of the depth images; The pixel coordinate information is transformed to obtain the corresponding point cloud data.

[0009] Preferably, obtaining image motion parameters based on the point cloud data includes: Normalize each of the point cloud data to obtain each normalized point cloud data; Perform multi-plane fitting based on the normalized point cloud data to obtain the first fitting plane and the second fitting plane. Obtain semantic information provided by the large model, wherein the semantic information includes main region information; Based on the semantic information, the plane parameters of the first fitting plane and the second fitting plane are obtained, wherein the plane parameters include the number of data points and the plane area; The plane with the largest number of data points or the largest plane area is used as the principal plane of the first fitting plane and the second fitting plane, respectively denoted as the first principal plane and the second principal plane. The image motion parameters are obtained based on the first principal plane and the second principal plane.

[0010] Preferably, obtaining the image motion parameters based on the first principal plane and the second principal plane includes: Obtain the first normal vector of the first principal plane and the second normal vector of the second principal plane, respectively; The rotation angles of the first principal plane and the second principal plane are calculated based on the first normal vector and the second normal vector. The image motion parameters are obtained based on the rotation angle and the preset rotation matrix.

[0011] Preferably, the step of inputting the user command, the image motion parameters, and the current frame image into a large model for video stabilization processing to obtain the target frame image includes: The user command, the image motion parameters, and the current frame image are input into the large model, and the current frame image is stabilized to obtain a preliminarily stabilized image. Based on the image detection algorithm, structural similarity analysis is performed between the initially stabilized image and the current frame image to calculate the similarity. Based on a preset similarity threshold, determine whether the image after initial stabilization is deformed or missing. If there is no deformation or missing parts, the image after preliminary stabilization is used as the target frame image; If deformation or missing parts exist, content-aware repair is performed on the initially stabilized image to obtain the target frame image.

[0012] Preferably, the step of inputting the user command, the image motion parameters, and the current frame image into a large model, and performing image stabilization processing on the current frame image to obtain a preliminarily stabilized image includes: The user command, the image motion parameters, and the current frame image are input into the large model to obtain images of each main region. Based on the user instruction, each of the main regions and image motion parameters, the user instruction is semantically parsed and mapped to obtain quantization parameters. The user instruction includes natural language, and the quantization parameters include one or more of jitter compensation intensity, smoothing filter coefficient, and rotation correction threshold. The image of the region to be processed is obtained based on the preset rotation matrix and quantization parameters; Based on the current frame image, the image of the region to be processed, and the main body image, a path is planned to obtain the image stabilization processing path; The current frame image is stabilized according to the image stabilization processing path to obtain a preliminarily stabilized image.

[0013] Preferably, when running on an edge computing device, the large model adopts a quantized multi-digit integer model; when running on a cloud server, the large model adopts a floating-point precision model.

[0014] Secondly, embodiments of the present invention provide a video stabilization device based on a large video understanding model, characterized in that the device comprises: The data acquisition module is used to acquire consecutive previous frames and the current frame from monocular video data; The image stabilization judgment module is used to input the previous several frames of images and the current frame image into the large model to determine whether the current frame image needs image stabilization processing. The depth acquisition module is used to perform depth estimation on the previous few frames and the current frame image according to a preset depth estimation algorithm when the current frame image needs image stabilization processing, and obtain the corresponding depth images respectively. The point cloud data acquisition module is used to acquire the corresponding point cloud data according to each of the depth images. The motion parameter acquisition module is used to acquire image motion parameters based on the point cloud data. The image stabilization module is used to input user commands, the image motion parameters, and the current frame image into a large model, perform video image stabilization processing, and obtain the target frame image.

[0015] Thirdly, embodiments of the present invention provide an electronic device, including: at least one processor, at least one memory, and computer program instructions stored in the memory, which, when executed by the processor, implement the method of the first aspect described above.

[0016] Fourthly, embodiments of the present invention provide a storage medium storing computer program instructions, which, when executed by a processor, implement the method of the first aspect described above.

[0017] In summary, the beneficial effects of the present invention are as follows: The video stabilization method, apparatus, device, and storage medium based on a large video understanding model provided in this invention involve acquiring several consecutive previous frames and the current frame from monocular video data; inputting the previous and current frames into a large model to determine whether the current frame requires stabilization; if stabilization is required, performing depth estimation on the previous and current frames using a preset depth estimation algorithm to obtain corresponding depth images; acquiring corresponding point cloud data based on each depth image; obtaining image motion parameters based on each point cloud data; and inputting user instructions, the image motion parameters, and the current frame into the large model for video stabilization to obtain the target frame image. This invention synergistically integrates the semantic analysis capabilities of a large video understanding model with the 3D information obtained from monocular depth estimation, providing a video stabilization method that combines semantic guidance and physical constraints. On the one hand, the large model can understand scene semantics, key subjects, and motion intentions based on multi-frame images, avoiding misjudgments and incorrect compensations caused by relying solely on depth estimation in complex scenarios such as occlusion and rapid movement. On the other hand, by introducing depth images, point cloud data, and the resulting motion parameters, the image stabilization process is based on real-world spatial motion relationships, enhancing the objectivity and robustness of the large model's image stabilization decisions. Through this collaborative approach, this invention overcomes the rigidity of traditional algorithmic strategies. The large model only intervenes in key stages such as image stabilization determination and final processing. Combined with a lightweight depth estimation algorithm, it achieves a balance between image stabilization effectiveness and computational efficiency. While ensuring real-time performance, it effectively avoids distortion or loss in the stabilized image, significantly improving the quality of video stabilization in complex application scenarios and greatly enhancing the applicability of the method, providing a superior solution for video shooting in complex scenes. Attached Figure Description

[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments of the present invention will be briefly introduced below. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort, and these are all within the protection scope of the present invention.

[0019] Figure 1 This is a flowchart illustrating the video stabilization method based on a large video understanding model according to an embodiment of the present invention. Figure 2 This is a schematic diagram of the video stabilization device based on a large video understanding model according to an embodiment of the present invention; Figure 3 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention. Detailed Implementation

[0020] The features and exemplary embodiments of various aspects of the present invention will now be described in detail. To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only configured to explain the present invention and are not configured to limit the present invention. For those skilled in the art, the present invention can be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the invention.

[0021] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.

[0022] It should be noted that all actions involving the acquisition of signals, information, or data in this invention are carried out in compliance with the relevant data protection laws and regulations of the locality and with authorization from the owner of the relevant device.

[0023] Example 1 Please see Figure 1 In a first aspect, embodiments of the present invention provide a video stabilization method based on a large video understanding model, the method comprising: S1: Obtain the previous several consecutive frames and the current frame from the monocular video data; Specifically, continuous video frame data is obtained from a monocular camera, and the current frame image and several adjacent frames are extracted. If the current frame is the nth frame, the preceding frames can be designated as n-1, n-2, n-3, etc., with the specific number of frames adjusted according to the actual application scenario to ensure temporal correlation. These images will serve as input for subsequent analysis, used to analyze motion changes and continuity information between images, and then determine whether video stabilization processing is needed in subsequent steps.

[0024] S2: Input the previous several frames of images and the current frame image into the large model, and determine whether the current frame image needs image stabilization processing; Specifically, a large-scale model refers to a large-scale neural network model built based on deep learning algorithms (such as Transformer, 3D CNN, ViT, SwinTransformer, etc.) that, after training, can understand the content of video images. Image stabilization refers to processing images or videos to eliminate image blurring or instability caused by factors such as camera shake and lens movement, thereby improving video quality. Several previous frames and the current frame are input into a trained large-scale model. The model analyzes the motion features of the images to determine whether image stabilization is needed for the current frame. In one embodiment, the large-scale model can be called every N frames (e.g., N=10) for deep semantic analysis. Lightweight optical flow is used for non-critical frames, and the analysis results of the large-scale model guide the parameters of the optical flow method, directly solving real-time issues. By analyzing image content through the large-scale model, it is possible to accurately determine whether the current frame requires image stabilization due to motion or other reasons, avoiding incorrect image stabilization of frames that do not need processing. The large-scale model can intelligently determine image stability based on the motion and changes of preceding and following frames, thereby achieving more efficient image stabilization.

[0025] S3: If the current frame image needs image stabilization, then perform depth estimation on the previous few frames and the current frame image according to the preset depth estimation algorithm, and obtain the corresponding depth images respectively; Specifically, the depth of the preceding frames and the current frame is first estimated using a preset depth estimation algorithm, yielding depth values. Then, depth images corresponding to the preceding and current frames are obtained based on these depth values. The depth information of the scene is inferred from the monocular video using the depth estimation algorithm, providing a foundation for subsequent point cloud generation. During depth estimation, not only is a single frame considered, but also the consistency of depth between frames. A temporal smoothing constraint is used to reduce flickering and jitter, ensuring temporal consistency. The depth image is the representation of the depth information of each pixel in the image on a two-dimensional plane, providing basic data for subsequent three-dimensional point cloud generation. The preset depth estimation algorithm processes the preceding and current frames to obtain corresponding depth values. The depth estimation algorithm adaptively adjusts according to the application scenario, dynamically selecting the depth estimation model based on device computing power or image scene complexity. For example, MiDaS is used when computing power is high, and MobileDepth is used when computing power is low. Scene complexity is initially determined using information such as gradients. MobileDepth is a lightweight depth estimation algorithm suitable for efficient operation in environments with limited computing resources, such as mobile devices. By employing the MobileDepth algorithm, accurate depth information can be obtained while ensuring computational efficiency, thereby supporting real-time video stabilization on various devices.

[0026] S4: Based on each depth image, obtain the corresponding point cloud data respectively; Specifically, point cloud data is a representation of a scene in three-dimensional space, helping to better understand the scene structure. It includes a dataset of a large number of points in three-dimensional space. For depth images, the depth value and corresponding pixel coordinates of each pixel are first extracted. Then, coordinate transformation algorithms are used to convert the two-dimensional pixel coordinates and depth values ​​into three-dimensional spatial coordinates. Examples of such algorithms include camera intrinsic parameter transformation algorithms. Point cloud data is used to capture the motion trajectory of the camera or scene, enabling more accurate capture of three-dimensional motion changes caused by device shake, avoiding the limitations of two-dimensional analysis, and providing data support for subsequent image stabilization processing.

[0027] S5: Obtain image motion parameters based on the point cloud data described above; Specifically, image motion parameters refer to quantified parameters characterizing the motion state of the current frame relative to several previous frames. Core parameters include rotation angle and translation amount, and are the core basis for correcting jitter in subsequent image stabilization processing. First, a rotation matrix is ​​constructed based on point cloud data, and then the motion parameters between two frames are calculated. These parameters describe the motion relationship between frames, providing accurate motion compensation information for subsequent image stabilization. In some embodiments, Kalman filtering is introduced to smooth the motion trajectory, eliminating random camera jitter, such as hand shake or equipment vibration causing abrupt changes in motion vectors between frames. The output is smoothed motion parameters, ensuring natural transitions between frames without significant jitter. Subsequently, pixel-level motion compensation can be performed based on the smoothed motion parameters output by the Kalman filter to generate a stabilized frame.

[0028] S6: Input the user command, the image motion parameters, and the current frame image into the large model, perform video stabilization processing, and obtain the target frame image.

[0029] Specifically, user commands refer to natural language instructions input by the user based on personalized needs, such as maintaining a documentary-style smoothness, preserving some handheld vibrations, and prioritizing the preservation of the main subject. These instructions guide the large-scale model to adjust its stabilization strategy. The model receives these user commands in natural language form through a user interface and converts them into stabilization control parameters that the large-scale model can understand. The large-scale model performs video stabilization processing on the current frame based on motion parameters and user commands, generating a stable target frame image. The large-scale model dynamically adjusts the stabilization strategy based on user commands, adjusting the intensity of jitter compensation. Depth estimation ensures that the strategy is executed based on real geometric data. By compensating for jitter, a smoother video output is achieved, improving the quality of video stabilization.

[0030] In one embodiment, S2 includes: S21: Input the previous several frames of images and the current frame of images into the large model to obtain semantic analysis results; Specifically, the preceding and current frames are input into a large model. Through training, the model extracts semantic information from the images, including scene type, object category, and object motion trajectory, generating semantic analysis results. Self-attention or cross-attention mechanisms are used to mine pixel-level, feature-level, and semantic-level associations between consecutive frames, ultimately outputting semantic analysis results. These results provide strong support for subsequent image stabilization processing, making it more intelligent and enabling adjustments to processing strategies based on different scenarios.

[0031] S22: Based on the semantic analysis results and the preset motion type recognition rule library, obtain the motion type in the current frame image; Specifically, the motion type recognition rule base is a set of preset rules, and motion types include: motions that do not require image stabilization and effective jitter motions (motions that require image stabilization). By analyzing the semantic analysis results and combining them with the preset motion type recognition rule base, the motion mode of the image is inferred, the motion type in the current frame image is identified, and accurate motion information is provided for subsequent image stabilization processing to ensure that the processing scheme matches the motion characteristics of the image.

[0032] S23: Determine whether the previous frame image needs image stabilization processing based on the motion type.

[0033] Specifically, based on the motion type information in the image, it is determined whether the current image requires image stabilization and whether there is image blurring or jitter caused by motion. After identifying the motion type, if the current frame image's motion type does not require image stabilization, the subsequent image stabilization algorithm execution steps are skipped, and the current frame image is directly output as a stabilized frame, ensuring smooth video playback and saving computational resources. If the current frame image's motion type is "effective jitter motion," the subsequent image stabilization process continues. By determining the motion type, it is ensured that image stabilization is only performed when necessary, improving system efficiency and response speed.

[0034] In one embodiment, S4 includes: S41: Obtain the pixel coordinate information of each depth image; Specifically, the coordinate information of all pixels in each depth image is extracted. This coordinate information includes the position of each pixel in the image and its corresponding depth value. Usually, the top left corner of the image is taken as the origin, the horizontal direction is the X-axis, and the vertical direction is the Y-axis. The coordinate values ​​are in pixels (for example, if the depth image resolution is 1920×1080, the coordinates of the top left pixel are (0,0) and the coordinates of the bottom right pixel are (1919,1079)).

[0035] S42: Perform coordinate transformation on the pixel coordinate information to obtain the corresponding point cloud data.

[0036] Specifically, the pixel coordinate information is transformed into point cloud data in three-dimensional space, converting the two-dimensional pixel coordinates and depth values ​​into point cloud data. This maps each pixel in the image to three-dimensional space, generating point cloud data that provides accurate spatial information for subsequent motion estimation and image stabilization.

[0037] In one embodiment, S5 includes: S51: Normalize each of the point cloud data to obtain each normalized point cloud data respectively; Specifically, the point cloud data is normalized by mapping the coordinates of each point to a preset uniform numerical range, generating normalized point cloud data separately to eliminate the effects of scale differences and noise, thus providing consistent data for subsequent calculations.

[0038] S52: Perform multi-plane fitting based on the normalized point cloud data to obtain the first fitting plane and the second fitting plane. Specifically, multi-plane fitting refers to performing plane fitting on the normalized point cloud data of the previous frame and the current frame image respectively, and using algorithms to fit planes that best represent the spatial distribution of these points, such as least squares method, RANSAC algorithm, etc. According to a preset principal plane selection rule, the first principal plane of the first fitting plane and the second principal plane of the second fitting plane are obtained respectively; according to the preset principal plane selection rule, the most representative fitting plane is selected as the first principal plane and the second principal plane. These principal planes usually reflect the most significant motion parts in the scene, and corresponding first and second fitting planes are generated to describe the key spatial structures in the image.

[0039] S53: Obtain semantic information provided by the large model, wherein the semantic information includes main region information; Specifically, the large model identifies the main subject of the image and determines the information of the regions in the video frame that have the main semantic meaning or interest value, namely the subject region information, such as the main person, moving target, foreground object, or scene region with clear structural features.

[0040] S54: Based on the semantic information, obtain the plane parameters of the first fitting plane and the second fitting plane, wherein the plane parameters include the number of data points and the plane area; Specifically, planar parameters of the first and second fitted planes are extracted based on semantic information. The number of data points refers to the number of data points in the main region or the number of point cloud data points constituting the plane, thereby determining the key planes or key regions used to calculate motion parameters. The plane area is the size of the fitted plane in three-dimensional space.

[0041] S55: The plane with the largest number of data points or the largest plane area is used as the principal plane of the first fitting plane and the second fitting plane, respectively denoted as the first principal plane and the second principal plane. Specifically, based on the plane parameters, the plane with the most data points or the largest area is selected as the first principal plane of the first fitting plane and the second principal plane of the second fitting plane, respectively. This rule ensures that the most representative plane in the scene is selected, thereby improving the accuracy of motion estimation and reducing noise interference.

[0042] S56: Obtain the image motion parameters based on the first principal plane and the second principal plane.

[0043] Specifically, image motion parameters refer to the characteristics that describe image changes, such as rotation and translation.

[0044] This step extracts the image's motion parameters by analyzing the changes in the first and second principal planes. Common methods include calculating the plane's normal vector and rotation angle. Obtaining accurate image motion parameters is the core of image stabilization, ensuring a more stable image after processing. By precisely acquiring motion parameters, accurate image stabilization can be achieved, improving the quality of the final image.

[0045] In one embodiment, S56 includes: S561: Obtain the first normal vector of the first principal plane and the second normal vector of the second principal plane respectively; Specifically, the normal vectors of the first principal plane and the second principal plane are extracted, namely, the first normal vector of the first principal plane and the second normal vector of the second principal plane. Normal vectors describe the orientation of a plane in three-dimensional space and are fundamental for calculating rotation angles and motion parameters, providing necessary geometric information for subsequent image stabilization processing.

[0046] S562: Calculate the angle based on the first normal vector and the second normal vector to obtain the rotation angle between the first principal plane and the second principal plane; Specifically, the angle between the first normal vector and the second normal vector is calculated to obtain the rotation angle between the two principal planes, which is used to describe the directional changes between frames.

[0047] S563: Obtain the image motion parameters based on the rotation angle and the preset rotation matrix.

[0048] Specifically, based on the rotation angle, a rotation matrix is ​​constructed, which can be expressed using the following formula:

[0049] In the formula, R is the rotation matrix. For the rotation angle, Let k be an antisymmetric matrix; Based on the calculated rotation angle, a rotation matrix is ​​constructed. The rotation matrix R is calculated using the formula above, which combines the cosine and sine of the angle with information about the rotation axis to accurately construct a matrix describing the rotational relationship between images, providing a foundation for subsequent image stabilization. Using the constructed rotation matrix, motion parameters between two frames are calculated. These parameters describe the motion relationship between frames, providing accurate motion compensation information for subsequent image stabilization processing. In some embodiments, the image motion parameters also include a translation vector T. After plane fitting, not only the rotation of the normal vector is calculated, but also the translation of the plane's centroid point.

[0050] In one embodiment, S6 includes: S61: Input the user command, the image motion parameters and the current frame image into the large model, perform image stabilization processing on the current frame image, and obtain a pre-stabilized image; Specifically, user commands refer to natural language instructions input by the user based on personalized needs, such as: maintaining a documentary-style smoothness, retaining some handheld vibration, or prioritizing the preservation of the main subject, etc., to guide the large model in adjusting its image stabilization strategy. The target frame image refers to the final output frame image that meets stability, integrity, and user requirements after image stabilization processing. By inputting user commands and image motion parameters into the large model, the current frame image is processed to obtain the preliminary image stabilization result. After processing by the large model, the image after preliminary stabilization initially eliminates instability and blurring caused by image motion. The image after preliminary stabilization effectively removes motion interference from the image, providing a clear and stable image foundation.

[0051] S62: Based on the image detection algorithm, perform structural similarity analysis between the initially stabilized image and the current frame image, and calculate the similarity. Specifically, structural similarity analysis is an image comparison method that determines the similarity between two images based on their structural information. By comparing the differences between the pre-stabilized image and the current frame image using structural similarity analysis algorithms, the similarity is calculated. This helps determine whether the stabilization process was successful, ensuring that the image is not distorted during the restoration process, thus guaranteeing the quality of the image stabilization and preventing post-processing image distortion.

[0052] S63: Based on the preset similarity threshold, determine whether the image after preliminary stabilization has deformation or missing parts; Specifically, the similarity threshold is a pre-set standard value used to determine whether image distortion or loss occurs after image processing. By comparing the calculated similarity with the preset threshold, it is determined whether the image after preliminary stabilization meets the stabilization standard, thus avoiding the output of images that do not meet the requirements.

[0053] S64: If there is no deformation or missing parts, the image after preliminary stabilization shall be used as the target frame image; Specifically, if the image after initial stabilization is not distorted or missing, it is directly output as the target frame image, simplifying the processing and improving processing efficiency while ensuring that the image quality meets the requirements.

[0054] S65: If there is deformation or missing parts, perform content-aware repair on the image after preliminary stabilization to obtain the target frame image.

[0055] Specifically, content-aware inpainting intelligently fills in missing areas based on the image's content, restoring lost regions and achieving image restoration. If an image contains distortions or missing parts, content-aware inpainting intelligently fills in the missing areas to obtain the final target frame image. Through content-aware inpainting, damaged parts of the image can be restored, ensuring image quality. Content-aware inpainting makes the stabilized image more complete, avoiding blank areas or distortion, and improving the user experience.

[0056] In one embodiment, S61 includes: S611: Input the user command, the image motion parameters and the current frame image into the large model to obtain images of each main region; Specifically, the system identifies the main subject regions in the image, such as people, vehicles, and key objects that users are particularly interested in. The large model obtains images of each subject region, which guides the image stabilization algorithm to prioritize the stability of that region in subsequent steps. Furthermore, during the image stabilization process, the system intelligently plans the cropping path based on the images of each subject region.

[0057] S612: Based on the user instruction, each of the main regions and image motion parameters, perform semantic parsing and mapping processing on the user instruction to obtain quantization parameters, wherein the user instruction includes natural language, and the quantization parameters include one or more of jitter compensation intensity, smoothing filter coefficient, and rotation correction threshold; Specifically, the user interface receives user commands in natural language and converts them into quantization parameters that the large model can understand. Quantization parameters refer to the specific numerical parameters that the user's natural language commands are converted into and can be used in the image stabilization algorithm, including one or more of the following: jitter compensation intensity, smoothing filter coefficient, and rotation correction threshold. The jitter compensation intensity represents the strength of the image stabilization algorithm in correcting jitter; a larger value indicates a more thorough correction. The smoothing filter coefficient represents the smoothness of the stabilized image; a larger value results in a smoother but potentially more rigid image. The rotation correction threshold represents the minimum rotation angle required to trigger rotation correction; a smaller value indicates greater sensitivity. A mapping relationship between commands and quantization parameters is established by combining the priority information of each subject region and image motion parameters. Through a pre-defined mapping rule base, the parsed user requirements are converted into specific quantization parameter values. For example, documentary-style stabilization is mapped to a shake compensation intensity of 0.9, a smoothing filter coefficient of 0.8, and a rotation correction threshold of 0.5°; while preserving handheld vibration is mapped to a shake compensation intensity of 0.6, a smoothing filter coefficient of 0.4, and a rotation correction threshold of 1.0°. The final output includes a set of quantized parameters such as shake compensation intensity, smoothing filter coefficient, and rotation correction threshold. Through semantic parsing and mapping, user commands are transformed into specific quantized parameters to guide adjustments during the image stabilization process. The introduction of quantized parameters allows for fine-tuning of the image stabilization effect according to different needs, ensuring compliance with user requirements. Setting quantized parameters enhances the adaptability of image stabilization, meeting the needs of different scenarios and users.

[0058] S613: Obtain the image of the region to be processed based on the preset rotation matrix and quantization parameters; Specifically, the image to be processed refers to the region in the current frame that requires image stabilization correction (rotation and translation compensation). Unlike the main body image, the image to be processed typically includes jitter-prone areas and background areas that need correction, used for subsequent targeted image stabilization processing. Based on a preset rotation matrix, the offset regions in the current frame image caused by rotational jitter (i.e., areas requiring rotation correction) are identified. Combining the rotation correction threshold and jitter compensation intensity in the quantization parameters, regions requiring image stabilization are selected, such as regions with rotation angles greater than the rotation correction threshold and regions with jitter offsets greater than the threshold corresponding to the compensation intensity. Simultaneously, based on the coordinate position information of the main body image, it is ensured that the image to be processed does not contain critical parts of the core main body region, or the core main body region is separately marked for priority protection. Finally, the image to be processed is segmented from the current frame image, forming the image to be processed, and its coordinate position information is recorded for subsequent processing.

[0059] The target frame image is expressed by the following formula: Istable = Warp(It, R) -1 ) In the formula, Istable is the target frame image, It is the current frame image, and R is the rotation matrix.

[0060] The target frame image after Istable stabilization, i.e., the output after jitter compensation. It refers to the current frame image, i.e., the image before stabilization. R −1 The inverse of the rotation matrix represents the direction of the compensation rotation. Warp: Image transformation operation used to realign and correct the current frame image according to the transformation of the inverse rotation matrix. Simply put, this step applies the inverse of the rotation matrix to the current frame image, corrects and aligns it, and eliminates the offset caused by shaking or motion during shooting, thereby obtaining a stable, jitter-free target frame image.

[0061] S614: Based on the current frame image, the image of the region to be processed, and the main body image, perform path planning to obtain the image stabilization processing path; Specifically, the image stabilization processing path refers to the optimal path planned based on the preceding steps, used to guide the image stabilization algorithm's processing. The large model analyzes the current frame image, the region to be processed, and the main subject region image to plan the image processing path, ensuring that each step in the image stabilization process is coherent and efficient. Path planning can rationally arrange processing steps, ensuring the efficiency and accuracy of image processing. Through path planning, unnecessary computations can be reduced, improving the overall efficiency and quality of image stabilization processing.

[0062] S615: Perform image stabilization processing on the current frame image according to the image stabilization processing path to obtain a preliminarily stabilized image.

[0063] Specifically, based on the planned image stabilization processing path, the current frame image is processed to obtain a preliminarily stabilized image. This ensures the processing follows the optimal path, maximizing both stabilization effectiveness and processing efficiency. Optimizing the image stabilization processing path makes the process more precise and efficient, ultimately resulting in a high-quality image.

[0064] In one embodiment, when running on an edge computing device, the large model adopts a quantized multi-digit integer model; when running on a cloud server, the large model adopts a floating-point precision model.

[0065] Specifically, edge computing devices refer to local end-device hardware such as drones, action cameras, mobile phones, or in-vehicle devices. These devices have limited computing power and memory, emphasizing low latency and low power consumption. Quantized multi-bit integer models refer to models that convert model weights and / or activation values ​​from floating-point representation to integer representation. Multi-bit integers include integer bit widths such as INT8 and INT16, replacing floating-point operations with integer arithmetic. Cloud servers refer to data center environments with stronger GPU or AI acceleration resources. Floating-point precision models are models that use floating-point precision such as FP16 and FP32 for inference, maintaining higher numerical expressive power. When running on a cloud server, a floating-point precision model is used. Using quantized integer models can significantly reduce model size and computing power requirements, reduce inference latency, and make image stabilization judgment, semantic parsing, and image stabilization processing more suitable for real-time video streams. Using floating-point precision models in the cloud can improve the stability and detail quality of semantic analysis and generative repair, reducing output fluctuations caused by numerical errors, thus ensuring system availability while maintaining the upper limit of image stabilization effect, achieving a balance between performance and efficiency.

[0066] Example 2 Please see Figure 2 This invention provides a video stabilization device based on a large video understanding model, characterized in that the device comprises: The data acquisition module is used to acquire consecutive previous frames and the current frame from monocular video data; The image stabilization judgment module is used to input the previous several frames of images and the current frame image into the large model to determine whether the current frame image needs image stabilization processing. The depth acquisition module is used to perform depth estimation on the previous few frames and the current frame image according to a preset depth estimation algorithm when the current frame image needs image stabilization processing, and obtain the corresponding depth images respectively. The point cloud data acquisition module is used to acquire the corresponding point cloud data according to each of the depth images. The motion parameter acquisition module is used to acquire image motion parameters based on the point cloud data. The image stabilization module is used to input user commands, the image motion parameters, and the current frame image into a large model, perform video image stabilization processing, and obtain the target frame image.

[0067] It should be noted that each module and unit in the video stabilization device based on the large video understanding model in this embodiment corresponds one-to-one with each step in the video stabilization method based on the large video understanding model in the aforementioned embodiment. Therefore, the specific implementation of this embodiment can refer to the implementation of the video stabilization method based on the large video understanding model in the aforementioned embodiment, and will not be repeated here.

[0068] Example 3 In addition, combined Figure 1 The video stabilization method based on a large video understanding model described in this embodiment of the invention can be implemented by an electronic device. Figure 3 A schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present invention is shown.

[0069] Electronic devices may include processors and memory storing computer program instructions.

[0070] Specifically, the processor may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement embodiments of the present invention.

[0071] The memory may include a large-capacity storage device for data or instructions. For example, and not limitingly, the memory may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disk drive, a magneto-optical disk drive, magnetic tape, or a Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, the memory may include removable or non-removable (or fixed) media. Where appropriate, the memory may be internal or external to a data processing device. In a particular embodiment, the memory is a non-volatile solid-state memory. In a particular embodiment, the memory includes a read-only memory (ROM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

[0072] The processor reads and executes computer program instructions stored in memory to implement any of the video stabilization methods based on large video understanding models in the above embodiments.

[0073] In one example, the electronic device may also include a communication interface and a bus. For example, Figure 3 As shown, the processor 401, memory 402, and communication interface 403 are connected through bus 410 and complete communication with each other.

[0074] The communication interface is mainly used to enable communication between various modules, devices, units and / or equipment in the embodiments of the present invention.

[0075] A bus, including hardware, software, or both, couples components of an electronic device together. For example, and not limitingly, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, a bus may include one or more buses. While specific buses are described and illustrated in embodiments of the invention, the invention contemplates any suitable bus or interconnect.

[0076] Example 4 Furthermore, in conjunction with the video stabilization method based on a large video understanding model in the above embodiments, this invention can be implemented using a computer-readable storage medium. This computer-readable storage medium stores computer program instructions; when executed by a processor, these computer program instructions implement any of the video stabilization methods based on a large video understanding model in the above embodiments.

[0077] In summary, this invention synergistically integrates the semantic analysis capabilities of a large-scale video understanding model with the 3D information obtained from monocular depth estimation, providing a video stabilization method that combines semantic guidance with physical constraints. On one hand, the large-scale model can understand scene semantics, key subjects, and motion intentions based on multiple frames of images, avoiding misjudgments and incorrect compensations caused by relying solely on depth estimation in complex scenarios such as occlusion and rapid movement. On the other hand, by introducing depth images, point cloud data, and the resulting motion parameters, the stabilization process is based on real-world spatial motion relationships, enhancing the objectivity and robustness of the large-scale model's stabilization decisions. Through this collaborative approach, this invention overcomes the rigidity of traditional algorithms, and the large-scale model only intervenes in key stages such as stabilization determination and final processing. Combined with a lightweight depth estimation algorithm, it achieves a balance between stabilization effectiveness and computational efficiency. While ensuring real-time performance, it effectively avoids distortion or loss in the stabilized image, significantly improving the quality of video stabilization in complex application scenarios and greatly enhancing the method's applicability, providing a superior solution for video shooting in complex scenes.

[0078] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.

[0079] The functional blocks shown in the above-described structural diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the required tasks. The programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.

[0080] It should also be noted that the exemplary embodiments mentioned in this invention describe methods or systems based on a series of steps or apparatus. However, this invention is not limited to the order of the steps described above; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.

[0081] The above description is merely a specific embodiment of the present invention. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the protection scope of the present invention.

Claims

1. A video stabilization method based on a large video understanding model, characterized in that, The method includes: Acquire consecutive preceding frames and the current frame from monocular video data; Input the previous few frames and the current frame image into the large model to determine whether the current frame image needs image stabilization processing. If the current frame image needs image stabilization, then depth estimation is performed on the previous few frames and the current frame image according to a preset depth estimation algorithm to obtain the corresponding depth images respectively; Based on each depth image, obtain the corresponding point cloud data respectively; Based on the point cloud data, obtain the image motion parameters; The user command, the image motion parameters, and the current frame image are input into a large model for video stabilization processing to obtain the target frame image.

2. The method according to claim 1, characterized in that, The step of inputting the previous several frames and the current frame image into a large model to determine whether the current frame image needs image stabilization processing includes: Input the previous few frames and the current frame image into the large model to obtain the semantic analysis results; Based on the semantic analysis results and the preset motion type recognition rule library, the motion type in the current frame image is obtained; Based on the motion type, determine whether the previous frame image requires image stabilization.

3. The method according to claim 1, characterized in that, The step of obtaining image motion parameters based on the point cloud data includes: Normalize each of the point cloud data to obtain each normalized point cloud data; Perform multi-plane fitting based on the normalized point cloud data to obtain the first fitting plane and the second fitting plane. Obtain semantic information provided by the large model, wherein the semantic information includes main region information; Based on the semantic information, the plane parameters of the first fitting plane and the second fitting plane are obtained, wherein the plane parameters include the number of data points and the plane area; The plane with the largest number of data points or the largest plane area is used as the principal plane of the first fitting plane and the second fitting plane, respectively denoted as the first principal plane and the second principal plane. The image motion parameters are obtained based on the first principal plane and the second principal plane.

4. The method according to claim 3, characterized in that, The step of obtaining the image motion parameters based on the first principal plane and the second principal plane includes: Obtain the first normal vector of the first principal plane and the second normal vector of the second principal plane, respectively; The rotation angles of the first principal plane and the second principal plane are calculated based on the first normal vector and the second normal vector. The image motion parameters are obtained based on the rotation angle and the preset rotation matrix.

5. The method according to claim 1, characterized in that, The step of inputting the user command, the image motion parameters, and the current frame image into a large model for video stabilization processing to obtain the target frame image includes: The user command, the image motion parameters, and the current frame image are input into the large model, and the current frame image is stabilized to obtain a preliminarily stabilized image. Based on the image detection algorithm, structural similarity analysis is performed between the initially stabilized image and the current frame image to calculate the similarity. Based on a preset similarity threshold, determine whether the image after initial stabilization is deformed or missing. If there is no deformation or missing parts, the image after preliminary stabilization is used as the target frame image; If deformation or missing parts exist, content-aware repair is performed on the initially stabilized image to obtain the target frame image.

6. The method according to claim 5, characterized in that, The step of inputting user commands, image motion parameters, and the current frame image into a large model, and performing image stabilization processing on the current frame image to obtain a preliminarily stabilized image includes: The user command, the image motion parameters, and the current frame image are input into the large model to obtain images of each main region. Based on the user instruction, each of the main regions and image motion parameters, the user instruction is semantically parsed and mapped to obtain quantization parameters. The user instruction includes natural language, and the quantization parameters include one or more of jitter compensation intensity, smoothing filter coefficient, and rotation correction threshold. The image of the region to be processed is obtained based on the preset rotation matrix and quantization parameters; Based on the current frame image, the image of the region to be processed, and the main body image, a path is planned to obtain the image stabilization processing path; The current frame image is stabilized according to the image stabilization processing path to obtain a preliminarily stabilized image.

7. The method according to claim 1, characterized in that, When running on an edge computing device, the large model adopts a quantized multi-digit integer model; when running on a cloud server, the large model adopts a floating-point precision model.

8. A video stabilization device based on a large video understanding model, characterized in that, The device includes: The data acquisition module is used to acquire consecutive previous frames and the current frame from monocular video data; The image stabilization judgment module is used to input the previous several frames of images and the current frame image into the large model to determine whether the current frame image needs image stabilization processing. The depth acquisition module is used to perform depth estimation on the previous few frames and the current frame image according to a preset depth estimation algorithm when the current frame image needs image stabilization processing, and obtain the corresponding depth images respectively. The point cloud data acquisition module is used to acquire the corresponding point cloud data according to each of the depth images. The motion parameter acquisition module is used to acquire image motion parameters based on the point cloud data. The image stabilization module is used to input user commands, the image motion parameters, and the current frame image into a large model, perform video image stabilization processing, and obtain the target frame image.

9. An electronic device, characterized in that, include: At least one processor, at least one memory, and computer program instructions stored in the memory, which, when executed by the processor, implement the method as described in any one of claims 1-7.

10. A storage medium storing computer program instructions thereon, characterized in that, The method as described in any one of claims 1-7 is implemented when the computer program instructions are executed by the processor.