A dynamic object scene recognition method and system for compensating for camera self-motion

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a motion field and combining it with inertial information to correct and adaptively modulate the visual background field, the stability and continuity issues of dynamic object scene perception in existing technologies are solved, enabling intuitive dynamic object scene cognition under conditions of high-speed motion and limited computing resources.

CN122244101APending Publication Date: 2026-06-19ZHEJIANG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ZHEJIANG UNIV
Filing Date: 2026-05-21
Publication Date: 2026-06-19

Application Information

Patent Timeline

21 May 2026

Application

19 Jun 2026

Publication

CN122244101A

IPC: G06T7/246; G06T7/269; G06T7/73; G06F18/25; G06V10/762; G06V10/82; G06V20/56

AI Tagging

Application Domain

Image analysis Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to compensate for the motion of high-speed robots in dynamic environments without requiring pre-trained semantic priors and with limited computational resources, and generate intuitive and coherent dynamic object scene cognition results. Furthermore, existing methods are susceptible to changes in lighting and missing textures, and traditional visual inertial odometry schemes suffer from integral drift and difficulties in dynamic point extraction.

Method used

By constructing a motion field containing physical motion information, the visual background field is collaboratively corrected and adaptively thresholded using inertial information. The inertial rotation angle and acceleration are extracted and smoothed. Combined with multi-scale pyramid optical flow estimation and polar geometry constraints, background points and outliers are separated. Visual-inertial collaborative trust decision-making and dynamic threshold modulation are performed to generate continuous dynamic event regions.

Benefits of technology

It achieves stable perception of dynamic scenes under complex motion conditions, generates continuous dynamic object scenes that conform to human intuitive cognition, reduces the interference of rapid camera movement on recognition results, is suitable for mobile robot platforms with limited computing resources, and avoids detection failure in dynamic false alarms and environments with missing visual textures.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244101A_ABST

Patent Text Reader

Abstract

This invention discloses a dynamic object scene cognition method and system that compensates for camera motion. The invention constructs an optical flow motion field, utilizing pyramid optical flow and brightness constrained by constant brightness to achieve continuous overall modeling of pixel-level motion states in complex scenes; it constructs a visual-inertial co-operation background master motion model, using inertial measurement unit data to verify and correct the visual background model, solving the problem of inaccurate background estimation under conditions of rapid motion or missing texture in a single visual scheme; it uses a residual motion field based on self-motion compensation for anomaly detection, combining robot speed calculated from inertial data to adaptively adjust the dynamic judgment threshold, effectively suppressing high-frequency noise interference while ensuring high dynamic sensitivity; and it performs cognitive clustering based on the spatial consistency of the residual field, aggregating discrete moving pixels into continuous dynamic event regions that conform to human observational intuition, solving the problem that traditional methods only output sparse feature points and lack intuitive semantic expression.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and image processing technology, and relates to a method and system for recognizing dynamic objects and scenes by compensating for the camera's own motion. Background Technology

[0002] With the rapid development of mobile robots and teleoperation technologies, the ability of robots to autonomously explore and remotely operate in complex, unstructured environments has become particularly important. In real-world telepresence and autonomous navigation tasks, the environment is often no longer an ideal static scene, but rather filled with constantly changing and highly unpredictable dynamic objects. These dynamic objects are not only an important component of the scene, but their motion states are also crucial for robots to make safe obstacle avoidance and decision-making. However, when the robot itself is in motion, the image contains a coupling between the background movement caused by the camera's own motion and the motion of the dynamic objects. Therefore, in practical applications, in highly dynamic and uncontrolled real-world environments, real-time and accurate decoupling of these two motions remains a core challenge for robot environmental perception.

[0003] In handling dynamic environments, current mainstream methods tend to incorporate deep learning or semantic information to explicitly identify dynamic objects. For example, object detection networks (such as the YOLO series) or semantic segmentation networks (such as Mask R-CNN) are used to remove or mask regions identified as potentially dynamic. While these methods based on strong semantic priors are effective in structured scenarios, they have significant limitations: First, their core mechanisms heavily rely on predefined semantic categories and training data distribution. When unknown dynamic entities or targets with ambiguous semantic boundaries appear in the environment, the system often fails to identify them or makes misjudgments. Second, these methods typically involve significant computational overhead, relying on high-performance computing power, which is difficult to meet the stringent requirements of low power consumption and low latency for mobile robot edge deployment. Existing semantic methods usually assume relatively stable camera motion. When the robot performs high-speed maneuvers or rapid turns, image blurring and sudden changes in viewpoint can cause the detection network to fail, thus failing to maintain continuous understanding of the dynamic environment. Furthermore, current inertial assistance schemes typically perform double integration of acceleration to calculate camera translation, leading to full-screen dynamic false alarms when the robot experiences physical bumps.

[0004] Furthermore, in scenarios involving human-machine collaboration, the ultimate service recipient of environmental cognition is the human, not merely the simultaneous localization and mapping (SMR) algorithms. Current work on dynamic object processing primarily focuses on the robustness of mapping algorithms, i.e., removing dynamic points as outliers or clustering sparse feature points to support backend pose optimization. However, such machine-optimized clustering results are often discrete and sparse, lacking the continuity and intuitiveness that aligns with human visual perception. For remote operators, what is needed is visual feedback that intuitively and clearly presents the spatial structure and motion trends of dynamic objects, rather than a scattered set of feature points. Existing technologies lack a systematic solution capable of densely and coherently clustering and presenting dynamic objects in a manner consistent with human visual perception after compensating for their own motion.

[0005] Therefore, how to achieve self-motion compensation for high-speed robots and generate intuitive and coherent dynamic object scene cognition results without the need for pre-trained semantic priors and under limited computational resources has become a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0006] The technical problem this invention aims to solve is that, in existing methods for dynamic scene perception in mobile robots, pure vision-based solutions are susceptible to changes in lighting and missing textures, leading to feature tracking failures. Traditional visual-inertial odometry (VIO) solutions, which directly utilize IMU integration to remove dynamic points, suffer from integration drift and difficulty in extracting complete dynamic semantics. Furthermore, existing technologies often treat dynamic elements as discrete, anomalous feature points for removal, lacking a description of continuous dynamic events that aligns with human intuition.

[0007] To this end, the present invention provides a dynamic object scene cognition method and system that compensates for the camera's own motion. By constructing a motion field containing physical motion information and using inertial information to perform collaborative correction and adaptive threshold modulation on the visual background field, continuous and robust perception of motion events in dynamic scenes is achieved.

[0008] The technical solution adopted by the present invention to solve the above-mentioned technical problems is as follows:

[0009] This invention discloses a dynamic object scene recognition method that compensates for camera motion, comprising:

[0010] Extract the inertial rotation angle and acceleration from the inertial sensor and accelerometer and smooth them; calculate the dynamic threshold modulation factor based on the instantaneous acceleration and the smoothed acceleration; extract the set of feature points from the camera image.

[0011] Multi-scale pyramid optical flow estimation is performed on two adjacent frames to obtain the set of observed coordinates of feature points in the next frame;

[0012] Background and outlier points in an image are separated based on polar geometry constraints; the optimal affine matrix and the in-visual-background point rate are obtained using the background points;

[0013] The visual rotation angle is separated using the optimal affine matrix, and the minimum period difference of the visual-inertial angle is calculated with the smoothed inertial rotation angle. Based on the minimum period difference of the visual-inertial angle, the visual in-point rate, and the preset threshold, a one-way coverage trust decision for visual-inertial coordination is made to obtain the final determined system reference rotation angle.

[0014] The magnitude and orientation angle of the two-dimensional optical flow residual vector are calculated using the optimal affine matrix, and the residual candidate set is extracted using outliers and residual magnitudes.

[0015] The residual candidate set is statistically analyzed, and the average residual magnitude and average motion direction within the grid are calculated.

[0016] By using a dynamic threshold modulation factor to relax the thresholds for the basic grid amplitude and residual judgment, candidate dynamic event regions are extracted. The candidate dynamic event regions are then subject to final judgment to remove static noise and output realistic dynamic objects.

[0017] Furthermore, the step of extracting the inertial rotation angle and acceleration from the inertial sensor and accelerometer and smoothing them, and calculating the dynamic threshold modulation factor based on the instantaneous acceleration and the smoothed acceleration, includes:

[0018] Discrete-time integration is performed on the Z-axis angular velocity of the gyroscope of the sensor to obtain the original inertial rotation angle at the current moment, and exponential smoothing filter is introduced to obtain the final inertial reference rotation angle.

[0019] The triaxial acceleration data is extracted to calculate the acceleration magnitude. A first-order low-pass filter is introduced to smooth the instantaneous acceleration magnitude, and the smoothed reference acceleration is updated and obtained. The abrupt difference between the current instantaneous acceleration magnitude and the smoothed reference acceleration is calculated as the dynamic threshold modulation factor.

[0020] Furthermore, the separation of background and outlier points in an image based on polar geometry constraints includes: transforming the two-dimensional pixel coordinates into a homogeneous representation by adding a constant dimension. and And, in alignment representation, a fundamental matrix reflecting epipolar geometric relations is introduced. Describe the epipolar geometric constraints of two image frames, construct the Sampson distance error equation based on the epipolar geometric constraints, measure the degree to which the point pair deviates from the static rigid scene, and obtain the first... The polar constraint error of each point pair;

[0021] When the preset epipolar error threshold is less than the epipolar constraint error, it is considered a background point; otherwise, it is considered an outlier.

[0022] Furthermore, obtaining the optimal affine matrix and visual background point rate representing the camera's true motion using background points includes: constructing a two-dimensional affine transformation model using a subset of background points, and solving for the optimal affine matrix that represents only the camera's true motion by minimizing the robust reprojection error of the random sampling consensus algorithm.

[0023] Set reprojection error threshold If the point satisfies Then the point is strictly determined to be an affine interior point, and the interior point rate of the visual background is calculated based on the number of affine interior points. The actual observed coordinates of the feature point in the next frame. For the optimal affine matrix, These are the homogeneous pixel coordinates of the previous frame.

[0024] Furthermore, the step of making a unidirectional coverage trust decision based on the minimum period difference of the visual inertial angle, the visual inlier rate, and a preset threshold to obtain the final determined system reference rotation angle includes: defaulting to accepting the inertial output, and determining that the vision has captured more accurate local motion when the visual inlier rate is greater than a preset quality threshold and the minimum period difference of the visual inertial angle is greater than a preset divergence threshold, and covering the inertial output with the visual rotation angle.

[0025] Furthermore, the calculation of the magnitude and orientation angle of the two-dimensional optical flow residual vector using the optimal affine matrix includes:

[0026] Motion compensation is performed on the feature points of the previous frame using the optimal affine matrix to calculate the theoretical predicted position. The theoretical predicted position is then subtracted from the actual observed position to generate the core two-dimensional optical flow residual vector. The amplitude and orientation angle of this residual vector are then extracted.

[0027] Furthermore, the extraction of the residual candidate set using outliers and residual magnitudes includes:

[0028] Using the anomaly points obtained by separating background points and outliers, when the residual amplitude of the anomaly feature point reaches the threshold of the basic motion amplitude, the feature point is used as the residual candidate set. When the number of candidate points is less than the minimum confidence limit of connected component clustering, global residual screening is used.

[0029] Furthermore, the step of using a dynamic threshold modulation factor to relax the thresholds for basic grid amplitude and residual determination to extract candidate dynamic event regions includes: multiplying the dynamic threshold modulation factor by the basic grid amplitude and residual determination threshold respectively to obtain relaxed grid amplitude thresholds and residual determination thresholds; constructing a binarized active grid mask using the relaxed residual determination thresholds; indicating that the mask of a grid cell is active if the average residual amplitude of the grid cell is greater than or equal to the relaxed residual determination threshold; and aggregating spatially adjacent active grids to extract candidate dynamic event regions.

[0030] Furthermore, the final decision on the candidate dynamic event region includes:

[0031] Extract the overall average direction of motion for each candidate region. Define its residual magnitude set for all feature points within the candidate region. quantiles are ,when The time is determined to be dynamic. The threshold for residual judgment after dynamic relaxation. As the system reference rotation angle, To determine the threshold for difference in directional deviation, The minimum amplitude threshold required to determine directional deviation.

[0032] On the other hand, a dynamic object scene recognition system for compensating for camera motion to implement the method is also disclosed, comprising:

[0033] The hardware data preprocessing module is used to extract basic physical parameters from hardware sensors in advance, extract the inertial rotation angle and acceleration of inertial sensors and accelerometers and perform smoothing, calculate the dynamic threshold modulation factor based on instantaneous acceleration and smoothed acceleration, and extract the feature point set in the camera image.

[0034] The background main motion estimation module performs multi-scale pyramid optical flow estimation on two adjacent frames, and separates background points from outliers based on polar geometry constraints; it uses background points to obtain the optimal affine matrix and visual background in-point rate representing the camera's true motion.

[0035] The visual-inertial coordination decision module is used to measure the visual-inertial angle difference to obtain the minimum period difference of the visual-inertial angle. Based on the minimum period difference of the visual-inertial angle, the visual in-point rate and the preset threshold, it performs a one-way coverage trust decision for visual-inertial coordination to obtain the final determined system reference rotation angle.

[0036] The cognitive decision-making module uses a threshold adaptive mechanism to piece together a set of discrete points into a continuous dynamic object block that conforms to human observation intuition.

[0037] The beneficial effects of this invention are as follows:

[0038] By modeling pixel-level motion information as a whole into a motion field, dynamic events are expressed in the form of continuous regions, avoiding the description of dynamic information with sparse feature points, which is more in line with the intuitive way humans perceive motion phenomena. By performing affine modeling of the main camera motion and combining it with inertial information for auxiliary judgment, epipolar constraints strictly separate pure background points. The judgment threshold is adaptively modulated using inertial data, which effectively reduces the interference of the camera's rapid self-motion on the dynamic event recognition results and improves the stability of the system under complex motion conditions.

[0039] This invention utilizes a method for constructing a residual field using background motion compensation. It does not rely on predefined semantic categories or highly complex deep learning models, and achieves stable perception of the dynamic environment while ensuring real-time performance. It is suitable for mobile robot platforms with limited computing resources.

[0040] This invention extracts the first-order abrupt difference of the smooth acceleration modulus and avoids full-screen dynamic false alarms from the robot during physical bumps by dynamically relaxing the downstream detection threshold.

[0041] This invention assigns trust weights to vision and inertia based on visual confidence. When encountering environments with missing visual textures (rapid robot movement), the weights tend to trust inertia more, avoiding the weight failure phenomenon caused by the current vision-inertia fusion scheme that subtracts visual prediction from inertial detection. Attached Figure Description

[0042] Figure 1 This is a flowchart of the present invention;

[0043] Figure 2 The hierarchical processing of pyramid optical flow in this invention;

[0044] Figure 3 This is a schematic diagram of the threshold mechanism under different robot movement speeds according to the present invention;

[0045] Figure 4 This is a schematic diagram of the residual field and local anomaly clustering of the present invention;

[0046] Figure 5 This is a schematic diagram illustrating the dynamic determination of visual-inertial linkage in this invention. Detailed Implementation

[0047] To make the objectives, technical solutions, and advantages of the present invention clearer, the specific technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Obviously, the described examples are only a part of the examples of the present invention. All other examples obtained based on the examples of the present invention without inventive effort are within the protection scope of the present invention.

[0048] like Figure 1As shown, the overall process of the method of the present invention includes the following steps. The present invention takes continuously acquired RGB image sequences and synchronous inertial measurement data as input, constructs a pixel-level motion model, and introduces its own motion compensation mechanism to achieve stable perception of dynamic objects.

[0049] The method of the present invention will be described in detail below.

[0050] Step 1: Synchronous acquisition and kinematic filtering of data from multiple heterogeneous sensors:

[0051] This step aims to extract fundamental physical parameters (rotation angle, translation abrupt change factor, image feature points) from the hardware sensors (inertial sensors and cameras) beforehand. These prior parameters will serve as the underlying inputs for subsequent motion compensation, adaptive threshold adjustment, and dynamic feature tracking. Specifically:

[0052] 1.1 Acquisition of angular velocity and rotational integration using inertial sensor:

[0053] To obtain the camera's absolute yaw angle change in the physical world, thus providing a reliable rotational reference source in the event of subsequent visual failure, gyroscope data from the inertial sensor is first extracted. The gyroscope's Z-axis angular velocity is then... Perform discrete-time integration to obtain the original inertial rotation angle at the current moment. :

[0054]

[0055] In the formula, —For the moment The original inertial rotation angle;

[0056] — For a moment The original inertial rotation angle;

[0057] — This refers to the angular velocity of the gyroscope's Z-axis;

[0058] — This represents the sampling time interval.

[0059] Since the original integral results are highly susceptible to high-frequency mechanical vibration noise, an exponential smoothing filter formula is introduced to calculate the final inertial reference rotation angle to ensure numerical smoothness. :

[0060]

[0061] In the formula, —The current inertial reference rotation angle after smoothing;

[0062] — This is the smoothed inertial reference rotation angle from the previous moment;

[0063] — A pre-defined exponential smoothing coefficient (0 to 1) is used to control the weighting of historical data and current data.

[0064] 1.2 Acceleration smoothing based on low-pass filtering:

[0065] To suppress high-frequency measurement noise from the accelerometer and extract the translational reference state of the camera motion, triaxial acceleration data is extracted. Calculate the acceleration modulus :

[0066]

[0067] In the formula, —This represents the instantaneous acceleration modulus at the current moment;

[0068] — This refers to the three-axis acceleration components output by the accelerometer.

[0069] Subsequently, a first-order low-pass filter is introduced to smooth the instantaneous acceleration magnitude, and the smoothed reference acceleration is updated and obtained. :

[0070]

[0071] In the formula, —For the moment The smoothed reference acceleration;

[0072] — For a moment The smoothed reference acceleration;

[0073] — This is a preset acceleration smoothing coefficient (0.6 in this embodiment), used to control the update rate and smoothness of the reference state.

[0074] 1.3 Drift-free translational mutation detection and modulation factor generation:

[0075] To extract the instantaneous magnitude of camera translation and dynamically increase the threshold for subsequent dynamic object detection during rapid camera acceleration or braking, while preventing false alarms across the entire screen, the abrupt difference between the current instantaneous acceleration magnitude and the smooth reference state is calculated. :

[0076]

[0077] In the formula, — This is the dynamic threshold modulation factor.

[0078] — This is the preset translation judgment threshold;

[0079] — This is a threshold relaxation factor. Its function is to pass this factor to downstream steps to amplify the anomaly judgment threshold when a sharp translation is detected, thereby accommodating global motion blur within a short period of time.

[0080] 1.4 Visual Image Acquisition and Sparse Feature Point Initialization:

[0081] To select the pixels with the richest texture and easiest tracking in the image as anchor points for subsequent optical flow calculations, sparse corner features are extracted from the current frame image. These points constitute the initial set of tracking points. ,in For the first Two-dimensional pixel coordinates of a feature point.

[0082] Step 2: Background principal motion estimation based on epipolar geometric constraints

[0083] Existing methods often include feature points of dynamic objects in the calculation when estimating camera motion, leading to significant deviations in the calculated camera motion model. This step proposes a two-layer mechanism: the first layer (step 2.2) uses epipolar geometry as a constraint to separate a pure set of background points. The second layer (step 2.3) only occurs in... By estimating camera motion, the path of dynamic objects contaminating the background model is fundamentally cut off.

[0084] 2.1 Multi-scale optical flow estimation of image pyramids:

[0085] To obtain the actual pixel displacement of feature points between two adjacent frames, a multi-scale pyramid optical flow algorithm, from coarse to fine, is used, such as... Figure 2 As shown. Based on the assumption of local brightness constancy, the values of each feature point are calculated. motion vector This allows us to obtain the set of observed coordinates of the feature points in the next frame. The calculation formula is as follows:

[0086]

[0087] In the formula, —These are the actual observed coordinates of the feature point in the next frame;

[0088] — These are the coordinates of the feature point in the current frame;

[0089] — This is the motion vector of the feature point.

[0090] 2.2 Background and outlier separation based on epipolar geometry constraints:

[0091] To initially filter out feature points that clearly do not conform to the rigid motion law of the background using the epipolar geometry rules of 3D spatial projection, the 2D pixel coordinates are transformed into a homogeneous representation by adding a constant dimension. and Introducing the fundamental matrix Describe the epipolar geometric constraints of two image frames. An error equation is constructed based on the Sampson distance; this error measures the degree to which a point pair deviates from a static, rigid scene.

[0092]

[0093] In the formula, —for the first The polar constraint error of each point pair;

[0094] and — These are the homogeneous pixel coordinates of the preceding and following frames;

[0095] —The fundamental matrix that reflects the epipolar geometric relations.

[0096] Set the polar error threshold Pixel. When When the point is determined to conform to the rigid background law, a mask is marked. Otherwise, the point deviates from the background motion and is marked with a mask. :

[0097]

[0098] In the formula, —for the first Background determination mask for each point;

[0099] — This is the set polar error threshold.

[0100] 2.3 Affine Model Construction and Parameter Estimation for Background Principal Motion:

[0101] After stripping away potential dynamic points, this step utilizes only the mask to accurately calculate the camera's 2D motion model and prevent it from being poisoned by dynamic objects. Pure background dot set A two-dimensional affine transformation model is constructed based on this subset. By minimizing the robust reprojection error of the random sampling consensus algorithm, the optimal affine matrix that represents only the true motion of the camera itself is obtained. :

[0102]

[0103] In the formula, —The optimal affine matrix obtained by solving;

[0104] —These are variables in a two-dimensional affine transformation matrix;

[0105] —The set of background points that have passed through the first layer of purification;

[0106] — This is a robust kernel function for random sampling consistency.

[0107] Set reprojection error threshold Pixel. If a point satisfies If the value is true, then the point is strictly determined to be an affine interior point, proving that it completely obeys the planar motion laws of the background. Based on the number of affine interior points, the visual background interior point ratio is calculated. :

[0108]

[0109] In the formula, —This represents the point ratio within the visual background, reflecting the reliability of the current visual estimation model;

[0110] —The total number of points within the clean background used in the calculation;

[0111] —The number of affine interior points that ultimately satisfy the reprojection error constraint.

[0112] Step 3: Visual-Inertial Collaborative Trust Decision and Optical Flow Residual Vector Construction

[0113] This step compares the inertial sensor data from step one with the visual model from step two. By comparing the data, it determines whether to trust the camera or the inertial sensor data (to prevent visual tracking from crashing when the robot turns its head quickly). It can also use its own motion compensation to isolate residuals that belong only to the object, providing direct evidence for subsequent detection.

[0114] 3.1 Measurement of differences in viewing angle:

[0115] To quantify the discrepancy between the rotation angle estimated purely visually and the rotation angle obtained from the hardware integration of the inertial sensor, from the affine matrix... Decompose the visual rotation angle To avoid sign jumps at the angle boundaries, modulo operation is used to define the minimum period difference of the apparent angle. :

[0116]

[0117] In the formula, —This represents the minimum period difference of the viewing angle;

[0118] — Rotation angle calculated for visual processing;

[0119] —The inertial reference rotation angle obtained in step 1.1.

[0120] The prediction results describe the background motion that a pixel should produce under the assumption of a static scene.

[0121] 3.2 One-way coverage trust decision-making for visual-inertial collaboration:

[0122] To dynamically determine which rotation angle to use as the system reference in scenarios where visual tracking is at the edge of failure, the system defaults to using a robust inertial output. If and only if the visual inlier rate Greater than the preset quality threshold And differences in viewing angles Greater than the divergence threshold At this point, the system determines that the vision has captured a more precise local motion, and therefore uses the visual rotation angle to overwrite the inertial output, obtaining the final system reference rotation angle. :

[0123]

[0124] In the formula, —The final system reference rotation angle;

[0125] —This is a preset angle divergence threshold;

[0126] — This is the preset confidence threshold for the interior point rate.

[0127] The residual motion field mainly reflects the motion deviation caused by dynamic objects or local non-rigid structures, and is the core input for subsequent dynamic discrimination.

[0128] 3.3 Self-motion compensation and optical flow residual vector construction:

[0129] To remove global background displacement caused by camera movement and restore the independent motion properties of objects in the image, the optimal affine matrix is used. Motion compensation is applied to feature points from the previous frame to calculate the theoretically predicted positions. The actual observed positions are subtracted from the theoretically predicted positions to generate the core two-dimensional optical flow residual vector. :

[0130]

[0131] In the formula, —for the first Two-dimensional optical flow residual vector of each pixel;

[0132] —These are the actual observed coordinates;

[0133] —These are the theoretically predicted coordinates calculated using the camera's self-motion model.

[0134] Subsequently, the magnitude of the residual vector is extracted. With direction angle .

[0135] 3.4 Active Residual Extraction Combined with Epipolar Constraints:

[0136] To utilize the calculated exterior point attributes as prior knowledge to pre-identify high-probability dynamic feature points and prevent omissions, this step utilizes the information obtained in step 2.2. The points that, as demonstrated in step 2.2, do not conform to the rigid projection law and are semantically potential dynamic targets. Therefore, the system prioritizes their inclusion when extracting the candidate residual set.

[0137]

[0138] In the formula, —This is the set of residual features extracted based on epipolar priors;

[0139] —The residual magnitude of the feature point;

[0140] —Based motion amplitude threshold;

[0141] —A mask for background determination.

[0142] To ensure robustness in extreme cases, when the number of candidate sets... hour( The minimum confidence level for connected component clustering (typically ranging from 5 to 15) indicates insufficient prior information for epipolar lines. In this case, the system switches to a global residual screening strategy to prevent missed detection of moving objects.

[0143] Step 4: Continuous Dynamic Event Recognition Based on Adaptive Gridding

[0144] Dense optical flow consumes a great deal of computational power, while sparse points lack continuous semantic shape. This step uses sparse residual gridding statistics to piece together discrete point sets into continuous dynamic object blocks that conform to human observation intuition with extremely low computational cost, and introduces a threshold adaptive mechanism.

[0145] 4.1 Mesh-based Consistency Modeling of Residual Motion Fields:

[0146] To reduce the dimensionality and spatialize discrete point coordinates in order to extract continuous object contours, the image plane is divided into equal parts. 1 grid cell. For any grid Let the feature fall within this grid and belong to the effective residual feature set finally generated in step 3.4. The feature point set is Statistically calculate the average residual magnitude within the grid. with average direction of motion :

[0147]

[0148] In the formula, —For grid cells The average residual amplitude;

[0149] —For grid cells The average direction of motion;

[0150] —For grid cells Total number of effective residual feature points.

[0151] — This represents the residual magnitude of a single feature point.

[0152] —The orientation angle of a single feature point.

[0153] 4.2 Adaptive Connected Component Clustering with Fusion Acceleration Factors:

[0154] To merge adjacent, scattered meshes that are all in motion into a single object region, and to prevent the full-screen mesh from being misinterpreted as motion during bumps using an acceleration factor, a modulation factor from step 1.2 is introduced. For the basic grid amplitude threshold Compared with the basic residual judgment threshold Dynamic relaxation:

[0155]

[0156] In the formula, —This is the threshold value for the dynamically widened grid amplitude;

[0157] —This is the threshold for determining the residual after dynamic relaxation;

[0158] — This is the preset basic grid amplitude threshold.

[0159] — This is the preset residual judgment threshold.

[0160] — This is the dynamic threshold modulation factor output in step 1.2.

[0161] like Figure 3 As shown, the basic residual judgment threshold is set when the robot moves rapidly. The restrictions will be dynamically relaxed.

[0162] Construct a binarized active grid mask using the relaxed threshold. :

[0163]

[0164] In the formula, —A binary mask indicating whether the grid is active (1 for active, 0 for inactive).

[0165] For the generated binary mask By aggregating spatially adjacent 1-valued grids, candidate dynamic event regions can be extracted. ,like Figure 4 As shown.

[0166] 4.3 Dynamic Judgment Based on Dual Criteria of Visual-Inertial Linkage:

[0167] To make a final decision on the extracted candidate regions to remove static noise and output realistic dynamic objects, for each candidate region... Extract its overall average direction of motion The set of residual magnitudes for all feature points within this region. Define its quantiles are Combined with the system reference rotation angle Execute dual criteria for determination:

[0168]

[0169] In the formula, —The residual amplitude of this region quantiles;

[0170] —This represents the overall average direction of motion in the region;

[0171] — This is the system reference rotation angle.

[0172] —A threshold for determining the difference in direction;

[0173] —The minimum amplitude threshold required to determine directional deviation.

[0174] As shown in Figure 5, a car moving laterally has an extremely high speed, resulting in a huge residual amplitude. This directly triggers the criterion and is classified as a dynamic object, which is of type [missing information]. A pedestrian walking in the opposite direction, even if their residual amplitude may not reach the strong abrupt change threshold, is considered a dynamic object if their motion vector direction is completely opposite to the background main optical flow direction (to the right). This triggers the criterion. Fire hydrants or background trees, although they may have slight residuals due to tracking noise, are oriented in accordance with the main background motion and their amplitudes are lower than [the original text is missing here]. The threshold for judgment is set, so it is removed by the system as a static background.

[0175] This invention achieves stable modeling of pixel-level motion relationships in complex scenes by constructing a multi-scale optical flow motion field and combining it with geometric consistency constraints. By introducing inertial information for collaborative estimation of the camera's own motion, it effectively eliminates the instability problem of motion estimation under rapid motion and visual degradation conditions. Based on this, a residual motion field is constructed by compensating for the camera's own motion, making the local abnormal motion characteristics of dynamic objects explicitly prominent. Through spatial structured modeling and a multi-criteria joint judgment mechanism, stable identification of dynamic object regions is achieved. Finally, a fusion decision-making mechanism using visual and inertial information is used to perform consistency verification and adaptive correction on the dynamic discrimination results, further enhancing the system's robustness and reliability in complex dynamic environments, thus achieving high-precision recognition of dynamic object scenes.

[0176] The above embodiments are used to explain and illustrate the present invention, but not to limit the present invention. Any modifications and changes made to the present invention within the spirit and scope of the claims shall fall within the protection scope of the present invention.

Claims

1. A method for recognizing dynamic objects and scenes by compensating for camera motion, characterized in that, include: The inertial rotation angle and acceleration of the inertial sensor and accelerometer are extracted and smoothed. The dynamic threshold modulation factor is calculated based on the instantaneous acceleration and the smoothed acceleration. Extract the set of feature points from the camera image; Multi-scale pyramid optical flow estimation is performed on two adjacent frames to obtain the set of observed coordinates of feature points in the next frame; Background and outlier separation in images based on polar geometry constraints; The optimal affine matrix and the in-visual-background point rate are obtained by using background points; The visual rotation angle is separated using the optimal affine matrix, and the minimum period difference of the visual rotation angle is calculated using the smoothed inertial rotation angle. Based on the minimum period difference of the visual inertial angle, the visual in-point rate, and the preset threshold, a one-way coverage trust decision for visual inertial coordination is made to obtain the final determined system reference rotation angle. The magnitude and orientation angle of the two-dimensional optical flow residual vector are calculated using the optimal affine matrix, and the residual candidate set is extracted using outliers and residual magnitudes. The residual candidate set is statistically analyzed, and the average residual magnitude and average motion direction within the grid are calculated. By using a dynamic threshold modulation factor to relax the thresholds for the basic grid amplitude and residual judgment, candidate dynamic event regions are extracted. The candidate dynamic event regions are then subject to final judgment to remove static noise and output realistic dynamic objects.

2. The dynamic object scene recognition method for compensating for camera motion according to claim 1, characterized in that, The process of extracting the inertial rotation angle and acceleration from the inertial sensor and accelerometer and smoothing them, and calculating the dynamic threshold modulation factor based on the instantaneous acceleration and the smoothed acceleration, includes: Discrete-time integration is performed on the Z-axis angular velocity of the gyroscope of the sensor to obtain the original inertial rotation angle at the current moment, and exponential smoothing filter is introduced to obtain the final inertial reference rotation angle. The triaxial acceleration data is extracted to calculate the acceleration magnitude. A first-order low-pass filter is introduced to smooth the instantaneous acceleration magnitude, and the smoothed reference acceleration is updated and obtained. The abrupt difference between the current instantaneous acceleration magnitude and the smoothed reference acceleration is calculated as the dynamic threshold modulation factor.

3. The dynamic object scene recognition method for compensating for camera motion according to claim 1, characterized in that, The background and outlier separation of an image based on polar geometry constraints includes: transforming the two-dimensional pixel coordinates into a homogeneous representation by adding a constant dimension. and And, in alignment representation, a fundamental matrix reflecting epipolar geometric relations is introduced. Describe the epipolar geometric constraints of two image frames, construct the Sampson distance error equation based on the epipolar geometric constraints, measure the degree to which the point pair deviates from the static rigid scene, and obtain the first... The polar constraint error of each point pair; When the preset epipolar error threshold is less than the epipolar constraint error, it is considered a background point; otherwise, it is considered an outlier.

4. The dynamic object scene recognition method for compensating for camera motion according to claim 1, characterized in that, The process of obtaining the optimal affine matrix and visual background point rate representing the camera's true motion using background points includes: constructing a two-dimensional affine transformation model using a subset of background points, minimizing the robust reprojection error of the random sampling consensus algorithm, and solving for the optimal affine matrix that represents only the camera's true motion. Set reprojection error threshold If the point satisfies Then the point is strictly determined to be an affine interior point, and the interior point rate of the visual background is calculated based on the number of affine interior points. The actual observed coordinates of the feature point in the next frame. For the optimal affine matrix, These are the homogeneous pixel coordinates of the previous frame.

5. The dynamic object scene recognition method for compensating for camera motion according to claim 1, characterized in that, The method of making a unidirectional coverage trust decision based on the minimum period difference of the visual inertial angle, the visual inlier rate, and a preset threshold to obtain the final system reference rotation angle includes: defaulting to accepting the inertial output, and determining that the vision has captured more accurate local motion when the visual inlier rate is greater than a preset quality threshold and the minimum period difference of the visual inertial angle is greater than a preset divergence threshold, and covering the inertial output with the visual rotation angle.

6. The dynamic object scene recognition method for compensating for camera motion according to claim 1, characterized in that, The calculation of the magnitude and orientation angle of the two-dimensional optical flow residual vector using the optimal affine matrix includes: Motion compensation is performed on the feature points of the previous frame using the optimal affine matrix to calculate the theoretical predicted position. The theoretical predicted position is then subtracted from the actual observed position to generate the core two-dimensional optical flow residual vector. The amplitude and orientation angle of this residual vector are then extracted.

7. The dynamic object scene recognition method for compensating for camera motion according to claim 1, characterized in that, The extraction of the residual candidate set using outliers and residual magnitudes includes: Using the anomaly points obtained by separating background points and outliers, when the residual amplitude of the anomaly feature point reaches the threshold of the basic motion amplitude, the feature point is used as the residual candidate set. When the number of candidate points is less than the minimum confidence limit of connected component clustering, global residual screening is used.

8. The dynamic object scene recognition method for compensating for camera motion according to claim 1, characterized in that, The step of using a dynamic threshold modulation factor to relax the thresholds for basic grid amplitude and residual determination to extract candidate dynamic event regions includes: multiplying the dynamic threshold modulation factor by the basic grid amplitude and residual determination threshold respectively to obtain relaxed grid amplitude thresholds and residual determination thresholds; constructing a binarized active grid mask using the relaxed residual determination thresholds; indicating that the mask of a grid cell is active if the average residual amplitude of the grid cell is greater than or equal to the relaxed residual determination threshold; and aggregating spatially adjacent active grids to extract candidate dynamic event regions.

9. A method for recognizing dynamic objects and scenes by compensating for camera motion according to claim 1, characterized in that, The final decision on the candidate dynamic event region includes: Extract the overall average direction of motion for each candidate region. Define its residual magnitude set for all feature points within the candidate region. quantiles are ,when The time is determined to be dynamic. The threshold for residual judgment after dynamic relaxation. As the system reference rotation angle, To determine the threshold for difference in directional deviation, The minimum amplitude threshold required to determine directional deviation.

10. A dynamic object scene recognition system for compensating for camera motion in implementing the method of any one of claims 1-9, characterized in that, include: The hardware data preprocessing module is used to extract basic physical parameters from hardware sensors in advance, extract the inertial rotation angle and acceleration of inertial sensors and accelerometers and perform smoothing, calculate the dynamic threshold modulation factor based on instantaneous acceleration and smoothed acceleration, and extract the feature point set in the camera image. The background main motion estimation module performs multi-scale pyramid optical flow estimation on two adjacent frames and separates background and outlier points based on polar geometry constraints. The optimal affine matrix and the in-visual-background point rate are obtained by using background points; The visual-inertial coordination decision module is used to measure the visual-inertial angle difference to obtain the minimum period difference of the visual-inertial angle. Based on the minimum period difference of the visual-inertial angle, the visual in-point rate and the preset threshold, it performs a one-way coverage trust decision for visual-inertial coordination to obtain the final determined system reference rotation angle. The cognitive decision-making module uses a threshold adaptive mechanism to piece together a set of discrete points into a continuous dynamic object block that conforms to human observation intuition.