An AR-based video label dynamic superimposition anti-shake method and system

By constructing a dynamic perspective projection model and a local feature search window, the pixel displacement of AR tags is calculated and compensated, solving the drift and jitter problems of AR tags in complex environments, and achieving a high-precision video overlay effect with low computational overhead.

CN122293995APending Publication Date: 2026-06-26广西信安锐达科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
广西信安锐达科技有限公司
Filing Date
2026-03-23
Publication Date
2026-06-26

Smart Images

  • Figure CN122293995A_ABST
    Figure CN122293995A_ABST
Patent Text Reader

Abstract

This invention discloses a video tag dynamic overlay anti-shake method and system based on AR, comprising: acquiring real-time video stream and device pose parameters at corresponding timestamps; constructing a dynamic perspective projection model based on the device pose parameters and calculating the initial pixel coordinates of the target AR tag in the current video frame; planning a local feature search window based on the initial pixel coordinates; calculating the average pixel displacement vector of image feature points in the local feature search window of the current and adjacent historical video frames; performing reverse displacement compensation on the average pixel displacement vector to obtain the target rendering coordinates, and rendering the target AR tag to the current video frame based on the target rendering coordinates. Compared with the prior art, this invention can overcome the shortcomings of large drift errors caused by simply relying on mechanical pose, high computational power consumption caused by global matching, and high-frequency jitter of tags caused by slight device vibrations, thereby improving tag mapping accuracy, reducing system computational power consumption, and improving the visual stability of video overlay.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of digital image processing technology, and more specifically, to an AR-based video tag dynamic overlay anti-shake method and system. Background Technology

[0002] With the development of urban governance and digital twin technology, AR real-scene command and dispatch systems are widely used in complex scenarios such as transportation hubs and urban high points. In actual operation, front-end monitoring equipment often faces multiple environmental interferences. On the one hand, at traffic nodes such as train station squares, dense large vehicles can easily obscure the image over a large area, causing conventional image tracking algorithms to easily mistake for foreground moving targets as background references. On the other hand, when high-point equipment uses high-magnification zoom to observe distant details, the local texture of the image is sparse, and high-altitude wind shear can easily cause mechanical micro-vibrations of the poles and lens roll shift.

[0003] While existing technologies can process video stream tags by directly reading the mechanical parameters of the camera gimbal and combining them with the perspective projection matrix to achieve 3D spatial mapping and initial image overlay of AR tags, existing technologies still suffer from problems such as large cumulative drift error of tags, excessive computing power consumption, and high-frequency jitter and visual tearing of tags.

[0004] Therefore, how to provide an AR-based video tag dynamic overlay anti-shake method that can overcome the shortcomings of relying solely on mechanical pose leading to large drift errors, global matching leading to high computational power consumption, and minor device vibrations causing high-frequency tag jitter, improve tag mapping accuracy, reduce system computational power consumption, and enhance the visual stability of video overlay has become a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] To address the aforementioned technical problems, this invention provides an AR-based video tag dynamic overlay anti-shake method, which overcomes the drawbacks of large drift errors caused by relying solely on mechanical pose, high computational power consumption due to global matching, and high-frequency tag jitter caused by minor device vibrations. This method improves tag mapping accuracy, reduces system computational power consumption, and enhances the visual stability of video overlay.

[0006] The first technical solution provided by this invention is as follows: This invention provides a video tag dynamic overlay anti-shake method based on AR, comprising the following steps: S1 determining a front-end monitoring device, acquiring a real-time video stream, and acquiring the device pose parameters of the front-end monitoring device at the corresponding timestamp; S2 constructing a dynamic perspective projection model based on the device pose parameters, and calculating the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model; S3 planning a local feature search window based on the initial pixel coordinates, the local feature search window representing a local image region centered on the initial pixel coordinates; S4 calculating the average pixel displacement vector of the image feature points of the current video frame and adjacent historical video frames within the local feature search window; S5 performing reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain target rendering coordinates, and rendering the target AR tag to the current video frame based on the target rendering coordinates.

[0007] Further, in a preferred embodiment of the present invention, the step of planning a local feature search window based on the initial pixel coordinates includes: Analyze the real-time optical zoom ratio in the device pose parameters; The target window side length of the target feature tracking window is obtained by multiplying the real-time optical zoom ratio with the preset base window side length. Based on the current video frame, the target window side length is truncated with the initial pixel coordinates as the center point to obtain a local feature search window.

[0008] Furthermore, in a preferred embodiment of the present invention, the step of constructing a dynamic perspective projection model based on the device pose parameters includes: The device pose parameters include the latitude and longitude coordinates, altitude, yaw angle, pitch angle, and roll angle of the front-end monitoring device; The latitude and longitude coordinates and altitude of the front-end monitoring device are compared with the three-dimensional geographic coordinates of the target AR tag to obtain the relative translation vector. Construct the camera extrinsic rotation matrix based on the yaw angle, pitch angle, and roll angle; By combining the pre-acquired camera intrinsic parameter matrix, the relative translation vector, and the extrinsic rotation matrix, a dynamic perspective projection model is synthesized.

[0009] Further, in a preferred embodiment of the present invention, the step of calculating the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model includes: The three-dimensional geographic coordinates of the target AR tag are transformed into homogeneous coordinates, and the three-dimensional geographic coordinates of the homogeneous coordinates are multiplied by the dynamic perspective projection model to obtain two-dimensional homogeneous coordinates. The two-dimensional homogeneous coordinates are normalized by perspective division to calculate the initial pixel coordinates of the target AR tag on the pixel plane of the current video frame.

[0010] Further, in a preferred embodiment of the present invention, the step of calculating the average pixel displacement vector of image feature points of the current video frame and adjacent historical video frames within the local feature search window includes: The local feature search window includes a first local feature search window located in the adjacent historical video frames and a second local feature search window located in the current video frame; The first local feature search window and the second local feature search window are respectively processed into grayscale; A corner detection algorithm is used to extract multiple background feature points within the first local feature search window; Based on the second local feature search window, target pixels that match the background feature points are searched, and the coordinate difference between the target pixel and the corresponding background feature point is calculated to obtain the pixel displacement vector corresponding to each background feature point. The average pixel displacement vector is obtained by summing and averaging all the pixel displacement vectors.

[0011] Furthermore, in a preferred embodiment of the present invention, the method further includes performing abnormal feature removal on the pixel displacement vector corresponding to each of the background feature points, specifically: Local data samples are obtained by randomly sampling all the pixel displacement vectors multiple times, and a basic affine transformation model is obtained by fitting the local data samples. Calculate the reprojection error between the pixel displacement vector of each background feature point and the fitted basic affine transformation model; Background feature points whose reprojection error is greater than a preset error threshold are identified as foreground motion interference points and removed. The pixel displacement vectors corresponding to background feature points whose reprojection error is less than or equal to the preset error threshold are retained, thus completing the abnormal feature removal.

[0012] Furthermore, in a preferred embodiment of the present invention, the method further includes determining the dead zone for image stabilization tolerance on the average pixel displacement vector, specifically as follows: Determine whether the magnitude of the average pixel displacement vector is greater than a preset pixel fluctuation threshold; If the magnitude of the average pixel displacement vector is less than or equal to the pixel fluctuation threshold, then the average pixel displacement vector is set to zero and reverse displacement compensation is performed. If the magnitude of the average pixel displacement vector is greater than the pixel fluctuation threshold, then reverse displacement compensation is performed.

[0013] Further, in a preferred embodiment of the present invention, the average pixel displacement vector is subjected to reverse displacement compensation with respect to the initial pixel coordinates to obtain the target rendering coordinates, including: Analyze the horizontal and vertical offsets of the average pixel displacement vector; By taking the opposite values ​​of the horizontal coordinate offset and the vertical coordinate offset, we obtain the reverse compensation horizontal coordinate component and the reverse compensation vertical component. The inverse compensation horizontal coordinate component and the inverse compensation vertical coordinate component are respectively added to the horizontal and vertical coordinates of the initial pixel coordinates to obtain the target rendering coordinates.

[0014] The present invention provides a second technical solution as follows: This invention also provides an AR-based video tag dynamic overlay stabilization system, including: The video pose acquisition module identifies the front-end monitoring device, acquires the real-time video stream, and acquires the device pose parameters of the front-end monitoring device at the corresponding timestamp. The initial coordinate calculation module constructs a dynamic perspective projection model based on the device pose parameters, and calculates the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model. The search window planning module plans a local feature search window based on the initial pixel coordinates. The local feature search window represents a local image region centered on the initial pixel coordinates. The pixel displacement calculation module calculates the average pixel displacement vector of the image feature points of the current video frame and adjacent historical video frames within the local feature search window; The reverse compensation rendering module performs reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain the target rendering coordinates, and renders the target AR tag to the current video frame based on the target rendering coordinates.

[0015] Furthermore, in a preferred embodiment of the present invention, the search window planning module includes: The zoom ratio unit is analyzed to analyze the real-time optical zoom ratio in the device pose parameters. The target side length calculation unit calculates the target window side length by multiplying the real-time optical zoom ratio with the preset basic window side length. The local window delineation unit, based on the current video frame, uses the initial pixel coordinates as the center point to truncate the side length of the target window and obtain a local feature search window.

[0016] This invention provides an AR-based video tag dynamic overlay anti-shake method, which overcomes the shortcomings of relying solely on mechanical pose leading to large drift errors, global matching resulting in high computational consumption, and minor device vibrations causing high-frequency tag jitter. It improves tag mapping accuracy, reduces system computational overhead, and enhances the visual stability of video overlay. The AR-based video tag dynamic overlay anti-shake method includes: S1 determining a front-end monitoring device, acquiring a real-time video stream, and acquiring the device pose parameters of the front-end monitoring device at the corresponding timestamp; S2 constructing a dynamic perspective projection model based on the device pose parameters, and calculating the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model; S3 planning a local feature search window based on the initial pixel coordinates, where the local feature search window represents a local image region centered on the initial pixel coordinates; S4 calculating the average pixel displacement vector of the image feature points of the current video frame and adjacent historical video frames within the local feature search window; S5 performing reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain target rendering coordinates, and rendering the target AR tag to the current video frame based on the target rendering coordinates.This invention first obtains the real-time video stream and corresponding timestamp device pose parameters from the front-end monitoring device, and calculates the initial pixel coordinates of the target AR tag in the current video frame using a dynamic perspective projection model, thereby establishing a macroscopic spatial mapping benchmark. Subsequently, based on these initial pixel coordinates, a local feature search window is oriented and defined as a local image region centered on these initial pixel coordinates. By planning a local tracking boundary from the full frame, redundant background pixel matching operations are significantly eliminated, directly overcoming the technical problem of excessive computational power consumption in existing technologies. Under this local spatial constraint, the average pixel displacement vector of image feature points in the current video frame and adjacent historical video frames is further extracted and calculated. This approach abandons the single reliance on camera pan-tilt parameters, which are prone to mechanical errors, and instead captures inter-frame relative offsets based on real image physical features, effectively solving the problem of large cumulative tag drift errors. The challenge lies in using the obtained average pixel displacement vector as a mathematical correction factor to perform reverse displacement compensation calculations on the initial pixel coordinates. This derives the target rendering coordinates with anti-shake properties and renders the target AR tag onto the current video frame accordingly. This reverse compensation mechanism utilizes the extracted inter-frame instantaneous displacement features to apply a reverse cancellation force to the initial projection coordinates, perfectly smoothing out high-frequency coordinate abrupt changes caused by minor device vibrations. This completely solves the problems of high-frequency tag jitter and visual tearing, ultimately achieving a dynamic anti-shake rendering technology for AR tags that combines low computational cost, high spatial overlay accuracy, and excellent visual stability. Compared to existing technologies, this invention overcomes the shortcomings of relying solely on mechanical pose leading to large drift errors, global matching leading to high computational cost, and minor device vibrations causing high-frequency tag jitter. It improves tag mapping accuracy, reduces system computational cost, and enhances the visual stability of video overlay. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A flowchart illustrating the steps of the AR-based video tag dynamic overlay anti-shake method provided in this embodiment of the invention; Figure 2 A flowchart illustrating the steps for constructing a dynamic perspective projection model provided in this embodiment of the invention; Figure 3 A logical framework diagram for calculating the average pixel displacement vector provided in an embodiment of the present invention; Figure 4A logical framework diagram for feature culling of pixel displacement vectors provided in an embodiment of the present invention; Figure 5 This is a performance comparison chart of feature extraction based on the dynamic search window adjustment mechanism provided in an embodiment of the present invention. Detailed Implementation

[0019] To enable those skilled in the art to better understand the technical solutions of this invention, the technical solutions of the embodiments of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.

[0020] It should be noted that when a component is referred to as being "fixed to" or "set on" another component, it can be directly on or indirectly set on the other component; when a component is referred to as being "connected to" another component, it can be directly connected to or indirectly connected to the other component.

[0021] It should be understood that the terms "length", "width", "upper", "lower", "front", "rear", "first", "second", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the present invention.

[0022] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" or "several" means two or more, unless otherwise explicitly specified.

[0023] It should be noted that the structures, proportions, sizes, etc., shown in the accompanying drawings of this specification are only for the purpose of assisting those skilled in the art in understanding and reading the content disclosed in the specification, and are not intended to limit the conditions under which the present invention can be implemented. Therefore, they have no substantial technical significance. Any modifications to the structure, changes in the proportions, or adjustments to the size, without affecting the effects and objectives that the present invention can produce, should still fall within the scope of the technical content disclosed in the present invention.

[0024] like Figures 1 to 5As shown in the figure, the present invention provides a video tag dynamic overlay anti-shake method based on AR, which can overcome the shortcomings of large drift error caused by simply relying on mechanical pose, high computing power consumption caused by global matching, and high frequency jitter of tags caused by slight vibration of the device. It can improve the accuracy of tag mapping, reduce the computing power consumption of the system, and improve the visual stability of video overlay.

[0025] This invention provides a video tag dynamic overlay anti-shake method based on AR, specifically including: S1 determining the front-end monitoring device, acquiring the real-time video stream, and acquiring the device pose parameters of the front-end monitoring device at the corresponding timestamp; S2 constructing a dynamic perspective projection model based on the device pose parameters, and calculating the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model; S3 planning a local feature search window based on the initial pixel coordinates, the local feature search window representing a local image region centered on the initial pixel coordinates; S4 calculating the average pixel displacement vector of the image feature points in the current video frame and adjacent historical video frames within the local feature search window; S5 performing reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain the target rendering coordinates, and rendering the target AR tag to the current video frame based on the target rendering coordinates. This invention first obtains the real-time video stream and corresponding timestamp device pose parameters of the front-end monitoring device, and calculates the initial pixel coordinates of the target AR tag in the current video frame by combining a dynamic perspective projection model, thereby establishing a macroscopic spatial mapping benchmark. Subsequently, a local feature search window is planned according to the initial pixel coordinates, limiting it to a local image region centered on the initial pixel coordinates. By planning a local tracking boundary from the full frame, the matching calculation of redundant background pixels is greatly eliminated, directly overcoming the technical problem of excessive computing power consumption in the prior art. Under this local spatial constraint, the average pixel displacement vector of image feature points in the current video frame and adjacent historical video frames is further extracted and calculated. In this way, the single dependence on the camera pan-tilt parameters that are prone to mechanical errors is abandoned, and instead the relative offset between frames is captured based on the physical features of the real image, thereby effectively solving the technical problem of large cumulative drift error of the tag. Finally, the obtained average pixel displacement vector is used as a mathematical correction factor to perform reverse displacement compensation calculation on the aforementioned initial pixel coordinates, deriving the target rendering coordinates with anti-vibration properties, and rendering the target AR tag to the current video frame accordingly. This reverse compensation mechanism uses the extracted inter-frame instantaneous displacement features to apply a reverse cancellation force to the initial projection coordinates, perfectly smoothing out high-frequency coordinate changes caused by minor device vibrations, and completely solving the phenomenon of high-frequency tag jitter and visual tearing. Ultimately, it achieves the effect of AR tag dynamic anti-shake rendering technology with low computing power consumption, high spatial superposition accuracy, and excellent visual stability. Compared with the existing technology, this invention can overcome the shortcomings of large drift error caused by simply relying on mechanical pose, high computing power consumption caused by global matching, and high-frequency tag jitter caused by minor device vibrations, improve tag mapping accuracy, reduce system computing power consumption, and improve the visual stability of video superposition.

[0026] It should be further explained that, regarding the acquisition of real-time video stream and device pose parameters under the corresponding timestamp in step S1, considering the inherent asymmetric delay between the video stream encoding and compression transmission link and the telemetry data reporting link for gimbal mechanical pose in the actual physical hardware architecture (for example, the video frame encoding time is usually greater than the sensor data packaging time), in order to ensure strict time synchronization between the two and avoid AR tag projection drift caused by asynchronous transmission, this embodiment adopts a timestamp alignment and interpolation mechanism based on a dual-end queue. In this embodiment: when the front-end monitoring device triggers image frame exposure and reads gimbal mechanical sensor parameters at the underlying hardware level, the same high-precision clock source at the underlying level uniformly stamps the absolute physical timestamp; at the system backend receiving end, a video frame buffer queue and a pose data buffer queue are established respectively; when extracting the image timestamp of the current video frame, the two pose timestamps closest to the image timestamp are retrieved in the pose data buffer queue. If a perfectly consistent timestamp cannot be found, then based on the time difference weight, linear interpolation is performed on the pose parameters corresponding to the two timestamps before and after, thereby dynamically deriving the fitting device pose parameters that accurately correspond to the exposure moment of the current video frame.

[0027] The following describes in detail the steps of the AR-based video tag dynamic overlay anti-shake method with specific embodiments.

[0028] Specifically, in a specific embodiment of the present invention, the step of planning a local feature search window based on the initial pixel coordinates includes: parsing the real-time optical zoom ratio in the device pose parameters; multiplying the real-time optical zoom ratio with the preset basic window side length to obtain the target window side length of the target feature tracking window; and, based on the current video frame, truncating the target window side length with the initial pixel coordinates as the center point to obtain the local feature search window. In a specific embodiment of the present invention, the real-time optical zoom ratio in the device pose parameters is first analyzed. Specifically, the value representing the lens optical magnification (e.g., 2.5x or 4.0x) is extracted from the hardware telemetry data stream transmitted in real-time from the front-end monitoring device as the real-time optical zoom ratio. Next, the target window side length is calculated by combining it with a preset base window side length, which is pre-calibrated during system initialization. In this embodiment, under 1x wide-angle (no zoom) conditions, feature detection rate analysis is performed on the test image stream of a typical monitoring scene to identify the smallest square region that stably contains at least 30 effective background feature points, and its side length value (e.g., 64 pixels) is stored in the system configuration file. During actual operation, the system directly multiplies the acquired real-time optical zoom ratio by this base window side length to derive the dynamic target window side length. To prevent excessive expansion of the local window under extreme telephoto conditions (e.g., 20x zoom) that could lead to a surge in redundant computing power, the system pre-sets a maximum window side length threshold, for example, half the physical resolution of the shorter side of the video frame. The system compares the theoretical product value with this maximum window side length threshold and takes the smaller value as the final target window side length for feature tracking, then truncates the local feature search window. In the current video frame, using the pre-calculated initial pixel coordinates as the geometric center, a square image sub-region is delineated by extending outwards by half the target window side length in both the horizontal and vertical directions. During this extension delineation process, if the boundary coordinates of the target window exceed the extreme value of the physical resolution of the current video frame (e.g., the coordinates become negative), or exceed the maximum pixel values ​​of the image's width and height, an asymmetric boundary truncation strategy is triggered. The specific implementation involves forcibly maintaining the center position of the initial pixel coordinates, and directly and rigidly cropping the out-of-bounds window edges to the physical boundary extremes of the video frame. This means that coordinate truncation is only performed on one side or adjacent sides exceeding the frame's boundaries, dynamically adapting the original square window into an irregular rectangular local feature search window that perfectly fits the frame's edge, thus completing the final cropping. Through this mechanism of dynamically binding the zoom ratio to the window size, the system can adapt to changes in the camera's field of view depth: at telephoto ends where image texture is sparse, it automatically expands the search range to ensure the number of feature points; at wide-angle ends, it correspondingly reduces the field of view to eliminate irrelevant environmental interference. This strategy significantly reduces overall redundant computational consumption while ensuring adaptive feature tracking.

[0029] To verify the effectiveness of the above-mentioned scheme based on real-time optical zoom dynamic adjustment of local feature search window size, such as... Figure 5As shown, a targeted control verification experiment was designed and executed. The experimental dataset originated from a 120-minute zoom test stream containing 20,000 consecutive video frames captured by surveillance cameras on a real urban main road. This test stream covered a dynamic stepless scaling process from 1x wide-angle to 20x telephoto. The experiment configured the traditional fixed pixel side length cropping scheme as the control group, and the aforementioned scheme based on statistical derivation of the basic side length and dynamic calculation of the target window side length by zoom ratio as the experimental group. The average time per frame for extracting local feature windows and performing basic feature point detection, as well as the number of effective feature points detected, were used as the core evaluation indicators. By running the test and collecting the underlying log data, it was found that in the short-focus stage of 1x to 5x zoom, the control group, due to maintaining an excessively large fixed search window, resulted in an average region cropping and feature traversal time of up to 15 milliseconds per frame, while the experimental group dynamically reduced the target size based on product calculation. The window side length significantly reduces the average frame time, stabilizing it in the 4-6 millisecond range. At 15x-20x zoom, the background is extremely magnified, resulting in extremely sparse local texture features. The control group, with its fixed search window, could only capture an average of 3-5 effective feature points, leading to a very high target tracking loss rate. In contrast, the experimental group, with its target window side length increasing proportionally with the real-time optical zoom, could still stably capture high-quality local feature search windows containing 25-40 effective feature points even at extremely long zoom levels. The combined experimental data demonstrates that the dynamic boundary capture embodiment provided in this example can reduce overall computational overhead by approximately 60%, while simultaneously increasing feature extraction efficiency to over 98% under high zoom conditions. This fully demonstrates the superior performance advantages of the window capture mechanism, derived from preset parameters and combined with optical zoom, in multi-scale video stream processing tasks.

[0030] Specifically, such as Figure 2 As shown, in a specific embodiment of the present invention, the step of constructing a dynamic perspective projection model based on the device pose parameters includes: S21 The device pose parameters include the latitude and longitude coordinates, altitude, yaw angle, pitch angle, and roll angle of the front-end monitoring device; S22 The latitude and longitude coordinates and altitude of the front-end monitoring device are differentially calculated with the three-dimensional geographic coordinates of the target AR tag to obtain a relative translation vector; S23 A camera extrinsic rotation matrix is ​​constructed based on the yaw angle, pitch angle, and roll angle; S24 The dynamic perspective projection model is synthesized by combining the pre-acquired camera intrinsic matrix, relative translation vector, and extrinsic rotation matrix.

[0031] In an embodiment of the present invention, the device pose parameters first include the latitude and longitude coordinates, altitude, yaw angle, pitch angle, and roll angle of the front-end monitoring device. For example, the latitude and longitude coordinates of the front-end monitoring device are 116.40 degrees east longitude and 39.90 degrees north latitude, and the altitude is 50.0 meters. Based on the obtained latitude and longitude coordinates and altitude of the front-end monitoring device and the three-dimensional geographic coordinates of the target AR tag, a difference calculation is performed. Specifically, the latitude and longitude coordinates and altitude are converted into a unified Cartesian three-dimensional coordinate system based on the local rectangular coordinate system of the northeast celestial sphere. Then, the physical distance difference between the front-end monitoring device and the target AR tag along three orthogonal spatial axes is calculated, thereby obtaining... The process begins with a relative translation vector representing the absolute relative spatial position of the two elements. Next, the camera's extrinsic rotation matrix is ​​constructed based on the yaw, pitch, and roll angles. Specifically, this involves extracting the rotation component corresponding to the vertical axis of the yaw angle, the rotation component corresponding to the horizontal axis of the pitch angle, and the rotation component corresponding to the camera's optical axis of the roll angle. Using the basic trigonometric sine and cosine calculation logic, matrix multiplication is performed on the rotation components of these three spatial dimensions, resulting in a 3x3 square matrix of camera extrinsic rotation. This matrix accurately represents the tilt and torsion of the camera lens in three-dimensional physical space. The absolute pointing attitude, including the attitude transformation, is then combined with the pre-acquired camera intrinsic parameter matrix, relative translation vector, and extrinsic rotation matrix to synthesize a dynamic perspective projection model. The source of the pre-acquired camera intrinsic parameter matrix is: during the manufacturing stage of the monitoring equipment, multi-angle shooting and corner point extraction calibration are performed based on a standard black and white checkerboard calibration board. The matrix contains the lens focal length and optical center pixel coordinates. During system initialization, this intrinsic parameter matrix is ​​directly read from the underlying hardware memory to ensure mapping accuracy. During the synthesis calculation stage, the relative translation vector is added as a translation column vector and concatenated to the right side of the camera extrinsic rotation matrix to form a complete fourth-order extrinsic transformation matrix. Furthermore, matrix multiplication is performed between the camera's intrinsic parameter matrix and the stitched extrinsic parameter transformation matrix to obtain a dynamic perspective projection model with 3 rows and 4 columns of spatial dimensions. This preferred implementation method of constructing a dynamic perspective projection model by integrating multi-dimensional spatial parameters can accurately and physically bridge the three-dimensional coordinates of the real geographic world to the camera's internal image coordinate system. At the same time, it can update the three-dimensional spatial mapping relationship in real time when the camera pan-tilt unit undergoes continuous mechanical rotation or is subjected to external environmental forces, such as wind load vibration causing the optical axis to roll and tilt. This completely eliminates label rotation tearing caused by ignoring the roll angle, and thus lays a solid and error-free geometric topological foundation for subsequent high-precision video frame pixel coordinate transformation and anti-shake rendering.

[0032] Specifically, in a specific embodiment of the present invention, the step of calculating the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model includes: performing homogeneous coordinate transformation on the three-dimensional geographic coordinates of the target AR tag; performing matrix multiplication on the three-dimensional geographic coordinates of the homogeneous coordinates and the dynamic perspective projection model to obtain two-dimensional homogeneous coordinates; performing perspective division normalization on the two-dimensional homogeneous coordinates to calculate the initial pixel coordinates of the target AR tag on the pixel plane of the current video frame.

[0033] In an embodiment of the present invention, firstly, the three-dimensional geographic coordinates of the target AR tag are transformed into homogeneous coordinates. Specifically, the spatial points of the target AR tag, obtained through high-precision mapping in the real geographic world and including longitude and altitude attributes, are converted into three-dimensional floating-point values ​​in a unified Cartesian coordinate system. For example, the x-coordinate is 500.5, the y-coordinate is 300.2, and the depth coordinate is 150.8. Based on the spatial dimensionality reduction calculation standard of computer cartography, a scalar dimension with a value of 1 is directly appended to the end of the above three-dimensional geographic coordinates to upgrade them into a four-dimensional column vector form, that is, a three-dimensional geographic coordinate system with homogeneous coordinates. The first step, the dimensionality-up operation, provides the mathematical foundation for subsequent affine transformations involving translation and rotation, supporting linear matrix multiplication. The next stage involves obtaining the two-dimensional homogeneous coordinates. Specifically, the four-dimensional column vector formed by the aforementioned three-dimensional geographic coordinates is multiplied by a pre-constructed dynamic perspective projection model. This dynamic perspective projection model is a floating-point transformation matrix containing 3 rows and 4 columns. Based on the row-column dot product operation of matrix multiplication, each row element of the dynamic perspective projection model matrix is ​​multiplied by the corresponding element of the four-dimensional column vector and summed. After three row-column dot product operations, a new column vector containing three values—the two-dimensional homogeneous coordinates—is output. For example, this column vector contains... The three components are 1500.0, 900.0, and 3.0, respectively. Next, the normalization calculation stage for the initial pixel coordinates begins. Perspective division normalization is performed on the above two-dimensional homogeneous coordinates. Since the third component of the two-dimensional homogeneous coordinates represents the physical distance or scaling factor of the target AR tag relative to the camera's optical center in the projection depth direction, the influence of this depth dimension needs to be removed to accurately map it onto a depthless two-dimensional plane. Therefore, based on the principle of perspective projection, the first two components of the two-dimensional homogeneous coordinates are divided by the third component. For example, dividing 1500.0 by 3.0 yields the horizontal axis pixel 500.0, and dividing 900.0 by 3.0 yields the vertical axis pixel 3. After the division operation, the original three dimensions are reduced to a standard two-dimensional image plane, thus calculating the initial pixel coordinates of the target AR tag on the pixel plane of the current video frame. This preferred implementation method based on homogeneous coordinate transformation and perspective division can transform the complex three-dimensional nonlinear spatial perspective mapping relationship into efficient linear matrix multiplication and division operations that computer hardware is extremely good at. While ensuring that the spatial coordinate mapping reaches sub-pixel level absolute accuracy, it can complete the spatial projection calculation of massive AR tags with extremely low CPU clock cycle consumption, perfectly meeting the stringent timeliness requirements of front-end monitoring equipment for real-time high frame rate superposition of video streams.

[0034] To verify the computational efficiency and spatial mapping accuracy of the above-mentioned scheme for deriving initial pixel coordinates through coordinate transformation and perspective division based on a dynamic perspective projection model, a specialized system-level test experiment was designed and executed for a high-frequency concurrent projection task involving massive spatial points. The experiment ran on an intelligent video analysis node equipped with a general-purpose edge computing chip. The input test data consisted of a 3D geographic coordinate set of 10,000 real-world AR tags covering an area of ​​10 square kilometers, pre-calibrated by a high-precision total station. The benchmark video stream resolution was set to 1920 x 1080 pixels at a frame rate of 60 frames per second. The experiment employed traditional multi-step non-linear... The analytical geometric scheme for solving 3D projection relationships using linear trigonometric functions was configured as the control group, while the aforementioned full matrix operation scheme based on homogeneous coordinates with added dimensions, joint matrix multiplication, and perspective division normalization was configured as the experimental group. The core evaluation indicators were the average processing time for completing the full coordinate projection of 10,000 target points within a single frame and the root mean square error between the theoretical projection point and the actual calibrated pixel point. Through continuous 120-minute extreme stress test and collection of performance monitoring data from the underlying operating system, it was found that the control group, due to frequent calls to the underlying floating-point coprocessor to perform nonlinear sine and cosine analytical operations when traversing each coordinate point, showed that... Complex non-homogeneous spatial translation and superposition resulted in an average processing time of up to 45 milliseconds for the projection of 10,000 points in a single frame, severely exceeding the limit and causing significant frame drops in the video stream. Furthermore, the cumulative truncation of floating-point numbers during the multi-step analytical calculation process led to a final root mean square pixel-level error of 3.5 pixels. In contrast, the experimental group, based on low-level cascaded optimization of linear matrix multiplication and single-step basic division instructions, completely parallelized and pipelined the dimensionality reduction process from 3D physical space points to 2D pixel planes. This significantly reduced the average processing time for simultaneous projection of 10,000 points in a single frame, stabilizing it in the range of 8 to 10 milliseconds, far below the 16.6 millisecond single-frame time processing limit for 60fps video streams. Meanwhile, thanks to the mathematical rigor of high-precision floating-point matrix transformation, its root mean square pixel-level error is strictly controlled at an extremely low sub-pixel level of 0.8 pixels. The above multi-dimensional experimental quantitative comparison data show that the spatial mapping embodiment based on two-dimensional homogeneous projection and perspective division normalization provided in this embodiment can reduce the computational cost of high-complexity coordinate system transformation by nearly 80%, and at the same time, it can significantly improve the positioning accuracy of pixel coordinates under the condition of massive target concurrent projection. This confirms the advantages and coordinate transformation reliability of this homogeneous matrix transformation and division dimensionality reduction mechanism in high-frequency dynamic rendering tasks of ultra-large-scale augmented reality video stream tags.

[0035] Specifically, such as Figure 3As shown, in this embodiment of the invention, the step of calculating the average pixel displacement vector of image feature points in the current video frame and adjacent historical video frames within a local feature search window includes: the local feature search window includes a first local feature search window located in adjacent historical video frames and a second local feature search window located in the current video frame; the first local feature search window and the second local feature search window are respectively processed into grayscale; multiple background feature points are extracted in the first local feature search window using a corner detection algorithm; target pixels that match the background feature points are searched based on the second local feature search window, and the coordinate difference between the target pixel and the corresponding background feature point is calculated to obtain the pixel displacement vector corresponding to each background feature point; the average pixel displacement vector is obtained by summing all pixel displacement vectors.

[0036] In a specific embodiment of the invention, firstly, grayscale dimensionality reduction processing is performed on the two search windows. Specifically, for each pixel within the window, its RGB three-channel values ​​are extracted and multiplied by the corresponding psychovisual weight coefficients. These psychovisual weight coefficients originate from the ITU-R BT.601 international standard: 0.299 for the red channel, 0.587 for the green channel, and 0.114 for the blue channel. Then, these three weighted products are summed to synthesize a single brightness value. This step converts the original color image matrix into a single-channel 8-bit grayscale matrix, significantly reducing memory overhead while preserving the image's spatial structure gradient. Next, a corner detection algorithm is used to extract background feature points within the first local feature search window. Specifically, the grayscale gradient of each pixel within the window's grayscale matrix is ​​calculated in the horizontal and vertical directions, and a spatial gradient covariance matrix is ​​constructed in its local neighborhood. Furthermore, two mathematical eigenvalues ​​of this covariance matrix are calculated. At this point, the feature value determination logic is performed: the minimum feature value among the two features is compared with a preset corner quality level threshold (e.g., set to 0.01). If the minimum feature value is less than or equal to the preset threshold, it indicates that the pixel is located in a smooth region or a single edge region of the image, which is prone to optical flow tracing slippage, and is therefore determined to be an invalid corner and removed. If the minimum feature value is significantly greater than the preset threshold, it indicates that the pixel has strong grayscale changes in multiple directions, and is retained as a candidate corner. It should be noted that the preset corner quality level threshold is obtained in advance during the system's offline debugging phase. Specifically, it is obtained by conducting multiple rounds of gradient distribution density statistical experiments on a large number of typical urban monitoring static background samples (e.g., samples containing building outlines and road markings) to derive the optimal floating-point boundary value that can stably screen out smooth regions and weak edge noise. After completing the feature value determination, to prevent excessive clustering of candidate corners, a preset minimum pixel spacing threshold (e.g., set to 5 pixels) is introduced for spatial distance constraint screening. Finally, the set of pixel coordinates that simultaneously satisfy the condition that the eigenvalue is greater than the threshold and the spatial distance constraint is used as the extracted background feature points. Then, a matching target pixel is searched based on the second local feature search window. Using the pyramid optical flow tracing algorithm, the path of each of the aforementioned background feature points in the new frame is tracked within the constraint boundaries of the second local feature search window. When a highly similar grayscale matching region is located, the center coordinates of that region are extracted as the target pixel. Next, the horizontal and vertical coordinates of the target pixel are subtracted from the horizontal and vertical coordinates of the corresponding background feature points to calculate the two-dimensional pixel displacement vector corresponding to each background feature point. Finally, the average of all successfully tracked pixel displacement vectors is calculated. Specifically, the horizontal and vertical components of the displacement vector are summed one by one, and then divided by the total number of valid feature points involved in the calculation, thus obtaining the average pixel displacement vector representing the overall motion trend of the local region.

[0037] It should be noted that adjacent historical video frames are set to the previous frame that is strictly adjacent to the current video frame in the time series, that is, the timestamp interval is the previous frame of a single video acquisition cycle. This ensures that the subsequent pyramid optical flow tracing algorithm has extremely high texture coherence and pixel matching accuracy when dealing with mechanical micro-vibrations. In addition, as another alternative implementation of the present invention, in specific working conditions where the computing power of the front-end device is limited or the monitoring screen shows extremely slow drift, adjacent historical video frames can also be dynamically configured to historical key frames that are a preset step size N (e.g., N=3 or 5) away from the current video frame. This reduces the overall computing load of the system while increasing the difference between frames, making it easier to capture minute movements.

[0038] Specifically, in specific embodiments of the present invention, such as Figure 4 As shown, it also includes removing abnormal features from the pixel displacement vectors corresponding to each background feature point. Specifically, this involves: randomly sampling all pixel displacement vectors multiple times to obtain local data samples, and fitting a basic affine transformation model based on the local data samples; calculating the reprojection error between the pixel displacement vector of each background feature point and the fitted basic affine transformation model; identifying background feature points with reprojection errors greater than a preset error threshold as foreground motion interference points and removing them, while retaining the pixel displacement vectors corresponding to background feature points with reprojection errors less than or equal to the preset error threshold, thus completing the removal of abnormal features.

[0039] In an embodiment of the present invention, firstly, based on the random sampling consensus algorithm, multiple random samplings are performed on all successfully tracked feature points. During each sampling, three non-collinear background feature points are randomly selected to form a local data sample. Combining the coordinate changes of these three feature points in the preceding and following frames, a basic affine transformation model covering translation and rotation mapping is solved. Within a preset number of iterations (e.g., 100 iterations), the above random sampling and model fitting operations are repeated continuously to find the optimal rigid body transformation model containing the most consensus points. Next, for each fitted basic affine transformation model, the reprojection error of the remaining feature points is calculated. Specifically, the original pixel coordinates of the background feature points in historical video frames are input into the currently fitted basic affine transformation model. A theoretical predicted coordinate is obtained through matrix multiplication and translation addition. Then, the two-dimensional Euclidean straight-line distance between this theoretical predicted coordinate and the target pixel coordinates obtained from actual optical flow tracing is calculated. This distance value is defined as the reprojection error. Subsequently, outlier detection and removal are performed. The system will consider a reprojection error greater than a preset error threshold (e.g., 2.5). Feature points (within a pixel distance) are identified as foreground motion interference points and removed; simultaneously, feature points with errors less than or equal to this threshold are retained. The displacement vectors corresponding to the retained feature points are the pure background displacement data that conforms to rigid body transformation constraints. It should be noted that in this embodiment, the preset error threshold is obtained by collecting a large number of typical surveillance videos containing moving vehicles and pedestrians during the offline development phase of the algorithm. The projection error distribution curves caused by camera mechanical micro-vibration and independent foreground motion are statistically analyzed separately, and a floating-point critical value that can effectively cut the bimodal distribution is found and fixed in the system configuration file. Through the above mechanism based on multiple sampling and error verification, even in situations where the image is obscured by large-area moving objects such as heavy trucks, the tracking algorithm can be prevented from being severely misled. This method effectively removes environmental motion noise, providing a high-fidelity global background motion benchmark for subsequent image stabilization compensation of the system.To verify the accuracy of the above-mentioned method of fitting an affine model based on multiple random sampling and removing outlier feature points based on reprojection error in the image stabilization benchmark extraction under complex foreground interference environments, a targeted high dynamic range scene comparison experiment was designed and executed. The experimental test dataset consisted of a 60-minute high dynamic range video stream collected by an ultra-high-definition surveillance camera deployed at a busy urban intersection, containing a large amount of traffic and pedestrian flow with severe obstruction during morning and evening rush hours. In this video stream, the average proportion of independently moving objects in the foreground of the image reached more than 40%, and the camera itself was accompanied by high-frequency wind-induced vibration. The experiment directly performed a simple arithmetic average on all detected feature points without any further processing. The global average displacement estimation scheme for consistency error screening was configured as the control group, while the aforementioned scheme based on iterative sampling fitting of the basic affine transformation model and strict removal of foreground interference points based on a preset error threshold of 2.5 pixels was configured as the experimental group. The root mean square error between the background motion reference displacement calculated by the system and the theoretical physical pixel displacement converted from the actual camera pose change synchronously acquired by the high-precision physical gyroscope was used as the core quantitative evaluation index. Through continuous processing and acquisition of low-level displacement tracking data on a high-concurrency analysis server, it was found that the control group, when processing video frames containing a large amount of dense traffic, had an algorithm that attached itself to the surface of fast-moving vehicles. The high-light reflection points and license plate corner features were all blindly included in the background displacement calculation pool, causing the calculated macroscopic displacement vector to be severely stretched in one direction by the moving traffic. The root mean square error of the displacement in a single frame soared to a maximum of 18.5 pixels, leading to severe distortion of the image stabilization baseline and coordinate superposition collapse. In contrast, the experimental group, by introducing rigid body transformation constraint verification based on local data sample fitting in the middle of the algorithm, quickly locked down the consensus feature point set representing the real background buildings in multiple iterations. Based on a preset error threshold, it accurately identified and eliminated more than 95% of the abnormal interference feature points attached to moving vehicles, resulting in a pure feature point set used to characterize the camera's physical vibration. The set of net background feature points is extremely stable, and the root mean square error of its background displacement tracking is strictly controlled within an extremely low sub-pixel range of 0.6 to 0.9 pixels throughout the entire morning and evening peak testing period. The experimental quantitative comparison data from multiple sets of complex working conditions clearly show that the abnormal feature removal embodiment based on multiple sampling error verification provided in this example can improve the background displacement benchmark derivation accuracy by more than 20 times when faced with extreme proportion dynamic foreground object occlusion and interference. This fully demonstrates the extremely high anti-interference robustness and industrial-grade practical application value of this abnormal feature removal mechanism in the task of stripping away motion noise in complex environments and locking in the pure underlying camera vibration trend.

[0040] Specifically, in the embodiments of the present invention, the method further includes determining the dead zone of image stabilization tolerance for the average pixel displacement vector, specifically: determining whether the magnitude of the average pixel displacement vector is greater than a preset pixel fluctuation threshold; if the magnitude of the average pixel displacement vector is less than or equal to the pixel fluctuation threshold, then the average pixel displacement vector is set to zero and reverse displacement compensation is performed; if the magnitude of the average pixel displacement vector is greater than the pixel fluctuation threshold, then reverse displacement compensation is performed.

[0041] In a specific embodiment of the present invention, firstly, the sum of squares of the average pixel displacement vector in the horizontal and vertical coordinate components is calculated based on a planar geometric two-dimensional spatial distance algorithm, and the square root of the sum of squares is taken to obtain a scalar value representing its actual physical absolute offset length, i.e., the modulus. Then, the calculated modulus is compared with a preset pixel fluctuation threshold using floating-point absolute values. The preset pixel fluctuation threshold is obtained by directly setting it based on the feature tracking error statistics generated by the front-end monitoring device in historical operating states; for example, the preset pixel fluctuation threshold is set to 1.5 pixels. Next, a flow control operation is performed based on the result of the floating-point comparison. If the modulus of the average pixel displacement vector is less than or equal to the preset pixel fluctuation threshold, the algorithm determines that the currently occurring extremely small feature point displacement belongs to the aforementioned algorithm's underlying noise or sensor thermal noise, rather than... In the event of real physical and mechanical vibration of the camera, the horizontal and vertical displacement values ​​of the average pixel displacement vector are forcibly reset to zero, i.e., a zeroing operation is performed. The reset zero vector is then used to perform reverse displacement compensation on the next-level rendering pipeline. If the magnitude of the average pixel displacement vector is strictly greater than the preset pixel fluctuation threshold, the algorithm determines that a real and effective physical and mechanical vibration of the camera has occurred that exceeds the environmental and electronic noise floor. In this case, the original numerical components of the average pixel displacement vector are preserved intact and reverse displacement compensation is performed directly. This preferred implementation method, which constructs the anti-shake tolerance dead zone based on magnitude comparison, can forcibly cut off the reverse compensation oscillation caused by small tracking errors when the camera is completely stationary. This can prevent the target AR tag from exhibiting continuous high-frequency small in-place shaking in a static background, thereby presenting a stable AR spatial visual overlay effect to the terminal monitoring user.

[0042] Specifically, in a specific embodiment of the present invention, the average pixel displacement vector is reverse-displaced relative to the initial pixel coordinates to obtain the target rendering coordinates. This includes: analyzing the horizontal and vertical offsets of the average pixel displacement vector; taking the opposite of the horizontal and vertical offsets to obtain the reverse-compensated horizontal and vertical components; and adding the reverse-compensated horizontal and vertical components to the horizontal and vertical coordinates of the initial pixel coordinates respectively to obtain the target rendering coordinates.

[0043] In a specific embodiment of the present invention, firstly, the horizontal and vertical offsets of the average pixel displacement vector are analyzed. Specifically, based on the underlying data structure, the two orthogonal floating-point numerical components of the average pixel displacement vector in a two-dimensional Cartesian coordinate system are read. For example, the horizontal offset is +5.5 pixels and the vertical offset is -3.2 pixels. Then, the horizontal and vertical offsets are inverted to obtain the inverse compensation horizontal and vertical components. Specifically, the extracted horizontal and vertical offsets are inverted in the arithmetic logic unit of the central processing unit to achieve an absolute 180-degree flip of the vector space direction. For example, the inverse of the aforementioned +5.5 pixel horizontal offset is used to obtain the -5.5 pixel inverse compensation horizontal component, and the inverse of the -3.2 pixel vertical offset is used to obtain the +3.2 pixel inverse compensation vertical component. Then, the inverse compensation horizontal component and the inverse... The target rendering coordinates are obtained by adding the compensated ordinate components to the x and y coordinates of the initial pixel coordinates, respectively. Specifically, the signed inverse compensated ordinate component is added to the x-axis value of the initial pixel coordinates calculated by dynamic perspective projection, and the signed inverse compensated ordinate component is added to the y-axis value of the initial pixel coordinates. For example, if the x-axis of the initial pixel coordinates is 800.0 and the y-axis is 600.0, after addition, the x-axis becomes 794.5 and the y-axis becomes 603.2, thus deriving the final target rendering coordinates used for graphics pipeline rendering. This preferred implementation method based on inverse number derivation and independent coordinate axis addition can absolutely physically cancel the vibration error of the front-end monitoring equipment caused by the external environment in the pixel plane, completely avoiding the high computing power overhead of complex spatial 3D inverse projection, and thus providing coordinate-level high-fidelity anti-shake correction output for augmented reality video streams within a millisecond time window.

[0044] It is important to note that in actual urban video surveillance scenarios, the mechanical vibrations generated by front-end monitoring equipment due to high-altitude wind loads, resonance caused by passing heavy vehicles, or the influence of gimbal motor gear clearances often manifest as high-frequency, extremely small-angle attitude micro-deflections (such as minute deformations in yaw, pitch, or roll angles). Based on the principles of spatial analytic geometry and projective geometry, when the camera's deflection angle approaches its minimum value, the complex three-dimensional perspective deformation it induces in a local area of ​​the image pixel plane can be approximated as a pure linear translation on a two-dimensional plane in a mathematical and physical model. Therefore, in this embodiment, analyzing the horizontal and vertical coordinate offsets of the average pixel displacement vector and taking the opposite values ​​of the horizontal and vertical coordinate offsets for reverse displacement compensation is a dimension reduction approximation compensation strategy specifically adopted for the aforementioned extremely small-angle micro-vibration conditions. This strategy completely avoids the introduction of complex three-dimensional matrix inversion and global reprojection operations, while ensuring that the visual overlay of AR tags within the local feature search window does not produce obvious perspective tearing and deformation. Thus, it achieves sub-pixel-level high-frequency image stabilization of video with extremely low CPU clock cycle consumption, supporting the core objective of this invention to reduce system computing power overhead and improve the visual stability of video overlay.

[0045] This invention also provides an AR-based video tag dynamic overlay anti-shake system, comprising: a video pose acquisition module, which determines the front-end monitoring device, acquires a real-time video stream, and acquires the device pose parameters of the front-end monitoring device at the corresponding timestamp; an initial coordinate calculation module, which constructs a dynamic perspective projection model based on the device pose parameters and calculates the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model; a search window planning module, which plans a local feature search window based on the initial pixel coordinates, wherein the local feature search window represents a local image region centered on the initial pixel coordinates; a pixel displacement calculation module, which calculates the average pixel displacement vector of image feature points in the current video frame and adjacent historical video frames within the local feature search window; and a reverse compensation rendering module, which performs reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain the target rendering coordinates and renders the target AR tag to the current video frame based on the target rendering coordinates.

[0046] Specifically, in a specific embodiment of the present invention, the search window planning module includes: a zoom ratio parsing unit, which parses the real-time optical zoom ratio in the device pose parameters; a target side length calculation unit, which multiplies the real-time optical zoom ratio with a preset basic window side length to obtain the target window side length of the target feature tracking window; and a local window delineation unit, which, based on the current video frame and with the initial pixel coordinates as the center point, truncates the target window side length to obtain a local feature search window.

[0047] Furthermore, in a preferred embodiment of the present invention, the AR-based video tag dynamic overlay anti-shake method and system are specifically deployed and applied to high-altitude command point monitoring and core transportation hub (such as railway station squares and key trunk roads) scenarios in the urban three-dimensional security prevention and control system.

[0048] Specifically, taking core transportation hub scenarios such as train station plazas as an example, large vehicles are densely packed in these environments, easily obscuring surveillance footage over large areas, and traffic flow often causes low-frequency mechanical resonance in the surveillance poles. This embodiment utilizes a RANSAC-based anomaly feature removal mechanism to directly filter out interference data generated by vehicle movement. The system relies solely on the extracted clean background displacement vector for reverse compensation to counteract pole vibration. Thus, even in heavy traffic, the superimposed AR police force distribution or warning tags can be stably anchored to their actual geographical location, avoiding drift and displacement due to traffic flow, ensuring the accuracy of command and dispatch.

[0049] This solution is designed for high-altitude surveillance scenarios near government compounds or key schools. When the device is zoomed in at 15x to 20x to observe distant details, the image texture is sparse, and high-altitude wind loads can easily cause high-frequency micro-vibrations and optical axis roll of the lens. To address this, this embodiment uses a dynamic search window mechanism linked to zoom magnification to enlarge the search area, ensuring feature detection rate, and uses a built-in maximum side length constraint to prevent computational power overflow. Simultaneously, it incorporates extrinsic parameter projection compensation, including roll angle, to offset image rotation distortion caused by wind loads. This solution ensures the accuracy of AR real-scene overlay under long-focus conditions, providing stable visualization support for digital twin systems.

[0050] As described above, the AR-based video tag dynamic overlay anti-shake method disclosed in this embodiment of the invention can overcome the shortcomings of large drift errors caused by relying solely on mechanical pose, high computational power consumption caused by global matching, and high-frequency jitter of tags caused by minor device vibrations. It improves tag mapping accuracy, reduces system computational power consumption, and enhances the visual stability of video overlay. The AR-based video tag dynamic overlay anti-shake method includes: S1 determining the front-end monitoring device, acquiring the real-time video stream, and acquiring the device pose parameters of the front-end monitoring device at the corresponding timestamp; S2 constructing a dynamic perspective projection model based on the device pose parameters, and calculating the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model; S3 planning a local feature search window based on the initial pixel coordinates, where the local feature search window represents a local image region centered on the initial pixel coordinates; S4 calculating the average pixel displacement vector of the image feature points of the current video frame and adjacent historical video frames within the local feature search window; S5 performing reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain the target rendering coordinates, and rendering the target AR tag to the current video frame based on the target rendering coordinates.This invention first obtains the real-time video stream and corresponding timestamp device pose parameters from the front-end monitoring device, and calculates the initial pixel coordinates of the target AR tag in the current video frame using a dynamic perspective projection model, thereby establishing a macroscopic spatial mapping benchmark. Subsequently, based on these initial pixel coordinates, a local feature search window is oriented and defined as a local image region centered on these initial pixel coordinates. By planning a local tracking boundary from the full frame, redundant background pixel matching operations are significantly eliminated, directly overcoming the technical problem of excessive computational power consumption in existing technologies. Under this local spatial constraint, the average pixel displacement vector of image feature points in the current video frame and adjacent historical video frames is further extracted and calculated. This approach abandons the single reliance on camera pan-tilt parameters, which are prone to mechanical errors, and instead captures inter-frame relative offsets based on real image physical features, effectively solving the problem of large cumulative tag drift errors. The challenge lies in using the obtained average pixel displacement vector as a mathematical correction factor to perform reverse displacement compensation calculations on the initial pixel coordinates. This derives the target rendering coordinates with anti-shake properties and renders the target AR tag onto the current video frame accordingly. This reverse compensation mechanism utilizes the extracted inter-frame instantaneous displacement features to apply a reverse cancellation force to the initial projection coordinates, perfectly smoothing out high-frequency coordinate abrupt changes caused by minor device vibrations. This completely solves the problems of high-frequency tag jitter and visual tearing, ultimately achieving a dynamic anti-shake rendering technology for AR tags that combines low computational cost, high spatial overlay accuracy, and excellent visual stability. Compared to existing technologies, this invention overcomes the shortcomings of relying solely on mechanical pose leading to large drift errors, global matching leading to high computational cost, and minor device vibrations causing high-frequency tag jitter. It improves tag mapping accuracy, reduces system computational cost, and enhances the visual stability of video overlay.

[0051] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A video tag dynamic overlay anti-shake method based on AR, characterized in that, include: S1 determines the front-end monitoring device, obtains the real-time video stream, and obtains the device pose parameters of the front-end monitoring device at the corresponding timestamp; S2 constructs a dynamic perspective projection model based on the device pose parameters, and calculates the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model; S3 plans a local feature search window based on the initial pixel coordinates, where the local feature search window represents a local image region centered on the initial pixel coordinates; S4 calculates the average pixel displacement vector of the image feature points of the current video frame and adjacent historical video frames within the local feature search window; S5 performs reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain the target rendering coordinates, and renders the target AR tag to the current video frame based on the target rendering coordinates.

2. The AR-based video tag dynamic overlay anti-shake method according to claim 1, characterized in that, The step of planning a local feature search window based on the initial pixel coordinates includes: Analyze the real-time optical zoom ratio in the device pose parameters; The target window side length of the target feature tracking window is obtained by multiplying the real-time optical zoom ratio with the preset base window side length. Based on the current video frame, the target window side length is truncated with the initial pixel coordinates as the center point to obtain a local feature search window.

3. The AR-based video tag dynamic overlay anti-shake method according to claim 1, characterized in that, The step of constructing a dynamic perspective projection model based on the device pose parameters includes: The device pose parameters include the latitude and longitude coordinates, altitude, yaw angle, pitch angle, and roll angle of the front-end monitoring device; The latitude and longitude coordinates and altitude of the front-end monitoring device are compared with the three-dimensional geographic coordinates of the target AR tag to obtain the relative translation vector. Construct the camera extrinsic rotation matrix based on the yaw angle, the pitch angle, and the roll angle; By combining the pre-acquired camera intrinsic parameter matrix, the relative translation vector, and the extrinsic rotation matrix, a dynamic perspective projection model is synthesized.

4. The AR-based video tag dynamic overlay anti-shake method according to claim 3, characterized in that, The step of calculating the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model includes: The three-dimensional geographic coordinates of the target AR tag are transformed into homogeneous coordinates, and the three-dimensional geographic coordinates of the homogeneous coordinates are multiplied by the dynamic perspective projection model to obtain two-dimensional homogeneous coordinates. The two-dimensional homogeneous coordinates are normalized by perspective division to calculate the initial pixel coordinates of the target AR tag on the pixel plane of the current video frame.

5. The AR-based video tag dynamic overlay anti-shake method according to claim 1, characterized in that, The step of calculating the average pixel displacement vector of image feature points of the current video frame and adjacent historical video frames within the local feature search window includes: The local feature search window includes a first local feature search window located in the adjacent historical video frames and a second local feature search window located in the current video frame; The first local feature search window and the second local feature search window are respectively processed into grayscale; A corner detection algorithm is used to extract multiple background feature points within the first local feature search window; Based on the second local feature search window, target pixels that match the background feature points are searched, and the coordinate difference between the target pixel and the corresponding background feature point is calculated to obtain the pixel displacement vector corresponding to each background feature point. The average pixel displacement vector is obtained by summing and averaging all the pixel displacement vectors.

6. The AR-based video tag dynamic overlay anti-shake method according to claim 5, characterized in that, It also includes performing anomaly removal on the pixel displacement vector corresponding to each of the background feature points, specifically: Local data samples are obtained by randomly sampling all the pixel displacement vectors multiple times, and a basic affine transformation model is obtained by fitting the local data samples. Calculate the reprojection error between the pixel displacement vector of each background feature point and the fitted basic affine transformation model; Background feature points whose reprojection error is greater than a preset error threshold are identified as foreground motion interference points and removed. The pixel displacement vectors corresponding to background feature points whose reprojection error is less than or equal to the preset error threshold are retained, thus completing the abnormal feature removal.

7. The AR-based video tag dynamic overlay anti-shake method according to claim 1, characterized in that, It also includes determining the dead zone for image stabilization tolerance on the average pixel displacement vector, specifically: Determine whether the magnitude of the average pixel displacement vector is greater than a preset pixel fluctuation threshold; If the magnitude of the average pixel displacement vector is less than or equal to the pixel fluctuation threshold, then the average pixel displacement vector is set to zero and reverse displacement compensation is performed. If the magnitude of the average pixel displacement vector is greater than the pixel fluctuation threshold, then reverse displacement compensation is performed.

8. The AR-based video tag dynamic overlay anti-shake method according to claim 1, characterized in that, The average pixel displacement vector is used to perform reverse displacement compensation on the initial pixel coordinates to obtain the target rendering coordinates, including: Analyze the horizontal and vertical offsets of the average pixel displacement vector; By taking the opposite values ​​of the horizontal coordinate offset and the vertical coordinate offset, we obtain the reverse compensation horizontal coordinate component and the reverse compensation vertical component. The inverse compensation horizontal coordinate component and the inverse compensation vertical coordinate component are respectively added to the horizontal and vertical coordinates of the initial pixel coordinates to obtain the target rendering coordinates.

9. A video tag dynamic overlay anti-shake system based on AR, characterized in that, The system is used to perform the AR-based video tag dynamic overlay anti-shake method according to any one of claims 1 to 8, the system comprising: The video pose acquisition module identifies the front-end monitoring device, acquires the real-time video stream, and acquires the device pose parameters of the front-end monitoring device at the corresponding timestamp. The initial coordinate calculation module constructs a dynamic perspective projection model based on the device pose parameters, and calculates the initial pixel coordinates of the target AR tag in the current video frame based on the dynamic perspective projection model. The search window planning module plans a local feature search window based on the initial pixel coordinates. The local feature search window represents a local image region centered on the initial pixel coordinates. The pixel displacement calculation module calculates the average pixel displacement vector of the image feature points of the current video frame and adjacent historical video frames within the local feature search window; The reverse compensation rendering module performs reverse displacement compensation on the initial pixel coordinates using the average pixel displacement vector to obtain the target rendering coordinates, and renders the target AR tag to the current video frame based on the target rendering coordinates.

10. The AR-based video tag dynamic overlay anti-shake system according to claim 9, characterized in that, The search window planning module includes: The zoom ratio unit is analyzed to analyze the real-time optical zoom ratio in the device pose parameters. The target side length calculation unit calculates the target window side length by multiplying the real-time optical zoom ratio with the preset basic window side length. The local window delineation unit, based on the current video frame, uses the initial pixel coordinates as the center point to truncate the side length of the target window and obtain a local feature search window.