Robot localization method based on visual tags and related devices

By simultaneously acquiring robot motion parameters and visual label images, performing tensor fusion and multi-scale feature extraction, the problem of high complexity in existing visual label detection algorithms is solved, achieving high-precision and rapid positioning under high-speed motion, reducing power consumption and extending battery life.

CN122244550APending Publication Date: 2026-06-19ZHIHAN XINGTU (SUZHOU) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHIHAN XINGTU (SUZHOU) TECH CO LTD
Filing Date
2026-04-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing visual label detection algorithms are highly complex when processing high-resolution images, resulting in excessive CPU/GPU computing resource consumption and failing to meet the millisecond-level control response requirements of mobile robots in high-speed motion.

Method used

By synchronously acquiring the robot's motion parameters and visual label images, tensor fusion is performed to extract multi-scale spatial features, which are mapped to a global environmental quality index to determine the visual label search area. Visual label recognition and localization are then performed, and a lightweight visual language model and adaptive operators are used to handle complex environments.

Benefits of technology

It reduces the complexity of visual label recognition, enhances the real-time control performance in high-speed motion scenarios, significantly reduces the overall power consumption of the mobile robot, extends its battery life, and achieves high-precision and rapid positioning.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244550A_ABST
    Figure CN122244550A_ABST
Patent Text Reader

Abstract

This invention relates to a robot localization method and related apparatus based on visual tags, belonging to the field of computer vision technology. The method includes: simultaneously acquiring the robot's motion parameters and visual tag images collected by the robot; performing tensor fusion on the motion parameters and visual tag images to obtain an environmental context tensor; extracting multi-scale spatial features from the environmental context tensor to obtain an environmental interference factor vector, and mapping the environmental interference factor vector to a global environmental quality index; predicting the robot's pose based on the motion parameters, and determining a visual tag search region in the visual tag images based on the predicted pose and the global environmental quality index; recognizing visual tags in the visual tag search region, and localizing the robot based on the recognition results and motion parameters. This invention enables accurate and rapid robot localization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a robot localization method and related apparatus based on visual tags. Background Technology

[0002] In embodied intelligence fields such as indoor navigation for mobile robots, autonomous landing of drones, and augmented reality, positioning technology based on artificial visual tags has become an effective positioning solution due to its advantages such as low deployment cost, strong resistance to cumulative errors, and absolute pose reference.

[0003] In existing technologies, visual label detection algorithms employ global sliding window or full-pixel gradient traversal search strategies. When processing 1080P or 4K high-resolution images, this requires ineffective gradient magnitude calculations and clustering operations on non-target background regions that account for more than 95%. This highly complex processing method results in excessive CPU / GPU resource consumption and low detection frame rates, failing to meet the millisecond-level control response requirements of mobile robots operating at high speeds.

[0004] This demonstrates that existing technologies are highly complex for visual inspection and cannot meet the demands for rapid response. Summary of the Invention

[0005] In view of this, it is necessary to provide a robot localization method and related device based on visual tags to solve the problem that the existing technology has high complexity in visual inspection and cannot meet the needs of rapid response.

[0006] To address the aforementioned problems, in a first aspect, the present invention provides a robot localization method based on visual tags, comprising: Simultaneously acquire the robot's motion parameters and the visual label images collected by the robot, and perform tensor fusion on the motion parameters and visual label images to obtain the environmental context tensor; Multi-scale spatial feature extraction is performed on the environmental context tensor to obtain the environmental disturbance factor vector, and the environmental disturbance factor vector is mapped to the global environmental quality index. The robot's pose is predicted based on motion parameters, and the visual label search area in the visual label image is determined based on the predicted pose and the global environmental quality index. Visual labels are identified in the visual label search area, and the robot is located based on the identification results and motion parameters.

[0007] In one possible implementation, motion parameters and visually labeled images are fused using tensors to obtain an environmental context tensor, including: Based on the motion parameters, the motion state tensor of the robot is extracted, and the visual label image is downsampled and pixel normalized to obtain the image feature tensor. By using the motion state tensor as the global semantic embedding image feature tensor, the environmental context tensor is obtained.

[0008] In one possible implementation, multi-scale spatial feature extraction is performed on the environmental context tensor to obtain an environmental disturbance factor vector, and this vector is then mapped to a global environmental quality index, including: A pre-defined lightweight visual language model is used to extract multi-scale spatial features from the environmental context tensor. An environmental quantization regression head is then used to map the extracted feature space into an environmental interference factor vector. The environmental interference factor vector includes the illumination consistency coefficient, the degree of motion blur of image gradient diffusion, the probability that visual labels are occluded by dynamic objects or structures, and the texture complexity reflecting the distribution of feature points in the scene. The environmental disturbance factor vector is weighted, summed, and normalized using a pre-defined nonlinear fusion function to obtain the global environmental quality index.

[0009] In one possible implementation, the robot's pose is predicted based on motion parameters, and the visual label search region in the visual label image is determined based on the predicted pose and the global environmental quality index, including: Based on motion parameters, the robot's motion trajectory is predicted, resulting in a spatial prediction cone for the robot's motion trajectory. Based on spatial prediction cones, visual labels are mapped onto visual label images to obtain the center point of visual labels in visual label images; The search margin for visual labels is determined based on the degree of motion blur in the image gradient diffusion, and the search region for visual labels in the image is determined based on the center point of the visual labels and the search margin.

[0010] In one possible implementation, visual label recognition is performed on the visual label search area, and the robot is localized based on the visual label recognition results and motion parameters, including: The visual detection operator for the visual label image is determined based on the global environmental quality index, and the visual detection operator is used to perform visual detection on the visual label image to obtain the position of the visual label in the visual label image. The reprojection residual of the visual tag is calculated based on the position of the visual tag in the visual tag image and the global environmental quality index. The robot's motion trajectory is constructed by combining the reprojection residual with the robot's motion parameters, and the robot is then located based on the motion trajectory.

[0011] In one possible implementation, a visual detection operator for determining visually labeled images based on a global environmental quality index includes: When the illumination consistency coefficient indicates that the visual label image has strong backlight or local overexposure, an adaptive histogram equalization strategy is used to calculate the optimal cropping threshold of the visual label image and reconstruct the local contrast. When the motion blur of the image gradient exceeds a preset threshold, the image gradient is enhanced by using a Laplacian sharpening convolution kernel or an edge enhancement operator.

[0012] In one possible implementation, the robot's motion trajectory is constructed based on the reprojection residual combined with the robot's motion parameters, including: Construct a full-state confidence operator based on reprojection residuals and global environmental quality index; The Kalman gain of the extended Kalman filter algorithm is adjusted by combining the full-state confidence operator, and the adjusted extended Kalman filter algorithm is used to construct the robot's motion trajectory line by combining the robot's motion parameters.

[0013] Secondly, the present invention also provides a robot localization device based on visual tags, comprising: The information acquisition module is used to synchronously acquire the robot's motion parameters and the visual label images collected by the robot, and to perform tensor fusion on the motion parameters and visual label images to obtain the environmental context tensor. The feature extraction module is used to extract multi-scale spatial features from the environmental context tensor to obtain the environmental interference factor vector, and then map the environmental interference factor vector to the global environmental quality index. The search region determination module is used to predict the robot's pose based on motion parameters, and to determine the visual label search region in the visual label image based on the predicted pose and the global environmental quality index. The localization module is used to identify visual labels in the visual label search area and to locate the robot based on the identification results and motion parameters.

[0014] Thirdly, the present invention also provides an electronic device, including a memory and a processor, wherein, Memory, used to store programs; A processor, coupled to a memory, is used to execute a program stored in the memory to implement the steps in the visual tag-based robot localization method of any of the above implementations.

[0015] Fourthly, the present invention also provides a computer-readable storage medium for storing a computer-readable program or instructions, which, when executed by a processor, can implement the steps of the visual tag-based robot localization method described above.

[0016] The beneficial effects of this invention are as follows: The robot localization method based on visual tags provided by this invention synchronously acquires the robot's motion parameters and the visual tag images collected by the robot, and performs tensor fusion of the motion parameters and visual tag images to obtain an environmental context tensor. This ensures that the image exposure time is precisely aligned with the sampling time of the inertial measurement unit and wheel speedometer on the nanosecond-level time axis, achieving symmetry of sampling frequencies among heterogeneous sensors and improving the accuracy of visual tag recognition. The environmental context tensor not only preserves the visual semantics of the scene, but also provides multi-dimensional constraint priors for subsequent large-scale model inference of imaging quality through the introduction of dynamic features, effectively bridging the limitations of single-modality representation in complex dynamic scenes. Multi-scale spatial feature extraction is performed on the environmental context tensor to obtain the environmental disturbance factor vector, which is then mapped to the global environmental quality index. The robot's pose is predicted based on motion parameters, and the visual label search region in the visual label image is determined based on the predicted pose and the global environmental quality index. This method accurately locates the region of the visual label in the visual label image, successfully avoiding redundant background calculations, reducing the complexity of visual label recognition, enhancing real-time control in high-speed motion scenarios, significantly reducing the overall power consumption of the mobile robot, and extending its battery life. Visual label recognition is performed on the visual label search region, and the robot is located based on the recognition results and motion parameters. This method achieves high-precision and high-speed positioning by using the recognized visual labels. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A flowchart illustrating a robot localization method based on visual tags provided in an embodiment of the present invention; Figure 2 A flowchart illustrating a method for determining an environment context tensor according to an embodiment of the present invention; Figure 3 A flowchart illustrating a method for determining a global environmental quality index provided in an embodiment of the present invention; Figure 4 A flowchart illustrating a method for determining a search region provided in an embodiment of the present invention; Figure 5 This is a flowchart illustrating a robot localization method provided in an embodiment of the present invention. Figure 6A flowchart illustrating a visual detection operator construction method provided in an embodiment of the present invention; Figure 7 A flowchart illustrating a method for determining a motion trajectory line according to an embodiment of the present invention; Figure 8 A schematic diagram of a robot localization device based on visual tags provided in an embodiment of the present invention; Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0019] Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form part of this application and are used together with the embodiments of the present invention to illustrate the principles of the present invention, but are not intended to limit the scope of the present invention.

[0020] In the description of the embodiments of the present invention, unless otherwise stated, "multiple" means two or more. "And / or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.

[0021] The terms "first," "second," etc., used in the embodiments of this invention are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a technical feature defined with "first" or "second" may explicitly or implicitly include at least one of that feature.

[0022] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a mutually exclusive, independent, or alternative embodiment. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0023] A specific embodiment of the present invention, such as Figure 1 As shown, a robot localization method based on visual tags is disclosed, including: S101, synchronously acquire the robot's motion parameters and the visual label images collected by the robot, and perform tensor fusion on the motion parameters and visual label images to obtain the environmental context tensor.

[0024] In this embodiment of the invention, to achieve synchronization between the robot's motion parameters and visual tag image acquisition, a sub-millisecond synchronous acquisition framework based on a hard-triggered mechanism is established through the underlying hardware abstraction layer. This framework utilizes a global shutter synchronization pulse generated by the microcontroller unit to drive the monocular camera and high-frequency sensor array to perform concurrent sampling, ensuring that the image exposure time is precisely aligned with the sampling time of the inertial measurement unit and wheel speedometer on the nanosecond-level time axis. The motion parameters include, but are not limited to, the robot's left and right wheel speeds, velocity, acceleration, and angular velocity.

[0025] In this embodiment of the invention, a one-dimensional motion parameter vector is constructed based on the robot's motion parameters, and an image tensor is constructed based on the pixels of the visual label image. The motion parameter vector and the image tensor are then fused to obtain an environmental context tensor. The specific process of tensor fusion will be described in detail later in this invention.

[0026] S102, multi-scale spatial feature extraction is performed on the environmental context tensor to obtain the environmental disturbance factor vector, and the environmental disturbance factor vector is mapped to the global environmental quality index.

[0027] In this embodiment of the invention, after obtaining the environmental context tensor containing rich contextual information, it is necessary to extract key interference indicators affecting visual recognition performance from it. A small-sized convolutional kernel is used to scan the tensor. This branch is highly sensitive to local pixel changes and is mainly used to extract blur features characterizing the degree of image blur. When the robot rotates rapidly, image edges produce ghosting, and the activation value output by this branch increases significantly. A medium-sized convolutional kernel is used. This branch can sense a slightly larger area of ​​pixel blocks and is mainly used to analyze the distribution of image brightness, i.e., extract illumination features characterizing the uniformity of illumination. If the image has large areas of overexposure or underexposure, the response features of this branch will shift. A larger-sized convolutional kernel is used. This branch's receptive field covers the main area of ​​the entire image and is used to evaluate the overall contrast decay of the image, i.e., extract contrast features characterizing whether the image has lost details due to fog, stains, or strong backlighting. The blur features, illumination features, and contrast features output by the above three parallel branches are summed element-wise at their corresponding spatial locations and fused into an environmental interference factor vector that comprehensively describes the local image degradation. The environmental interference factor vector is input into a fully connected network layer. This fully connected layer acts like a scorer, weighting and summing the various degradation indicators in the vector, and finally outputting a scalar value between zero and one. This value is the global environmental quality index. The closer the value is to one, the more ideal the current visual environment; the closer the value is to zero, the more severe the visual interference.

[0028] S103, based on motion parameters, predicts the robot's pose, and based on the predicted pose and global environmental quality index, determines the visual label search area in the visual label image.

[0029] In this embodiment of the invention, based on the acquired motion parameters and the robot's precise pose from the previous moment, a coarse pose of the robot at the current moment is calculated; this pose is called the predicted pose. Although the predicted pose accumulates errors, it has high relative accuracy over a short period. When the global environmental quality index is greater than a preset threshold, it indicates that the image is clear and the lighting is good. At this time, the visual features are robust, and the labels are easily detected. To save computational resources, only a small rectangular window is defined around the projection point of the predicted pose as the visual label search area. When the global environmental quality index is less than or equal to the preset threshold, it indicates that the image is blurry or too dark. At this time, the predicted pose may have some deviation, and the visual algorithm itself is prone to missing detections.

[0030] S104: Perform visual label recognition on the visual label search area, and locate the robot based on the visual label recognition results and motion parameters.

[0031] In this embodiment of the invention, after determining the visual tag search area, visual tags are identified within that area. Based on the identification results and motion parameters, the robot is positioned to achieve precise localization. The specific localization method will be described in detail later in this invention.

[0032] The robot localization method based on visual tags provided by this invention synchronously acquires the robot's motion parameters and the visual tag images collected by the robot. It then performs tensor fusion of the motion parameters and visual tag images to obtain an environmental context tensor. This ensures that the image exposure time is precisely aligned with the sampling time of the inertial measurement unit and wheel speedometer on the nanosecond-level time axis, achieving symmetry in the sampling frequency among heterogeneous sensors and improving the accuracy of visual tag recognition. The environmental context tensor not only preserves the visual semantics of the scene but also provides multi-dimensional constraint priors for subsequent large-scale model inference of imaging quality through the introduction of dynamic features, effectively bridging the limitations of single-modality representation in complex dynamic scenes. Multi-scale spatial feature extraction is performed on the environmental context tensor to obtain the environmental disturbance factor vector, which is then mapped to the global environmental quality index. The robot's pose is predicted based on motion parameters, and the visual label search region in the visual label image is determined based on the predicted pose and the global environmental quality index. This method accurately locates the region of the visual label in the visual label image, successfully avoiding redundant background calculations, reducing the complexity of visual label recognition, enhancing real-time control in high-speed motion scenarios, significantly reducing the overall power consumption of the mobile robot, and extending its battery life. Visual label recognition is performed on the visual label search region, and the robot is located based on the recognition results and motion parameters. This method achieves high-precision and high-speed positioning by using the recognized visual labels.

[0033] In some possible embodiments of the present invention, such as Figure 2 As shown, tensor fusion is performed on motion parameters and visual label images to obtain an environmental context tensor, including: S201, Based on motion parameters, extract the robot's motion state tensor, perform downsampling and pixel standardization on the visual label image to obtain the image feature tensor; S202, the motion state tensor is used as the global semantic embedding image feature tensor to obtain the environment context tensor.

[0034] In this embodiment of the invention, the motion state tensor of the robot is extracted based on motion parameters. In the data preprocessing stage, the original high-resolution image stream undergoes bilinear interpolation downsampling and pixel-level normalization to transform it into a feature tensor adapted to the input dimension of a lightweight visual language model. The upsized motion state vector is then used as the feature space of the global semantic embedding image tensor, ultimately constructing an environment context tensor that represents the global environmental features. This tensor not only preserves the visual semantics of the scene but also provides multi-dimensional constraint priors for subsequent large-scale model inference of imaging quality through the introduction of dynamic features, effectively bridging the limitations of single-modality representation in complex dynamic scenes.

[0035] In some possible embodiments of the present invention, such as Figure 3 As shown, multi-scale spatial feature extraction is performed on the environmental context tensor to obtain the environmental disturbance factor vector, and the environmental disturbance factor vector is mapped to the global environmental quality index, including: S301 uses a pre-defined lightweight visual language model to extract multi-scale spatial features from the environmental context tensor, and uses an environmental quantization regression head to map the extracted feature space into an environmental interference factor vector. The environmental interference factor vector includes the illumination consistency coefficient, the degree of motion blur of image gradient diffusion, the probability that visual labels are occluded by dynamic objects or structures, and the texture complexity reflecting the distribution of feature points in the scene. S302 uses a preset nonlinear fusion function to perform weighted summation and normalization of the environmental disturbance factor vector to obtain the global environmental quality index.

[0036] In this embodiment of the invention, a lightweight visual language model (VLM) deployed at the edge is used as the perception decision unit. Internally, an encoder composed of a visual Transformer (ViT) performs multi-scale spatial feature extraction on the input environmental context tensor. Utilizing the Transformer's unique self-attention mechanism, the model can perform saliency scanning of illumination non-uniformity across the entire image domain, specular reflection caused by surface water, and edge diffusion caused by motion. Furthermore, the VLM's inference backend does not directly output classification results. Instead, it uses a specially designed environment quantization regression head to map the abstract feature space into a structured vector of environmental disturbance factors. .in, Represents the uniformity of illumination. Characterizing the degree of motion blur in an image gradient diffusion. This represents the probability prediction of a visual label being occluded by a dynamic object or structure, while This reflects the texture complexity of the feature point distribution in the scene. To establish a unified benchmark for optimization feedback, a nonlinear fusion function is further introduced to map the high-dimensional vector S to a scalarized global environmental quality index. This index, through weighted summation and normalization, comprehensively considers the sensitivity weights of each interference dimension to positioning accuracy.

[0037] In some possible embodiments of the present invention, such as Figure 4 As shown, the robot's pose is predicted based on motion parameters, and the visual label search region in the visual label image is determined based on the predicted pose and the global environmental quality index, including: S401, based on motion parameters, predict the robot's motion trajectory to obtain the spatial prediction cone of the robot's motion trajectory; S402, Based on the spatial prediction cone, the visual label is mapped to the visual label image to obtain the center point of the visual label in the visual label image; S403, determine the search margin of visual labels based on the degree of motion blur of image gradient diffusion, and determine the search region of visual labels in the visual label image based on the center point of visual labels and the search margin.

[0038] In this embodiment of the invention, to achieve the ultimate optimization of computational efficiency and real-time performance, a predictive perception framework based on spatiotemporal causality is constructed. This process first uses the posterior pose of the previous sampling period as the starting state. By performing a second-order dynamic integration on the high-frequency angular velocity and acceleration data acquired by the current IMU and the instantaneous linear velocity increment provided by the wheel velocimeter, a spatial prediction cone capable of covering the robot's nonlinear motion trajectory is constructed. This prediction cone performs perspective projection transformation through the camera intrinsic parameter matrix, accurately mapping the label distribution probability in three-dimensional space to the two-dimensional imaging plane, thereby calculating the predicted projection center of the visual label in the current high-resolution frame. Building upon this, a dynamic margin adjustment mechanism based on semantic environment awareness is introduced. Unlike traditional fixed-radius search, this mechanism is based on dynamic fuzzy factors. Combined with the robot's current instantaneous angular velocity Real-time calculation of search margin This formula uses weighting coefficients. and This balances the displacement deviation caused by mechanical motion with the feature blurring caused by image degradation. Subsequently, with Centered on To expand the boundaries, a dynamic ROI clipping window with strong spatiotemporal constraints is generated. Region =[( u Δ , v Δ ),( u + Δ , v + Δ Gradient calculation and disjoint-set clustering are performed only within this limited local pixel domain, which directly blocks the interference of background noise in non-target areas at the physical level. This not only reduces computational complexity and effectively avoids more than 80% of invalid computational redundancy, but also significantly improves the signal-to-noise ratio and success rate of feature extraction in complex dynamic environments.

[0039] In some possible embodiments of the present invention, such as Figure 5 As shown, visual label recognition is performed on the visual label search area, and the robot is localized based on the visual label recognition results and motion parameters, including: S501, determine the visual detection operator for the visual label image based on the global environmental quality index, and use the visual detection operator to perform visual detection on the visual label image to obtain the position of the visual label in the visual label image; S502, calculate the reprojection residual of the visual label based on the position of the visual label in the visual label image and the global environmental quality index; S503 constructs the robot's motion trajectory line based on the reprojection residual and the robot's motion parameters, and then locates the robot based on the motion trajectory line.

[0040] In this embodiment of the invention, traditional visual label detection typically uses edge detection operators with fixed thresholds to extract the black and white boundaries of the labels. However, in scenarios with motion blur or uneven lighting, image edges become smooth or broken, and detection operators with fixed parameters often fail to effectively extract contours, leading to detection failure. In this embodiment of the invention, when the global environmental quality index is high, a visual detection operator with standard sensitivity is invoked. This operator has a high response threshold to changes in image gradients, accurately capturing clear and sharp edges of the labels and effectively eliminating interference from background textures. When the global environmental quality index drops to a moderate level, the visual detection operator is automatically switched to a low-sensitivity, large smoothing window mode. Specifically, a slight smoothing filter is first applied to the search area image to suppress noise; simultaneously, the system lowers the edge determination threshold, ensuring that even smoothed light-dark boundaries due to blurring can be successfully identified as potential edges. Furthermore, for the binarization process, the system adaptively expands the judgment tolerance range for bright and dark pixel values ​​based on the degree of decrease in the environmental quality index, thereby ensuring that the black and white modules of the labels can still be correctly segmented even under uneven lighting conditions. When the global environmental quality index is extremely low, a morphological operator specifically optimized for degraded images can be further invoked. Through a combination of erosion and dilation operations, the broken label outline connection line can be repaired, thereby outputting the approximate position of the visual label center point in the image pixel coordinate system.

[0041] In this embodiment of the invention, based on the robot's predicted pose derived from motion parameters in the aforementioned embodiments, and combined with the known fixed position of the visual label in the world coordinate system, the system can calculate the theoretically correct position of the visual label in the image using the perspective projection principle in computer graphics. This position is called the reprojection prediction point. The actual detected position of the visual label is used as the observation point. The pixel distance difference between the observation point and the reprojection prediction point is calculated. A key improvement of this invention is that the evaluation result of this distance difference is further modulated by the global environmental quality index. If the global environmental quality index indicates severe interference in the current image, even if the observation point and the prediction point are close, a larger uncertainty margin will be automatically assigned to the observation result; conversely, if the environmental quality is excellent, a smaller uncertainty margin will be assigned. When the image is blurry, the edges of objects seen by the human eye are also divergent and inaccurate in position. The system simulates this perceptual uncertainty by expanding the confidence interval of the reprojection residual. Finally, a reprojection residual descriptor with environmental confidence weights is output, which includes magnitude and direction. Finally, the robot's motion trajectory is constructed based on the reprojection residual and the robot's motion parameters, and the robot is located based on the motion trajectory.

[0042] In some possible embodiments of the present invention, such as Figure 6 As shown, the visual detection operator for determining visually labeled images based on the global environmental quality index includes: S601, when the illumination consistency coefficient indicates that the visual label image has strong backlight or local overexposure, an adaptive histogram equalization strategy is used to calculate the optimal cropping threshold of the visual label image and reconstruct the local contrast. S602: When the motion blur of the image gradient exceeds a preset threshold, the image gradient is enhanced by using a Laplacian sharpening convolution kernel or an edge enhancement operator.

[0043] In this embodiment of the invention, the environmental interference factor and global environmental quality index obtained in the aforementioned embodiments are converted into an executable parameter instruction set for visual detection processing. Specifically, to address the contradiction between accuracy and power consumption in complex imaging environments, the original image resolution is maintained in the predicted label projection core area to ensure complete edge gradients and guarantee sub-pixel-level positioning accuracy; while in the outer edge region, the downsampling factor is dynamically increased based on feedback from the global environmental quality index. At the specific operator tuning level, hot-swappable reconstruction at the operator level is achieved through low-level configuration descriptors or dynamic function pointers. When the light and shadow factors... When strong backlighting or localized overexposure is detected, an immediate command is issued to activate the local adaptive histogram equalization module, and the optimal cropping threshold is calculated in real time to reconstruct local contrast; if the motion blur factor... If the threshold is exceeded, the system injects a 3×3 Laplacian sharpening convolution kernel or edge enhancement operator into the feature retrieval pipeline to compensate for the loss of corner detection accuracy caused by motion blur by pre-enhancing the image gradient. The entire injection process is completed in the inter-frame gap of the detection thread, ensuring the continuity of the processing flow. To ensure the stability of the system on embedded terminals, a closed-loop adjustment loop based on processor load feedback is also integrated. The system monitors the CPU / NPU utilization and the current frame processing latency in real time. Once the computational load reaches the high-water mark warning, LLM will forcefully increase the downsampling rate or shut down non-core preprocessing operators through policy injection. Within the controlled boundaries of computing resources, it minimizes the impact of light and shadow fluctuations and gradient blur on positioning robustness, achieving the optimal allocation of computational performance and positioning accuracy in dynamic environments.

[0044] In some possible embodiments of the present invention, such as Figure 7 As shown, the robot's motion trajectory is constructed based on the reprojection residual and the robot's motion parameters, including: S701, a full-state confidence operator is constructed based on reprojection residuals and global environmental quality index; S702 combines the full-state confidence operator to adjust the Kalman gain of the extended Kalman filter algorithm, and uses the adjusted extended Kalman filter algorithm to construct the robot's motion trajectory line by combining the robot's motion parameters.

[0045] In this embodiment of the invention, a nonlinear dynamic mapping model based on confidence scores is established during the data fusion stage. First, the global environmental quality index from the previous embodiments is integrated with the reprojection residuals from the visual label detection operator feedback, and a full-state confidence operator is constructed using a weighted fusion function. This operator not only reflects the robustness of the external physical environment but also includes the self-consistency of the visual solution in the geometric dimension. Subsequently, this operator is used to perform online correction on the observation noise covariance matrix in the extended Kalman filter. To ensure the system's response sensitivity under drastic environmental changes, the evolution of the observation noise covariance matrix follows an exponential mapping formula. In practical engineering implementation, the Kalman gain is adjusted in real time by dynamically scaling the diagonal elements of the observation noise covariance matrix. When the VLM detects an extremely harsh environment (such as complete occlusion or extreme blurring caused by severe turbulence) that causes the full-state confidence operator to drop below a preset safety threshold, the eigenvalues ​​of the observation matrix will surge exponentially, causing the Kalman gain to rapidly approach zero. This process automatically triggers a visual isolation protection mechanism, mathematically cutting off the influence of unreliable visual observations on state updates and preventing erroneous data from causing system divergence. Simultaneously, due to the weakening of visual weights, the state estimator will automatically switch to a calculation mode dominated by IMU pre-integration and wheel speed odometer mileage, based on the EKF prediction step logic. Since the aforementioned steps have ensured the continuity of motion data through hardware synchronization, the system can maintain smooth trajectory output using inertia and dynamic constraints until the VLM detects a recovery in the global environmental quality index and the visual calculation regains high confidence. Then, the visual weights are smoothly regressed by reducing the value of the observation matrix. This semantically guided dynamic weighting mechanism completely changes the limitations of traditional fusion algorithms, which suffer from frequent trajectory jitter and divergence in complex environments, ensuring that the robot can maintain centimeter-level positioning consistency and robustness under all working conditions.

[0046] Finally, after each frame's state update, the system enters the self-supervised verification phase. First, using the fused posterior pose output by the Extended Kalman Filter (EKF) and combined with the camera intrinsic parameter matrix, the label corner points in 3D space are inversely projected back to the current image plane and compared with the pixel corner coordinates extracted by the original visual operator to calculate the reprojection residual. This residual is considered the gold standard for measuring visual perception accuracy. If a large reprojection residual still appears even with a high environmental quality score from the VLM, it is determined that the semantic evaluation logic of the current VLM deviates from the physical facts. At this point, the system triggers a weight fine-tuning mechanism, using this deviation as a feedback signal to online correct the nonlinear mapping weights between the VLM's internal environmental interference factor and the confidence operator, thereby eliminating systematic evaluation errors caused by specific site lighting characteristics. Simultaneously, the system possesses a high-confidence sample mining function. When the system detects that the reprojection residual is extremely small and the sensor fusion consistency is high, it collaboratively stores the current environmental context tensor along with the successful policy parameters injected by the LLM in a local environmental and parameter experience knowledge base. Ultimately, the system achieves unsupervised adaptation and continuous evolution in heterogeneous physical environments.

[0047] This invention addresses the pain point of traditional methods easily causing pose jumps under drastic changes in lighting or motion blur. By online correction of downsampling factors, edge enhancement convolution kernels and parameters, it ensures the integrity of feature extraction in low signal-to-noise ratio environments, stabilizes corner point extraction accuracy at the sub-pixel level, significantly reduces the root mean square error of pose calculation, and makes the robot's positioning trajectory on complex working surfaces smoother and more continuous.

[0048] To better implement the visual tag-based robot localization method in this invention embodiment, based on the visual tag-based robot localization method, correspondingly, as follows: Figure 8 As shown, this embodiment of the invention also provides a robot localization device based on visual tags. The robot localization device 800 based on visual tags includes: The information acquisition module 801 is used to synchronously acquire the robot's motion parameters and the visual label images collected by the robot, and to perform tensor fusion on the motion parameters and visual label images to obtain the environmental context tensor. The feature extraction module 802 is used to perform multi-scale spatial feature extraction on the environmental context tensor to obtain the environmental interference factor vector, and to map the environmental interference factor vector to the global environmental quality index. The search region determination module 803 is used to predict the robot's pose based on motion parameters and determine the visual label search region in the visual label image based on the predicted pose and the global environmental quality index. The positioning module 804 is used to perform visual label recognition in the visual label search area and to locate the robot based on the visual label recognition results and motion parameters.

[0049] The visual tag-based robot localization device 800 provided in the above embodiments can realize the technical solutions described in the above embodiments of the visual tag-based robot localization method. The specific implementation principles of each module or unit can be found in the corresponding content in the above embodiments of the visual tag-based robot localization method, which will not be repeated here.

[0050] like Figure 9 As shown, the present invention also provides an electronic device 900. The electronic device 900 includes a processor 901, a memory 902, and a display 903. Figure 9 Only some components of the electronic device 900 are shown, but it should be understood that it is not required to implement all of the components shown, and more or fewer components may be implemented instead.

[0051] In some embodiments, processor 901 may be a central processing unit (CPU), a microprocessor, or other data processing chip, used to run program code stored in memory 902 or process data, such as the visual tag-based robot localization method of the present invention.

[0052] In some embodiments, processor 901 may be a single server or a group of servers. The server group may be centralized or distributed. In some embodiments, processor 901 may be local or remote. In some embodiments, processor 901 may be implemented on a cloud platform. In some embodiments, the cloud platform may include private cloud, public cloud, hybrid cloud, community cloud, distributed cloud, internal cloud, multi-cloud, etc., or any combination thereof.

[0053] In some embodiments, memory 902 may be an internal storage unit of electronic device 900, such as a hard disk or memory of electronic device 900. In other embodiments, memory 902 may also be an external storage device of electronic device 900, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc. equipped on electronic device 900.

[0054] Furthermore, the memory 902 may include both internal storage units of the electronic device 900 and external storage devices. The memory 902 is used to store application software and various types of data installed on the electronic device 900.

[0055] In some embodiments, display 903 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. Display 903 is used to display information from electronic device 900 and to display a visual user interface. Components 901-903 of electronic device 900 communicate with each other via a system bus.

[0056] In some embodiments, when the processor 901 executes a visual tag-based robot localization program in the memory 902, the following steps may be performed: Simultaneously acquire the robot's motion parameters and the visual label images collected by the robot, and perform tensor fusion on the motion parameters and visual label images to obtain the environmental context tensor; Multi-scale spatial feature extraction is performed on the environmental context tensor to obtain the environmental disturbance factor vector, and the environmental disturbance factor vector is mapped to the global environmental quality index. The robot's pose is predicted based on motion parameters, and the visual label search area in the visual label image is determined based on the predicted pose and the global environmental quality index. Visual labels are identified in the visual label search area, and the robot is located based on the identification results and motion parameters.

[0057] It should be understood that when the processor 901 executes the visual tag-based robot localization program in the memory 902, in addition to the functions mentioned above, it can also perform other functions, as detailed in the description of the corresponding method embodiments above.

[0058] Furthermore, this embodiment of the invention does not specifically limit the type of electronic device 900 mentioned. Electronic device 900 can be a mobile phone, tablet computer, personal digital assistant (PDA), wearable device, laptop computer, or other portable electronic device. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices running iOS, Android, Microsoft, or other operating systems. The aforementioned portable electronic device can also be other portable electronic devices, such as a laptop computer with a touch-sensitive surface (e.g., a touch panel). It should also be understood that in some other embodiments of the invention, electronic device 900 may not be a portable electronic device, but rather a desktop computer with a touch-sensitive surface (e.g., a touch panel).

[0059] Accordingly, this application also provides a computer-readable storage medium for storing a computer-readable program or instruction. When the program or instruction is executed by a processor, it can implement the steps or functions of the visual tag-based robot localization method provided in the above-described method embodiments.

[0060] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware, and the program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.

[0061] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A robot localization method based on visual tags, characterized in that, include: Simultaneously acquire the robot's motion parameters and the visual label images collected by the robot, and perform tensor fusion on the motion parameters and the visual label images to obtain the environmental context tensor; Multi-scale spatial feature extraction is performed on the environmental context tensor to obtain an environmental interference factor vector, and the environmental interference factor vector is mapped to a global environmental quality index. Based on the motion parameters, the robot's pose is predicted, and based on the predicted pose and the global environmental quality index, the visual label search area in the visual label image is determined. Visual labels are identified in the visual label search area, and the robot is located based on the identification results of the visual labels and the motion parameters.

2. The robot localization method based on visual tags according to claim 1, characterized in that, The step of tensor fusion of the motion parameters and the visual label image to obtain an environmental context tensor includes: Based on the motion parameters, the motion state tensor of the robot is extracted, and the visual label image is downsampled and pixel normalized to obtain the image feature tensor. The motion state tensor is embedded as a global semantic into the image feature tensor to obtain the environment context tensor.

3. The robot localization method based on visual tags according to claim 1, characterized in that, The step of extracting multi-scale spatial features from the environmental context tensor to obtain an environmental disturbance factor vector, and mapping the environmental disturbance factor vector to a global environmental quality index, includes: A preset lightweight visual language model is used to extract multi-scale spatial features from the environmental context tensor, and an environmental quantization regression head is used to map the extracted feature space into an environmental interference factor vector. The environmental interference factor vector includes the illumination consistency coefficient, the degree of motion blur of image gradient diffusion, the probability that visual labels are occluded by dynamic objects or structural components, and the texture complexity reflecting the distribution of feature points in the scene. The environmental interference factor vector is weighted, summed, and normalized using a preset nonlinear fusion function to obtain the global environmental quality index.

4. The robot localization method based on visual tags according to claim 3, characterized in that, The step of predicting the robot's pose based on the motion parameters and determining the visual label search region in the visual label image based on the predicted pose and the global environmental quality index includes: Based on motion parameters, the robot's motion trajectory is predicted, resulting in a spatial prediction cone for the robot's motion trajectory. Based on the spatial prediction cone, the visual label is mapped onto the visual label image to obtain the visual label center point in the visual label image; The search margin of the visual label is determined based on the degree of motion blur of the image gradient diffusion, and the search region of the visual label in the visual label image is determined based on the center point of the visual label and the search margin.

5. The robot localization method based on visual tags according to claim 1, characterized in that, The step of performing visual label recognition on the visual label search area and locating the robot based on the visual label recognition results and the motion parameters includes: Based on the global environmental quality index, a visual detection operator for the visual label image is determined, and the visual detection operator is used to perform visual detection on the visual label image to obtain the position of the visual label in the visual label image. The reprojection residual of the visual label is calculated based on the position of the visual label in the visual label image and the global environmental quality index; The robot's motion trajectory is constructed based on the reprojection residual and the robot's motion parameters, and the robot is located based on the motion trajectory.

6. The robot localization method based on visual tags according to claim 5, characterized in that, The visual detection operator for determining the visual label image based on the global environmental quality index includes: When the illumination consistency coefficient indicates that the visual label image has strong backlight or local overexposure, an adaptive histogram equalization strategy is used to calculate the optimal cropping threshold of the visual label image and reconstruct the local contrast. When the motion blur of the image gradient exceeds a preset threshold, the image gradient is enhanced by using a Laplacian sharpening convolution kernel or an edge enhancement operator.

7. The robot localization method based on visual tags according to claim 5, characterized in that, The process of constructing the robot's motion trajectory line based on the reprojection residual and the robot's motion parameters includes: A full-state confidence operator is constructed based on the reprojection residual and the global environmental quality index; The Kalman gain of the extended Kalman filter algorithm is adjusted by combining the full-state confidence operator, and the adjusted extended Kalman filter algorithm is used to construct the motion trajectory line of the robot based on the robot's motion parameters.

8. A robot localization device based on visual tags, characterized in that, include: The information acquisition module is used to synchronously acquire the robot's motion parameters and the visual label images collected by the robot, and to perform tensor fusion on the motion parameters and the visual label images to obtain the environmental context tensor. The feature extraction module is used to perform multi-scale spatial feature extraction on the environmental context tensor to obtain an environmental interference factor vector, and to map the environmental interference factor vector to a global environmental quality index. The search region determination module is used to predict the pose of the robot based on the motion parameters, and to determine the visual label search region in the visual label image based on the predicted pose and the global environmental quality index. The positioning module is used to perform visual label recognition in the visual label search area and to position the robot based on the recognition results of the visual labels and the motion parameters.

9. An electronic device, characterized in that, Including memory and processor, among which, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in the visual tag-based robot localization method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer-readable programs or instructions, which, when executed by a processor, can implement the steps in the visual tag-based robot localization method according to any one of claims 1 to 7.