A method for estimating and tracking the posture of a catenary insulator based on depth perception
By using deep sensing technology and employing YOLOv8-OBB and RefineNet algorithms for catenary insulator attitude estimation and tracking, the low efficiency and low accuracy of manual detection in traditional methods are solved, achieving efficient and accurate insulator condition monitoring.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTHWEST JIAOTONG UNIV
- Filing Date
- 2025-06-23
- Publication Date
- 2026-06-23
AI Technical Summary
Traditional methods for detecting overhead contact line insulators rely on manual identification, which suffers from high workload, poor real-time performance, low accuracy, and low efficiency and precision in complex environments.
A depth-sensing-based approach is adopted, using the YOLOv8-OBB target detection algorithm to locate the insulator region. The depth map and the RefineNet semantic segmentation model are combined for pose estimation and tracking. Through multi-scale feature extraction and pose correction, multi-target detection and inter-frame pose tracking are achieved.
It improves the accuracy and robustness of insulator attitude estimation, reduces manpower consumption, achieves efficient attitude tracking and detection, and meets real-time requirements.
Smart Images

Figure CN120953564B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of railway engineering technology, and in particular to a method for attitude estimation and tracking of contact wire insulators based on depth perception. Background Technology
[0002] Insulators, as crucial components of railway overhead contact lines, play a vital role in suspending and maintaining electrical isolation between the energized parts of the suspension system and the insulated parts. Due to their long-term outdoor exposure, insulators suffer from daily fatigue wear and environmental factors, leading to damage and flashover, which degrades their insulation performance and ultimately affects train safety. Therefore, condition monitoring of overhead contact line insulators is a critical aspect of ensuring train safety. Traditional inspection methods typically involve using inspection vehicles to capture images of the insulators and then manually inspecting their condition. However, manual inspection is susceptible to the skill level of the inspectors, resulting in drawbacks such as high workload, poor real-time performance, and low accuracy, easily leading to missed or false detections.
[0003] In the field of railway catenary maintenance, insulator attitude estimation and tracking technology is crucial, directly affecting the safety and reliability of railway operations. While traditional image processing methods can achieve good results for specific detection tasks, they rely on manually designed and constructed features, involve a large workload, require high customization, and are easily affected by factors such as target shape, lighting, and background. Therefore, when faced with new problems, they generally suffer from low operating efficiency and poor detection accuracy.
[0004] Therefore, this invention proposes a depth-sensing-based attitude estimation and tracking method for contact network insulators. By using YOLOv8-OBB and attitude estimation technology, it significantly improves accuracy, robustness, and processing speed, greatly reducing the consumption of human resources and improving work efficiency. Summary of the Invention
[0005] The purpose of this invention is to provide a method for attitude estimation and tracking of contact wire insulators based on depth perception.
[0006] To achieve the above objectives, the present invention is implemented according to the following technical solution:
[0007] This invention provides a method for attitude estimation and tracking of contact wire insulators based on depth sensing, comprising:
[0008] Collect RGB and depth images of insulators in the overhead contact system;
[0009] The YOLOv8-OBB target detection algorithm is used to locate the insulator region in the RGB image and generate a mask image of the detection result.
[0010] The initial translation vector of the insulator is estimated using a mask image and a depth image, and the initial pose assumption of the insulator is generated based on the estimation results.
[0011] After pruning and normalizing the initial pose assumptions, the pose of the insulator is iteratively optimized using the semantic segmentation model RefineNet to obtain candidate pose estimation results.
[0012] The optimal pose is selected from the candidate pose estimation results, and the optimal pose is aligned and the coordinate offset is corrected to obtain the image frame of the final pose.
[0013] The attitude of the insulator is tracked based on the image frames.
[0014] Preferably, the method for locating the insulator region in the RGB image based on the YOLOv8-OBB target detection algorithm specifically comprises:
[0015] Extract multi-scale spatial features from the RGB image and output the geometric parameters of each rotated bounding box, including center point coordinates, width, height, and rotation angle.
[0016] Preferably, the method for generating the mask image of the detection result is as follows:
[0017] Obtain the positions of the four vertices of the rotated bounding box in the image coordinate system to obtain the quadrilateral region;
[0018] Initialize a binary mask matrix with the same size as the RGB image, and fill the corresponding quadrilateral regions in the mask matrix with 1s, while keeping the remaining regions with 0s.
[0019] Preferably, the method for estimating the initial translation vector of the insulator using a mask image and a depth image includes:
[0020] Extract the set of pixel coordinates with a mask value of 1 from the mask image, and use the centroid of the set as the image center point of the target region;
[0021] Read the depth value corresponding to the center point of the image from the depth map, and back-project the center point of the image onto the camera coordinate system to obtain the initial translation vector:
[0022]
[0023] In the formula, This represents the depth value corresponding to the center point of the image. For the camera intrinsic parameter matrix, The center point of the depth map;
[0024] in,
[0025]
[0026] In the formula, and These are the focal lengths in the horizontal and vertical directions, respectively. The coordinates of the main point.
[0027] Preferably, the method for generating the initial pose assumption of the insulator based on the estimation result specifically includes:
[0028] N direction vectors are generated on a unit sphere surface of the insulator using an equidistant sampling method.
[0029] For each direction vector, in its normal plane at intervals of... Rotate the vector itself to generate M attitude transformations, so that each direction vector contains multiple rotation attitudes with different azimuth angles, and obtain the rotation hypothesis;
[0030] Combining the initial translation vector with the rotation assumption, a total of N is formed. M initial pose assumptions:
[0031]
[0032] In the formula, For the first Line number Initial pose assumptions for the column, For the first Line number The rotation hypothesis of columns.
[0033] Preferably, the method for trimming and normalizing the initial pose assumption specifically includes:
[0034] Obtain the CAD model of the insulator, and transform the set of 3D points of the CAD model to the camera coordinate system based on the initial pose assumptions, where each point satisfies:
[0035]
[0036] In the formula, for The corresponding transformation point, The first point in the three-dimensional point set A three-dimensional point;
[0037] The transformed 3D points are projected onto the image plane using the camera intrinsic parameter matrix to obtain the pixel coordinates:
[0038]
[0039] In the formula, for The corresponding pixel coordinates for Depth value in camera coordinate system;
[0040] The minimum bounding rectangle of the insulator is obtained based on the set of two-dimensional pixels, and the bounding rectangle is expanded by 10% of its size.
[0041] Based on the expanded bounding box, image patches are cropped from the RGB image and the depth image respectively;
[0042] The size of the image patches in the RGB image is scaled to the preset network input size, and the non-zero depth values in the image patches of the depth image are normalized.
[0043] Determine the initial center point of the image patch in the camera coordinate system, and project this initial center point onto the image center position of the image patch in the normalized depth map, which serves as the visual reference center.
[0044] Preferably, the architecture of the semantic segmentation model RefineNet includes:
[0045] The input layer has a dual-branch structure, including a real image branch and a synthetic image branch. The real image branch is used to input image patches after cropping and normalization, while the synthetic image branch is used to input multimodal image data, including RGB images, depth maps, normal maps, and 3D point cloud maps.
[0046] The coding layer is a two-way symmetrical encoder structure, including shallow and deep encoder structures. Each route is composed of ResNet-34 modules of the architecture, which process the image data of the real image branch and the synthetic image branch respectively, and are used to extract multi-scale features from the image data and construct cross-modal residual feature maps.
[0047] The feature fusion layer includes multiple residual convolutional units and attention mechanism units, which are used to perform spatial compression and semantic fusion on the residual feature maps;
[0048] The attitude increment regression layer is used to output attitude corrections, including rotation increments and translation increments.
[0049] The method for iteratively optimizing the attitude of the insulator is as follows:
[0050] The maximum number of iterations is set to 5. The current pose is iteratively updated based on the pose correction to obtain candidate pose estimation results. The iterative update operation is as follows:
[0051]
[0052]
[0053] In the formula, and The first The estimation results of rotation and translation during round iteration. and The first The rotation increment and translation vector during round iteration. and The first The estimation results of rotation and translation during round iteration.
[0054] Preferably, the method for selecting the optimal pose from the candidate pose estimation results specifically includes:
[0055] Based on the candidate pose estimation results, the CAD model of the insulator is projected onto the image plane, and the corresponding RGB image, depth map and three-dimensional spatial coordinate map are generated by image rendering as the first image feature;
[0056] Based on the candidate pose estimation results, the projection region is determined in the original RGB image and cropped. The cropped region image and its corresponding depth information and three-dimensional coordinate table are extracted as the second image features.
[0057] The first and second image features are combined into an image feature pair, and then channel-wise difference, stitching and attention-weighted fusion are performed. The fused features are mapped to scalar scores, which represent the degree of matching between the pose hypothesis and the actual observation. The pose with the highest scalar score is selected as the optimal pose.
[0058] If the origin of the CAD model is inconsistent with the geometric centroid, the optimal pose is updated using the offline-predicted correction matrix:
[0059]
[0060] In the formula, As the final posture, This is the optimal posture. This is the correction matrix.
[0061] Preferably, the method for aligning the orientation and correcting the coordinate offset of the optimal posture specifically includes:
[0062] Extract the z-axis direction vector of the final pose. and will To align with unit vectors, construct the alignment rotation matrix:
[0063]
[0064] In the formula, for The identity matrix, and They are respectively The cross product and dot product with unit vectors for The antisymmetric matrix;
[0065] Multiplying the alignment rotation matrix by the fixed rotation matrix yields the rotation correction matrix, where the fixed rotation matrix is:
[0066]
[0067] The rotation component in the optimal pose is updated based on the rotation correction matrix to obtain the image frame of the final pose.
[0068] Preferably, the method for attitude tracking of insulators based on image frames specifically includes:
[0069] Get the current image frame and the previous image frame;
[0070] Edge erosion and bilateral filtering are performed on the depth map of the current image frame. Then, a point cloud map is generated from the processed depth map using the depth back projection relationship.
[0071] The RGB image, depth image, point cloud image of the current image frame and the optimal pose of the previous image frame are input into the semantic segmentation model RefineNet, and the pose correction is output.
[0072] The attitude estimate of the current image frame is updated based on the attitude correction amount and used as the initial reference attitude for attitude tracking of subsequent image frames, thereby realizing attitude tracking of the insulator.
[0073] Compared with the prior art, the embodiments of the present invention have at least the following advantages or beneficial effects:
[0074] (1) This invention solves the problems of low accuracy due to static matching and unstable estimation caused by multiple pose solutions for symmetrical objects in traditional pose estimation methods by constructing a pixel-level semantic matching scoring mechanism based on rendered images and real images and introducing a rotation hypothesis space discretization method based on uniform sampling of unit sphere.
[0075] (2) This invention solves the problem of simultaneous pose tracking of multiple contact network insulators by using YOLOv8-OBB-based multi-target detection and instance segmentation, executing an attitude estimation pipeline independently for each detection instance, and constructing an independent tracker based on inter-frame attitude correction.
[0076] (3) The present invention generates semantic matching pairs of real and rendered images for each group of pose candidates, evaluates their spatial consistency, selects the optimal pose as the final result, effectively eliminates redundant poses and erroneous estimations, and achieves accurate screening of multiple candidate poses.
[0077] (4) This invention achieves stable 6D attitude time series tracking by constructing an inter-frame attitude tracking mechanism and combining depth map edge erosion and bilateral filtering preprocessing, camera projection inverse transformation and continuous attitude correction mechanism. Attached Figure Description
[0078] Figure 1 The flowchart shows a method for attitude estimation and tracking of contact wire insulators based on depth perception, provided by the present invention.
[0079] Figure 2 This is an architecture diagram of the YOLOv8-OBB target detection algorithm in an embodiment of the present invention.
[0080] Figure 3 This is an architecture diagram of the RefineNet semantic segmentation model in this embodiment of the invention.
[0081] Figure 4 This is an architecture diagram of the scoring module ScoreNet in an embodiment of the present invention. Detailed Implementation
[0082] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0083] In this document, suffixes such as "module," "part," or "unit" used to denote elements are used only for the purpose of illustrative purposes and have no specific meaning in themselves. Therefore, "module," "part," or "unit" may be used interchangeably.
[0084] In this document, the terms "upper," "lower," "inner," "outer," "front," "rear," "one end," and "the other end," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the present invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.
[0085] In this document, "and / or" includes any and all combinations of one or more of the listed related items.
[0086] In this article, "multiple" means two or more, that is, it includes two, three, four, five, etc.
[0087] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0088] In this document, unless otherwise explicitly specified and limited, the terms "installed," "equipped with," "connected," etc., should be interpreted broadly. For example, "connection" can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, a direct connection, or an indirect connection through an intermediate medium; it can be a connection within two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.
[0089] In this paper, the "YOLOv8-OBB target detection algorithm" is a rotating target detection algorithm based on the YOLOv8 target detection algorithm framework. It can identify targets with directional angles. Compared with the traditional axis-aligned bounding box, the rotating bounding box (OBB) can flexibly express the directional information of the target through angle parameters. It is especially suitable for handling insulator targets in the catenary that are rotated, tilted or obstructed.
[0090] Reference Figure 1 As shown, this invention provides a method for attitude estimation and tracking of contact wire insulators based on depth sensing, including steps S1 to S6:
[0091] Step S1: Collect the RGB image and depth image of the insulators in the contact network.
[0092] In this embodiment, the ZED-Mini binocular depth camera is fixedly mounted on the end effector of the robotic arm. Based on the actual installation position of the insulator in the contact network system, the robotic arm is controlled to move to a suitable observation pose to ensure that the insulator is completely in the center of the camera's field of view, avoiding edge occlusion or image cropping, and ensuring the accuracy of subsequent image processing, target detection, and attitude estimation.
[0093] Once the robotic arm reaches the designated pose, the ZED-Mini camera is activated to collect data, simultaneously acquiring a high-resolution RGB image and the corresponding depth map of the target scene.
[0094] In some embodiments, in order to adapt to different ambient lighting conditions on site, the camera's exposure parameters can be automatically adjusted according to the ambient brightness through an automatic exposure adjustment algorithm during the acquisition process.
[0095] Adjusting the camera's exposure parameters can continuously provide high-quality images in complex environments such as strong outdoor light, shadows, or low light at night, effectively avoiding overexposure or underexposure.
[0096] Step S2: Locate the insulator region in the RGB image based on the YOLOv8-OBB target detection algorithm and generate a mask image of the detection results.
[0097] Specifically, the method for locating the insulator region in an RGB image based on the YOLOv8-OBB target detection algorithm is as follows:
[0098] Extract multi-scale spatial features from the RGB image and output the geometric parameters of each rotated bounding box, including center point coordinates, width, height, and rotation angle.
[0099] Further, a mask image of the detection results is generated:
[0100] Obtain the positions of the four vertices of the rotated bounding box in the image coordinate system to obtain the quadrilateral region;
[0101] Initialize a binary mask matrix with the same size as the RGB image, and fill the corresponding quadrilateral regions in the mask matrix with 1s, while keeping the remaining regions with 0s.
[0102] like Figure 2 The diagram shows the model architecture of the YOLOv8-OBB object detection algorithm, including Backbone, Neck, and Head modules. The Backbone module consists of multiple Conv-BN-SiLU (CBS) modules and a cross-stage fusion module (C2f), capable of extracting semantic information at multiple scales and generating feature maps from high to low resolution. The Neck module employs an upsampling and feature concatenation strategy (Concat) and further integrates features from upper and lower layers through the C2f module to enhance the expressive power for targets at different scales, outputting three sets of fused feature maps. In the Head module, each scale corresponds to a set of three output branches, used to predict the bounding box parameters (Box), target category (Cls), and target rotation angle (Angle), respectively. Each branch contains two CBS modules and one convolutional layer (Conv2d).
[0103] Step S3: Estimate the initial translation vector of the insulator using the mask image and depth image, and generate the initial pose assumption of the insulator based on the estimation results.
[0104] Due to the near-cylindrical geometric characteristics of the insulator, a uniform sampling strategy based on a unit sphere is adopted to generate multiple viewing directions. Specifically, the method for estimating the initial translation vector of the insulator using a mask image and a depth map includes:
[0105] Extract the set of pixel coordinates with a mask value of 1 from the mask image, and use the centroid of the set as the image center point of the target region;
[0106] Read the depth value corresponding to the center point of the image from the depth map, and back-project the center point of the image onto the camera coordinate system to obtain the initial translation vector:
[0107]
[0108] In the formula, This represents the depth value corresponding to the center point of the image. For the camera intrinsic parameter matrix, The center point of the depth map;
[0109] in,
[0110]
[0111] In the formula, and These are the focal lengths in the horizontal and vertical directions, respectively. The coordinates of the main point.
[0112] Furthermore, based on the estimation results, the initial pose assumptions of the insulator are generated:
[0113] N direction vectors are generated on a unit sphere surface of the insulator using an equidistant sampling method.
[0114] For each direction vector, in its normal plane at intervals of... Rotate the vector itself to generate M attitude transformations, so that each direction vector contains multiple rotation attitudes with different azimuth angles, and obtain the rotation hypothesis;
[0115] Combining the initial translation vector with the rotation assumption, a total of N is formed. M initial pose assumptions:
[0116]
[0117] In the formula, For the first Line number Initial pose assumptions for the column, For the first Line number The rotation hypothesis of columns.
[0118] The initial translation vector is the three-dimensional position estimate of the insulator in the camera coordinate system, representing the translation component in the initial attitude.
[0119] In this embodiment, 40 direction vectors are generated on a unit sphere using an equidistant sampling method, with each direction vector representing a possible observation direction.
[0120] Step S4: After pruning and normalizing the initial pose assumptions, the pose of the insulator is iteratively optimized using the semantic segmentation model RefineNet to obtain candidate pose estimation results.
[0121] Specifically, the method for trimming and normalizing the initial pose assumption is as follows:
[0122] Obtain the CAD model of the insulator, and transform the set of 3D points of the CAD model to the camera coordinate system based on the initial pose assumptions, where each point satisfies:
[0123]
[0124] In the formula, for The corresponding transformation point, The first point in the three-dimensional point set A three-dimensional point;
[0125] The transformed 3D points are projected onto the image plane using the camera intrinsic parameter matrix to obtain the pixel coordinates:
[0126]
[0127] In the formula, for The corresponding pixel coordinates for Depth value in camera coordinate system;
[0128] The minimum bounding rectangle of the insulator is obtained based on the set of two-dimensional pixels, and the bounding rectangle is expanded by 10% of its size.
[0129] Based on the expanded bounding box, image patches are cropped from the RGB image and the depth image respectively;
[0130] The size of the image patches in the RGB image is scaled to the preset network input size, and the non-zero depth values in the image patches of the depth image are normalized.
[0131] Determine the initial center point of the image patch in the camera coordinate system, and project this initial center point onto the image center position of the image patch in the normalized depth map, which serves as the visual reference center.
[0132] like Figure 3 As shown, the architecture of the semantic segmentation model RefineNet includes:
[0133] The input layer has a dual-branch structure, including a real image branch and a synthetic image branch. The real image branch is used to input image patches after cropping and normalization, while the synthetic image branch is used to input multimodal image data, including RGB images, depth maps, normal maps, and 3D point cloud maps.
[0134] The coding layer is a two-way symmetrical encoder structure, including shallow and deep encoder structures, corresponding to... Figure 3 The convolutional normalization activation module and residual module in the architecture are composed of ResNet-34 modules shared by each route. They process the image data of the real image branch and the synthetic image branch respectively, and are used to extract multi-scale features in the image data and construct cross-modal residual feature maps.
[0135] Furthermore, the feature maps of the real image branch and the synthetic image branch are stitched together in the channel dimension by the stitching module, and then the convolutional normalization activation module and the residual module extract deeper features.
[0136] To enhance the neural network's ability to perceive spatial order information, the vector sequence of each spatial location obtained after the feature map is processed by the encoder is input into the position embedding module. This module adds a unique position information encoding to the vector of each spatial location, thereby enabling the subsequent Transformer module to recognize the relative relationships and semantic structure between the features at each location;
[0137] The feature fusion layer, corresponding to the Transformer encoder in the diagram, consists of two structurally similar but parameter-distributed Transformer sub-networks. These sub-networks independently learn translational and rotational features and output targets, respectively. Specifically: a multi-head self-attention mechanism is used to capture global dependencies within the sequence; residual connections and layer normalization modules (i.e., residual convolutional units) are employed to ensure network training stability and gradient propagation; and fully connected network layers are used for non-linear mapping to enhance feature representation capabilities.
[0138] The Transformer encoder output is mapped to a 3D translation vector space through a linear transformation to obtain the 3D translation features corresponding to the vector sequence at each spatial location. The rotation branch has a similar structure to the translation branch, with independent multi-head self-attention layers, residual connections, and feedforward networks.
[0139] The attitude increment regression layer is used to output attitude corrections, including rotation increments and translation increments.
[0140] Furthermore, the method for iteratively optimizing the attitude of the insulator is as follows:
[0141] The maximum number of iterations is set to 5. The current pose is iteratively updated based on the pose correction to obtain candidate pose estimation results. The iterative update operation is as follows:
[0142]
[0143]
[0144] In the formula, and The first The estimation results of rotation and translation during round iteration. and The first The rotation increment and translation vector during round iteration. and The first The estimation results of rotation and translation during round iteration.
[0145] In this embodiment, the size of the image patches in the RGB image is scaled to 128. The default network input size is 128.
[0146] Step S5: Select the optimal pose from the candidate pose estimation results, perform orientation alignment and coordinate offset correction on the optimal pose, and obtain the image frame of the final pose.
[0147] It's important to explain that because insulators possess significant axisymmetric characteristics, their appearance remains unchanged when rotated along the axis of symmetry in space. This can easily lead to multiple solutions in image-based 6D pose estimation, especially during continuous frame recognition. This manifests as discontinuities, drifts, or even abrupt changes in the rotational components of the pose matrix (particularly the angles around the axis of symmetry) between different frames. This problem severely impacts the stability and executability of industrial grasping paths. Therefore, after completing the initial 6D pose estimation, the initial pose matrix needs to be aligned to a uniform orientation and fine-tuned in coordinates.
[0148] Specifically, the method for selecting the optimal pose from the candidate pose estimation results is as follows:
[0149] Based on the candidate pose estimation results, the CAD model of the insulator is projected onto the image plane, and the corresponding RGB image, depth map and three-dimensional spatial coordinate map are generated by image rendering as the first image feature;
[0150] Based on the candidate pose estimation results, the projection region is determined in the original RGB image and cropped. The cropped region image and its corresponding depth information and three-dimensional coordinate table are extracted as the second image features.
[0151] The first and second image features are combined into an image feature pair, and then channel-wise difference, stitching and attention-weighted fusion are performed. The fused features are mapped to scalar scores, which represent the degree of matching between the pose hypothesis and the actual observation. The pose with the highest scalar score is selected as the optimal pose.
[0152] If the origin of the CAD model is inconsistent with the geometric centroid, the optimal pose is updated using the offline-predicted correction matrix:
[0153]
[0154] In the formula, As the final posture, This is the optimal posture. This is the correction matrix.
[0155] Furthermore, a method for orientation alignment and coordinate offset correction of the optimal pose is proposed:
[0156] Extract the z-axis direction vector of the final pose. and will To align with unit vectors, construct the alignment rotation matrix:
[0157]
[0158] In the formula, for The identity matrix, and They are respectively The cross product and dot product with unit vectors for The antisymmetric matrix;
[0159] Multiplying the alignment rotation matrix by the fixed rotation matrix yields the rotation correction matrix, where the fixed rotation matrix is:
[0160]
[0161] The rotation component in the optimal pose is updated based on the rotation correction matrix to obtain the image frame of the final pose.
[0162] in, The cross product vector with the unit vector is used to define the axis of rotation. The dot product with the unit vector represents the cosine of the rotation angle.
[0163] In this embodiment, the following is used: Figure 4 The scoring model ScoreNet shown performs channel-wise differencing, concatenation, and attention-weighted fusion to select the optimal pose.
[0164] The ScoreNet scoring model works as follows: it combines the features of the first image with the features of the second image to form an image pair (A, B). After downsampling by the convolutional normalization activation module, semantic information is extracted. Then, the feature map is enhanced by the residual module, which preserves information while enhancing nonlinear expressive power. The feature maps of each pair of images are then concatenated by the stitching module. The concatenated feature map is then passed through the convolutional normalization activation module and the residual module to gradually abstract local matching features into high-level global matching semantics. Finally, the feature map is flattened into a sequence format by the flatten layer and positional embedding is added to enable Transformer to perceive spatial structure.
[0165] For each pair of (A, B) matching features that have been flattened into a sequence, a multi-head self-attention mechanism is used to capture the interrelationships between the internal regions of A and B. Then, an average pooling layer is used to perform average pooling along the spatial dimension, fusing all attention information in the space into a global matching feature vector for each pair of A and B. The matching feature vectors of all pose candidates are then passed through a cross-candidate multi-head attention module, allowing all pose candidates to "communicate" with each other and reinforce which candidates are more reasonable or more conflicting. Finally, a linear transformation MLP scoring module is applied to each candidate pose vector, ultimately outputting the score for all candidate poses for each sample.
[0166] Step S6: Perform attitude tracking on the insulator based on the image frames.
[0167] It should be explained that after acquiring the current frame image, in order to ensure the continuity and stability of the pose estimation in the time series, it is necessary to perform high-precision tracking and updating of the pose estimation result of the previous frame.
[0168] Specifically, the method for attitude tracking of insulators based on image frames includes:
[0169] Get the current image frame and the previous image frame;
[0170] Edge erosion and bilateral filtering are performed on the depth map of the current image frame. Then, a point cloud map is generated from the processed depth map using the depth back projection relationship.
[0171] The RGB image, depth image, point cloud image of the current image frame and the optimal pose of the previous image frame are input into the semantic segmentation model RefineNet, and the pose correction is output.
[0172] The attitude estimate of the current image frame is updated based on the attitude correction amount and used as the initial reference attitude for attitude tracking of subsequent image frames, thereby realizing attitude tracking of the insulator.
[0173] Among them, performing edge erosion and bilateral filtering on the original depth map can improve the spatial consistency of the image boundary and avoid irregular noise from interfering with subsequent projection calculations.
[0174] In this embodiment, the specific performance evaluation on the NVIDIA GeForce RTX 4090 platform is as follows: Testing on 1200 complex industrial scene images, the target detection accuracy reached 98.6%, with an average inference time of less than 0.05 seconds, meeting the requirements for real-time industrial operations; in attitude evaluation, the average error was less than 5cm, and the angle error was less than 10°. o It exhibits fast attitude convergence and strong robustness; in simulated industrial continuous operation scenarios, the average error of the 30-frame tracking experiment is controlled within 2.7cm, and the error jitter amplitude is less than 1cm.
[0175] Specifically, the attitude estimation and tracking method of this invention achieves industrial-grade performance in terms of speed, accuracy, and robustness, and has a wide range of applications.
[0176] The above description is merely an example and illustration of the structure of the present invention. Those skilled in the art can make various modifications or additions to the specific embodiments described, or use similar methods to replace them, as long as they do not deviate from the structure of the invention or exceed the scope defined in the claims, all of which should fall within the protection scope of the present invention.
Claims
1. A method for attitude estimation and tracking of overhead contact line insulators based on depth sensing, characterized in that, Includes the following steps: Collect RGB and depth images of insulators in the overhead contact system; The YOLOv8-OBB target detection algorithm is used to locate the insulator region in the RGB image and generate a mask image of the detection result. The initial translation vector of the insulator is estimated using a mask image and a depth image. Based on the estimation results, the initial pose assumption of the insulator is generated, specifically as follows: N direction vectors are generated on a unit sphere surface of the insulator using an equidistant sampling method. For each direction vector, rotate it around itself at 60° intervals in its normal plane to generate M attitude transformations, so that each direction vector contains multiple rotation attitudes with different azimuth angles, and obtain the rotation hypothesis. The initial translation vector is combined with the rotation assumption to form a total of N×M initial pose assumptions: In the formula, T ij Let R be the initial pose assumption for the i-th row and j-th column. ij Let t0 be the rotation assumption for the i-th row and j-th column, and t0 be the initial translation vector; After pruning and normalizing the initial pose assumptions, the pose of the insulator is iteratively optimized using the semantic segmentation model RefineNet to obtain candidate pose estimation results. The optimal pose is selected from the candidate pose estimation results, and the optimal pose is then aligned in direction and corrected for coordinate offset to obtain the image frame of the final pose. The method for aligning the optimal pose in direction and correcting for coordinate offset is as follows: Extract the z-axis direction vector V of the final attitude z and V z To align with unit vectors, construct the alignment rotation matrix: In the formula, I is a 3×3 identity matrix, and V and c are V z The cross product and dot product with unit vectors, [V] is the antisymmetric matrix of V; Multiplying the alignment rotation matrix by the fixed rotation matrix yields the rotation correction matrix, where the fixed rotation matrix is: The rotation component in the optimal pose is updated based on the rotation correction matrix to obtain the image frame of the final pose. The attitude of the insulator is tracked based on the image frames.
2. The method for attitude estimation and tracking of contact wire insulators based on depth sensing according to claim 1, characterized in that, The method for locating the insulator region in an RGB image based on the YOLOv8-OBB target detection algorithm is as follows: Extract multi-scale spatial features from the RGB image and output the geometric parameters of each rotated bounding box, including center point coordinates, width, height, and rotation angle.
3. The method for attitude estimation and tracking of contact wire insulators based on depth sensing according to claim 1, characterized in that, The method for generating the mask image of the detection result is as follows: Obtain the positions of the four vertices of the rotated bounding box in the image coordinate system to obtain the quadrilateral region; Initialize a binary mask matrix with the same size as the RGB image, and fill the corresponding quadrilateral regions in the mask matrix with 1s, while keeping the remaining regions with 0s.
4. The method for attitude estimation and tracking of contact wire insulators based on depth perception according to claim 1, characterized in that, The method for estimating the initial translation vector of an insulator using a mask image and a depth image includes: Extract the set of pixel coordinates with a mask value of 1 from the mask image, and use the centroid of the set as the image center point of the target region; Read the depth value corresponding to the center point of the image from the depth map, and back-project the center point of the image onto the camera coordinate system to obtain the initial translation vector: In the formula, z c The depth value is the value corresponding to the center point of the image, and K is the camera intrinsic parameter matrix. c V c () represents the center point of the depth map image; in, In the formula, f x and f y These are the focal lengths in the horizontal and vertical directions, respectively. x ,c y () are the coordinates of the main point.
5. The method for attitude estimation and tracking of contact wire insulators based on depth sensing according to claim 1, characterized in that, The method for trimming and normalizing the initial pose assumption is as follows: Obtain the CAD model of the insulator, and transform the set of 3D points of the CAD model to the camera coordinate system based on the initial pose assumptions, where each point satisfies: In the formula, x i , For x i The corresponding transformation point, x i Let i be the i-th 3D point in the set of 3D points; The transformed 3D points are projected onto the image plane using the camera intrinsic parameter matrix to obtain the pixel coordinates: In the formula, (U i V i ) is x i , The corresponding pixel coordinates, z i , For x i , Depth value in camera coordinate system; The minimum bounding rectangle of the insulator is obtained based on the set of two-dimensional pixels, and the bounding rectangle is expanded by 10% of its size. Based on the expanded bounding box, image patches are cropped from the RGB image and the depth image respectively; The size of the image patches in the RGB image is scaled to the preset network input size, and the non-zero depth values in the image patches of the depth image are normalized. Determine the initial center point of the image patch in the camera coordinate system, and project this initial center point onto the image center position of the image patch in the normalized depth map, which serves as the visual reference center.
6. The method for attitude estimation and tracking of contact wire insulators based on depth sensing according to claim 1, characterized in that, The architecture of the semantic segmentation model RefineNet includes: The input layer has a dual-branch structure, including a real image branch and a synthetic image branch. The real image branch is used to input image patches after cropping and normalization, while the synthetic image branch is used to input multimodal image data, including RGB images, depth maps, normal maps, and 3D point cloud maps. The coding layer is a two-way symmetrical encoder structure, including shallow and deep encoder structures. Each route is composed of ResNet-34 modules of the architecture, which process the image data of the real image branch and the synthetic image branch respectively, and are used to extract multi-scale features from the image data and construct cross-modal residual feature maps. The feature fusion layer includes multiple residual convolutional units and attention mechanism units, which are used to perform spatial compression and semantic fusion on the residual feature maps; The attitude increment regression layer is used to output attitude corrections, including rotation increments and translation increments. The method for iteratively optimizing the attitude of the insulator is as follows: The maximum number of iterations is set to 5. The current pose is iteratively updated based on the pose correction to obtain candidate pose estimation results. The iterative update operation is as follows: In the formula, R k+1 and t k+1 The values are the estimated rotation and translation results for the (k+1)th iteration, ΔR. k and △t k Let R be the rotation increment and translation vector at the k-th iteration. k and t k The first The estimation results of rotation and translation during round iteration.
7. The method for attitude estimation and tracking of contact wire insulators based on depth sensing according to claim 1, characterized in that, The method for selecting the optimal pose from the candidate pose estimation results is as follows: Based on the candidate pose estimation results, the CAD model of the insulator is projected onto the image plane, and the corresponding RGB image, depth map and three-dimensional spatial coordinate map are generated by image rendering as the first image feature; Based on the candidate pose estimation results, the projection region is determined in the original RGB image and cropped. The cropped region image and its corresponding depth information and three-dimensional coordinate table are extracted as the second image features. The first and second image features are combined into an image feature pair, and then channel-wise difference, stitching and attention-weighted fusion are performed. The fused features are mapped to scalar scores, which represent the degree of matching between the pose hypothesis and the actual observation. The pose with the highest scalar score is selected as the optimal pose. If the origin of the CAD model is inconsistent with the geometric centroid, the optimal pose is updated using the offline-predicted correction matrix: In the formula, T final As the final posture, T best This is the optimal posture. T align This is the correction matrix.
8. The method for attitude estimation and tracking of contact wire insulators based on depth sensing according to claim 1, characterized in that, The method for attitude tracking of insulators based on image frames is as follows: Get the current image frame and the previous image frame; Edge erosion and bilateral filtering are performed on the depth map of the current image frame. Then, a point cloud map is generated from the processed depth map using the depth back projection relationship. The RGB image, depth image, point cloud image of the current image frame and the optimal pose of the previous image frame are input into the semantic segmentation model RefineNet, and the pose correction is output. The attitude estimate of the current image frame is updated based on the attitude correction amount and used as the initial reference attitude for attitude tracking of subsequent image frames, thereby realizing attitude tracking of the insulator.