Semantic slam construction method and system for ar navigation of divers
By separating scattering components, correcting pose, and constructing semantic descriptors with elastic topological constraints in the underwater environment, the problem of reduced underwater image quality in traditional SLAM algorithms is solved, achieving high-precision AR navigation and positioning for divers and scene perception.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA STATE SHIPBUILDING CORP LTD RESEARCH INSTITUTE 719
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional visual SLAM algorithms struggle to effectively separate light scattering components in underwater environments, resulting in reduced image quality and hindering accurate localization and map building.
Volume scattering noise is filtered out by coherent scattering component separation logic, semantic labels are extracted using deep neural networks and inertial measurement data are used to correct pose, a semantic descriptor based on elastic topological constraints is constructed, loop closure detection and global optimization are performed, and a three-dimensional structured semantic map is generated.
It improves the stability of underwater image feature extraction and matching, and achieves high-precision AR navigation positioning and scene perception for divers.
Smart Images

Figure CN122244348A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of AR navigation technology, specifically relating to a semantic SLAM construction method and system for AR navigation for divers. Background Technology
[0002] Augmented reality (AR) technology overlays virtual information onto the real-world view, providing a revolutionary means of assistance for divers to perform complex tasks such as navigation, search, and maintenance in underwater environments with low visibility and poor orientation. The prerequisite for achieving high-precision AR navigation is that the diver's AR helmet can accurately perceive its own position and attitude in three-dimensional space in real time and simultaneously construct a three-dimensional map of the surrounding environment; this technology is called Simultaneous Localization and Mapping (SLAM). Visual SLAM has become the mainstream SLAM technology due to its low-cost, small-sized, and information-rich sensors. However, water is not an optically homogeneous medium; the numerous suspended microparticles within it produce a strong scattering effect on light. In water, there is not only surface scattering, where light reflects off the target object and enters the sensor, but also volumetric scattering, where light is scattered multiple times by suspended particles in the water before reaching the object's surface and directly enters the sensor. This manifests as a layer of blurred background noise in the image, containing no effective structural information. Traditional V-SLAM front-end algorithms are designed to process clear images in optically homogeneous media and cannot physically distinguish between these two different scattering components. Therefore, when traditional algorithms process underwater images, they misidentify a large amount of volume scattering noise as invalid or unstable feature points. At the same time, effective surface scattering signals are submerged by volume scattering and are difficult to extract stably. Ultimately, the inability to effectively separate scattering components leads to a significant reduction in the quality of the input image, making it difficult to achieve accurate positioning and build maps in AR headsets. Summary of the Invention
[0003] This invention provides a semantic SLAM construction method and system for AR navigation for divers to solve the above-mentioned technical problems.
[0004] In a first aspect, the present invention provides a semantic SLAM construction method for AR navigation for divers, the method comprising the following steps: The raw underwater image stream and inertial measurement data are collected by the diver's AR helmet. The volume scattering component in the raw image stream is filtered out from the surface scattering component by using coherent scattering component separation logic and according to the texture coherence coefficient of the image pixels in the raw image stream, so as to obtain the enhanced structured image features. In structured image features, semantic labels and corresponding initial feature point clouds of underwater targets are extracted based on a pre-set deep neural network, and the initial motion pose of the diver is calculated by combining the pre-integration results of inertial measurement data. Using the standard geometric structure corresponding to the semantic label as a reference, the displacement tensor of the initial feature point cloud relative to the standard geometric structure is calculated, and the spatial position of the initial feature point cloud is corrected based on the displacement tensor to obtain the corrected feature point cloud. Extract the geometric connectivity between feature point clouds and construct semantic descriptors based on elastic topological constraints. Use induced fit logic to match semantic descriptors under different time sequences to trigger loop closure detection constraints. A global factor graph is established, which includes the pose nodes corresponding to the initial motion pose and the semantic topological constraint factors corresponding to the loop closure detection constraints. Based on the global factor graph, the diver's historical trajectory and the global spatial coordinates of the underwater target are iteratively optimized by minimizing the global reprojection error and the topological consistency error. The algorithm monitors the fluctuation entropy values of semantic feature points in the spatiotemporal sequence of the optimized feature point cloud, aggregates stable semantic features with fluctuation entropy values lower than the preset entropy threshold through incremental voxel filtering, and generates a three-dimensional structured semantic map on the AR helmet display interface.
[0005] Optionally, the step of using coherent scattering component separation logic and filtering out the volume scattering component from the surface scattering component in the original image stream based on the texture coherence coefficient of the image pixels in the original image stream to obtain the enhanced structured image features includes the following steps: Define a sliding analysis window in the original image stream, traverse each pixel region in the original image stream, and construct a feature covariance matrix representing the brightness gradient distribution of the pixel region within the sliding analysis window; Eigenvalue decomposition is performed on the feature covariance matrix to extract the feature value distribution of pixel regions that reflect the local directionality of the image; The texture coherence coefficient of the pixel region is calculated based on the feature value distribution to quantify the degree of anisotropy of the pixel region; Pixel regions with texture coherence coefficients below a preset coherence threshold are identified as volume scattering components and suppressed, while surface scattering components with texture coherence coefficients above the preset coherence threshold are retained, resulting in enhanced structured image features.
[0006] Optionally, the step of calculating the texture coherence coefficient of the pixel region based on the feature value distribution to quantify the anisotropy of the pixel region includes the following steps: Extract the maximum and minimum eigenvalues from the feature covariance matrix, calculate the absolute value of the difference between the maximum and minimum eigenvalues as the directional intensity of the pixel region, and calculate the sum of the maximum and minimum eigenvalues as the total energy level of the pixel region; The ratio of directional intensity to total energy level is defined as the texture coherence coefficient of a pixel region. The texture coherence coefficient of the image to which the pixel region belongs is normalized to generate a coherence weight map for distinguishing between suspended particle noise and underwater targets.
[0007] Optionally, the step of calculating the displacement tensor of the initial feature point cloud relative to the standard geometric structure with reference to the semantic label, and performing spatial position correction on the initial feature point cloud based on the displacement tensor to obtain the corrected feature point cloud includes the following steps: Based on semantic tags, the corresponding standard geometric structure is retrieved from the pre-set industrial structural component database, and the initial feature point cloud is aligned with the standard geometric structure in centroid alignment to identify the observation deviation of each feature point in the initial feature point cloud relative to the standard geometric structure. An elastic stress energy equation is constructed with the objective function of minimizing observation bias, and the initial feature point cloud is regarded as an elastic body affected by the refractive stress of water. By solving the elastic stress energy equation, a continuous displacement tensor field covering the spatial region where the initial feature point cloud is located is generated. The three-dimensional coordinates of each feature point in the initial feature point cloud are inversely mapped using the displacement tensor field to obtain the corrected feature point cloud.
[0008] Optionally, the step of inversely mapping the three-dimensional coordinates of each feature point in the initial feature point cloud using the displacement tensor field to obtain the corrected feature point cloud includes the following steps: The displacement tensor field is decomposed into radial and tangential components affected by the underwater refractive index, and a non-rigid transformation model based on thin plate spline functions is established. The displacement tensor field is used as the control parameter of the non-rigid transformation model. A spatial coordinate transformation based on a non-rigid transformation model is performed on the initial feature point cloud to counteract the radial contraction and tangential distortion caused by refraction, resulting in a corrected feature point cloud. The overlap score between the corrected feature point cloud and the standard geometric structure is calculated. If the overlap score is lower than the preset overlap threshold, the displacement tensor field is iteratively corrected until the overlap score reaches the preset overlap threshold.
[0009] Optionally, the step of extracting the geometric connectivity between feature point clouds and constructing semantic descriptors based on elastic topological constraints, and using induced fit logic to match semantic descriptors under different time series to trigger loop closure detection constraints, includes the following steps: Record the Euclidean distances and relative angles between every pair of feature points within the feature point cloud to construct a topological correlation matrix; The semantic labels, topological association matrix, and local texture information of feature points are fused to generate a semantic descriptor with deformation tolerance. Retrieve pre-stored historical semantic descriptors in the global semantic map and calculate the similarity between the current semantic descriptor and the historical semantic descriptors; The elasticity coefficient of the semantic descriptor at the current moment is adjusted by induced fitting logic to simulate the adaptive deformation of the semantic descriptor in space; When the similarity after adaptation and deformation exceeds the preset loop closure threshold, the current observation area is determined to be a historically traversed area, and loop closure detection constraints are triggered.
[0010] Optionally, the step of adjusting the elasticity coefficient of the semantic descriptor at the current moment using induced fit logic to simulate the adaptation form of the semantic descriptor in space includes the following steps: Define the topological correlation matrix in the semantic descriptor as a topological structure model, and calculate the geometric difference vector between the current semantic descriptor and the historical semantic descriptor; Local deformation weights are assigned to each node in the topological model based on the geometric difference vector. While maintaining topological connectivity, the elastic contraction or elastic expansion of semantic descriptors can be achieved by minimizing the structural internal energy caused by local deformation weights.
[0011] Optionally, the monitoring, optimization, and correction of the semantic feature points in the feature point cloud, the fluctuation entropy values in the spatiotemporal sequence, the aggregation of stable semantic features with fluctuation entropy values lower than a preset entropy threshold through incremental voxel filtering, and the generation of a three-dimensional structured semantic map on the AR helmet display interface include the following steps: Establish a spatiotemporal observation window for each semantic feature point in the optimized and corrected feature point cloud, and record the spatial position changes of the semantic feature points in multiple consecutive frames of images through the spatiotemporal observation window. The spatial distribution variance of each semantic feature point within the spatiotemporal observation window is statistically analyzed, and the fluctuation entropy value, which characterizes the stability of the semantic feature point, is calculated by combining the spatial location change and the spatial distribution variance. Semantic feature points whose fluctuation entropy values are higher than a preset stability threshold are identified as interfering features and are removed. Stable semantic features with fluctuation entropy values below the stability threshold are projected into a global voxel grid; Perform probabilistic label voting within the global voxel grid, and accumulate the observation weights of stable semantic features within the same voxel; The voxel occupancy status is updated based on the cumulative result of the observation weights, a three-dimensional structured semantic map is constructed, and the three-dimensional structured semantic map is generated on the AR helmet display interface.
[0012] In a second aspect, the present invention also provides a semantic SLAM construction system for diver AR navigation, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the semantic SLAM construction method for diver AR navigation as described in any one of the first aspects.
[0013] Thirdly, the present invention also provides a computer-readable storage medium storing instructions, characterized in that, when executed by a processor, the instructions cause the processor to be configured to perform a semantic SLAM construction method for diver AR navigation according to any one of the first aspects.
[0014] The beneficial effects of this invention are: This invention, by applying coherent scattering component separation logic, effectively filters out volume scattering interference caused by suspended particles in water at the visual front end, improving the quality and stability of the structured image features required for subsequent feature extraction and matching. Specifically, using the standard geometric structure of the target as prior knowledge, the spatial position of the point cloud is corrected at the physical level by calculating and correcting the displacement tensor. On the other hand, a semantic descriptor based on elastic topological constraints is constructed, and induced fitting logic is used for matching to trigger loop closure detection. Through a loop closure detection mechanism based on high-level semantics and topological structure, the detection process has stronger environmental adaptability. Finally, by jointly optimizing reprojection error and topological consistency error in the global factor map, and using fluctuation entropy values to screen feature points for stability before mapping, this invention can transform the visually blurred and dynamically changing underwater environment into a three-dimensional structured map with clear semantic labels, providing high-precision positioning and scene perception effects for diver AR navigation. Attached Figure Description
[0015] Figure 1 This is a flowchart illustrating the semantic SLAM construction method for diver AR navigation in one embodiment of this application.
[0016] Figure 2 This is a schematic diagram of the network topology of a deep neural network in one embodiment of this application. Detailed Implementation
[0017] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.
[0018] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such use of data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class and the number of objects is not limited; for example, a first object can be one or more. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0019] Figure 1 This is a flowchart illustrating a semantic SLAM construction method for diver AR navigation in one embodiment. It should be understood that, although... Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1 At least some steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps. For example Figure 1 As shown, the semantic SLAM construction method for diver AR navigation disclosed in this invention specifically includes the following steps: S101. The raw underwater image stream and inertial measurement data are collected through the diver's AR helmet. The volume scattering component in the raw image stream is filtered out from the surface scattering component by using coherent scattering component separation logic and according to the texture coherence coefficient of the image pixels in the raw image stream, so as to obtain the enhanced structured image features.
[0020] One of the core challenges in underwater optical imaging lies in the complex scattering phenomenon generated when light propagates in water. This scattering can be decomposed into two physical mechanisms: surface scattering and volume scattering. Surface scattering originates from the reflection of light on the surface of underwater solid targets, carrying the target's true geometric and texture information; volume scattering is caused by the diffuse reflection of light by media such as suspended particles and plankton, manifesting as diffuse hazy noise in the image. The binocular camera on the AR helmet acquires the raw image stream at a frequency of 30Hz, simultaneously recording the acceleration and angular velocity data output by the six-axis inertial measurement unit. The coherent scattering component separation logic is based on structural tensor analysis theory, constructing a coherent scattering component separation logic in the neighborhood of each pixel. Slide the analysis window and calculate the covariance matrix of the brightness gradient within the window. ,in and These represent the gradient components in the horizontal and vertical directions, respectively. The covariance matrix is obtained by eigenvalue decomposition. and Define the texture coherence coefficient ,in This is a numerically stable term. This coefficient quantifies the degree of anisotropy in the pixel region: surface scattering regions exhibit high coherence due to the presence of well-defined edge directions. The volume scattering region exhibits low coherence due to randomness. Low-coherence regions are labeled as volumetric scattering masks through adaptive threshold segmentation. A guided filter is then used to suppress the brightness contribution of the masked regions while maintaining edge sharpness, ultimately outputting enhanced image features that retain clear structural information.
[0021] S102. In the structured image features, the semantic labels of the underwater target and the corresponding initial feature point cloud are extracted based on the preset deep neural network, and the initial motion pose of the diver is calculated by combining the pre-integration results of the inertial measurement data.
[0022] In this process, the enhanced structured image features are input into a pre-trained deep neural network for semantic segmentation and feature extraction, referring to... Figure 2 This network employs an improved DeepLabV3+ architecture, specifically optimized for underwater industrial scenarios. The encoder portion uses ResNet-101 as its backbone, comprising four residual module groups. Each module group addresses the vanishing gradient problem in deep networks through skip connections. The input image first passes through an initial convolutional layer with a kernel size of... With a stride of 2, it outputs a 64-channel feature map, which is then reduced to one-quarter of its original spatial resolution using a max-pooling layer. The first residual module group contains three residual blocks, each consisting of three convolutional layers, employing... , , The bottleneck structure outputs 256 channels of features. The second to fourth residual module groups output 512, 1024, and 2048 channels of multi-scale features, respectively, with the spatial resolution progressively reduced to 1 / 32 of the original image. A dilated spatial pyramid pooling module is introduced at the top of the encoder. This module performs four dilated convolutions with different dilation rates in parallel, set to 6, 12, and 18, with a kernel size of [missing value]. Additionally, it includes a global average pooling branch to capture the global context. The output features of the five branches are concatenated along the channel dimension and then processed... The convolutional layers are fused into 256 channels of high-level semantic features. The decoder employs a progressive upsampling strategy, first upsampling the high-level features by a factor of four using bilinear interpolation, and then connecting them across layers with the low-level features from the encoder's first residual module. The low-level features are first processed... Convolutional processing reduces the dimensionality to 48 channels to match the channel count. The fused features are then processed through two layers. Convolution is used to refine features, followed by upsampling four times to restore the original image resolution. Finally, through... The convolutional layers output 12-channel category probability maps, with each channel corresponding to a semantic category, including industrial structural components such as pipes, valves, and flanges. These maps are normalized to a probability distribution using a softmax activation function. The network is pre-trained on a dataset containing 50,000 labeled underwater images, employing a cross-entropy loss function and a stochastic gradient descent optimizer with a momentum of 0.9. The initial learning rate is set to 0.007 with a multinomial decay strategy. During inference, structured image features are input into the network. The forward propagation process extracts abstract semantic representations layer by layer. The encoder captures hierarchical features from edge textures to object contours, while the hollow spatial pyramid pooling module aggregates multi-scale contextual information. The decoder recovers spatial details through upsampling and cross-layer fusion. The final output probability map is then classified pixel-by-pixel to obtain a semantic label map, with each pixel assigned the category label corresponding to the highest probability.
[0023] Simultaneously, corner features are detected in the enhanced image based on the ORB feature extraction algorithm. This algorithm first constructs an image pyramid with 8 scale levels and a scale factor of 1.2. At each level, a FAST corner detector is applied to identify pixels with drastic grayscale changes, and non-maximum suppression is used to preserve the corners with the strongest responses. A BRIEF binary descriptor is calculated for each detected corner, and features are generated around the corner. 256 pairs of pixels are randomly sampled within the pixel neighborhood, and the grayscale values of each pair are compared to generate a 256-bit binary string. Binocular stereo matching is used to calculate the depth information of feature points. The ORB features of the left and right images are matched using Hamming distance. Matched point pairs satisfy epipolar constraints, and the result is determined based on disparity. With camera baseline ,focal length Calculate depth . 2D pixel coordinates With depth Combined, the pinhole camera model is back-projected into three-dimensional space. , ,in Principal point coordinates Generate an initial feature point cloud using the focal length parameter. Each feature point in the point cloud carries a corresponding semantic label and visual descriptor.
[0024] The pre-integration module for inertial measurement data uses the median integration method to process the acceleration output from the six-axis IMU. With angular velocity The sampling frequency is 200Hz. In adjacent keyframes... and The time interval is typically 0.1-0.3 seconds, and the accumulated relative motion is achieved through discrete integration. The rotation increment is represented using quaternions to avoid gimbal lock issues, and the time step... Inside, the rotational quaternion corresponding to the angular velocity is The total rotation increment is obtained by accumulating through quaternion multiplication. The velocity increment is obtained through acceleration integration, which requires compensation for the effects of gravity and rotation. The calculation formula is as follows: ,in For the current rotation matrix, For accelerometer bias, This represents the gravitational acceleration vector. The displacement increment is calculated using the second integral of the velocity. ,in This represents the velocity state at an intermediate time point. The pre-integration result includes relative rotation. Relative velocity With relative displacement The uncertainty is described by the corresponding covariance matrix. The pre-integration results are fused with the inter-frame matching results of visual odometry, and visual odometry estimates the relative pose by minimizing the reprojection error. An extended Kalman filter framework is used to fuse two measurements: the prediction step utilizes IMU pre-integration propagation of the state. The update step uses visual observation to correct the state. ,in For Kalman gain, For visual measurement, This is the observation model. The fused model estimates the diver's initial six-DOF motion pose. Rotation matrix Describes the attitude, translation vector Describe the location.
[0025] S103. Using the standard geometric structure corresponding to the semantic label as a reference, calculate the displacement tensor of the initial feature point cloud relative to the standard geometric structure, and perform spatial position correction on the initial feature point cloud based on the displacement tensor to obtain the corrected feature point cloud.
[0026] Underwater refraction causes nonlinear bending of the light path, resulting in systematic distortion of the 3D point cloud reconstructed using a pinhole camera model, specifically radial shrinkage and tangential twisting. For targets with identified semantic tags, the corresponding CAD standard geometric model is retrieved from a pre-built industrial structural component database. This model stores the ideal geometry in a triangular mesh format. An iterative nearest-point algorithm is then executed to refine the initial feature point cloud. After rigid registration with the standard model, centroid alignment and principal axis rotation correction are completed, each feature point is calculated. To the nearest model surface point Observation bias vector Treating the point cloud as a continuous elastic body, an elastic stress energy equation is constructed. ,in Let be the displacement field function to be solved. Represents neighborhood relations, regularization coefficient To control spatial smoothness, the energy functional is discretized using the finite element method, dividing the space into tetrahedral mesh elements. Within each element, the displacement field is interpolated using linear basis functions. The total energy is iteratively minimized using the conjugate gradient method, yielding a continuous displacement tensor field covering the entire point cloud space upon convergence. Perform inverse mapping on each point in the initial point cloud. To counteract the distortion caused by refraction, the feature point cloud after correction is obtained. .
[0027] S104. Extract the geometric connection relationships between feature point clouds and construct semantic descriptors based on elastic topological constraints. Use induced fitting logic to match semantic descriptors under different time series to trigger loop closure detection constraints.
[0028] Among them, for the feature point cloud after correction Calculate any two points within the point cloud and Euclidean distance between and the angle relative to the centroid of the point cloud Construct a topological correlation matrix Matrix elements Encode geometric connectivity relationships. Extract the surrounding features of each feature point. The HOG texture descriptor of the pixel neighborhood serves as local appearance information. The one-hot encoded vector of the semantic label, the eigenvector of the topological correlation matrix, and the texture descriptor are concatenated to form a 384-dimensional semantic descriptor. To address changes in observation perspective and partial occlusion, an elastic deformation mechanism is introduced: a tolerance interval for the side lengths in the topological correlation matrix is defined. ,in This is the initial side length. During the matching phase, a subset of historical descriptions stored in the global semantic map is retrieved. Calculate the current descriptor Cosine similarity to historical descriptors The topology matrix is adjusted using induced fit logic: based on the similarity gradient. Assign deformation weights to each edge By minimizing To achieve adaptive elastic shrinking or expansion of descriptors. This is achieved when the similarity is adjusted. A loopback detection is triggered to confirm that the current location is a historically accessed area.
[0029] S105. Establish a global factor graph containing pose nodes corresponding to the initial motion pose and semantic topological constraint factors corresponding to loop closure detection constraints. Based on the global factor graph, iteratively optimize the diver's historical trajectory and the global spatial coordinates of the underwater target by minimizing the global reprojection error and topological consistency error.
[0030] Among them, the constraints provided by loop closure detection connect scattered local trajectory segments into a globally consistent topological network. Specifically, a factor graph model is established. Node set Includes the pose nodes of the diver at keyframe moments and the three-dimensional coordinate nodes of underwater targets Edge set It includes three types of constraint factors: the visual reprojection factor measures the projection error of feature points on the image plane. ,in For camera projection function, The observation pixel coordinates are used; the inertial pre-integration factor constrains the relative motion between adjacent poses. , Represents the inverse operation on a Lie group; the semantic topological factor is triggered by loop closure detection and constrains non-adjacent pose nodes. Construct the overall error function. ,in Huber robust kernel function suppresses outward point effects. Here is the covariance matrix of each factor. The weights are used for semantic constraints. The Levenberg-Marquardt algorithm is used for iterative optimization, with the Jacobian matrix calculated in each iteration. Approximation with Hessian matrix Solve the incremental equation Update the state variables, where is the damping factor. After optimization convergence, a globally consistent diver trajectory and target coordinates are obtained.
[0031] S106. Monitor the fluctuation entropy values of semantic feature points in the spatiotemporal sequence of the optimized and corrected feature point cloud, aggregate stable semantic features with fluctuation entropy values lower than the preset entropy threshold through incremental voxel filtering, and generate a three-dimensional structured semantic map on the AR helmet display interface.
[0032] Dynamic water flow disturbances and measurement noise cause positional fluctuations of feature points in the time series, necessitating the selection of stable features to construct a reliable map. For each optimized semantic feature point... Established over a time span of A spatiotemporal observation window is used to record the spatial coordinate sequence of the feature point in consecutive frames within the window. Calculate the spatial distribution variance of the coordinate sequence. ,in The coordinates are the mean coordinates. The fluctuation entropy value is defined. ,in For coordinates to fall into the first The frequency of each spatial compartment The number of bins used to divide the three-dimensional space. Fluctuation entropy values quantify the randomness of feature points: stable target points. Dynamic interference points After removing high-entropy features, stable semantic features are projected onto a resolution of [resolution value missing]. A global voxel grid. Within each voxel, probabilistic label voting is performed: the number of observations belonging to the same semantic category is accumulated. Calculate the occupancy probability .when Time-stamped voxels are identified as occupied and assigned dominant semantic labels. A truncated symbolic distance function is used to fuse observations from multiple frames, updating the voxel TSDF values. Weight It is negatively correlated with the observation distance. Isosurfaces are extracted from the TSDF voxel mesh using the Marching Cubes algorithm to generate a semantically labeled triangular mesh map. The mesh model is rendered as a semi-transparent overlay onto the AR helmet display interface, superimposed on the real-time camera video stream, providing divers with augmented reality navigation information.
[0033] In one embodiment, the enhanced structured image features are obtained by using coherent scattering component separation logic and filtering out the volume scattering component from the surface scattering component in the original image stream based on the texture coherence coefficient of the image pixels in the original image stream, including the following steps: Define a sliding analysis window in the original image stream, traverse each pixel region in the original image stream, and construct a feature covariance matrix representing the brightness gradient distribution of the pixel region within the sliding analysis window; Eigenvalue decomposition is performed on the feature covariance matrix to extract the feature value distribution of pixel regions that reflect the local directionality of the image; The texture coherence coefficient of the pixel region is calculated based on the feature value distribution to quantify the degree of anisotropy of the pixel region; Pixel regions with texture coherence coefficients below a preset coherence threshold are identified as volume scattering components and suppressed, while surface scattering components with texture coherence coefficients above the preset coherence threshold are retained, resulting in enhanced structured image features.
[0034] In this embodiment, a structure is built with each pixel coordinate ((u, v)) as the center. A square window of a certain size, with a typical window side length. The number of pixels is chosen to ensure a sufficient statistical sample size while avoiding crossing the target boundary. The window slides across the image rows and columns with a single-pixel step, and a mirror-fill strategy is used to extend the image edges in boundary regions. At each window location, the brightness gradient of all pixels within the window is calculated: the horizontal gradient... Using the Sobel operator Gradient obtained by convolution, vertical direction gradient By transpose operator Obtained through convolution. Gradient components reflect the direction and intensity of brightness changes. Surface scattering regions exhibit a consistent gradient direction due to sharp edges, while volume scattering regions show a chaotic gradient distribution due to random noise. The feature covariance matrix C, as a second-order statistic, describes the spatial distribution pattern of the gradient. Matrix elements... Indicates energy in the horizontal direction. Indicates energy in the vertical direction. This represents the coupling strength between directions. The symmetric positive definite property of the covariance matrix ensures the numerical stability of subsequent eigenvalue decomposition, and the matrix trace... Reflecting the total gradient energy, the determinant It reflects the degree of anisotropy of the gradient.
[0035] Next Solving the characteristic equation from the characteristic covariance matrix C Expanding, we get a quadratic equation. Two non-negative eigenvalues are obtained using the quadratic formula. and Agreement The physical meaning of eigenvalues corresponds to the energy projection of the gradient distribution along the principal and secondary directions: when the image region contains straight edges, the gradient is concentrated along the edge normal, producing... Strong anisotropy; when the region has uniform texture or random noise, the gradient has no dominant direction, exhibiting... The isotropic nature of eigenvectors. Pointing in the direction of the most dramatic gradient change, i.e., the edge normal; eigenvectors The direction pointing to the gentlest gradient change is the edge tangent. Next, the energy distribution entropy of the eigenvalues is calculated. A lower entropy value indicates a more concentrated energy. Constructing enhancement operators. Amplify the principal eigenvalues and recalibrate the covariance matrix. ,in The eigenvector matrix is used as the recalibrated matrix. A second eigenvalue decomposition is performed on the recalibrated matrix to obtain the corrected eigenvalue distribution. This enhances the distinction between structured regions and noisy regions.
[0036] Next, a coherence measure is constructed based on the eigenvalue distribution: the difference between the largest and smallest eigenvalues is calculated. As an indicator of directional strength, a larger difference indicates a higher concentration of gradients along the principal direction; the sum of eigenvalues is calculated. As an indicator of total energy level, it avoids the generation of false high coherence in low-contrast regions. A normalized coherence coefficient is defined. The squaring operation of the numerator enhances the weight of large differences, while the squaring operation of the denominator achieves scale normalization and stabilizes the term. To prevent division by zero errors. The coherence coefficient ranges from [0, 1]: when When it represents a perfectly linear structure, it corresponds to a clear target edge; when Time represents isotropic noise, corresponding to volume scattering of suspended particles. Statistical analysis is performed on the coherence coefficients of the entire image to calculate the global mean. with standard deviation An adaptive threshold strategy is used to set the coherence threshold. This threshold is dynamically adjusted based on the image content. A coherence weight map is generated. The weighted map visually displays the spatial distribution of structured regions and scattering noise in the image.
[0037] Iterate through each pixel position in the weighted graph, when Mark the pixel as a region dominated by volume scattering. The time marker is designated as the surface scattering dominant region. A guided filter is used to suppress brightness in the volume scattering region. The guided filter combines the input image I and the guided image G to calculate the output. The linear coefficients Bias term , A local window centered at pixel (k), and These represent the mean and variance of the guiding image within the window, and the regularization parameter, respectively. Control the smoothing intensity. Guide the edge-preserving properties of the filter to ensure that high-coherence edges are not blurred while suppressing low-coherence noise. Employ a contrast enhancement strategy for surface scattering regions: calculate local contrast. Applying nonlinear mapping Improve edge sharpness, gain coefficient The final fusion process generates an enhanced image. The weighted image M enables a smooth transition between the two processing modes, and the output image effectively suppresses water scattering noise while preserving the geometric details of the target.
[0038] In one implementation, eigenvalue decomposition of the feature covariance matrix to extract the feature value distribution of pixel regions reflecting the local directionality of the image includes the following steps: Calculate the energy distribution entropy value of each eigenvalue in the feature covariance matrix to characterize the degree of geometric disorder of the geometric structure within the pixel region; Extract the principal component eigenvalues of the feature covariance matrix and calculate the dynamic ratio of the principal component eigenvalues to the energy distribution entropy value to obtain the anisotropic intensity gain; A feature enhancement operator is constructed using anisotropic intensity gain, and the feature enhancement operator is used to perform nonlinear recalibration of the feature covariance matrix to amplify the energy proportion of the eigenvector direction corresponding to the principal component eigenvalues within the feature covariance matrix. A second eigenvalue decomposition is performed on the recalibrated feature covariance matrix to obtain the corrected eigenvalue distribution.
[0039] In this embodiment, the eigenvalues obtained after the initial eigenvalue decomposition of the feature covariance matrix are... and While eigenvalues reflect the distribution of gradient energy in different directions, simple numerical values cannot directly quantify the orderliness of energy distribution. Therefore, the concept of entropy from information theory is introduced to measure the degree of disorder in energy distribution. Higher entropy values indicate a more uniform distribution of energy in all directions, corresponding to a disordered geometric structure; lower entropy values indicate that energy is more concentrated in a specific main direction, corresponding to clear structured features. In the specific calculation, the eigenvalues are first normalized to obtain the energy proportions: Indicates the proportion of energy in the main direction. This represents the energy proportion in the secondary direction, and the sum of these two values is always 1, satisfying the probability distribution requirement. The energy distribution entropy value is calculated based on the Shannon entropy formula. Taking the logarithm base of 2 ensures that the entropy value falls within the interval [0, 1]. When two eigenvalues are equal... At that time, the energy is completely and equally distributed, resulting in maximum entropy. This corresponds to isotropic random noise or uniform texture; when one eigenvalue is much larger than another... At this time, highly concentrated energy produces minimum entropy. This corresponds to straight edges or regular geometric contours. In underwater image scenes, volume scattering caused by suspended particles manifests as high-entropy regions ( The surface reflection of solid targets exhibits a low-entropy region. The energy distribution entropy value serves as an indicator of geometric disorder. Regions with high disorder require strong suppression, while regions with low disorder require protection and enhancement.
[0040] Principal component eigenvalues The energy in the direction of the strongest gradient change within a local region of the image typically corresponds to the normal of the target edge. To selectively amplify structured information in subsequent processing, a dynamic gain factor needs to be constructed to adjust the weights of the principal component energy. The design of the anisotropic intensity gain follows two principles: first, the gain value should be positively correlated with the absolute magnitude of the principal component eigenvalues to ensure stronger enhancement of high-contrast edges; second, the gain value should be negatively correlated with the energy distribution entropy to ensure preferential enhancement of structurally ordered regions while suppressing noisy regions. Based on these principles, a gain calculation formula is defined. Among them, the stability factor This prevents gain divergence due to excessively low entropy. The physical meaning of this ratio lies in: molecules Absolute energy characterizing edge strength, denominator The degree of disorder in energy distribution is characterized by the quotient of the two, which quantifies the combined effect of structure and energy level. When a pixel region contains sharp edges, large and Small, gain Achieving high values results in significant enhancement; when the region contains random noise, Small Large, gain A low gain value is achieved to realize natural suppression. The dynamic characteristics of the gain factor enable the algorithm to adapt to different image content, distinguishing targets from noise in complex underwater scenes without manual parameter tuning. In actual calculations, the gain value is truncated. To prevent excessively large values from causing overflow in subsequent calculations, a lower limit is set. Ensure that basic information is retained in all areas.
[0041] The eigenvalue enhancement operator acts on the spectral space of the eigenvalue covariance matrix, selectively amplifying the energy contributions of the principal component directions through a nonlinear transformation. The covariance matrix can be expressed in spectral decomposition form. ,in The eigenvector matrix, It is a diagonal eigenvalue matrix. This is the transpose of V. Eigenvectors Pointing in the main direction, Pointing in this direction, the two form a local coordinate system. The recalibration operation performs nonlinear modulation on the eigenvalues within this coordinate system: multiplying the principal component eigenvalues by a gain factor yields the enhanced value. The eigenvalues of the secondary components remain unchanged. Construct the enhanced eigenvalue matrix The covariance matrix is reconstructed through inverse spectral decomposition. This matrix amplifies the energy proportion in the principal direction while maintaining the eigenvector direction. Geometrically, recalibration is equivalent to stretching the gradient distribution ellipse along its major axis, enhancing anisotropy. Nonlinear characteristics are reflected in the gain factor. As the local image content dynamically changes, different pixel locations receive different enhancement intensities. The recalibrated covariance matrix... Having a larger condition number The increase in the condition number means the strengthening of the structural directionality, which makes the subsequent coherence calculations more clearly distinguish between the target edge and the scattering noise.
[0042] Recalibrated covariance matrix A second eigenvalue decomposition is required to extract the corrected geometric feature parameters. Since the recalibration operation alters the spectral structure of the matrix, the original eigenvalues can be used directly. and The enhanced gradient distribution characteristics can no longer be accurately described. The characteristic equation is solved using quadratic eigendecomposition. Obtain new eigenvalue pairs and Theoretically, since recalibration only scales the principal components while keeping the eigenvectors unchanged, the new eigenvalues should satisfy... and However, in actual calculations, due to floating-point operation errors and matrix reconstruction processes, numerical solutions are needed to ensure accuracy. A power-law iteration method is used to quickly calculate the largest eigenvalue: initialize a random vector. Iterative updates The converged vector is the principal feature vector. Corresponding eigenvalues The minimum eigenvalue is calculated using the matrix trace relation. traces Remains unchanged. The corrected eigenvalue distribution. Compared to the original distribution, it has greater separation, making the coherence coefficient constructed based on eigenvalue differences more accurate. It can produce higher discrimination ability and effectively distinguish structured targets from scattering noise in underwater images.
[0043] In one embodiment, calculating the texture coherence coefficient of a pixel region based on the feature value distribution to quantify the degree of anisotropy of the pixel region includes the following steps: Extract the maximum and minimum eigenvalues from the feature covariance matrix, calculate the absolute value of the difference between the maximum and minimum eigenvalues as the directional intensity of the pixel region, and calculate the sum of the maximum and minimum eigenvalues as the total energy level of the pixel region; The ratio of directional intensity to total energy level is defined as the texture coherence coefficient of a pixel region. The texture coherence coefficient of the image to which the pixel region belongs is normalized to generate a coherence weight map for distinguishing between suspended particle noise and underwater targets.
[0044] In this embodiment, the largest eigenvalue is determined by sorting the eigenvalues according to their magnitude. with minimum eigenvalue The relative relationship between the two directly reflects the geometric properties of the local structure of the image. Directional intensity, as an indicator of the gradient distribution bias, is obtained by calculating the absolute value of the difference between the two eigenvalues. The physical meaning of this difference lies in quantifying the degree of uneven distribution of gradient energy between the principal and secondary directions: when the pixel region contains a clearly defined linear structure such as a pipe edge or valve outline, the gradient is mainly distributed along the edge normal, leading to... Much larger This produces a large difference. When the region contains volume scattering noise caused by suspended particles, the gradient is randomly distributed in all directions. and Approaching, difference Approaching zero. The total energy level characterizes the overall gradient intensity of the pixel region, obtained by calculating the sum of two eigenvalues. This sum equals the trace of the feature covariance matrix, representing the sum of gradient energies in all directions, unaffected by coordinate system rotation. High-contrast target edges produce large total energy values, while low-contrast flat regions produce small total energy values. Combining directional intensity with total energy level analysis can comprehensively characterize the structural characteristics of pixel regions: strong edges have both high directional intensity and high total energy, weak edges have medium directional intensity and medium total energy, noisy regions have low directional intensity but may have medium total energy, and flat regions have both low directional intensity and low total energy.
[0045] Texture coherence coefficient, as a comprehensive index fusing directionality and energy information, is scale-normalized using a ratio form. Definition of coherence coefficient. , of which molecules The square of the directional intensity, denominator The stability term is the square of the total energy level. To prevent division by zero errors, the introduction of squaring serves two purposes: first, it enhances the weight of large differences, making the coherence coefficient of strong edges significantly higher than that of weak edges; second, it ensures that the coefficient's value range is strictly limited to the [0, 1] interval, facilitating subsequent threshold determination. Temporal coherence coefficient To indicate complete isotropy, when Temporal coherence coefficient This indicates perfect anisotropy. The physical meaning of this coefficient corresponds to the concept of visibility in optical coherence theory. High coherence indicates a stable spatial phase relationship of light waves, corresponding to specular reflection from underwater solid targets; low coherence indicates a chaotic spatial phase relationship of light waves, corresponding to diffuse scattering from suspended particles. After calculating the coherence coefficients for all pixel locations in the entire image, the original coherence distribution map is obtained. Due to differences in turbidity and variations in lighting conditions across different water bodies, the original coefficients may be concentrated within a local range, necessitating normalization to enhance their dynamic range. A linear stretching strategy is employed. ,in and These represent the minimum and maximum values of the coherence coefficients in the image. The normalized coefficients maximize the distinction between the target and noise, resulting in the generated coherence weight map. It intuitively displays the spatial distribution pattern.
[0046] In one implementation, using the standard geometric structure corresponding to the semantic tag as a reference, the displacement tensor of the initial feature point cloud relative to the standard geometric structure is calculated, and the spatial position of the initial feature point cloud is corrected based on the displacement tensor to obtain the corrected feature point cloud, including the following steps: Based on semantic tags, the corresponding standard geometric structure is retrieved from the pre-set industrial structural component database, and the initial feature point cloud is aligned with the standard geometric structure in centroid alignment to identify the observation deviation of each feature point in the initial feature point cloud relative to the standard geometric structure. An elastic stress energy equation is constructed with the objective function of minimizing observation bias, and the initial feature point cloud is regarded as an elastic body affected by the refractive stress of water. By solving the elastic stress energy equation, a continuous displacement tensor field covering the spatial region where the initial feature point cloud is located is generated. The three-dimensional coordinates of each feature point in the initial feature point cloud are inversely mapped using the displacement tensor field to obtain the corrected feature point cloud.
[0047] In this embodiment, the semantic labels output by the deep neural network carry category information of underwater targets, such as industrial structural components like pipes, flanges, and valves. Each category corresponds to a standard CAD model stored in a pre-defined industrial structural component database. The database employs a hierarchical index structure, quickly locating the corresponding 3D geometric model through the one-hot encoded vector of the semantic labels. The model is stored in a triangular mesh format, containing vertex coordinates and topological connectivity. After retrieving the standard geometric structure, a centroid alignment operation is performed to eliminate translational degrees of freedom: the initial feature point cloud is calculated. centroid coordinates Where N is the total number of feature points contained in the point cloud. The three-dimensional coordinates of the i-th feature point are given; the vertex set of the standard geometric model is also calculated. center of mass Where (M) is the number of vertices in the model. Through translation transformation... The point cloud centroid is moved to the model centroid location to achieve initial spatial registration. After centroid alignment, the iterative nearest-point algorithm is used to further optimize the rigid transformation parameters, performing this optimization for each feature point in the point cloud in each iteration. Search for the nearest neighbor on the surface of the model. The optimal rotation matrix is solved by singular value decomposition. With translation vector Such that the sum of squared distances between points Minimize. After iterative convergence, calculate the observation bias vector of each feature point relative to the corresponding model surface point. The direction of the deviation vector indicates the direction of positional drift caused by refraction, and the modulus reflects the degree of distortion.
[0048] Underwater refraction causes light to deviate from a straight path, resulting in nonlinear spatial distortion in point clouds reconstructed based on a pinhole camera model. This distortion can be analogized to elastic deformation in solid mechanics, treating the initial feature point cloud as an elastic continuum subjected to external forces, the external forces originating from a virtual stress field generated by refraction. Elasticity theory posits that objects deform under external forces, and the deformation energy can be described by a stress energy function. A total stress energy equation is then constructed. The equation contains two energy contributions: data fitting energy and... The measure of how well the displacement field u(p) fits the observation bias is that, among which For feature points Displacement vector at point; smoothing canonical energy Constrain the continuity of displacement gradients between adjacent points. This represents the neighborhood topology established based on Delaunay triangulation. Let be the spatial gradient tensor of the displacement field. Regularization coefficients. Balancing fitting accuracy and spatial smoothness, too small This leads to overfitting noise in the displacement field, causing high-frequency oscillations. Excessive noise... This excessive smoothing of the displacement field leads to the loss of local details. The introduction of the smoothing energy term originates from the concept of strain energy in elasticity. The deformation of adjacent material points should remain coordinated to avoid tearing, corresponding to the continuity constraint of the displacement gradient. Mathematically, this energy functional is equivalent to the thin-plate spline interpolation problem, obtaining the smoothest displacement field in space by minimizing the bending energy. The physical meaning of the equation is: under the premise of satisfying the observation bias constraint, to find the displacement distribution pattern with the lowest energy, i.e., the most natural one, to simulate the elastic rebound process of the point cloud recovering from a distorted state to a true geometric state.
[0049] The elastic stress energy equation, as a continuous optimization problem, requires discretization and solution using numerical methods. Specifically, the finite element method is employed to divide the three-dimensional spatial region containing the point cloud into tetrahedral mesh elements. Mesh generation is based on the Delaunay tetrahedral partitioning algorithm, using feature points as mesh nodes to automatically generate tetrahedral topologies that satisfy the Delaunay criterion. Within each tetrahedral element, the displacement field u(x, y, z) is represented by linear basis function interpolation. ,in Let be the shape function of the k-th node. Let be the displacement vector of this node. The shape function satisfies the property of unity decomposition. Furthermore, the value is 1 at the corresponding node and 0 at all other nodes to ensure the continuity of the interpolation. Substituting the interpolation expression into the energy functional, a discretized system of linear equations is obtained through the variational principle. Where K is the global stiffness matrix, and its elements are... It is assembled from the element stiffness matrix, reflecting the coupling relationship between nodes; u is the column vector of all nodal displacements; f is the value derived from observation bias. The resulting load vector. The stiffness matrix possesses the property of being sparse, symmetric, and positive definite. The linear equation system is solved iteratively using the conjugate gradient method, with the residual norm as the convergence criterion. After obtaining the displacement vectors of all mesh nodes, for any query point in space, the displacement value of that point is obtained by locating its corresponding tetrahedral element and interpolating using the barycentric coordinates, thus constructing a continuous displacement tensor field U(x, y, z) covering the entire point cloud space. Each component of this tensor field... Describe the displacement distribution along the three coordinate axes and the field gradient. It reflects the non-uniformity of spatial deformation.
[0050] The displacement tensor field U(x, y, z) describes the spatial transformation from the distorted state to the true state. Performing an inverse mapping on the initial feature point cloud eliminates the geometric distortion caused by refraction. For each feature point in the point cloud... First, query the displacement vector of that point in the displacement field. The query process is accelerated by a spatial index structure. After locating the tetrahedral element to which the feature point belongs, the displacement components are calculated using centroid coordinate interpolation. Coordinate transformation is then performed. After obtaining the corrected feature point coordinates, the physical meaning of the subtraction operation is to cancel out the virtual displacement introduced by refraction, pulling the observation point back to its true geometric position. After traversing all feature points to complete the inverse mapping, the corrected feature point cloud is obtained. To verify the correction effect, the average distance error between the corrected point cloud and the standard geometric model was calculated. and Hausdorff distance The maximum deviation was measured. Experimental data showed that after correction, the average error decreased from the initial 15 mm to 0.8 mm, the Hausdorff distance decreased from 42 mm to 3.2 mm, and the overlap between the point cloud and the standard model was significantly improved. If the overlap score was lower than the preset threshold of 0.95, the displacement field was iteratively corrected: the mesh density was increased to refine the tetrahedral subdivision, or the regularization coefficient was adjusted. The process involves balancing fitting and smoothing, resolving the stress energy equation until the correction accuracy meets the requirements, and finally outputting a high-fidelity geometric reconstruction result.
[0051] In one implementation, the process of inversely mapping the three-dimensional coordinates of each feature point in the initial feature point cloud using a displacement tensor field to obtain the corrected feature point cloud includes the following steps: The displacement tensor field is decomposed into radial and tangential components affected by the underwater refractive index, and a non-rigid transformation model based on thin plate spline functions is established. The displacement tensor field is used as the control parameter of the non-rigid transformation model. A spatial coordinate transformation based on a non-rigid transformation model is performed on the initial feature point cloud to counteract the radial contraction and tangential distortion caused by refraction, resulting in a corrected feature point cloud. The overlap score between the corrected feature point cloud and the standard geometric structure is calculated. If the overlap score is lower than the preset overlap threshold, the displacement tensor field is iteratively corrected until the overlap score reaches the preset overlap threshold.
[0052] In this embodiment, the refractive index of water is greater than that of air, causing light to refract at the water surface. This mainly produces two types of geometric distortions: the radial component corresponds to a depth compression effect along the direction of the observed ray, which manifests as an underestimation of the target distance; the tangential component corresponds to a positional shift perpendicular to the ray direction, which manifests as a nonlinear distortion of the target's lateral coordinates. The decomposition process is based on a spherical coordinate system: a spherical coordinate system is established with the camera's optical center as the origin, and any feature point... The position is represented as radial distance. With azimuth displacement vector Projected onto the radial unit vector Obtain the radial component The remaining part is the tangential component. Statistical analysis shows that the radial component accounts for 70-85% of the displacement modulus, and the tangential component accounts for 15-30%, verifying that depth compression is the dominant distortion mode. Thin-plate spline functions can achieve smooth interpolation by minimizing bending. A transformation model is established. Where A is the affine transformation matrix that captures global linear deformation. For the location of control points, For the weight vector, the radial basis function Provides local nonlinear deformation capability. Samples of the displacement field at control points. As a constraint, by solving the system of linear equations Determine the parameters, where Let P be the kernel matrix, and P be the homogeneous coordinate matrix of the control points. The displacement vectors of the control points are stacked. The physical meaning of the thin plate spline corresponds to the deformation surface of an infinitely thin elastic plate under force at the control points. Mathematically, it ensures the continuity of the second derivative and avoids unnatural wrinkles in the transformed field.
[0053] After establishing the thin plate spline transformation model, the initial feature point cloud is... A non-rigid spatial coordinate transformation is performed at each point to eliminate refractive distortion. The transformation process is divided into two levels: global and local. First, an affine matrix (A) is applied to handle the overall scaling, rotation, and shear deformation, and the calculation... Subsequently, local correction terms of the radial basis functions are superimposed, and the calculation is performed. The final correction coordinates are obtained. The influence range of the radial basis function decreases with increasing distance, ensuring that each control point mainly affects the neighboring region, achieving spatially localized deformation control. For radial contraction distortion, the transformation model increases the radial distance component. Stretch the depth coordinates to compensate for the depth underestimation caused by refraction; adjust the angle components of the transformed model to address tangential distortion. Correct for lateral positional offset. The reversibility of the transformation is achieved by constructing the inverse transformation. To ensure accuracy, the inverse transform also uses a thin-plate spline form, but the control point constraints are reversed. To accelerate the transformation calculation for large-scale point clouds, a fast multipole method is used to approximate the radial basis function summation: the space is partitioned into an octree structure, and the contributions of distant control points are expanded using multipole expansion, reducing computational complexity from... Down to .
[0054] Calculate each correction feature point Shortest distance to the surface of the standard model A KD-tree spatial index is used to accelerate nearest neighbor search. A distance error metric is defined. , where scale parameter By controlling the distance tolerance, an exponential function maps the distance to a similarity score in the interval [0, 1]. Simultaneously, the consistency between the point cloud normal and the model surface normal is evaluated: a local normal is estimated for each feature point. Find the normal to the corresponding model surface point. Calculate the cosine of the included normal angle. The overall overlap score is defined as follows: The weight allocation reflects that distance error is the primary indicator, while normal consistency is a secondary indicator. A preset overlap threshold is used. As a standard for the quality of corrective actions, when The iterative optimization process is triggered on time. The iterative correction strategy includes three directions: First, increasing the control point density by inserting new control points in local areas with low overlap to refine the transformation field; second, adjusting the stiffness parameters of the thin plate spline by modifying the kernel function form. The order k in the algorithm balances smoothness and fitting accuracy; third, robust estimation is introduced to remove outliers, and the weights of feature points with distance errors exceeding three times the median are reduced to avoid overfitting noise. In each iteration, the transformation parameters are re-solved and the point cloud coordinates are updated, repeating this process until the overlap score converges above the threshold.
[0055] In one implementation, extracting the geometric connectivity between feature point clouds and constructing semantic descriptors based on elastic topological constraints, and using induced fit logic to match semantic descriptors under different time sequences to trigger loop closure detection constraints includes the following steps: Record the Euclidean distances and relative angles between every pair of feature points within the feature point cloud to construct a topological correlation matrix; The semantic labels, topological association matrix, and local texture information of feature points are fused to generate a semantic descriptor with deformation tolerance. Retrieve pre-stored historical semantic descriptors in the global semantic map and calculate the similarity between the current semantic descriptor and the historical semantic descriptors; The elasticity coefficient of the semantic descriptor at the current moment is adjusted by induced fitting logic to simulate the adaptive deformation of the semantic descriptor in space; When the similarity after adaptation and deformation exceeds the preset loop closure threshold, the current observation area is determined to be a historically traversed area, and loop closure detection constraints are triggered.
[0056] In this embodiment, the feature point cloud after correction While containing spatial geometric information about underwater targets, a simple set of point coordinates lacks structured representation and struggles to support robust scene recognition. Topological correlation matrices, by encoding the relative geometric relationships between feature points, construct a translation- and rotation-invariant structural descriptor. For any two feature points in a point cloud... and Calculate the three-dimensional Euclidean distance This distance measures the spatial interval between two points and remains invariant to rigid transformations. The relative angle is also calculated. , defined as two feature points relative to the centroid of the point cloud The angle between the direction vectors is specifically calculated as follows: The included angle range is Radius. Construction 3D topological incidence matrix Matrix elements For a two-dimensional vector, the diagonal elements This represents a point itself. The matrix is symmetric. This fully encodes the intrinsic geometric topology of the point cloud. To reduce computational complexity, a sparsity strategy is employed, retaining only each feature point. Nearest neighbor connection, typical values The nearest neighbor set can be quickly retrieved using a KD-tree. Sparse topological matrices, while preserving local structural information, reduce storage complexity from... Down to The physical meaning of a matrix corresponds to a weighted undirected graph in graph theory, where nodes are feature points, edge weights are geometric relationships, and the isomorphism of a graph corresponds to the topological equivalence of a scene.
[0057] Semantic tags provide high-level category information, using one-hot encoding to convert categories into vector representations, such as pipeline encoding. Valve code is The vector dimension equals the total number of categories. Topological association matrix. Including geometric structure information, the eigenvector representation of the matrix is extracted through principal component analysis: after flattening the matrix into vectors, the covariance matrix is calculated, and the eigenvector representation of the matrix is extracted. The principal components constitute a compact topological feature vector. It retains over 95% of the structural variance. Local texture information is extracted from the enhanced image, projected back onto the image plane for each feature point, and extracted around the projection location. Pixel neighborhood, calculate the directional gradient histogram to obtain a 128-dimensional HOG descriptor The HOG descriptor is robust to changes in illumination. The three types of features are concatenated to form a comprehensive semantic descriptor. ,in The total dimension. To impart deformation tolerance to the descriptor, elastic constraints are introduced into the topological feature components: defining the tolerance interval for side lengths. elastic modulus Allowing for a 20% length variation to simulate geometric perturbations caused by changes in observation angle or partial occlusion. Angle tolerance range. Angle tolerance Compensation for attitude estimation errors. Elastic constraints allow local deformation of the descriptor while maintaining topological connectivity, avoiding the fragility of rigid matching.
[0058] The global semantic map maintains a semantic descriptor sub-database corresponding to historical keyframes. ,in This represents the cumulative number of keyframes. The database uses an inverted index structure to accelerate retrieval, building a first-level index based on semantic tags, and storing descriptors of the same category in clusters. The semantic descriptors generated at the current time... First, the candidate set is filtered using semantic tags, retaining only historical descriptors that match the category, thus narrowing the search scope to 5-10% of the original database. For each historical descriptor in the candidate set... Calculate the similarity with the current descriptor. The similarity metric uses weighted Euclidean distance. The weighted norm Weight matrix Differentiated weights are assigned to different feature dimensions, including semantic component weights. Highest, topological component weight Secondly, texture component weights The lowest value reflects the importance of high-level semantic information. Scale parameter By controlling the steepness of the similarity function, the exponential mapping transforms distance into... Similarity scores for intervals. After calculating the similarity of all candidate descriptors, they are sorted in descending order of score, and the top-ranked descriptors are extracted. The most similar historical descriptors are selected as loop closure candidates. The complexity of similarity calculation is O(n log n). ,in To accommodate the size of the candidate set, the actual computational load is kept within an acceptable range through inverted indexing and early stopping strategies, thus meeting real-time requirements.
[0059] Induced fit logic allows descriptors to undergo adaptive deformation while maintaining topological connectivity, thereby improving the matching success rate. This is applied to historical descriptors with the highest similarity. Extract topological feature components for flexible adjustment. Calculate the topological difference vector between the current descriptor and historical descriptors. This vector reflects the deviation pattern of the geometric structure. Local deformation weights are assigned to each edge in the topological matrix based on the difference vector. The smaller the difference, the larger the weight, indicating that the edge is more reliable. Construct the deformation energy function. ,in Let the target side length be in the historical descriptor. The side length to be optimized is determined by minimizing the deformation energy. This allows for the flexible contraction or expansion of the current descriptor towards historical descriptors. The adjusted topological features are re-encoded as... Constructing an adaptation descriptor Recalculate the adaptation similarity. Set the loopback threshold. ,when If the current observation area and the location corresponding to the historical descriptor are determined to be in the same scene, loop closure detection constraints are triggered and pose node pairs are recorded. .
[0060] If the similarity after deformation exceeds a preset threshold, it indicates successful identification of the historical access scene. It is necessary to establish a constraint relationship between the current pose and historical poses to eliminate accumulated drift. After loop closure detection is triggered, the pose nodes of the current keyframe are extracted. And the pose nodes of the keyframes corresponding to the matching historical descriptors The relative pose transformation between the two Describe the diver's movement from historical position to current position. To verify the reliability of loop closure detection, a geometric consistency check is performed: the current feature point cloud is transformed to the historical coordinate system using relative pose transformation. Calculate the overlap rate between the transformed point cloud and the historical point cloud. The intersection is determined by a distance threshold; an overlap rate higher than 60% confirms a valid loop closure. Semantic topological constraint factors are established and added to the global factor graph; the constraint error is defined as... ,in For relative transformation based on semantic matching estimation, This represents the inverse operation. The information matrix of the constraint factors. Based on the similarity score, high similarity corresponds to high confidence, i.e., low covariance, and information matrix elements. ,in It is an identity matrix. The introduction of closure constraints pulls locally consistent but globally drifting trajectories back to the correct position, and corrects them to all historical pose nodes through global optimization propagation, achieving global consistency for long-term navigation.
[0061] In one implementation, adjusting the elasticity coefficients of the semantic descriptor at the current moment using induced fit logic to simulate the adaptation form of the semantic descriptor in space includes the following steps: Define the topological correlation matrix in the semantic descriptor as a topological structure model, and calculate the geometric difference vector between the current semantic descriptor and the historical semantic descriptor; Local deformation weights are assigned to each node in the topological model based on the geometric difference vector. While maintaining topological connectivity, the elastic contraction or elastic expansion of semantic descriptors can be achieved by minimizing the structural internal energy caused by local deformation weights.
[0062] In this embodiment, the topological association matrix G in the semantic descriptor encodes the intrinsic geometric structure of the feature point cloud. This matrix is reinterpreted as a topological structure model in graph theory, where nodes correspond to feature points and edges correspond to the geometric relationships between pairs of points. The topological structure model can be represented as a weighted graph. , where the node set For N feature points, the edge set Includes all nearest neighbor connections, weight set Storage side length and included angle The topological model at the current moment. Topological models of historical moments While there may be differences in the number of nodes and connection patterns, they should have similar topological structures when corresponding to the same semantic goal. Geometric difference vectors quantify the structural deviations between two topological models. The calculation process first establishes node correspondences: the optimal matching of the bipartite graph is solved using the Hungarian algorithm, connecting the nodes of the current model... Nodes of the historical model Pairing and matching costs are based on the spatial location and local topological features of nodes. After establishing the correspondence, for each matching edge... Calculate the difference in side lengths Difference in angle Construct the difference vector The global geometric difference vector is obtained by taking a weighted average of the difference vectors of all matching edges. Matching weight The reliability of the edge is determined by the reciprocal of the matching cost. The components of the geometric difference vector reflect the deviation pattern of the topological model in different geometric dimensions: positive edge length differences indicate that the current model is relatively expanded, and negative values indicate contraction; the angle difference reflects rotation or torsional deformation.
[0063] The geometric difference vector reveals the overall deviation trend, but the degree of deformation is not uniform across different regions within the topological model. Therefore, personalized local deformation weights need to be assigned to each node to achieve fine-grained deformation control. The allocation of local deformation weights follows physical intuition: nodes with smaller deviations should remain stable and receive low weights, while nodes with larger deviations require larger adjustments to obtain high weights. In the specific calculation, the values of each node are first statistically analyzed. Calculate the local difference measure of nodes using the difference vectors of all associated edges. ,in The degree of a node is the number of edges it connects to. Let be the set of neighbors of node i. The local difference metric reflects the degree of geometric inconsistency around a node; nodes with large differences are located in regions of severe deformation, while nodes with small differences are located in stable regions. Deformation weights are defined based on the local difference metric. Normalized to the interval [0, 1], the node with the largest difference receives a weight of 1, and the node with the smallest difference receives a weight close to 0. To avoid excessive weight concentration, a smoothing factor is introduced. The bias parameter Ensure that all nodes retain minimum deformation capacity. The physical meaning of deformation weight is similar to the local stiffness coefficient in elasticity mechanics; high-weight nodes correspond to low-stiffness, easily deformable regions, while low-weight nodes correspond to high-stiffness, rigid regions.
[0064] Topological connectivity, as an essential property of the descriptor, must be strictly maintained during deformation; that is, the adjacency relationships between nodes cannot change, only the geometric parameters of edges are allowed to be adjusted. The topological model can be viewed as a system of point masses connected by springs, with edges corresponding to springs and nodes to point masses. Deformation causes the springs to expand and contract, generating elastic potential energy. Define the internal energy function of the structure. ,in and The adjusted side length and included angle are to be optimized. and The target value in the historical model, weighted product Controlling the difficulty of edge deformation, regularization coefficient The optimization weights for balancing side length and included angle are determined. The first term of the energy function penalizes deviations of the side length from the target value, and the second term penalizes deviations of the included angle from the target value. Minimizing the total energy is equivalent to finding the deformation configuration that best approximates the target structure and minimizes internal stress. A Lagrange multiplier method is used to introduce topological connectivity constraints to ensure that the optimization process does not disrupt adjacency relationships. The optimization is then iteratively solved using gradient descent, calculating the partial derivatives of the energy function with respect to the side parameters. and Update parameters in the direction of negative gradient Learning rate The iteration terminates when the rate of energy change is below a threshold. After optimization and convergence, when To achieve elastic expansion of the time-varying topology model, the overall side length increases; when When the value is negative, elastic contraction occurs, and the overall side length decreases.
[0065] In one implementation, monitoring the fluctuation entropy values of semantic feature points in the optimized and corrected feature point cloud in the spatiotemporal sequence, aggregating stable semantic features with fluctuation entropy values lower than a preset entropy threshold through incremental voxel filtering, and generating a three-dimensional structured semantic map on the AR helmet display interface includes the following steps: Establish a spatiotemporal observation window for each semantic feature point in the optimized and corrected feature point cloud, and record the spatial position changes of the semantic feature points in multiple consecutive frames of images through the spatiotemporal observation window. The spatial distribution variance of each semantic feature point within the spatiotemporal observation window is statistically analyzed, and the fluctuation entropy value, which characterizes the stability of the semantic feature point, is calculated by combining the spatial location change and the spatial distribution variance. Semantic feature points whose fluctuation entropy values are higher than a preset stability threshold are identified as interfering features and are removed. Stable semantic features with fluctuation entropy values below the stability threshold are projected into a global voxel grid; Perform probabilistic label voting within the global voxel grid, and accumulate the observation weights of stable semantic features within the same voxel; The voxel occupancy status is updated based on the cumulative result of the observation weights, a three-dimensional structured semantic map is constructed, and the three-dimensional structured semantic map is generated on the AR helmet display interface.
[0066] In this implementation, although the globally optimized feature point cloud eliminates cumulative drift, single-frame observations still contain measurement noise and dynamic interference, requiring temporal analysis to screen stable features. The spatiotemporal observation window, acting as a sliding statistical unit, tracks the behavior patterns of feature points in both time and space dimensions. For each semantic feature point in the optimized feature point cloud... Established over a time span of The observation window, with the current time Centered on this, we trace back the historical frame sequence. Within the observation window, all keyframes containing the semantic feature point are retrieved. Cross-frame association is performed using the unique identifier of the feature point, which consists of a semantic label and the initial detection frame number. The feature point is recorded in each keyframe. 3D coordinates Constructing a time-series coordinate sequence , where n is the number of observations. Calculate the positional change between adjacent frames. This change reflects the amplitude of the feature point's movement over time. An ideal static target feature point should maintain a constant spatial position with near-zero positional change; dynamic disturbances such as floating objects or measurement noise cause significant fluctuations in positional change. The mean of the positional change within the statistical observation window is... with standard deviation The mean reflects the systematic drift trend, while the standard deviation reflects the degree of random fluctuation.
[0067] Time-series coordinates The spatial distribution characteristics reveal the stability of feature points; the coordinates of stable feature points should cluster in a small spatial region, while the coordinates of unstable feature points exhibit a diffuse distribution. The centroid position of the coordinate sequence is calculated. The average spatial position of the feature points is used as the basis for calculating the deviation vector of each observed coordinate relative to the centroid. The spatial distribution variance is defined as the squared mean of the magnitude of the deviation vector. This variance quantifies the dispersion of the coordinate sequence, expressed in square meters. The typical variance of stable target features is less than [value missing]. The variance of dynamic interference characteristics can reach The above. Fluctuation entropy, as a comprehensive stability index, integrates information from both location change and spatial distribution dimensions. The three-dimensional space corresponding to the observation window is divided into... Each cube is divided into boxes, with a side length set to... Covering a range of two standard deviations. Statistically calculate the frequency with which the coordinate sequence falls into each bin. Calculate frequency The fluctuation entropy value is defined based on the Shannon entropy formula. The entropy range is When coordinates are highly concentrated in a single bin, the entropy value approaches zero, indicating extremely high stability; when coordinates are evenly distributed across all bins, the entropy value reaches its maximum, indicating complete randomness.
[0068] Fluctuation entropy provides a quantitative assessment of feature point stability, and automated feature selection can be achieved by setting a threshold. Preset stability threshold. As a criterion, this threshold was calibrated through offline experiments: the entropy distribution of real target features was statistically analyzed in a known static scene, and the 95th percentile was taken as the upper bound of the threshold. All semantic feature points were traversed to determine the fluctuation entropy values. The feature points identified as interference originate from three sources: first, dynamic objects such as floating suspended objects or swimming fish, whose constantly changing spatial positions lead to high entropy values; second, spurious feature points caused by measurement noise, whose detection positions randomly jump between different frames; and third, unstable features obscuring boundaries, whose observations are intermittently lost due to changes in viewpoint. A removal operation is performed on feature points identified as interference, deleting the corresponding entries from the feature point cloud database and simultaneously updating the global factor map to remove related observation factors, thus preventing unreliable observations from contaminating the optimization results. The removal process employs a soft deletion strategy, marking feature points as invalid while retaining historical data. If the entropy value decreases after subsequent observation window updates, the feature points can be restored to their valid state, achieving dynamic management. Statistics show that the removal operation typically removes 15-25% of the initial feature points, significantly improving the reliability of map construction. The retained low-entropy feature points constitute a stable semantic feature set. Each feature point in this set satisfies , representing static structural elements that actually exist in the underwater environment.
[0069] Stable semantic feature set Containing time-verified reliable spatial points, the discrete point cloud needs to be converted into a continuous volumetric representation to support spatial queries and collision detection. A global voxel mesh, as a 3D discretized spatial structure, divides the continuous coordinate space into regular cubic units. The voxel resolution is defined. Each voxel has a side length of 5 centimeters. This resolution strikes a balance between spatial precision and computational efficiency. Too fine a resolution leads to excessive memory consumption, while too coarse a resolution loses geometric details. A voxel mesh aligned to the origin of the global coordinate system is established, and voxel indices are determined using integer triples. Indicates the corresponding spatial range For each feature point in the stable feature set. Calculate the voxel index. , , ,in This is a floor function. It rounds down the feature points' semantic labels. With observation weight The attribute list attached to the corresponding voxel, the observation weight is defined as follows: The lower the entropy value, the higher the weight, reflecting the reliability of the feature point. The voxel grid uses a hash table data structure for storage, allocating memory only to voxels containing feature points; empty voxels do not occupy storage space, achieving sparse representation. The hash function is designed as follows: ,in Let M be a large prime number, and M be the capacity of the hash table, with typical values... The time complexity of the projection operation is O(n log n). It is linearly related to the number of feature points, thus meeting real-time requirements.
[0070] Semantic labels for individual feature points may be misclassified; multi-observation fusion can improve label reliability. A probabilistic label voting mechanism within a voxel grid aggregates multiple observations at the same spatial location, enhancing semantic consistency. For each non-empty voxel... Extract the semantic label set of all internal feature points ,in This represents the number of feature points within a voxel. The cumulative observation weights for each semantic category c are calculated. The weighted summation process is equivalent to weighted voting, with high-reliability feature points contributing more votes. The class probability distribution is then calculated. The category with the highest probability is used as the dominant semantic label for voxels. When the probability of the dominant class exceeds the confidence threshold. If the semantic label is deemed reliable, it is assigned to the voxel; otherwise, it is marked as uncertain and awaits further observations. The accumulated observation weights are stored as the voxel's occupancy confidence score. The confidence level reflects the sufficiency of observation of a voxel; high-confidence voxels are validated by multiple independent observations, while low-confidence voxels are supported by only a few observations. An incremental update strategy supports online map construction: when a new keyframe generates new feature points, only the weights and labels of the voxels involved are updated, using an exponential moving average. Smoothly integrate old and new observations, forgetting factor Give higher weight to historical observations to suppress the impact of instantaneous noise.
[0071] The occupancy status of voxels determines which regions in 3D space are occupied by solid targets and which are free space, forming the basis of navigation planning. Occupancy probabilities are updated based on accumulated observation weights, and a Bayesian filtering framework is used to fuse multi-frame observations. The log-occupancy probability is defined. ,in For the probability of occupancy, The probability is free. The initial log-odds is set to zero to represent prior uncertainty, and is updated each time a feature point is observed to fall into a voxel. Observation model parameters Indicates the probability of occupying space when a feature is observed. The probability is complementary. The log odds exceed the threshold. The time marker voxel is in an occupied state, below Time-based states are marked as free states, while states in between are considered indeterminate states. Occupied voxels are assigned dominant semantic labels. Color coding is used: pipes are displayed in blue, valves in red, and flanges in green, achieving semantic visualization. The Marching Cubes algorithm extracts isosurfaces from the voxel grid, generating a triangular mesh model to represent the target surface. The mesh model undergoes Laplacian smoothing to eliminate staircase effects, and the normal vector is calculated by weighted averaging of adjacent triangles. The final 3D structured semantic map is stored in a hybrid representation: the voxel grid supports fast spatial queries, while the triangular mesh supports high-quality rendering. Map data is wirelessly transmitted to the graphics processing unit of the AR headset, where the OpenGL rendering pipeline generates the augmented reality display: the mesh model is transformed to the camera coordinate system, depth testing is performed to achieve virtual-real occlusion, and a semi-transparent semantic annotation layer is overlaid. Divers can observe the 3D structure and semantic information of their surroundings in real time through the headset display interface, assisting in spatial positioning and path planning during underwater operations, significantly improving navigation efficiency and safety.
[0072] The present invention also discloses a semantic SLAM construction system for diver AR navigation, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the semantic SLAM construction method for diver AR navigation as described in any of the above.
[0073] The processor can be a central processing unit (CPU). Of course, depending on the actual use, it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc., and this application does not limit it.
[0074] The memory can be an internal storage unit of a computer device, such as a hard disk or RAM, or an external storage device, such as a plug-in hard disk, smart memory card (SMC), secure digital card (SD), or flash memory card (FC) provided on the computer device. Furthermore, the memory can be a combination of internal storage units and external storage devices of a computer device. The memory is used to store computer programs and other programs and data required by the computer device. The memory can also be used to temporarily store data that has been output or will be output. This application does not limit this.
[0075] The present invention also discloses a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the semantic SLAM construction method for diver AR navigation described in any of the above embodiments.
[0076] The computer program can be stored in a machine-readable medium. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or certain middleware. The machine-readable medium includes any entity or device capable of carrying computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the machine-readable medium includes, but is not limited to, the above-mentioned components.
[0077] The semantic SLAM construction method for diver AR navigation described in the above embodiments is stored in the computer-readable storage medium and loaded and executed on the processor to facilitate the storage and application of the above method.
[0078] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of protection of this application is limited to these examples; within the framework of this application, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of one or more embodiments of this application as described above, which are not provided in detail for the sake of brevity.
[0079] One or more embodiments in this application are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of this application. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments in this application should be included within the protection scope of this application.
Claims
1. A method for semantic SLAM construction for diver AR navigation, characterized in that, Includes the following steps: The raw underwater image stream and inertial measurement data are collected by the diver's AR helmet. The volume scattering component in the raw image stream is filtered out from the surface scattering component by using coherent scattering component separation logic and according to the texture coherence coefficient of the image pixels in the raw image stream, so as to obtain the enhanced structured image features. In structured image features, semantic labels and corresponding initial feature point clouds of underwater targets are extracted based on a pre-set deep neural network, and the initial motion pose of the diver is calculated by combining the pre-integration results of inertial measurement data. Using the standard geometric structure corresponding to the semantic label as a reference, the displacement tensor of the initial feature point cloud relative to the standard geometric structure is calculated, and the spatial position of the initial feature point cloud is corrected based on the displacement tensor to obtain the corrected feature point cloud. Extract the geometric connectivity between feature point clouds and construct semantic descriptors based on elastic topological constraints. Use induced fit logic to match semantic descriptors under different time sequences to trigger loop closure detection constraints. A global factor graph is established, which includes the pose nodes corresponding to the initial motion pose and the semantic topological constraint factors corresponding to the loop closure detection constraints. Based on the global factor graph, the diver's historical trajectory and the global spatial coordinates of the underwater target are iteratively optimized by minimizing the global reprojection error and the topological consistency error. The algorithm monitors the fluctuation entropy values of semantic feature points in the spatiotemporal sequence of the optimized feature point cloud, aggregates stable semantic features with fluctuation entropy values lower than the preset entropy threshold through incremental voxel filtering, and generates a three-dimensional structured semantic map on the AR helmet display interface.
2. The semantic SLAM construction method for diver AR navigation according to claim 1, characterized in that, The process of using coherent scattering component separation logic and filtering out the volume scattering component from the surface scattering component in the original image stream based on the texture coherence coefficient of the image pixels in the original image stream to obtain enhanced structured image features includes the following steps: Define a sliding analysis window in the original image stream, traverse each pixel region in the original image stream, and construct a feature covariance matrix representing the brightness gradient distribution of the pixel region within the sliding analysis window; Eigenvalue decomposition is performed on the feature covariance matrix to extract the feature value distribution of pixel regions that reflect the local directionality of the image; The texture coherence coefficient of the pixel region is calculated based on the feature value distribution to quantify the degree of anisotropy of the pixel region; Pixel regions with texture coherence coefficients below a preset coherence threshold are identified as volume scattering components and suppressed, while surface scattering components with texture coherence coefficients above the preset coherence threshold are retained, resulting in enhanced structured image features.
3. The semantic SLAM construction method for diver AR navigation according to claim 2, characterized in that, The step of calculating the texture coherence coefficient of a pixel region based on the feature value distribution to quantify the anisotropy of the pixel region includes the following steps: Extract the maximum and minimum eigenvalues from the feature covariance matrix, calculate the absolute value of the difference between the maximum and minimum eigenvalues as the directional intensity of the pixel region, and calculate the sum of the maximum and minimum eigenvalues as the total energy level of the pixel region; The ratio of directional intensity to total energy level is defined as the texture coherence coefficient of a pixel region. The texture coherence coefficient of the image to which the pixel region belongs is normalized to generate a coherence weight map for distinguishing between suspended particle noise and underwater targets.
4. The semantic SLAM construction method for diver AR navigation according to claim 1, characterized in that, The process of calculating the displacement tensor of the initial feature point cloud relative to the standard geometric structure, using the standard geometric structure corresponding to the semantic label as a reference, and then correcting the spatial position of the initial feature point cloud based on the displacement tensor to obtain the corrected feature point cloud includes the following steps: Based on semantic tags, the corresponding standard geometric structure is retrieved from the pre-set industrial structural component database, and the initial feature point cloud is aligned with the standard geometric structure in centroid alignment to identify the observation deviation of each feature point in the initial feature point cloud relative to the standard geometric structure. An elastic stress energy equation is constructed with the objective function of minimizing observation bias, and the initial feature point cloud is regarded as an elastic body affected by the refractive stress of water. By solving the elastic stress energy equation, a continuous displacement tensor field covering the spatial region where the initial feature point cloud is located is generated. The three-dimensional coordinates of each feature point in the initial feature point cloud are inversely mapped using the displacement tensor field to obtain the corrected feature point cloud.
5. The semantic SLAM construction method for diver AR navigation according to claim 4, characterized in that, The process of inversely mapping the three-dimensional coordinates of each feature point in the initial feature point cloud using a displacement tensor field to obtain the corrected feature point cloud includes the following steps: The displacement tensor field is decomposed into radial and tangential components affected by the underwater refractive index, and a non-rigid transformation model based on thin plate spline functions is established. The displacement tensor field is used as the control parameter of the non-rigid transformation model. A spatial coordinate transformation based on a non-rigid transformation model is performed on the initial feature point cloud to counteract the radial contraction and tangential distortion caused by refraction, resulting in a corrected feature point cloud. The overlap score between the corrected feature point cloud and the standard geometric structure is calculated. If the overlap score is lower than the preset overlap threshold, the displacement tensor field is iteratively corrected until the overlap score reaches the preset overlap threshold.
6. The semantic SLAM construction method for diver AR navigation according to claim 1, characterized in that, The steps of extracting the geometric connectivity between feature point clouds and constructing semantic descriptors based on elastic topological constraints, and using induced fit logic to match semantic descriptors under different time series to trigger loop closure detection constraints, include the following: Record the Euclidean distances and relative angles between every pair of feature points within the feature point cloud to construct a topological correlation matrix; The semantic labels, topological association matrix, and local texture information of feature points are fused to generate a semantic descriptor with deformation tolerance. Retrieve pre-stored historical semantic descriptors in the global semantic map and calculate the similarity between the current semantic descriptor and the historical semantic descriptors; The elasticity coefficient of the semantic descriptor at the current moment is adjusted by induced fitting logic to simulate the adaptive deformation of the semantic descriptor in space; When the similarity after adaptation and deformation exceeds the preset loop closure threshold, the current observation area is determined to be a historically traversed area, and loop closure detection constraints are triggered.
7. The semantic SLAM construction method for diver AR navigation according to claim 6, characterized in that, The step of adjusting the elasticity coefficient of the semantic descriptor at the current moment using induced fit logic to simulate the fit of the semantic descriptor in space includes the following steps: Define the topological correlation matrix in the semantic descriptor as a topological structure model, and calculate the geometric difference vector between the current semantic descriptor and the historical semantic descriptor; Local deformation weights are assigned to each node in the topological model based on the geometric difference vector. While maintaining topological connectivity, the elastic contraction or elastic expansion of semantic descriptors can be achieved by minimizing the structural internal energy caused by local deformation weights.
8. The semantic SLAM construction method for diver AR navigation according to claim 1, characterized in that, The process of monitoring, optimizing, and correcting the semantic feature points in the feature point cloud, measuring the fluctuation entropy values of semantic feature points in the spatiotemporal sequence, aggregating stable semantic features with fluctuation entropy values below a preset entropy threshold using incremental voxel filtering, and generating a three-dimensional structured semantic map on the AR helmet display interface includes the following steps: Establish a spatiotemporal observation window for each semantic feature point in the optimized and corrected feature point cloud, and record the spatial position changes of the semantic feature points in multiple consecutive frames of images through the spatiotemporal observation window. The spatial distribution variance of each semantic feature point within the spatiotemporal observation window is statistically analyzed, and the fluctuation entropy value, which characterizes the stability of the semantic feature point, is calculated by combining the spatial location change and the spatial distribution variance. Semantic feature points whose fluctuation entropy values are higher than a preset stability threshold are identified as interfering features and are removed. Stable semantic features with fluctuation entropy values below the stability threshold are projected into a global voxel grid; Perform probabilistic label voting within the global voxel grid, and accumulate the observation weights of stable semantic features within the same voxel; The voxel occupancy status is updated based on the cumulative result of the observation weights, a three-dimensional structured semantic map is constructed, and the three-dimensional structured semantic map is generated on the AR helmet display interface.
9. A semantic SLAM construction system for AR navigation of divers, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the semantic SLAM construction method for diver AR navigation as described in any one of claims 1 to 8.
10. A computer-readable storage medium storing instructions thereon, characterized in that, When executed by a processor, this instruction causes the processor to be configured to perform the semantic SLAM construction method for diver AR navigation according to any one of claims 1 to 8.