A multi-head feature-geometry consistency image enhancement method for visual slam
By employing a multi-head feature-geometric consistency image enhancement method, which utilizes a shared feature encoder and multi-task constraints, the problem of unstable feature detection in visual SLAM systems in complex environments is solved, improving positioning accuracy and tracking stability. This method is suitable for plug-and-play modules in visual SLAM systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING UNIV
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
AI Technical Summary
Existing visual SLAM systems suffer from image degradation in rainy, nighttime, and low-light environments, leading to instability in feature detection and matching processes, affecting pose estimation and trajectory error accumulation. Furthermore, learning-based SLAM systems are highly complex and have poor compatibility.
A multi-head feature-geometric consistency image enhancement method is adopted. Multi-scale features are extracted by a shared feature encoder and constrained by an image reconstruction head, a feature consistency head, and a geometric supervision head. A teacher-student network is constructed for training. By combining pixel, structural, and geometric supervision, feature distribution destruction and structural drift during the enhancement process are suppressed.
Without altering the classic SLAM system structure, this method improves positioning accuracy and tracking stability, reduces the negative impact of the enhancement process on visual SLAM, and is suitable for image processing and localization mapping under complex weather conditions.
Smart Images

Figure CN122243775A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of robot visual localization mapping and image enhancement technology, and in particular to a multi-head feature-geometric consistency image enhancement method for visual SLAM. Background Technology
[0002] Visual Simultaneous Localization and Mapping (VSLAM) is a core foundational technology in mobile robots, unmanned systems, and intelligent sensing devices. This technology estimates camera pose and constructs environmental maps using visual sensors, providing fundamental support for path planning, obstacle avoidance, and scene understanding. Compared to high-cost sensors such as LiDAR, visual sensors offer advantages such as low cost, high information density, and flexible deployment. Therefore, classic feature-based visual SLAM systems are widely used in indoor and outdoor robots, autonomous driving assistance, and augmented reality scenarios.
[0003] However, in rainy, nighttime, and low-light environments, input images exhibit degradation phenomena such as rain streaks, decreased contrast, blurred edges, and increased noise. These significantly disrupt the feature detection, description, and matching processes at the visual SLAM front end, further leading to unstable pose estimation, difficulty in relocalization, and accumulation of trajectory errors. To mitigate these issues, existing techniques typically overlay low-light enhancement, rain removal, or general image restoration models at the visual SLAM front end.
[0004] Most existing image enhancement methods primarily aim to improve the visual quality of a single frame, focusing on optimizing pixel reconstruction quality, perceptual quality, or subjective image visibility, while lacking explicit constraints on downstream geometric tasks. Practice shows that while some image enhancement methods can make images visually clearer, they may introduce structural inconsistencies in the temporal dimension, manifesting as abnormal fluctuations in the number of feature matches, long-tailed expansion of geometric constraint distribution, and local jumps in pose estimation, thus negatively impacting the geometric optimization of visual SLAM.
[0005] On the other hand, learning-based SLAM or multimodal SLAM can improve robustness under complex conditions to some extent, but these methods often require refactoring the front-end feature extraction and matching process, resulting in high system complexity, high deployment costs, and poor compatibility with classic feature-based SLAM systems, making it difficult to plug and play in existing mature pipelines.
[0006] Therefore, there is an urgent need to propose a new image enhancement method that can not only improve the visual quality of degraded images, but also explicitly maintain the feature stability and geometric consistency related to visual SLAM during the enhancement process. This would improve the overall localization performance and tracking stability of the system under harsh visual conditions without changing the structure of the classic visual SLAM system. Summary of the Invention
[0007] To address the aforementioned problems in existing technologies, this invention proposes a multi-head feature-geometric consistency image enhancement method for visual SLAM. This method does not solely aim to improve the subjective visual quality of images, but rather addresses the practical needs of visual SLAM by simultaneously introducing feature consistency constraints and geometric supervision constraints during the image enhancement stage. This ensures that the enhanced output not only improves image usability but also maintains the stability of key features and scene structure upon which the downstream SLAM front-end relies.
[0008] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0009] A multi-head feature-geometric consistency image enhancement method for visual SLAM, which mainly includes the following steps:
[0010] Preprocessing: Construct a sequence of degraded visual images in a simulated environment, and establish clean reference images and depth supervision data that are spatiotemporally aligned with the degraded images for the training phase;
[0011] Step S1: Feed the degraded input image into the shared feature encoder to extract multi-scale visual features;
[0012] Step S2: Input the multi-scale visual features into the image reconstruction head to obtain an enhanced image, and maintain scene texture, edge and contour information through pixel consistency constraints and structural consistency constraints;
[0013] Step S3: Input the shared features into the feature consistency head to obtain the feature representation related to the visual SLAM front end. By constraining the difference between the degraded image and the enhanced image in the feature space, the destruction of the key feature distribution by the enhancement process is suppressed.
[0014] Step S4: Input the shared features into the geometric supervision head to predict the scene's geometric or depth representation, and use depth supervision data aligned with the current image as geometric anchors to constrain it in order to suppress structural drift during the enhancement process;
[0015] Step S5: Construct the teacher network and the student network. The teacher network does not participate in gradient backpropagation, but is updated through the exponential moving average of the student network parameters to provide a time-stable reference target for feature consistency constraints and geometric supervision constraints.
[0016] Step S6: The image reconstruction loss, feature consistency loss, geometric supervision loss, and teacher-student consistency loss are weighted and combined, and the weights of each loss are dynamically scheduled in stages during the training process to obtain the trained augmented model.
[0017] Step S7: During the deployment phase, the trained enhancement model is used to enhance the real-time acquired degraded images, and the enhancement results are directly input into the classic feature-based visual SLAM system to improve the localization accuracy, tracking stability and geometric constraint quality without changing the original front-end and back-end structure of SLAM.
[0018] Through the above design, this invention constructs a multi-head network consisting of a shared feature encoder, an image reconstruction head, a feature consistency head, and a geometric supervision head. The degraded input image is processed by the shared feature encoder to extract multi-scale features, which are then fed into three task branches. The image reconstruction head is used to restore the enhanced image, the feature consistency head is used to extract mid-to-high-level feature representations related to visual SLAM, and the geometric supervision head is used to predict depth or geometric structure representations. During the training phase, network parameters are collaboratively optimized through four types of constraints: pixel-to-structure consistency, feature consistency, geometric supervision, and teacher-student consistency. During the deployment phase, the trained enhancement model is integrated as an independent front-end module into a classic feature-based visual SLAM system.
[0019] The method provided by this invention can integrate the enhancement module as a plug-and-play module without changing the classic feature-based SLAM pipeline, and is suitable for image processing and localization mapping under complex weather conditions.
[0020] Preferably, in the preprocessing step, the image acquisition unit sets degradation conditions and constructs a set of paired degradation-cleaning image sequences, specifically including the following steps:
[0021] S-A1: The image acquisition unit first sets the degradation synthesis operator in the CARLA simulation environment as follows:
[0022] ;
[0023] in, For degraded images, This indicates a degradation resulting from a combination of depth-guided rain wisps rendering and low-light perturbation. To clean the reference image, For deep supervision of truth value, These are degradation control parameters, including adjustable parameters that affect the visual quality of the acquired images in the simulation environment, such as rainfall intensity, exposure, noise amplitude, and cloud thickness.
[0024] S-A2: The image acquisition unit acquires images according to the above degradation conditions and performs spatiotemporal alignment according to the following formal conditions for sampling along the same path and in the same pose:
[0025] ;
[0026] in, This represents the camera pose corresponding to the degraded image frame. Clean the camera pose corresponding to the reference image for this frame;
[0027] S-A3: The image acquisition unit generates a training sample set from the spatiotemporally aligned images.
[0028] ;
[0029] Through the above design, a controllable degraded image can be generated by mapping clean reference images and ground truth depth data in a simulation environment using a unified degradation synthesis model. By using the registration constraint of "same path, same pose, same timestamp", the degraded image, clean reference image and depth supervision data can be strictly aligned in time and space. A set of paired training samples with consistent registration of pixel supervision, structural supervision and geometric supervision can be constructed, providing a high-quality, adjustable and reproducible training data foundation for image enhancement models for visual SLAM.
[0030] Preferably, in step 1, the shared feature encoder uses a lightweight U-Net-style coding structure to perform layer-by-layer downsampling and encoding, and the output features are shared by the subsequent three heads. Specifically, this includes the following steps:
[0031] S-B1: The shared encoder's layer 0 encodes pixel space information into the feature space using convolution operations while maintaining the same spatial resolution.
[0032] ;
[0033] in, This is a shallow feature map. This is for the first layer convolutional block operation of the shared encoder;
[0034] S-B2: The shared encoder performs layer-by-layer downsampling encoding on the input image features and outputs the deepest bottleneck feature:
[0035] ;
[0036] in, These are the skip features obtained from downsampling encoding. =1,...,L represents the number of downsampling layers. For downsampling operation, As a convolutional block, the deepest bottleneck output is:
[0037] ;
[0038] S-B3: The shared encoder outputs the obtained bottleneck features and layer-by-layer skip features to the subsequent three heads:
[0039] ;
[0040] in, This is the enhanced output image obtained after processing by the image reconstruction head. For image reconstruction head mapping function, This is a dense feature representation obtained after processing with a characteristic seborrheic head. For feature consistency header mapping function, This is the geometric feature map output by the geometry head. This is the mapping function for the geometry supervision head.
[0041] Through the above design, shallow texture edge information and deep semantic structure information can be preserved simultaneously in the shared encoder using a lightweight U-Net-style layer-by-layer downsampling encoding method. Furthermore, bottleneck features and multi-scale skip features are uniformly distributed to the image reconstruction head, feature consistency head, and geometric supervision head, thereby achieving multi-task collaborative optimization under the same shared representation.
[0042] Preferably, in step 2, the image reconstruction head maintains the stability of scene edges and local structures while removing rain streaks and low-light noise interference, specifically including the following steps:
[0043] S-C1: The image reconstruction head performs feature expansion on the bottleneck features output by the shared encoder.
[0044] ;
[0045] in, For the image reconstruction head at the deepest scale in layer L The initial features obtained after feature expansion. This is the feature expansion module, in which features are expanded. To carry out learning transformation.
[0046] S-C2: The image reconstruction head performs layer-by-layer upsampling and fusion of bottleneck features and layer-by-layer skip features.
[0047] ;
[0048] ;
[0049] in, For the image reconstruction head in the first Layer fusion Features at the same scale as the encoding end Decoding features, For feature fusion module, For channel splicing, To decode the features of the previous layer Upsampling to the Candidate decoding features after layer spatial scale. This is an upsampling operation;
[0050] S-C3: The image reconstruction head maps the fused shallowest layer decoded features back to the RGB image space to obtain an enhanced image.
[0051] ;
[0052] in, To enhance the output image, For output mapping module;
[0053] S-C4: The image reconstruction head loss includes pixel reconstruction loss, gradient structure consistency loss, structural similarity loss, and total variational regularization term, let the set of pixel coordinates be:
[0054] ;
[0055] The following losses of the image reconstruction head can be obtained:
[0056] ;
[0057] ;
[0058] ;
[0059] ;
[0060] ;
[0061] in, is the horizontal coordinate of the pixel. is the vertical coordinate of the pixel. Image width, Image height, For pixel reconstruction loss, The number of elements in the pixel set. For gradient structure consistency loss, For gradient operators, It is a structural similarity index. They are respectively and The mean within a local window, for and Variance within a local window, , for and Covariance within a local window , The stability constant in SSIM For log-SSIM structural loss, For total variation regularization, , Let x and y represent the difference operators respectively, and the total loss of the image reconstruction head is obtained as:
[0062] ;
[0063] in, , , , These correspond to the weighting coefficients of each item.
[0064] Through the above design, the image reconstruction head restores details and preserves edge structure by fusing feature expansion and multi-scale upsampling; at the same time, it uses pixel, gradient, log-SSIM and TV joint loss constraints to suppress rain streaks, low light noise and artifacts, avoid over-smoothing and structural drift, so that the enhanced image can balance sharpness and structural stability, and provide reliable input for subsequent feature consistency and geometric supervision.
[0065] Preferably, in step 3, the feature consistency head maps shared bottleneck features to dense feature representations (slam_feat) closely related to visual SLAM through projection and embedding operations, and applies corresponding feature consistency constraints to the degraded and image reconstruction head-enhanced images, specifically including the following steps:
[0066] S-D1: The feature consistency header performs channel projection of the bottleneck feature onto a specific feature channel dimension.
[0067] ;
[0068] in, This refers to the intermediate projection features after projection. These are the weight parameters for the channel projection layer. These are the offset parameters for the channel projection layer;
[0069] S-D2: The feature consistency header normalizes the scale and distribution of the embedded features to obtain slam_feat.
[0070] ;
[0071] in, For degraded images The corresponding slam_feat, For the normalization operator, For embedding transformation modules;
[0072] S-D3: Enhanced image output from the image reconstruction head. Bottleneck features of the enhanced image are obtained again through a shared encoder:
[0073] ;
[0074] in, For the collection of parameters of the shared encoder, For the parameters to be used Forward encoding is performed on the shared encoder to obtain the slam_feat of the enhanced image:
[0075] ;
[0076] in, This is the overall mapping function for the feature consistency head, taking the bottleneck feature as input and outputting slam_feat;
[0077] S-D4: The feature consistency head performs a position-wise L1 constraint loss on pixel-aligned dense features, letting for Feature network coordinate set:
[0078] ;
[0079] in, For feature consistency loss, For a network coordinate in the feature map;
[0080] S-D5: The feature consistency header uses a feature-constrained valve gating function for delayed startup settings.
[0081] ;
[0082] in This is the gate function for the feature consistency header. For training progress variables, The training progress is set to enable the feature consistency constraint threshold.
[0083] Through the above design, the feature consistency head projects, embeds, and normalizes the shared bottleneck features into slam_feat, and applies position-wise L1 consistency constraints to the corresponding dense features of the degraded and enhanced images, suppressing the destruction of key feature distribution and matching friendliness by the enhancement process; at the same time, it reduces early gradient conflicts by gating and delaying activation, improving the stability of multi-task training and the adaptability of the SLAM front-end.
[0084] Preferably, in step 4, the geometric supervision head uses the true depth value obtained from the simulation platform as the geometric supervision anchor point, and is constrained by geometric consistency error, specifically including the following steps:
[0085] S-E1: The geometry head outputs geometric features based on the bottleneck features output by the shared encoder.
[0086] ;
[0087] in, The geometric feature map output by the geometric supervision head. For the mapping function of the geometry supervision head;
[0088] S-E2: Depth prediction is obtained by projecting the geometric feature map into depth.
[0089] ;
[0090] in, This is the predicted depth map output by the geometry head. This is the depth projection function from geometric features to depth. As a calibration constant for the minimum depth, This is the calibration constant for the maximum depth. The Sigmod function is used to compress the linear output to (0,1). For the depth projection layer weight function, Here is the bias function for the depth projection layer;
[0091] S-E3: The geometric supervisory head calculates the L1 depth regression error based on the known depth truth value.
[0092] ;
[0093] in, The number of elements in the pixel set. For pixel position index, This is the true depth value of a certain pixel.
[0094] Through the above design, the geometric supervision head predicts geometric features using shared bottleneck features and outputs a depth map through depth projection. It uses the simulated synchronous depth ground truth as a geometric anchor point to apply L1 regression constraints, so that the enhancement process remains consistent at the three-dimensional structure level, suppressing implicit structural distortion and geometric drift. This improves the quality of subsequent SLAM front-end geometric constraints and the overall stability of localization and tracking.
[0095] This invention optimizes images in multiple dimensions by designing multi-head consistency constraints. While suppressing rain patterns, low-light noise and other visual degradations, it maintains the scene edge and contour structure and constrains the distance between the degraded image and the enhanced image in the feature space. This reduces the damage to the distribution of key visual SLAM features during the enhancement process and avoids implicit geometric distortions in the enhancement results through structural constraints, thus promoting image optimization towards SLAM features.
[0096] Preferably, in step 5, the student network parameters at the current moment are obtained by performing an exponential moving average with a preset attenuation coefficient to obtain the corresponding teacher network parameters, specifically including the following steps:
[0097] S-F1: The teacher network obtains its parameter set by updating the student parameters using EMA.
[0098] ;
[0099] in, For the set of network parameters for teachers, For the student network parameter set, The EMA attenuation coefficient;
[0100] S-F2: The teacher network outputs teacher feature references and uses stop-gradient to ensure that teacher parameters are not backpropagated.
[0101] ;
[0102] in, The slam_feat output for the teacher The predicted depth map output for teachers;
[0103] S-F3: The teacher network calculates teacher-student consistency loss while simultaneously constraining features and geometry.
[0104] ;
[0105] in, For teacher-student consistency regularization, For student networks in augmented image domain slam_feat, To stop the gradient operator.
[0106] Through the above design, the teacher network uses EMA for smooth updates and as a stability reference. Combined with the consistency regularization constraint features and geometry of stopping gradients, it reduces training oscillations and degeneracy, and improves stability across time.
[0107] This invention introduces an EMA teacher-student mechanism. The student network participates in regular forward propagation and gradient backpropagation, while the teacher network does not participate in backpropagation. Instead, it updates the teacher network through an exponential moving average of the historical parameters of the student network, providing a time-stable reference target for feature consistency constraints and geometric supervision constraints.
[0108] Preferably, in step 6, the dynamic weight scheduling strategy enables the network to converge to a stable image restoration space in the early stage of training, and gradually introduces feature consistency loss and geometric supervision loss in the later stage of training for higher-level refinement and optimization, specifically including the following steps:
[0109] S-G1: The dynamic adjustment strategy first requires obtaining the total network loss as follows:
[0110] ;
[0111] in, For the total loss item, , , , These are the weight coefficients for each item. For feature consistency header gating function, For the geometric supervisory head gating function, This is the geometric loss term;
[0112] S-G2: The dynamic weight scheduling function is designed with the following geometric supervision head gating function:
[0113] ;
[0114] in, The training progress is set to enable the threshold for geometric supervision constraints.
[0115] S-G3: The dynamic weight scheduling function controls the features and geometric weights to gradually increase as training progresses.
[0116]
[0117] in, , These are the maximum weights corresponding to the losses.
[0118] Through the above design, the dynamic scheduling of gating and gradually increasing weights is adopted, so that the network training mainly focuses on image reconstruction to achieve stable convergence in the early stage, and then gradually introduces feature consistency and geometric supervision for high-level refinement in the middle and later stages. This reduces gradient conflicts and training oscillations in multiple tasks, improves the feature matching and structural consistency of the augmentation results, and thus enhances SLAM adaptation and localization stability.
[0119] This invention employs a dynamic weight scheduling strategy: in the early stages of training, image reconstruction-related constraints are prioritized to allow the network to enter a stable recovery space; in the later stages of training, the weights of feature consistency and geometric supervision loss are gradually increased, and the feature consistency head and geometric supervision head are activated sequentially according to the preset training stages to achieve orderly collaborative optimization among multiple task objectives.
[0120] Preferably, in step 7, the trained augmented model can be used as a plug-and-play module, requiring no changes to the classic SLAM structure during deployment. The augmented model can be viewed as a deterministic front-end mapping, expressed as:
[0121] ;
[0122] in, These are the parameters of the converged student network after training. Classical eigentype SLAM can be used as black-box parameters, expressed as:
[0123] ;
[0124] in, For trajectory pose sequence, This refers to internal states such as maps and keyframes.
[0125] Through the above design, the enhancement module can be plugged and played into SLAM without structural modifications, improving robust localization and tracking stability.
[0126] The beneficial effects of this invention are as follows: This invention provides a multi-head feature-geometric consistency task-oriented image enhancement method for visual SLAM. Without modifying the front-end and back-end structures of classic feature-based visual SLAM, the enhancement module is deployed as an independent preprocessing unit, which is highly compatible and easy to implement in engineering. Through joint modeling of three types of constraints—image reconstruction, feature consistency, and geometric supervision—the enhancement result no longer only pursues subjective visual quality, but explicitly serves the feature stability and geometric estimation requirements of visual SLAM. Depth supervision is used as a geometric anchor point to effectively suppress structural drift during the enhancement process and reduce abnormal fluctuations in front-end geometric constraints. At the same time, the combination of the EMA teacher-student mechanism and dynamic weight scheduling strategy improves the stability and convergence reliability of multi-task joint training, thereby significantly improving the localization accuracy, tracking stability, and front-end constraint quality of the visual SLAM system in degraded scenarios such as rain, low light, and complex lighting changes. Attached Figure Description
[0127] Figure 1 This is a schematic diagram of the overall method flow of the present invention;
[0128] Figure 2 This is a nighttime rain image captured in an example of the present invention;
[0129] Figure 3 This is a daytime visible rain image collected in an example of the present invention;
[0130] Figure 4 This is a schematic diagram of the complete enhancement network in this invention. Detailed Implementation
[0131] The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments, but the scope of protection of the present invention is not limited to the following embodiments. Any equivalent substitutions made based on the concept of the present invention to the network structure, loss term form, training phase division, and deployment method shall fall within the scope of protection of the present invention.
[0132] like Figure 1 As shown, a multi-head feature-geometric consistency image enhancement method for visual SLAM includes the following steps:
[0133] Preprocessing: Construct a sequence of degraded visual images in a simulated environment, and establish clean reference images and depth supervision data that are spatiotemporally aligned with the degraded images for the training phase;
[0134] Step S1: Feed the degraded input image into the shared feature encoder to extract multi-scale visual features;
[0135] Step S2: Input the multi-scale visual features into the image reconstruction head to obtain an enhanced image, and maintain scene texture, edge and contour information through pixel consistency constraints and structural consistency constraints;
[0136] Step S3: Input the shared features into the feature consistency head to obtain the feature representation related to the visual SLAM front end. By constraining the difference between the degraded image and the enhanced image in the feature space, the destruction of the key feature distribution by the enhancement process is suppressed.
[0137] Step S4: Input the shared features into the geometric supervision head to predict the scene's geometric or depth representation, and use depth supervision data aligned with the current image as geometric anchors to constrain it in order to suppress structural drift during the enhancement process;
[0138] Step S5: Construct the teacher network and the student network. The teacher network does not participate in gradient backpropagation, but is updated through the exponential moving average of the student network parameters to provide a time-stable reference target for feature consistency constraints and geometric supervision constraints.
[0139] Step S6: The image reconstruction loss, feature consistency loss, geometric supervision loss, and teacher-student consistency loss are weighted and combined, and the weights of each loss are dynamically scheduled in stages during the training process to obtain the trained augmented model.
[0140] Step S7: During the deployment phase, the trained enhancement model is used to enhance the real-time acquired degraded images, and the enhancement results are directly input into the classic feature-based visual SLAM system to improve the localization accuracy, tracking stability and geometric constraint quality without changing the original front-end and back-end structure of SLAM.
[0141] Furthermore, in the preprocessing step, the image acquisition unit sets degradation conditions and constructs a set of paired degradation-cleaning image sequences, specifically including the following steps:
[0142] S-A1: The image acquisition unit first sets the degradation synthesis operator in the CARLA simulation environment as follows:
[0143] ;
[0144] in, For degraded images, This indicates a degradation resulting from a combination of depth-guided rain wisps rendering and low-light perturbation. To clean the reference image, For deep supervision of truth value, The degradation control parameters are as follows: rain streak rendering intensity 1.2, rain streak quantity 4500, exposure factor 0.25, photon noise 0.03, Jiaxing Gaussian noise floor 0.005, cloud thickness 100, rain streak length range 8~28, wind direction tilt y-axis 10, x-axis 1, rain streak blur x-axis 1, y-axis 2.5, near blur 2, far blur 1.
[0145] S-A2: The image acquisition unit acquires images according to the above degradation conditions and performs spatiotemporal alignment according to the following formal conditions for sampling along the same path and in the same pose:
[0146] ;
[0147] in, This represents the camera pose corresponding to the degraded image frame. Clean the camera pose corresponding to the reference image for this frame.
[0148] S-A3: The image acquisition unit generates a training sample set from the spatiotemporally aligned images.
[0149] The acquired images are as follows: Figure 2 , 3 As shown.
[0150] Furthermore, in step 1, the shared feature encoder uses a lightweight U-Net-style coding structure to perform layer-by-layer downsampling and encoding, and the output features are shared by the subsequent three heads. Specifically, this includes the following steps:
[0151] S-B1: The shared encoder's layer 0 encodes pixel space information into the feature space using convolution operations while maintaining the same spatial resolution.
[0152] ;
[0153] in, This is a shallow feature map. This is for the first layer convolutional block operation of the shared encoder.
[0154] S-B2: The shared encoder performs layer-by-layer downsampling encoding on the input image features and outputs the deepest bottleneck feature:
[0155] ;
[0156] in, These are the skip features obtained from downsampling encoding. =1,...,L represents the number of downsampling layers. For downsampling operation, As a convolutional block, the deepest bottleneck output is:
[0157] ;
[0158] S-B3: The shared encoder outputs the obtained bottleneck features and layer-by-layer skip features to the subsequent three heads:
[0159] ;
[0160] in, This is the enhanced output image obtained after processing by the image reconstruction head. For image reconstruction head mapping function, This is a dense feature representation obtained after processing with a characteristic seborrheic head. For feature consistency mapping function, This is the geometric feature map output by the geometry head. For the geometric supervision head mapping function, L is set to 4 in this example.
[0161] Furthermore, in step 2, the image reconstruction head maintains the stability of scene edges and local structures while removing rain streaks and low-light noise interference, specifically including the following steps:
[0162] S-C1: The image reconstruction head performs feature expansion on the bottleneck features output by the shared encoder.
[0163] ;
[0164] in, For the image reconstruction head at the deepest scale in layer L The initial features obtained after feature expansion. This is the feature expansion module, in which features are expanded. To carry out learning transformation.
[0165] S-C2: The image reconstruction head performs layer-by-layer upsampling and fusion of bottleneck features and layer-by-layer skip features.
[0166] ;
[0167] ;
[0168] in, For the image reconstruction head in the first Layer fusion Features at the same scale as the encoding end Decoding features, For feature fusion module, For channel splicing, To decode the features of the previous layer Upsampling to the Candidate decoding features after layer spatial scale. This is an upsampling operation.
[0169] S-C3: The image reconstruction head maps the fused shallowest layer decoded features back to the RGB image space to obtain an enhanced image.
[0170] ;
[0171] in, To enhance the output image, This is the output mapping module.
[0172] S-C4: The image reconstruction head loss includes pixel reconstruction loss, gradient structure consistency loss, structural similarity loss, and total variational regularization term, let the set of pixel coordinates be:
[0173] ;
[0174] The following losses of the image reconstruction head can be obtained:
[0175] ;
[0176] ;
[0177] ;
[0178] ;
[0179] ;
[0180] in, is the horizontal coordinate of the pixel. is the vertical coordinate of the pixel. Image width, Image height, For pixel reconstruction loss, The number of elements in the pixel set. For gradient structure consistency loss, For gradient operators, It is a structural similarity index. They are respectively and The mean within a local window, for and Variance within a local window, , for and Covariance within a local window , The stability constant in SSIM For log-SSIM structural loss, For total variation regularization, , Let x and y represent the difference operators respectively, and the total loss of the image reconstruction head is obtained as:
[0181] ;
[0182] in, , , , The corresponding weight coefficients for each item are used in this example. =1.0, =1.0, =1.0, =0.1.
[0183] Furthermore, in step 3, the feature consistency head maps shared bottleneck features into dense feature representations (slam_feat) closely related to visual SLAM through projection and embedding operations, and applies corresponding feature consistency constraints to the degraded and image reconstruction head-enhanced images, specifically including the following steps:
[0184] S-D1: The feature consistency header performs channel projection of the bottleneck feature onto a specific feature channel dimension.
[0185] ;
[0186] in, This refers to the intermediate projection features after projection. These are the weight parameters for the channel projection layer. These are the offset parameters for the channel projection layer.
[0187] S-D2: The feature consistency header normalizes the scale and distribution of the embedded features to obtain slam_feat.
[0188] ;
[0189] in, For degraded images The corresponding slam_feat, For the normalization operator, For embedding transformation modules.
[0190] S-D3: Enhanced image output from the image reconstruction head. Bottleneck features of the enhanced image are obtained again through a shared encoder:
[0191] ;
[0192] in, For the collection of parameters of the shared encoder, For the parameters to be used The shared encoder is used for forward encoding to obtain the slam_feat of the enhanced image:
[0193] ;
[0194] in, This is the global mapping function for the feature consistency head, taking the bottleneck feature as input and outputting slam_feat.
[0195] S-D4: The feature consistency head performs a position-wise L1 constraint loss on pixel-aligned dense features, letting for Feature network coordinate set:
[0196] ;
[0197] in, For feature consistency loss, is a network coordinate in the feature map.
[0198] S-D5: The feature consistency header uses a feature-constrained valve gating function for delayed startup settings.
[0199] ;
[0200] in, The gating function for the feature consistency header. For training progress variables, To set the threshold for enabling feature consistency constraints during training progress, this example uses... It is 10.
[0201] Furthermore, in step 4, the geometric supervision head uses the true depth value obtained from the simulation platform as the geometric supervision anchor point, and is constrained by geometric consistency error, specifically including the following steps:
[0202] S-E1: The geometry head outputs geometric features based on the bottleneck features output by the shared encoder.
[0203] ;
[0204] in, The geometric feature map output by the geometric supervision head. This is the mapping function for the geometric supervision head.
[0205] S-E2: Depth prediction is obtained by projecting the geometric feature map into depth.
[0206] ;
[0207] in, This is the predicted depth map output by the geometry head. This is the depth projection function from geometric features to depth. As a calibration constant for the minimum depth, This is the calibration constant for the maximum depth. The Sigmod function is used to compress the linear output to (0,1). For the depth projection layer weight function, This is the bias function for the depth projection layer.
[0208] S-E3: The geometric supervisory head calculates the L1 depth regression error based on the known depth truth value.
[0209] ;
[0210] in, The number of elements in the pixel set. For pixel position index, This is the true depth value of a certain pixel.
[0211] Furthermore, in step 5, the student network parameters at the current moment are obtained by performing an exponential moving average based on a preset decay coefficient to obtain the corresponding teacher network parameters. This specifically includes the following steps:
[0212] S-F1: The teacher network obtains its parameter set by updating the student parameters using EMA.
[0213] ;
[0214] in, For the set of network parameters for teachers, For the student network parameter set, This is the EMA attenuation coefficient.
[0215] S-F2: The teacher network outputs teacher feature references and uses stop-gradient to ensure that teacher parameters are not backpropagated.
[0216] ;
[0217] in, The slam_feat output for the teacher The predicted depth map output for teachers.
[0218] S-F3: The teacher network calculates teacher-student consistency loss while simultaneously constraining features and geometry.
[0219] ;
[0220] in, For teacher-student consistency regularization, For student networks in augmented image domain slam_feat, To stop the gradient operator.
[0221] Furthermore, in step 6, the dynamic weight scheduling strategy enables the network to converge to a stable image restoration space in the early stages of training. In the later stages of training, feature consistency loss and geometric supervision loss are gradually introduced for higher-level refinement and optimization, resulting in the final complete network maze. Figure 4 As shown, the specific steps include:
[0222] S-G1: The dynamic adjustment strategy first requires obtaining the total network loss as follows:
[0223] ;
[0224] in, For the total loss item, , , , These are the weight coefficients for each item. For feature consistency header gating function, For the geometric supervisory head gating function, This is the geometric loss term. In this example, The specific item weights are obtained by referring to point 4. =0.05, =0.01, =0.05.
[0225] S-G2: The dynamic weight scheduling function is designed with the following geometric supervision head gating function:
[0226] ;
[0227] in, To set the threshold for enabling geometric supervision constraints during training progress, this example uses... =25.
[0228] S-G3: The dynamic weight scheduling function controls the features and geometric weights to gradually increase as training progresses.
[0229]
[0230] in, , These are the maximum weights corresponding to the loss; this example uses... =1.5, =0.1.
[0231] Furthermore, in step 7, the trained augmented model can be used as a plug-and-play module, requiring no changes to the classic SLAM structure during deployment. The augmented model can be viewed as a deterministic front-end mapping, expressed as:
[0232] ;
[0233] in, These are the parameters of the converged student network after training. Classical eigentype SLAM can be used as black-box parameters, expressed as:
[0234] ;
[0235] in, For trajectory pose sequence, This includes internal states such as maps and keyframes. Step S7: Deployment method.
[0236] After training, the augmentation model is deployed as an independent front-end preprocessing module before the classic feature-based visual SLAM system. During runtime, only real-time degraded images need to be acquired, input into the augmentation model to obtain the augmentation result, and then the augmentation result is directly fed into ORB-SLAM2, ORB-SLAM3, or other classic feature-based visual SLAM systems. Because this invention does not change the keyframe management, feature matching, loop closure detection, and back-end optimization structure of the SLAM system, it has good compatibility and deployability.
[0237] In this example, the CARLA simulation platform is used to construct training data. Specifically, visual image sequences under clean and degraded conditions are acquired along the same path, with depth and pose information recorded simultaneously. Degraded conditions include, but are not limited to, rainy days during the day, rainy days at night, and rainy days with low illumination at night. This strictly spatiotemporally aligned data construction method allows for the simultaneous acquisition of training data required for pixel-level, structural, and geometric supervision. During training, the clean image serves as a constraint reference for image reconstruction, and the depth map serves as an anchor point for geometric supervision. During deployment, only the augmentation network and the visual SLAM system are retained, eliminating the need to input clean reference images and ground truth depth values.
[0238] In this example, the present invention is used as a front-end module to connect to a classic binocular vision SLAM system to perform localization tests on a continuous degradation sequence. The results are shown in Table 1:
[0239] Table 1. Comparison of ATE (unit: meters) after processing the dataset using different methods.
[0240]
[0241] The results show that this invention can reduce abnormal fluctuations in the number of feature matches under degradation conditions, decrease overall trajectory error, and enhance trajectory continuity and tracking stability. Compared with enhancement methods that only focus on image visual quality optimization, this invention, by considering both feature stability and geometric consistency, exhibits more stable and interpretable performance gains in visual SLAM applications.
[0242] This example demonstrates the effectiveness of the Feature Consistency Head (FH) and Geometric Supervision Head (GH) in the model through ablation experiments. The experimental setup includes four configurations: (1) using only the image reconstruction head; (2) combining the image reconstruction head with the Feature Consistency Head; (3) combining the image reconstruction head with the Geometric Supervision Head; and (4) a complete model integrating all modules. The results are shown in Table 2.
[0243] Table 2 Performance Comparison of Ablation Module Models
[0244]
[0245] The results show that the ATE (Automatic Evaluation) is worst when both FH (Fast Hierarchy) and GH (High Hierarchy Process) are removed, leaving only the image reconstruction branch. This indicates that pixel and local structural constraints alone are insufficient to suppress feature drift and geometric distortion. With only FH enabled, the mean ATE decreases from 2.234 to 1.858 and the RMSE decreases to 2.052, indicating that feature consistency can alleviate feature distribution perturbations caused by appearance degradation. However, the standard deviation increases, and structural cumulative shift still exists. With only GH enabled, the mean ATE and RMSE further decrease to 1.794 and 1.991, respectively, and the minimum error is lower, indicating that geometric supervision can effectively suppress structural distortion and provide more consistent geometric input, although fluctuations still exist. The complete model achieves the best results when both FH and GH are enabled simultaneously. The mean and RMSE decrease by approximately 36.0% and 30.8% from the baseline, respectively, and the median decreases to 1.334. The errors are more concentrated and stable, demonstrating the complementary synergy between the two in suppressing feature drift and constraining geometric distortion.
[0246] It should be noted that the number of network layers, loss function form, stage division method, feature dimension, training rounds, simulation platform type, and specific SLAM system type used in the above embodiments can all be adjusted according to the actual application scenario. As long as the core idea remains to explicitly constrain feature stability and geometric consistency simultaneously during image enhancement, and to achieve stable training through a teacher-student mechanism and dynamic scheduling strategy, it falls within the protection scope of this invention.
Claims
1. A multi-head feature-geometric consistency image enhancement method for visual SLAM, characterized in that, This method is an image processing neural network for SLAM, and it is performed according to the following steps: Preprocessing: Construct a sequence of degraded visual images in a simulated environment, and establish clean reference images and depth supervision data that are spatiotemporally aligned with the degraded images for the training phase; Step S1: Feed the degraded input image into the shared feature encoder to extract multi-scale visual features; Step S2: Input the multi-scale visual features into the image reconstruction head to obtain an enhanced image, and maintain scene texture, edge and contour information through pixel consistency constraints and structural consistency constraints; Step S3: Input the shared features into the feature consistency head to obtain the feature representation related to the visual SLAM front end. By constraining the difference between the degraded image and the enhanced image in the feature space, the destruction of the key feature distribution by the enhancement process is suppressed. Step S4: Input the shared features into the geometric supervision head to predict the scene's geometric or depth representation, and use depth supervision data aligned with the current image as geometric anchors to constrain it in order to suppress structural drift during the enhancement process; Step S5: Construct the teacher network and the student network. The teacher network does not participate in gradient backpropagation, but is updated through the exponential moving average of the student network parameters to provide a time-stable reference target for feature consistency constraints and geometric supervision constraints. Step S6: The image reconstruction loss, feature consistency loss, geometric supervision loss, and teacher-student consistency loss are weighted and combined, and the weights of each loss are dynamically scheduled in stages during the training process to obtain the trained augmented model. Step S7: During the deployment phase, the trained enhancement model is used to enhance the real-time acquired degraded images, and the enhancement results are directly input into the classic feature-based visual SLAM system to improve the localization accuracy, tracking stability and geometric constraint quality without changing the original front-end and back-end structure of SLAM.
2. The multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In the preprocessing step, the image acquisition unit sets degradation conditions and constructs a set of paired degradation-cleaning image sequences, specifically including the following steps: S-A1: The image acquisition unit first sets the degradation synthesis operator in the CARLA simulation environment to: ; in, For degraded images, This indicates a degradation resulting from a combination of depth-guided rain wisps rendering and low-light perturbation. To clean the reference image, For deep supervision of truth value, These are the degradation control parameters, including adjustable parameters that affect the visual quality of the acquired images in the simulation environment, such as rainfall intensity, exposure, noise amplitude, and cloud thickness. S-A2: The image acquisition unit acquires images according to the above degradation conditions and performs spatiotemporal alignment according to the following formal conditions for sampling along the same path and in the same pose: ; in, This represents the camera pose corresponding to the degraded image frame. Clean the camera pose corresponding to the reference image for this frame; S-A3: The image acquisition unit generates a training sample set from the spatiotemporally aligned images. 。 3. The multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In step 1, the shared feature encoder uses a lightweight U-Net-style coding structure to perform layer-by-layer downsampling and encoding, and the output features are shared by the subsequent three heads. Specifically, it includes the following steps: S-B1: The shared encoder's layer 0 encodes pixel space information into the feature space using convolution operations while maintaining the same spatial resolution. ; in, This is a shallow feature map. This is for the first layer convolutional block operation of the shared encoder; S-B2: The shared encoder performs layer-by-layer downsampling encoding on the input image features and outputs the deepest bottleneck feature: ; in, These are the skip features obtained from downsampling encoding. =1,...,L represents the number of downsampling layers. For downsampling operation, As a convolutional block, the deepest bottleneck output is: ; S-B3: The shared encoder outputs the obtained bottleneck features and layer-by-layer skip features to the subsequent three heads: ; in, This is the enhanced output image obtained after processing by the image reconstruction head. For image reconstruction head mapping function, This is a dense feature representation obtained after processing with a characteristic seborrheic head. For feature consistency header mapping function, This is the geometric feature map output by the geometry head. This is the mapping function for the geometry supervision head.
4. The multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In step 2, the image reconstruction head maintains the stability of scene edges and local structures while removing rain streaks and low-light noise interference. Specifically, this includes the following steps: S-C1: The image reconstruction head performs feature expansion on the bottleneck features output by the shared encoder. ; in, For the image reconstruction head at the deepest scale in layer L The initial features obtained after feature expansion. This is the feature expansion module, in which features are expanded. Perform learning transformation; S-C2: The image reconstruction head performs layer-by-layer upsampling and fusion of bottleneck features and layer-by-layer skip features. ; ; in, For the image reconstruction head in the first Layer fusion Features at the same scale as the encoding end Decoding features, For feature fusion module, For channel splicing, To decode the features of the previous layer Upsampling to the Candidate decoding features after layer spatial scale. This is an upsampling operation; S-C3: The image reconstruction head maps the fused shallowest layer decoded features back to the RGB image space to obtain an enhanced image. ; in, To enhance the output image, For output mapping module; S-C4: The image reconstruction head loss includes pixel reconstruction loss, gradient structure consistency loss, structural similarity loss, and total variational regularization term, let the set of pixel coordinates be: ; The following losses of the image reconstruction head can be obtained: ; ; ; ; ; in, is the horizontal coordinate of the pixel. is the vertical coordinate of the pixel. Image width, Image height, For pixel reconstruction loss, The number of elements in the pixel set. For gradient structure consistency loss, For gradient operators, It is a structural similarity index. They are respectively and The mean within a local window, for and Variance within a local window, , for and Covariance within a local window , The stability constant in SSIM For log-SSIM structural loss, For total variational regularization, , Let x and y represent the difference operators respectively, and the total loss of the image reconstruction head is obtained as: ; in, , , , These correspond to the weighting coefficients of each item.
5. The multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In step 3, the feature consistency head maps shared bottleneck features into dense feature representations (slam_feat) closely related to visual SLAM through projection and embedding operations, and applies corresponding feature consistency constraints to the degraded and image reconstruction head-enhanced images. Specifically, this includes the following steps: S-D1: The feature consistency header performs channel projection of the bottleneck feature onto a specific feature channel dimension. ; in, This refers to the intermediate projection features after projection. These are the weight parameters for the channel projection layer. These are the offset parameters for the channel projection layer; S-D2: The feature consistency header normalizes the scale and distribution of the embedded features to obtain slam_feat. ; in, For degraded images The corresponding slam_feat, For the normalization operator, For embedding transformation modules; S-D3: Enhanced image output from the image reconstruction head. Bottleneck features of the enhanced image are obtained again through a shared encoder: ; in, For the collection of parameters of the shared encoder, For the parameters to be used Forward encoding is performed on the shared encoder to obtain the slam_feat of the enhanced image: ; in, This is the overall mapping function for the feature consistency head, taking the bottleneck feature as input and outputting slam_feat; S-D4: The feature consistency head performs a position-wise L1 constraint loss on pixel-aligned dense features, letting for Feature network coordinate set: ; in, For feature consistency loss, For a network coordinate in the feature map; S-D5: The feature consistency header uses a feature-constrained valve gating function for delayed startup settings. ; in, For feature consistency header gating function, For training progress variables, The training progress is set to enable the feature consistency constraint threshold.
6. The multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In step 4, the geometric supervision head uses the true depth value obtained from the simulation platform as the geometric supervision anchor point, and is constrained by geometric consistency error. Specifically, this includes the following steps: S-E1: The geometric head outputs geometric features based on the bottleneck features output by the shared encoder. ; in, The geometric feature map output by the geometric supervision head. For the mapping function of the geometric supervision head; S-E2: Depth prediction is obtained by projecting the geometric feature map into depth. ; in, This is the predicted depth map output by the geometry head. The depth projection function is the projection of geometric features onto the depth. As a calibration constant for the minimum depth, This is the calibration constant for the maximum depth. The Sigmod function is used to compress the linear output to (0,1). For the depth projection layer weight function, Here is the bias function for the depth projection layer; S-E3: The geometric supervisory head calculates the L1 depth regression error based on the known depth truth value. ; in, The number of elements in the pixel set. For pixel position index, This is the true depth value of a certain pixel.
7. The multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In step 5, the student network parameters at the current moment are subjected to an exponential moving average with a preset decay coefficient to obtain the corresponding teacher network parameters. This specifically includes the following steps: S-F1: The teacher network obtains its parameter set by updating the student parameters using EMA. ; in, For the set of network parameters for teachers, For the student network parameter set, The EMA attenuation coefficient; S-F2: The teacher network outputs teacher feature references and uses stop-gradient to ensure that teacher parameters are not backpropagated. ; in, The slam_feat output for the teacher The predicted depth map output for teachers; S-F3: The teacher network calculates teacher-student consistency loss while simultaneously constraining features and geometry. ; in, For teacher-student consistency regularization, For student networks in augmented image domain slam_feat, To stop the gradient operator.
8. The multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In step 6, the dynamic weight scheduling strategy enables the network to converge to a stable image restoration space in the early stages of training, and then gradually introduces feature consistency loss and geometric supervision loss for higher-level refinement and optimization in the later stages of training. Specifically, this includes the following steps: S-G1: The dynamic adjustment strategy first requires obtaining the total network loss as follows: ; in, For the total loss item, , , , These are the weight coefficients for each item. For feature consistency header gating function, For the geometric supervisory head gating function, This is the geometric loss term; S-G2: The dynamic weight scheduling function is designed with the following geometric supervision head gating function: ; in, The training progress is set to enable the threshold for geometric supervision constraints. S-G3: The dynamic weight scheduling function controls the gradual increase of features and geometric weights as training progresses. in, , These are the maximum weights corresponding to the losses.
9. A multi-head feature-geometric consistency image enhancement method for visual SLAM according to claim 1, characterized in that, In step 7, the trained augmented model can be used as a plug-and-play module, requiring no changes to the classic SLAM structure during deployment; the augmented model can be viewed as a deterministic front-end mapping, expressed as: ; in, These are the parameters of the student network that converges after training; classic eigentype SLAM can be used as black-box parameters, expressed as: ; in, For trajectory pose sequence, This refers to internal states such as maps and keyframes.