Intelligent underwater structure inspection method based on visible light and sonar image multi-modal deep learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining multimodal deep learning methods with visible light and sonar images, the problem of balancing texture details and structural contours in underwater structure inspection was solved, achieving efficient and stable defect identification and quantitative assessment, and reducing human intervention and risks.

CN122265818APending Publication Date: 2026-06-23HARBIN INST OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HARBIN INST OF TECH
Filing Date: 2026-05-07
Publication Date: 2026-06-23

Application Information

Patent Timeline

07 May 2026

Application

23 Jun 2026

Publication

CN122265818A

IPC: G06V20/05; G06V10/764; G06V10/774; G06V10/30; G06V10/143; G06V10/82; G06N3/045; G06N3/0464

AI Tagging

Application Domain

Character and pattern recognition Biological models

Technology Topics

EngineeringSpatial registration

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineeringLight filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data setDescent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangementHeating and refrigeration combinationsHeat flowWorking fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories EngineeringSediment
Credit text analysis method, credit object auditing method and credit object auditing device
CN114386430AReduce labor costs Improve efficiency Finance Semantic analysisCredit cardEngineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional underwater structure inspection methods struggle to capture both texture details and structural outlines, and are unstable in identifying defects in complex aquatic environments. Furthermore, the inspection results are difficult to correlate with the spatial structure and lack quantitative assessment.

Method used

An intelligent inspection method based on multimodal deep learning of visible light and sonar images is adopted. By constructing a technical process of multi-source acquisition, preprocessing and registration, multimodal fusion enhancement, intelligent detection of defects and 3D modeling, it can realize the identification of apparent defects of underwater structures, spatial positioning and extraction of geometric parameters.

Benefits of technology

It achieves efficient and stable image analysis and disease identification in complex aquatic environments, reduces the degree of human intervention, improves detection accuracy and safety, and provides reliable structural condition assessment.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122265818A_ABST

Patent Text Reader

Abstract

The application proposes an underwater structure intelligent inspection method based on visible light and sonar image multimodal deep learning. The method comprises the following steps: step one, constructing a multimodal data set of underwater structure visible light images and sonar images; step two, visible light image and sonar image preprocessing and phased spatial registration facing cross-modal fusion constraints; step three, constructing a visible light-sonar multimodal fusion enhancement model based on deep learning; step four, carrying out intelligent detection of underwater structure apparent diseases based on the fusion image; step five, underwater structure local three-dimensional expression and key parameter extraction based on the fusion recognition result constraint; step six, intelligent inspection output based on the association of the multimodal fusion result and the recognition result. Through the establishment of the intelligent inspection method based on multimodal deep learning, the application can realize more efficient and stable image analysis and disease identification through an intelligent inspection robot in a complex water environment, reduce the degree of manual participation and reduce the risk of field operation.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical fields of intelligent underwater structure inspection, structural health monitoring and detection, and intelligent bridge operation and maintenance, and in particular to an intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images. Specifically, it involves technologies such as underwater visible light imaging, imaging sonar sensing, multimodal image fusion, deep learning target detection, 3D reconstruction, and quantitative assessment of defects. This method can be applied to defect detection, geometric state recognition, and intelligent inspection of underwater piers of large bridges, caisson foundations, wharf pile foundations, hydraulic structures, and other underwater structures. Background Technology

[0002] As the service life of infrastructure such as bridges, hydraulic structures, and ports continues to increase, a large number of underwater structures, including bridge piers, caissons, and pile foundations, are subjected to long-term effects of water flow erosion, sediment abrasion, water corrosion, biological adhesion, and complex load coupling, making them prone to various defects such as cracks, spalling, scouring, and exposed rebar. These defects are characterized by their strong concealment, slow development, but significant harm. If they are not detected and accurately assessed in a timely manner, they may lead to a decrease in load-bearing capacity, durability degradation, or even structural instability.

[0003] Underwater bridge piers are a crucial component of a bridge's load-bearing system. Traditional inspection methods suffer from poor safety, low efficiency, and strong subjectivity. Traditional bridge structural inspections primarily rely on manual methods, requiring professional personnel for everything from on-site photography and damage identification to written documentation and comprehensive assessment. This process is inefficient, costly, and the results are highly subjective. Existing underwater structural inspection methods mainly include manual diving inspection, single visible light visual inspection, and sonar scanning. Manual diving relies on experience-based judgment, which is not only inefficient but also carries significant operational risks, making it unsuitable for large-scale, routine inspections.

[0004] With the continuous advancement of artificial intelligence and image processing technologies, computer vision-based structural damage identification methods have been widely applied. These technologies analyze structural images to achieve non-contact, high-precision identification of surface defects, significantly improving monitoring efficiency and security. While single visible light methods can provide rich texture details, under conditions of turbid water, low illumination, and significant water scattering and absorption, images are prone to blurring, distortion, low contrast, and color shifts, making it difficult to reliably identify subtle defects such as cracks and spalling. Although single sonar methods have strong anti-turbidity capabilities and long detection ranges, the images often have low resolution and strong speckle noise, resulting in insufficient representation of small surface defects.

[0005] In recent years, deep learning technology has made rapid progress in underwater image enhancement, target detection, and 3D reconstruction. Existing research has begun to attempt to improve the accuracy of disease detection and 3D reconstruction through convolutional neural networks, Transformer architecture, semantic segmentation networks, and point cloud processing models. However, most existing methods focus on single-modal optimization and lack methods for visible light and sonar collaborative perception for underwater structure inspection scenarios. They also lack an integrated technology chain that connects "image fusion, disease detection, 3D modeling, and parameter quantization."

[0006] Therefore, there is an urgent need to propose an intelligent underwater structure inspection method that can comprehensively utilize the advantages of visible light image texture details and sonar image structural contours to solve problems such as poor underwater image quality, unstable defect identification, difficulty in matching detection results with spatial structures, and insufficient quantitative assessment capabilities. Summary of the Invention

[0007] The purpose of this invention is to address the limitations of traditional single-sensor methods in simultaneously capturing texture details, structural contours, and adaptability to complex aquatic environments. It provides an intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images. This method constructs an integrated technical process encompassing "multi-source acquisition—preprocessing and registration—multimodal fusion enhancement—intelligent defect detection—3D modeling and parameter extraction—inspection result output," enabling the identification of apparent defects, spatial localization, geometric parameter extraction, and condition assessment of underwater structures.

[0008] This invention is achieved through the following technical solution: This invention proposes an intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images, the method comprising: Step 1: Construct a multimodal dataset of visible light and sonar images of underwater structures; By using visible light cameras and imaging sonar mounted on underwater robots, submersibles, or fixed inspection platforms, visible light images and sonar images of the target underwater structure area are collected synchronously or quasi-synchronously, along with pose information, depth information, or timestamp information corresponding to the collection process, to construct a multimodal raw dataset for the same inspection object. Step 2: Preprocessing and staged spatial registration of visible light and sonar images under cross-modal fusion constraints; First, the visible light image undergoes brightness distribution correction, detail enhancement, and edge preservation processing. Second, the sonar image undergoes speckle noise suppression, significant echo enhancement, and contour stabilization processing. Based on this, a two-stage registration method of "coarse-scale initial alignment + fine-scale feature correction" is adopted for registration: first, the initial spatial transformation is determined using time synchronization relationships, pose information, or overall structural contour correspondence to complete coarse-scale registration; then, local deviations are further estimated using edge contours, local salient regions, and cross-modal feature similarity to complete fine-scale registration. Step 3: Construct a deep learning-based visible light-sonar multimodal fusion enhancement model; The preprocessed visible light image and sonar image are respectively input into a dual-branch feature encoding network to extract shallow texture information and deep semantic structure information of the two modalities. The bidirectional information interaction between the two modalities is realized through an iterative cross-attention module, so that the visible light features introduce stable contour information of the sonar, and the sonar features introduce detailed texture information of the visible light. Then, the global context modeling module based on DETR is used to strengthen the long-distance dependency expression. Finally, the fused image is generated by the residual decoder. Step 4: Conduct intelligent detection of apparent defects in underwater structures based on fused images; Step 5: Local 3D representation and key parameter extraction of underwater structures constrained by fusion recognition results; To address the identification results of surface defects of underwater bridge piers and the need for foundation scour identification, after obtaining fused images and defect detection results, the detected cracks, spalling, local defect areas and bridge pier foundation boundary areas are used as spatial constraint objects to construct a local three-dimensional representation process associated with the identification results. Step Six: Intelligent inspection output based on the correlation between multimodal fusion results and recognition results; The fused and enhanced image obtained in step three, the disease identification results obtained in step four, and the local three-dimensional parameter results obtained in step five are uniformly correlated to generate an intelligent inspection output result that includes disease type, location, spatial parameters, and bridge pier foundation scour assessment information, realizing an integrated inspection process of fused image visualization, recognition result expression, and parameter evaluation output.

[0009] Furthermore, step one specifically includes: Step 11: Multimodal synchronous acquisition; underwater visible light imaging equipment and imaging sonar equipment are used to conduct synchronous or quasi-synchronous inspection and acquisition of the target underwater structure; among them, the visible light imaging equipment is used to acquire surface texture, edges, color and local defects of the structure, and the imaging sonar equipment is used to acquire the structure outline, boundary echo and overall geometric response information in turbid environment. Steps 1 and 2: Parameter acquisition and pose recording; while acquiring image and sonar data, record one or more of the following: timestamp, depth, track coordinates, attitude angle, heading angle, relative installation relationship of equipment, and sampling frequency, for subsequent time alignment, spatial registration, and scale recovery between different modes; Step 13: Cross-modal sample association and matching; Based on the same inspection object, the same area, the same time period, or adjacent pose conditions, establish association between visible light images and sonar images to form initial multimodal sample pairs; For data that is not strictly synchronized, establish candidate matching pairs based on temporal proximity, pose difference constraints, and structural contour similarity. Step 14: Construction of Complex Working Condition Dataset; Collect multimodal samples under different water depths, turbidities, illuminance, sampling distances, structural types, and disease types to form a dataset covering complex working conditions; The disease types include at least one or more of the following: cracks, spalling, erosion, local defects, cavitation, and abnormal adhesion. Step 15: Sample Screening and Task Labeling; Exclude samples with severe blurring, overexposure, invalid echoes, abrupt pose changes, missing targets, or those where effective correspondence cannot be established, and retain the acquisition parameters and metadata tags of valid samples; For image fusion tasks, establish paired input relationships between visible light images and sonar images; For disease detection tasks, label disease areas with rectangular boxes, polylines, key points, or pixel-level masks; For 3D modeling tasks, establish mapping relationships between multi-view image sequences, pose information, and scale references. Step 16: Stratified sampling and dataset partitioning; perform stratified sampling of the data according to the structure object, disease category and working condition distribution, and divide the data into training set, validation set and test set to avoid data leakage caused by adjacent samples of the same target area falling into different data subsets at the same time.

[0010] Furthermore, step two specifically includes: Step 21: Visible light image preprocessing for crack texture preservation and cross-modal fusion stability; The acquired underwater visible light images are subjected to luminance mapping correction, local contrast enhancement, and edge preservation enhancement. Luminance mapping correction is represented using gamma transform as follows:

[0011] In the formula, Represents the original visible light image in pixels grayscale value at that location This represents the corrected grayscale value. This is the proportionality coefficient. These are the gamma transform parameters; Step 22: Sonar image preprocessing for structural contour preservation and significant echo enhancement; The sonar images are processed by speckle noise suppression, local echo enhancement, and contour stabilization. The speckle noise suppression is represented by Gaussian filtering as follows:

[0012]

[0013] In the formula, Represents the original sonar image. This represents the filtered sonar image. This represents the convolution operation. Represents the Gaussian kernel function. Standard deviation; Steps 2 and 3: Standardize the input scale and format; resample the preprocessed visible light image and sonar image, normalize the pixels, map the channels, and standardize the data format to ensure that the two modal data meet the requirements for parallel input to the subsequent deep network. Step 24: Perform coarse-scale initial registration and fine-scale feature registration.

[0014] Furthermore, step two and four specifically include: First, based on the correspondence between timestamps, pose information, acquisition order, or the overall outline of the bridge pier, an initial spatial mapping relationship is established between the visible light image and the sonar image to obtain a coarse-scale registration result. Then, based on the coarse-scale registration result, local deviations are further corrected using edge contours, salient regions, or cross-modal similarity features to obtain a fine-scale registration result. The affine transformation relationship of the coarse-scale registration is expressed as follows:

[0015] In the formula, Represents the coordinates of a visible light image. Represents the coordinates of the sonar image. Represents the initial transformation matrix at the coarse scale; Furthermore, fine-scale registration is achieved by minimizing the inconsistency in the local regions of the two modes, and its objective function is expressed as:

[0016] In the formula, This represents the contour consistency loss, used to constrain the correspondence between the pier's external boundary in the two modes; This represents the local region consistency loss, used to constrain the spatial matching of local content; and These are the weighting coefficients.

[0017] Furthermore, step three specifically includes: Step 31: Visible light and sonar dual-branch convolutional feature encoding; constructing visible light convolutional feature encoders and sonar convolutional feature encoders corresponding to the fusion architecture diagram; processing and registering the preprocessed and registered visible light image. With sonar images By inputting the visible light convolutional feature encoder and the sonar convolutional feature encoder respectively, local visible light representation features are obtained. Local characterization features of sonar Their calculation forms are expressed as follows:

[0018] In the formula, This represents a visible light convolutional feature encoder. This represents a sonar convolutional feature encoder. This represents visible light features that contain information about crack textures and peeling boundaries. Sonar features that include information about the pier outline and foundation boundaries; Step 32: Construct hierarchical feature representations; During the encoding process, the receptive field is gradually expanded through stride convolution, pooling, or downsampling operations to form multi-scale hierarchical features, so as to maintain both local detail expression and overall contour perception capabilities. Step 33: BICAF dual-modal iterative cross-attention module; visible light features and sonar characteristics Input the dual-modal iterative cross-attention fusion module to calculate sonar-guided visible light update features and visible light-guided sonar update features, respectively; where visible light features are used as queries and sonar features are used as keys and values, the cross-attention calculation is expressed as follows:

[0019]

[0020] In the formula, , , , , , These represent the projection matrices for bimodal interaction. The feature dimension normalization coefficient is used; the two cross-attention branches mentioned above are used to supplement the structural boundary representation in the visible light image with sonar stable contour information and to supplement the detail representation in the sonar image with visible light local texture information, respectively. Steps 3 and 4: Iterative Gated Update; Gating coefficients are introduced into the features updated through cross-attention to balance the contributions of the original modal features and the cross-modal compensation features; the update formula is expressed as:

[0021]

[0022] In the formula, and They represent the first Visible light and sonar characteristics at the next iteration and They represent the first The gating weights in the next iteration This represents element-wise multiplication; Step 35: DETR Global Context Modeling; Input the bimodal fusion features updated through iterative cross-attention into the DETR encoder to model the long-distance dependencies between the main pier, local defect areas, and foundation boundaries, thus obtaining the global fusion features. Its calculation form is expressed as:

[0023] In the formula, This represents the fused features after bimodal interaction. This represents the DETR global context modeling module. It represents a global fusion feature that combines the relationship between local disease details and the overall structural outline; Step 36: Residual decoder construction; global fusion of features Input residual decoder, output fused image ; The residual decoding process is represented by the method of "visible light base image + sonar supplemented residual":

[0024] In the formula, This represents the supplementary residual map generated by the residual decoder, used to enhance the representation of the pier structure outline, foundation boundaries, and local salient areas; Step 37: Establish the training objective for the fusion model; the loss function of the fusion model includes at least one or more of pixel reconstruction loss, structure preservation loss, and edge consistency loss, and its total loss is expressed as:

[0025] In the formula, This represents the basic reconstruction error. This indicates that the constraint structure edges are preserved. This indicates the significant region constraint of the sonar. This indicates local contrast constraint in sonar. , , , These are the weighting coefficients; Step 38: Output fusion enhancement results; apply the trained fusion model to the multimodal samples to be tested and output the fused image.

[0026] Furthermore, step four specifically includes: Step 41: Construct a sample set of bridge pier surface defects; mark the cracked areas, peeling areas, and localized defects in the fused images to form a sample set of bridge pier surface defects. ; For the The labeling information for each disease target is as follows:

[0027]

[0028] In the formula, The label should indicate the type of disease, including at least two categories: cracks and peeling. Indicates the target bounding box parameters. Indicates the center position of the target box. and These represent the width and height, respectively. Step 42: Establish a disease detection backbone network; use a convolutional backbone network to extract features from the fused image to obtain disease characterization feature maps at different scales; Step 43: Introduce a coordinate attention module to enhance the ability to perceive the location of diseases; output features to the backbone network. One-dimensional global aggregation is performed along the height and width directions respectively to obtain the directional perception feature description of the bridge pier defect area; The coordinate attention is represented as:

[0029]

[0030] In the formula, Indicates the first Disease detection feature map for each channel, This represents the feature encoding result along the height direction. This represents the feature encoding result along the width direction. and These represent the height and width of the feature map, respectively. Step 44: Construct a multi-scale feature enhancement module; perform cross-layer aggregation and semantic fusion on the features output from different layers of the backbone network, so that the shallow high-resolution features and the deep high-semantic features work together to enhance the recognition ability of small target diseases, weak texture diseases and diseases with obvious scale changes. Steps four and five: Establish a target decoding structure based on Transformer; use a Transformer encoder-decoder structure to perform global modeling and target query of the fused features, and output the disease category and corresponding location parameters; Step 46: Construct a joint loss function for target detection of bridge pier defects; The total loss is expressed as follows: Using a combination of classification loss and bounding box regression loss for optimization.

[0031] in, , These are the weighting coefficients; Step 47: Output disease detection results; output disease category, bounding box coordinates, target confidence, area, length or relative position parameters, and establish the correspondence between disease targets and original structural regions; Step 48: Perform post-processing of disease results; based on the detection confidence threshold, category constraints, spatial adjacency relationship or structural prior rules, screen, merge and correct the initial detection results to improve the stability and interpretability of the final detection results.

[0032] Furthermore, step five specifically includes: Step 51: Construct a multi-view observation sequence; Step 52: Perform feature matching and camera pose recovery; Step 53: Restore local 3D points based on multi-view projection relationships; Step 54: Construct a point cloud screening and denoising mechanism constrained by the diseased area; Step 55: Establish the mapping relationship between disease identification results and 3D point cloud; Steps 5 and 6: Extract spatial parameters of apparent diseases; Step 57: Extract the scour depth and key dimensional parameters of the bridge pier foundation; Step 58: Construct the local three-dimensional representation result.

[0033] Furthermore, step six specifically includes: Step 61: Construct an inspection result association set; The fused image results obtained in Step 3, the defect identification results obtained in Step 4, and the local 3D parameter results obtained in Step 5 are uniformly organized to form an inspection result association set, which is used to characterize the pier surface defect information, foundation scour information, and their corresponding spatial parameter information; the inspection result association set is represented as follows:

[0034] In the formula, Indicates the result of image fusion. This indicates the results of disease identification. This represents the results of local three-dimensional parameters; Step 62: Generate visualized and structured inspection results; based on the fused image, overlay and display the category, location, confidence level and corresponding spatial parameters of the disease target, and simultaneously generate structured result entries; the structured results include at least one or more of the following: disease category, two-dimensional location, three-dimensional location, local size parameters, scour depth parameters and key geometric parameters of the pier; Step 63: Output intelligent inspection results; output the visualized inspection results and structured parameter results in a unified manner to form intelligent inspection results for underwater bridge pier inspection tasks, which are used for structural status analysis, inspection record retention and subsequent re-inspection and comparison; the output results include at least the fused enhanced image, the defect detection mark results, the defect spatial parameters and the bridge pier foundation scour and key dimension assessment results.

[0035] The present invention also proposes an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the intelligent underwater structure inspection method based on multimodal deep learning of visible light and sonar images.

[0036] The present invention also proposes a computer-readable storage medium for storing computer instructions, which, when executed by a processor, implement the steps of the intelligent underwater structure inspection method based on multimodal deep learning of visible light and sonar images.

[0037] The beneficial effects of this invention are: 1. By combining visible light images with sonar images, we can fully leverage the advantages of visible light images in expressing texture details and the advantages of sonar images in representing the stability of structural contours in murky and low-light environments. 2. Overcome the limitations of single visible light images being severely affected by water scattering and absorption, and single sonar images having limited resolution and insufficient detail, and obtain fused images that have both visual clarity and structural integrity. 3. Using fused images as detection input, compared to using only a single modality image for recognition, it can more effectively preserve edge information, local texture information, and structural background information of diseased areas such as cracks and peeling. 4. By introducing a coordinate attention module and a multi-scale feature enhancement module into the detection network, the ability to express targets with weak features in slender cracks, small-sized peeling, and complex backgrounds can be enhanced, thereby reducing the false detection rate and the false detection rate, and improving the detection accuracy and inference stability. 5. The fused image obtained by this invention is superior in terms of structural contour integrity and surface detail representation, which is beneficial for subsequent visible light-sonar data alignment, point cloud generation and 3D model reconstruction, and provides a more reliable data foundation for quantitative analysis of pier scour depth, key dimensions and spatial distribution of local defects. 6. This invention establishes an intelligent inspection method based on multimodal deep learning, which enables more efficient and stable image analysis and disease identification in complex aquatic environments through intelligent inspection robots, reducing the degree of human intervention and lowering on-site operation risks. Attached Figure Description

[0038] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0039] Figure 1 This is an overall flowchart of an intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images.

[0040] Figure 2 This is a schematic diagram illustrating the construction of a multimodal dataset of visible light and sonar images of underwater structures.

[0041] Figure 3 This is a schematic diagram of the preprocessing and cross-modal spatial registration of visible light images and sonar images.

[0042] Figure 4 This is a schematic diagram of a deep learning-based visible light-sonar multimodal fusion enhancement model.

[0043] Figure 5 This is a schematic diagram of an intelligent detection model for apparent defects in underwater structures based on fused images.

[0044] Figure 6 This is a schematic diagram of local 3D representation and key parameter extraction based on the constraints of fusion recognition results.

[0045] Figure 7 This is a schematic diagram of intelligent inspection output based on multimodal fusion and association of recognition results. Detailed Implementation

[0046] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0047] Specifically, in combination Figures 1-7 This invention proposes an intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images. The method includes: Step 1: Construct a multimodal dataset of visible light and sonar images of underwater structures; By using visible light cameras and imaging sonar mounted on underwater robots, submersibles, or fixed inspection platforms, visible light images and sonar images of the target underwater structure area are collected synchronously or quasi-synchronously, along with pose information, depth information, or timestamp information corresponding to the collection process, to construct a multimodal raw dataset for the same inspection object. Furthermore, step one specifically includes: Step 11: Multimodal synchronous acquisition; underwater visible light imaging equipment and imaging sonar equipment are used to conduct synchronous or quasi-synchronous inspection and acquisition of the target underwater structure; among them, the visible light imaging equipment is used to acquire surface texture, edges, color and local defects of the structure, and the imaging sonar equipment is used to acquire the structure outline, boundary echo and overall geometric response information in turbid environment. Steps 1 and 2: Parameter acquisition and pose recording; while acquiring image and sonar data, record one or more of the following: timestamp, depth, track coordinates, attitude angle, heading angle, relative installation relationship of equipment, and sampling frequency, for subsequent time alignment, spatial registration, and scale recovery between different modes; Step 13: Cross-modal sample association and matching; Based on the same inspection object, the same area, the same time period, or adjacent pose conditions, establish association between visible light images and sonar images to form initial multimodal sample pairs; For data that is not strictly synchronized, establish candidate matching pairs based on temporal proximity, pose difference constraints, and structural contour similarity. Step 14: Construction of Complex Working Condition Dataset; Collect multimodal samples under different water depths, turbidities, illuminance, sampling distances, structural types, and disease types to form a dataset covering complex working conditions; The disease types include at least one or more of the following: cracks, spalling, erosion, local defects, cavitation, and abnormal adhesion. Step 15: Sample Screening and Task Labeling; Exclude samples with severe blurring, overexposure, invalid echoes, abrupt pose changes, missing targets, or those where effective correspondence cannot be established, and retain the acquisition parameters and metadata tags of valid samples; For image fusion tasks, establish paired input relationships between visible light images and sonar images; For disease detection tasks, label disease areas with rectangular boxes, polylines, key points, or pixel-level masks; For 3D modeling tasks, establish mapping relationships between multi-view image sequences, pose information, and scale references. Step 16: Stratified sampling and dataset partitioning; perform stratified sampling of the data according to the structure object, disease category and working condition distribution, and divide the data into training set, validation set and test set to avoid data leakage caused by adjacent samples of the same target area falling into different data subsets at the same time.

[0048] Step 2: Preprocessing and staged spatial registration of visible light and sonar images under cross-modal fusion constraints; To address the significant differences between visible light images and sonar images in imaging mechanisms, grayscale response, edge features, and spatial scale, the following approach is adopted: First, the visible light images undergo brightness distribution correction, detail enhancement, and edge preservation processing to highlight cracks, spalling boundaries, and local texture information. The sonar images undergo speckle noise suppression, significant echo enhancement, and contour stabilization processing to enhance the main contour of the bridge piers, scour boundaries, and structural shape response. Based on this, a two-stage registration method of "coarse-scale initial alignment + fine-scale feature correction" is employed: first, the initial spatial transformation is determined using temporal synchronization relationships, pose information, or the correspondence of the overall structural contour, completing coarse-scale registration; then, local deviations are further estimated using edge contours, local salient regions, and cross-modal feature similarity to complete fine-scale registration. Through these processes, the differences in brightness distribution, noise interference, and spatial correspondence between the two modal data are reduced, providing a consistent input basis for subsequent bi-branch feature encoding, cross-modal cross-attention fusion, and defect detection based on the fused image.

[0049] Furthermore, step two specifically includes: Step 21: Visible light image preprocessing for crack texture preservation and cross-modal fusion stability; Brightness mapping correction, local contrast enhancement, and edge preservation enhancement are performed on the acquired underwater visible light images to improve the discernibility of crack edges, spalling contours, and details on the bridge pier surface, while reducing grayscale compression caused by water scattering, absorption, and low illumination. Brightness mapping correction is represented using gamma transform as follows:

[0050] In the formula, Represents the original visible light image in pixels grayscale value at that location This represents the corrected grayscale value. This is the proportionality coefficient. The parameters are gamma transform parameters. This processing enables crack dark textures and peeling boundaries in visible light images, which are originally less affected by turbidity, to obtain higher responses, providing more stable input features for subsequent visible light branch feature encoding and cross-modal attention fusion.

[0051] Step 22: Sonar image preprocessing for structural contour preservation and significant echo enhancement; Sonar images were processed with speckle noise suppression, local echo enhancement, and contour stabilization to highlight the main boundaries of the bridge piers, the scour boundaries of the foundations, and local strong echo areas, thereby reducing the interference of speckle noise and artifacts on the subsequent establishment of cross-modal correspondences. The speckle noise suppression was represented by Gaussian filtering as follows:

[0052]

[0053] In the formula, Represents the original sonar image. This represents the filtered sonar image. This represents the convolution operation. Represents the Gaussian kernel function. The standard deviation is denoted as ; through the above processing, the structural information related to the overall outline and foundation morphology of the bridge pier in the sonar image is enhanced, making it more suitable as a cross-modal supplementary feature of the visible light image.

[0054] Steps 2 and 3: Standardize the input scale and format; resample the preprocessed visible light image and sonar image, normalize the pixels, map the channels, and standardize the data format to ensure that the two modal data meet the requirements for parallel input to the subsequent deep network. Step 24: Perform coarse-scale initial registration and fine-scale feature registration.

[0055] Furthermore, step two and four specifically include: First, based on the correspondence between timestamps, pose information, acquisition order, or the overall outline of the bridge pier, an initial spatial mapping relationship is established between the visible light image and the sonar image, resulting in a coarse-scale registration result. The purpose is to roughly map the main pier region in the two modal images to the same reference range. Subsequently, based on the coarse registration result, local deviations are further corrected using edge contours, salient regions, or cross-modal similarity features, resulting in a fine-scale registration result. The purpose is to improve the spatial consistency of crack areas, spalling boundaries, and local contours in the dual-modal images. The affine transformation relationship of the coarse-scale registration is expressed as:

[0056] In the formula, Represents the coordinates of a visible light image. Represents the coordinates of the sonar image. Represents the initial transformation matrix at the coarse scale; Furthermore, fine-scale registration is achieved by minimizing the inconsistency in the local regions of the two modes, and its objective function is expressed as:

[0057] In the formula, This represents the contour consistency loss, used to constrain the correspondence between the pier's external boundary in the two modes; This represents the consistency loss in a local region, used to constrain the spatial matching of local content such as peeling areas and significant structural boundary regions. and The weighting coefficients are used. Through two-stage registration at both the coarse and fine scales, the dual-modal inputs satisfy both the overall structural correspondence and the spatial consistency of local disease areas.

[0058] Step 3: Construct a deep learning-based visible light-sonar multimodal fusion enhancement model; The preprocessed visible light image and sonar image are respectively input into a dual-branch feature encoding network to extract shallow texture information and deep semantic structure information of the two modalities. The bidirectional information interaction between the two modalities is realized through an iterative cross-attention module, so that the visible light features introduce stable contour information of the sonar, and the sonar features introduce detailed texture information of the visible light. Then, the global context modeling module based on DETR is used to strengthen the long-distance dependency expression. Finally, the fused image is generated by the residual decoder. Furthermore, step three specifically includes: Step 31: Visible light and sonar dual-branch convolutional feature encoding; constructing visible light convolutional feature encoders and sonar convolutional feature encoders corresponding to the fusion architecture diagram; processing and registering the preprocessed and registered visible light image. With sonar images By inputting the visible light convolutional feature encoder and the sonar convolutional feature encoder respectively, local visible light representation features are obtained. Local characterization features of sonar Their calculation forms are expressed as follows:

[0059] In the formula, This represents a visible light convolutional feature encoder. This represents a sonar convolutional feature encoder. This represents visible light features that contain information about crack textures and peeling boundaries. Sonar features that include information about the pier outline and foundation boundaries; Step 32: Construct hierarchical feature representations; During the encoding process, the receptive field is gradually expanded through stride convolution, pooling, or downsampling operations to form multi-scale hierarchical features, so as to maintain both local detail expression and overall contour perception capabilities. Step 33: BICAF dual-modal iterative cross-attention module; visible light features and sonar characteristics Input the dual-modal iterative cross-attention fusion module to calculate sonar-guided visible light update features and visible light-guided sonar update features, respectively; where visible light features are used as queries and sonar features are used as keys and values, the cross-attention calculation is expressed as follows:

[0060]

[0061] In the formula, , , , , , These represent the projection matrices for bimodal interaction. The feature dimension normalization coefficient is used; the two cross-attention branches mentioned above are used to supplement the structural boundary representation in the visible light image with sonar stable contour information and to supplement the detail representation in the sonar image with visible light local texture information, respectively. Steps 3 and 4: Iterative Gated Update; Gating coefficients are introduced into the features updated through cross-attention to balance the contributions of the original modal features and the cross-modal compensation features; the update formula is expressed as:

[0062]

[0063] In the formula, and They represent the first Visible light and sonar characteristics at the next iteration and They represent the first The gating weights in the next iteration This indicates element-wise multiplication; through this gating mechanism, the fusion process can maintain a balance between preserving the original texture / contour information and introducing cross-modal compensation information.

[0064] Step 35: DETR Global Context Modeling; Input the bimodal fusion features updated through iterative cross-attention into the DETR encoder to model the long-distance dependencies between the main pier, local defect areas, and foundation boundaries, thus obtaining the global fusion features. Its calculation form is expressed as:

[0065] In the formula, This represents the fused features after bimodal interaction. This represents the DETR global context modeling module. It represents a global fusion feature that combines the relationship between local disease details and the overall structural outline; Step 36: Residual decoder construction; global fusion of features Input residual decoder, output fused image ; The residual decoding process is represented by the method of "visible light base image + sonar supplemented residual":

[0066] In the formula, This represents the supplementary residual map generated by the residual decoder, used to enhance the representation of the pier structure outline, foundation boundaries, and local salient areas; Step 37: Establish the training objective for the fusion model; the loss function of the fusion model includes at least one or more of pixel reconstruction loss, structure preservation loss, and edge consistency loss, and its total loss is expressed as:

[0067] In the formula, This represents the basic reconstruction error. This indicates that the constraint structure edges are preserved. This indicates the significant region constraint of the sonar. This indicates local contrast constraint in sonar. , , , These are the weighting coefficients; Step 38: Output fusion enhancement results; Apply the trained fusion model to the multimodal samples to be tested, and output a fused image with high texture clarity, strong boundary stability and good resistance to turbidity interference.

[0068] Step 4: Conduct intelligent detection of apparent defects in underwater structures based on fused images; First, a detection dataset for typical diseases such as cracks, spalling, local defects, and erosion marks is constructed. Using fused images as input, a target detection model based on the DETR framework is established. Second, a coordinate attention module is introduced after the backbone network to improve the network's sensitivity to disease location. Third, a multi-scale feature enhancement module is set up to improve the model's feature representation ability for small-scale cracks, weak-texture spalling, and irregular disease areas in complex backgrounds. Finally, a Transformer encoder-decoder is used to complete disease category prediction and bounding box regression.

[0069] Furthermore, step four specifically includes: Step 41: Construct a sample set of bridge pier surface defects; mark the cracked areas, peeling areas, and localized defects in the fused images to form a sample set of bridge pier surface defects. ; For the The labeling information for each disease target is as follows:

[0070]

[0071] In the formula, The label should indicate the type of disease, including at least two categories: cracks and peeling. Indicates the target bounding box parameters. Indicates the center position of the target box. and These represent the width and height, respectively. Step 42: Establish a disease detection backbone network; use a convolutional backbone network to extract features from the fused image to obtain disease characterization feature maps at different scales; Step 43: Introduce a coordinate attention module to enhance the ability to perceive the location of diseases; output features to the backbone network. One-dimensional global aggregation is performed along the height and width directions respectively to obtain the directional perception feature description of the bridge pier defect area; The coordinate attention is represented as:

[0072]

[0073] In the formula, Indicates the first Disease detection feature map for each channel, This represents the feature encoding result along the height direction. This represents the feature encoding result along the width direction. and These represent the height and width of the feature map, respectively. This module encodes the elongated direction of the crack, the local boundary of the peeling area, and the relative positional relationship of the disease into the detection features, thereby improving the ability to locate elongated cracks and local weak texture diseases.

[0074] Step 44: Construct a multi-scale feature enhancement module; perform cross-layer aggregation and semantic fusion on the features output from different layers of the backbone network, so that the shallow high-resolution features and the deep high-semantic features work together to enhance the recognition ability of small target diseases, weak texture diseases and diseases with obvious scale changes. Steps four and five: Establish a target decoding structure based on Transformer; use a Transformer encoder-decoder structure to perform global modeling and target query of the fused features, and output the disease category and corresponding location parameters; Step 46: Construct a joint loss function for target detection of bridge pier defects; The total loss is expressed as follows: Using a combination of classification loss and bounding box regression loss for optimization.

[0075] Among them, classification loss The combination of cross-entropy loss and Focal Loss is expressed as follows:

[0076]

[0077]

[0078] Bounding box regression loss The combination of L1 loss and generalized cross-union ratio loss is expressed as:

[0079] In the formula, Indicates the first Predicted probability of disease types. This indicates the corresponding real label. Indicates the predicted probability of the target category. , , , , , These are the weighting coefficients. This joint loss function is used to simultaneously improve the classification accuracy and target location regression accuracy of imbalanced targets such as cracks and spalling. Step 47: Output disease detection results; output disease category, bounding box coordinates, target confidence, area, length or relative position parameters, and establish the correspondence between disease targets and original structural regions; Step 48: Perform post-processing of disease results; based on the detection confidence threshold, category constraints, spatial adjacency relationship or structural prior rules, screen, merge and correct the initial detection results to improve the stability and interpretability of the final detection results.

[0080] Step 5: Local 3D representation and key parameter extraction of underwater structures constrained by fusion recognition results; To address the needs of underwater bridge pier surface defect identification and foundation scour identification, after obtaining fused images and defect detection results, a local 3D representation workflow associated with the identification results is constructed, using detected cracks, spalling, local defect areas, and pier foundation boundary areas as spatial constraints. Specifically, point cloud reconstruction of the pier surface and foundation neighborhood is performed using registered multimodal image sequences and multi-view geometric relationships; then, the 2D defect areas, boundary contours, and structurally significant areas obtained in the defect detection stage are mapped to 3D space, and parameters such as crack location, spalling area range, pier foundation scour depth, and local geometric dimensions are extracted. This step is not a general 3D reconstruction for the entire structure, but rather a constrained 3D modeling and parameter extraction for the target area in the "fusion enhancement result - defect identification result - key parameter evaluation" technology chain, thereby realizing the transfer of 2D identification results to spatial quantification results. Furthermore, step five specifically includes: Step 51: Construct a multi-view observation sequence. Combine the fused images of the same target bridge pier area obtained at different observation times to form a multi-view observation sequence, denoted as:

[0081] In the formula, Indicates the first Fusion images from multiple perspectives Indicates the number of observation angles.

[0082] Record the disease detection results obtained in step four as follows:

[0083] In the formula, Indicates the first Each disease target category Indicates the target bounding box parameters. Indicates the target confidence level. This indicates the number of identified targets. Based on the boundary boxes of the defects and the boundary regions of the bridge pier foundations, the regions of interest to be represented in 3D are extracted from the fused image to form local modeling inputs constrained by the recognition results, rather than performing indiscriminate 3D reconstruction of the entire scene.

[0084] Step 52: Perform feature matching and camera pose recovery. Extract salient feature points within the constrained local region of interest and establish cross-view matching relationships. Let the first... The first perspective and the first The corresponding feature points in each viewpoint are respectively and Then its matching set is represented as:

[0085] In the formula, This indicates the number of locally matched points. Based on the extrinsic parameter relationships between viewpoints recovered from the set of matched points, a viewpoint geometric model required for local 3D reconstruction is established.

[0086] Step 53: Recover local 3D points based on multi-view projection relationships. For 3D points within the local area of interest... In its first The projection relationship in each viewpoint is represented as follows:

[0087] In the formula, Representing a three-dimensional point In the Pixel coordinates from each viewpoint Indicates the first Intrinsic parameter matrices for each perspective, and These represent the rotation matrix and translation vector, respectively. This represents the scale factor. Based on the corresponding projection relationships from multiple viewpoints, the local region of interest is triangulated and restored to obtain the initial local point cloud.

[0088] In the formula, This indicates the number of local 3D points recovered.

[0089] Step 54: Construct a point cloud filtering and denoising mechanism constrained by the diseased area. For each point in the initial local point cloud... Calculate its to Average distance between nearest neighbors:

[0090] In the formula, Point The 1. Nearest neighbor. If the following conditions are met:

[0091] Then the point These were identified as outliers and removed; among them, This represents the average neighborhood distance of a local point cloud. Indicates the standard deviation of the neighborhood distance. This represents the threshold coefficient. Based on this, only the subsets of point clouds corresponding to the areas of concern for disease, the boundary areas of bridge pier foundations, and the scour-sensitive areas are retained to form a locally effective set of point clouds:

[0092] Step 55: Establish the mapping relationship between disease identification results and 3D point cloud. Map the 2D disease identification results obtained in Step 4 to the local 3D point cloud, and construct the correspondence between disease category, 2D bounding box, and 3D spatial points. For the... Each disease target, its three-dimensional point set is represented as follows:

[0093] In the formula, Indicates the first Projection mapping function under each viewpoint Indicates the first The disease target is in the first Two-dimensional regions from a single perspective. Through this mapping relationship, two-dimensional identification results such as cracks, spalling, and local defects can be transformed into corresponding three-dimensional spatial objects.

[0094] Steps five and six: Extract spatial parameters of apparent diseases. For the first... A three-dimensional point set of a disease target Extract its spatial location, extent, and local geometric quantities. Its spatial center can be represented as:

[0095] In the formula, Indicates the first The three-dimensional center position of each disease target. When the target is peeling or local defects, the disease area can be estimated by the projection area of its three-dimensional point set on the reference fitting plane; when the target is a crack, its three-dimensional extension scale can be estimated by the principal direction length of the crack point set.

[0096] Step 57: Extract the scour depth and key dimensional parameters of the bridge pier foundation. For the bridge pier foundation boundary and scour-sensitive areas, extract the local effective point cloud data. Extracting the basic surface point set With the set of scour depressions Let the reference plane equation for the bridge pier foundation be:

[0097] Then any point The distance to the reference plane is expressed as:

[0098] In the formula, Point The normal distance to the reference plane. The maximum, average, or quantile distance from points within the scour area to the reference plane is used as a measure of the scour depth of the bridge pier foundation, denoted as:

[0099]

[0100] Furthermore, based on the spatial distribution of the pier boundary point set, key geometric parameters such as the local cross-sectional width, height, and boundary contour dimensions of the pier can be further extracted.

[0101] Step 58: Constructing a Local 3D Representation. The spatial parameters of the apparent defects, the scour depth of the bridge pier foundation, and key dimensional parameters are uniformly organized to form a local 3D representation.

[0102] In the formula, Indicates the first Spatial area parameters of each disease target Indicates the first Spatial length parameter of a crack target This represents the set of key geometric dimensional parameters for bridge pier foundations. The results are used to establish a closed-loop correspondence between "fused image—disease identification—spatial parameter evaluation".

[0103] Step Six: Intelligent inspection output based on the correlation between multimodal fusion results and recognition results; The fused and enhanced image obtained in step three, the disease identification results obtained in step four, and the local three-dimensional parameter results obtained in step five are uniformly correlated to generate an intelligent inspection output result that includes disease type, location, spatial parameters, and bridge pier foundation scour assessment information, realizing an integrated inspection process of fused image visualization, recognition result expression, and parameter evaluation output.

[0104] Furthermore, step six specifically includes: Step 61: Construct an inspection result association set; The fused image results obtained in Step 3, the defect identification results obtained in Step 4, and the local 3D parameter results obtained in Step 5 are uniformly organized to form an inspection result association set, which is used to characterize the pier surface defect information, foundation scour information, and their corresponding spatial parameter information; the inspection result association set is represented as follows:

[0105] In the formula, Indicates the result of image fusion. This indicates the results of disease identification. This represents the results of local three-dimensional parameters; Step 62: Generate visualized and structured inspection results; based on the fused image, overlay and display the category, location, confidence level and corresponding spatial parameters of the disease target, and simultaneously generate structured result entries; the structured results include at least one or more of the following: disease category, two-dimensional location, three-dimensional location, local size parameters, scour depth parameters and key geometric parameters of the pier; Step 63: Output intelligent inspection results; output the visualized inspection results and structured parameter results in a unified manner to form intelligent inspection results for underwater bridge pier inspection tasks, which are used for structural status analysis, inspection record retention and subsequent re-inspection and comparison; the output results include at least the fused enhanced image, the defect detection mark results, the defect spatial parameters and the bridge pier foundation scour and key dimension assessment results.

[0106] Example This implementation method focuses on an intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images. It uses the detection scenario of cracks and spalling defects in underwater bridge piers as an example. The overall process includes: dual-modal data construction, image preprocessing, dual-modal image fusion, defect target detection, and 3D modeling and parameter extraction.

[0107] 1. Dual-modal data construction: First, visible light and sonar images of the target underwater structure region are acquired. A dual-modal sample correspondence is established based on the same structural scene or adjacent observation scenes, forming input data that can be used for image fusion and defect detection. Targets in the defect detection task are manually labeled. Each target instance is represented by bounding box parameters (x, y, w, h) and a category label, where the category labels include two types: crack and peeling.

[0108] 2. Dual-modal image preprocessing: Before feeding the data into the deep learning network, the visible light image and the sonar image are preprocessed separately. Gamma correction is performed on the visible light image to improve the brightness distribution of the underwater image and enhance details in dark areas; Gaussian filtering is performed on the sonar image to suppress noise and improve the stability of subsequent feature extraction; if necessary, the dual-modal images are resized and contrast-enhanced to meet the input requirements of the subsequent network. Through the above processing, the impact of environmental interference on the quality of the original data can be reduced, providing a more stable data foundation for subsequent dual-modal feature extraction.

[0109] 3. Bimodal Image Fusion: Preprocessed visible light and sonar images are input into a bi-branch convolutional feature encoding module to extract corresponding local representation features. Then, the two modal features are input into a BICAF bidirectional iterative cross-attention fusion module. This bidirectional attention mechanism enables information interaction between different modalities, allowing the visible light branch to supplement the sonar's structural contour information, and the sonar branch to supplement the visible light's local texture information. Next, the fused features are input into a DETR Transformer Encoder to model the global dependencies and structural semantic relationships in the cross-modal features, enhancing feature expressive power. Finally, the encoded fused features are input into a residual decoder to generate a fused image. This fused image preserves the visual continuity and local texture details of the visible light image while also including supplementary information on structural contours and overall morphology from the sonar image.

[0110] 4. Damage Target Detection Based on Fuded Images: The fused images are input into an underwater bridge pier damage target detection model based on the DETR framework. This model uses ResNet101 as the backbone network to extract high-level semantic features from the fused images. Addressing the characteristics of underwater cracks and spalling targets on bridge piers—weak details, significant scale differences, and irregular boundaries—a coordinate attention module is introduced after the backbone network output to enhance the model's ability to perceive the spatial location information of the damage. Simultaneously, a multi-scale feature enhancement module is introduced to strengthen the representation of small and weakly featured targets. Subsequently, the enhanced features are input into a Transformer encoder-decoder structure to complete damage category prediction and bounding box regression.

[0111] 5. 3D Modeling and Parameter Extraction: First, a 3D modeling dataset for underwater bridge piers was constructed. A spatial alignment method for visible light and sonar data based on feature matching was studied to reduce spatial coordinate misalignment caused by differences between the two data sources. An image-driven point cloud generation method was designed and combined with multi-view geometry principles to improve point cloud density and modeling accuracy. The point cloud was denoised and simplified to balance modeling efficiency and detail preservation. Finally, the scour depth and key dimensional parameters of the bridge piers were extracted, and accuracy verification and error analysis were conducted.

[0112] 6. Inspection Result Output: The fused images, defect detection results, and geometric parameters obtained from 3D modeling are uniformly organized to form inspection results oriented towards engineering applications. Output content may include: fused structural images, defect categories, defect locations, defect quantities, key dimensions of piers, scour depth, and related assessment results.

[0113] The above describes in detail the intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images proposed in this invention, including key steps such as dual-modal data construction, image preprocessing, dual-modal fusion, defect detection, and 3D modeling and parameter extraction. This invention fully leverages the complementary advantages of visible light and sonar images in underwater structure inspection, improving the quality of the fused image, defect identification capabilities, and subsequent geometric parameter extraction capabilities. For those skilled in the art, equivalent substitutions or conventional adjustments can be made to the network structure, preprocessing procedures, loss function combinations, and parameter configurations without departing from the technical concept of this invention; all such substitutions and adjustments should fall within the protection scope of this invention.

[0114] The present invention also proposes an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the intelligent underwater structure inspection method based on multimodal deep learning of visible light and sonar images.

[0115] The present invention also proposes a computer-readable storage medium for storing computer instructions, which, when executed by a processor, implement the steps of the intelligent underwater structure inspection method based on multimodal deep learning of visible light and sonar images.

[0116] The memory in this application embodiment can be volatile memory or non-volatile memory, or it can include both volatile and non-volatile memory. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM). It should be noted that the memory used in the methods described in this invention is intended to include, but is not limited to, these and any other suitable types of memory.

[0117] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., high-density digital video discs (DVDs)), or semiconductor media (e.g., solid-state disks (SSDs)).

[0118] In implementation, each step of the above method can be completed by integrated logic circuits in the processor's hardware or by instructions in software. The steps of the method disclosed in the embodiments of this application can be directly implemented by a hardware processor, or by a combination of hardware and software modules in the processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method. To avoid repetition, detailed descriptions are omitted here.

[0119] It should be noted that the processor in the embodiments of this application can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method embodiments can be completed by the integrated logic circuitry in the processor's hardware or by instructions in software form. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied as execution by a hardware decoding processor, or as a combination of hardware and software modules in the decoding processor. The software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory, and the processor reads the information in the memory and, in conjunction with its hardware, completes the steps of the above methods.

[0120] The above provides a detailed description of the intelligent underwater structure inspection method based on multimodal deep learning using visible light and sonar images proposed in this invention. Specific examples have been used to illustrate the principles and implementation methods of this invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this invention. Therefore, the content of this specification should not be construed as a limitation of this invention.

Claims

1. An underwater structure intelligent inspection method based on visible light and sonar image multi-modal deep learning, characterized in that, The method includes: Step 1: Construct a multimodal dataset of visible light and sonar images of underwater structures; By using visible light cameras and imaging sonar mounted on underwater robots, submersibles, or fixed inspection platforms, visible light images and sonar images of the target underwater structure area are collected synchronously or quasi-synchronously, along with pose information, depth information, or timestamp information corresponding to the collection process, to construct a multimodal raw dataset for the same inspection object. Step 2: Preprocessing and staged spatial registration of visible light and sonar images under cross-modal fusion constraints; First, the visible light image undergoes brightness distribution correction, detail enhancement, and edge preservation processing. Second, the sonar image undergoes speckle noise suppression, significant echo enhancement, and contour stabilization processing. Based on this, a two-stage registration method of "coarse-scale initial alignment + fine-scale feature correction" is adopted for registration: first, the initial spatial transformation is determined using time synchronization relationships, pose information, or overall structural contour correspondence to complete coarse-scale registration; then, local deviations are further estimated using edge contours, local salient regions, and cross-modal feature similarity to complete fine-scale registration. Step 3: Construct a deep learning-based visible light-sonar multimodal fusion enhancement model; The preprocessed visible light image and sonar image are respectively input into a dual-branch feature encoding network to extract shallow texture information and deep semantic structure information of the two modalities. The bidirectional information interaction between the two modalities is realized through an iterative cross-attention module, so that the visible light features introduce stable contour information of the sonar, and the sonar features introduce detailed texture information of the visible light. Then, the global context modeling module based on DETR is used to strengthen the long-distance dependency expression. Finally, the fused image is generated by the residual decoder. Step 4: Conduct intelligent detection of apparent defects in underwater structures based on fused images; Step 5: Local 3D representation and key parameter extraction of underwater structures constrained by fusion recognition results; To address the identification results of surface defects of underwater bridge piers and the need for foundation scour identification, after obtaining fused images and defect detection results, the detected cracks, spalling, local defect areas and bridge pier foundation boundary areas are used as spatial constraint objects to construct a local three-dimensional representation process associated with the identification results. Step Six: Intelligent inspection output based on the correlation between multimodal fusion results and recognition results; The fused and enhanced image obtained in step three, the disease identification results obtained in step four, and the local three-dimensional parameter results obtained in step five are uniformly correlated to generate an intelligent inspection output result that includes disease type, location, spatial parameters, and bridge pier foundation scour assessment information, realizing an integrated inspection process of fused image visualization, recognition result expression, and parameter evaluation output.

2. The method of claim 1, wherein, Step one specifically includes: Step 11: Multimodal synchronous acquisition; underwater visible light imaging equipment and imaging sonar equipment are used to conduct synchronous or quasi-synchronous inspection and acquisition of the target underwater structure; among them, the visible light imaging equipment is used to acquire surface texture, edges, color and local defects of the structure, and the imaging sonar equipment is used to acquire the structure outline, boundary echo and overall geometric response information in turbid environment. Steps 1 and 2: Parameter acquisition and pose recording; while acquiring image and sonar data, record one or more of the following: timestamp, depth, track coordinates, attitude angle, heading angle, relative installation relationship of equipment, and sampling frequency, for subsequent time alignment, spatial registration, and scale recovery between different modes; Step 13: Cross-modal sample association and matching; Based on the same inspection object, the same area, the same time period, or adjacent pose conditions, establish association between visible light images and sonar images to form initial multimodal sample pairs; For data that is not strictly synchronized, establish candidate matching pairs based on temporal proximity, pose difference constraints, and structural contour similarity. Step 14: Construction of Complex Working Condition Dataset; Collect multimodal samples under different water depths, turbidities, illuminance, sampling distances, structural types, and disease types to form a dataset covering complex working conditions; The disease types include at least one or more of the following: cracks, spalling, erosion, local defects, cavitation, and abnormal adhesion. Step 15: Sample Screening and Task Labeling; Exclude samples with severe blurring, overexposure, invalid echoes, abrupt pose changes, missing targets, or those where effective correspondence cannot be established, and retain the acquisition parameters and metadata tags of valid samples; For image fusion tasks, establish paired input relationships between visible light images and sonar images; For disease detection tasks, label disease areas with rectangular boxes, polylines, key points, or pixel-level masks; For 3D modeling tasks, establish mapping relationships between multi-view image sequences, pose information, and scale references. Step 16: Stratified sampling and dataset partitioning; perform stratified sampling of the data according to the structure object, disease category and working condition distribution, and divide the data into training set, validation set and test set to avoid data leakage caused by adjacent samples of the same target area falling into different data subsets at the same time.

3. The method of claim 1, wherein, Step two specifically includes: Step 21: Visible light image preprocessing for crack texture preservation and cross-modal fusion stability; The acquired underwater visible light images are subjected to luminance mapping correction, local contrast enhancement, and edge preservation enhancement. Luminance mapping correction is represented using gamma transform as follows: wherein denotes the grey value of the original visible light image at pixel denotes the corrected grey value, denotes the corrected grey value, is a proportionality factor, is a gamma transformation parameter; Step 22: Sonar image preprocessing for structural contour preservation and significant echo enhancement; The sonar images are processed by speckle noise suppression, local echo enhancement, and contour stabilization. The speckle noise suppression is represented by Gaussian filtering as follows: wherein denotes the original sonar image, denotes the filtered sonar image, denotes a convolution operation, denotes a Gaussian kernel function, is the standard deviation; Steps 2 and 3: Standardize the input scale and format; resample the preprocessed visible light image and sonar image, normalize the pixels, map the channels, and standardize the data format to ensure that the two modal data meet the requirements for parallel input to the subsequent deep network. Step 24: Perform coarse-scale initial registration and fine-scale feature registration.

4. The method of claim 3, wherein, Step two and four specifically include: First, based on the correspondence between timestamps, pose information, acquisition order, or the overall outline of the bridge pier, an initial spatial mapping relationship is established between the visible light image and the sonar image to obtain a coarse-scale registration result. Then, based on the coarse-scale registration result, local deviations are further corrected using edge contours, salient regions, or cross-modal similarity features to obtain a fine-scale registration result. The affine transformation relationship of the coarse-scale registration is expressed as follows: wherein denotes the visible light image coordinate, denotes the sonar image coordinate, denotes a coarse scale initial transformation matrix; Furthermore, fine-scale registration is achieved by minimizing the inconsistency in the local regions of the two modes, and its objective function is expressed as: In the formula, represents the contour consistency loss, used to constrain the correspondence of the bridge pier shape boundary in the bimodal; represents the local area consistency loss, used to constrain the spatial matching of local content; and is a weight coefficient.

5. The method of claim 4, wherein, Step three specifically includes: Step 31: Visible light and sonar dual-branch convolutional feature encoding; constructing visible light convolutional feature encoders and sonar convolutional feature encoders corresponding to the fusion architecture diagram; processing and registering the preprocessed and registered visible light image. With sonar images By inputting the visible light convolutional feature encoder and the sonar convolutional feature encoder respectively, local visible light representation features are obtained. Local characterization features of sonar Their calculation forms are expressed as follows: In the formula, This represents a visible light convolutional feature encoder. This represents a sonar convolutional feature encoder. This represents visible light features that contain information about crack textures and peeling boundaries. Sonar features that include information about the pier outline and foundation boundaries; Step 32: Construct hierarchical feature representations; During the encoding process, the receptive field is gradually expanded through stride convolution, pooling, or downsampling operations to form multi-scale hierarchical features, so as to maintain both local detail expression and overall contour perception capabilities. Step 33: BICAF dual-modal iterative cross-attention module; visible light features and sonar characteristics Input the dual-modal iterative cross-attention fusion module to calculate sonar-guided visible light update features and visible light-guided sonar update features, respectively; where visible light features are used as queries and sonar features are used as keys and values, the cross-attention calculation is expressed as follows: In the formula, , , , , , These represent the projection matrices for bimodal interaction. The feature dimension normalization coefficient is used; the two cross-attention branches mentioned above are used to supplement the structural boundary representation in the visible light image with sonar stable contour information and to supplement the detail representation in the sonar image with visible light local texture information, respectively. Steps 3 and 4: Iterative Gated Update; Gating coefficients are introduced into the features updated through cross-attention to balance the contributions of the original modal features and the cross-modal compensation features; the update formula is expressed as: In the formula, and They represent the first Visible light and sonar characteristics at the next iteration and They represent the first The gating weights in the next iteration This represents element-wise multiplication; Step 35: DETR Global Context Modeling; Input the bimodal fusion features updated through iterative cross-attention into the DETR encoder to model the long-distance dependencies between the main pier, local defect areas, and foundation boundaries, thus obtaining the global fusion features. Its calculation form is expressed as: In the formula, This represents the fused features after bimodal interaction. This represents the DETR global context modeling module. It represents a global fusion feature that combines the relationship between local disease details and the overall structural outline; Step 36: Residual decoder construction; global fusion of features Input residual decoder, output fused image ; The residual decoding process is represented by the method of "visible light base image + sonar supplemented residual": In the formula, This represents the supplementary residual map generated by the residual decoder, used to enhance the representation of the pier structure outline, foundation boundaries, and local salient areas; Step 37: Establish the training objective for the fusion model; the loss function of the fusion model includes at least one or more of pixel reconstruction loss, structure preservation loss, and edge consistency loss, and its total loss is expressed as: In the formula, This represents the basic reconstruction error. This indicates that the constraint structure edges are preserved. This indicates the significant region constraint of the sonar. This indicates local contrast constraint in sonar. , , , These are the weighting coefficients; Step 38: Output fusion enhancement results; apply the trained fusion model to the multimodal samples to be tested and output the fused image.

6. The method according to claim 1, characterized in that, Step four specifically includes: Step 41: Construct a sample set of bridge pier surface defects; mark the cracked areas, peeling areas, and localized defects in the fused images to form a sample set of bridge pier surface defects. ; For the The labeling information for each disease target is as follows: In the formula, The label should indicate the type of disease, including at least two categories: cracks and peeling. Indicates the target bounding box parameters. Indicates the center position of the target box. and These represent the width and height, respectively. Step 42: Establish a disease detection backbone network; use a convolutional backbone network to extract features from the fused image to obtain disease characterization feature maps at different scales; Step 43: Introduce a coordinate attention module to enhance the ability to perceive the location of diseases; output features to the backbone network. One-dimensional global aggregation is performed along the height and width directions respectively to obtain the directional perception feature description of the bridge pier defect area; The coordinate attention is represented as: In the formula, Indicates the first Disease detection feature map for each channel, This represents the feature encoding result along the height direction. This represents the feature encoding result along the width direction. and These represent the height and width of the feature map, respectively. Step 44: Construct a multi-scale feature enhancement module; perform cross-layer aggregation and semantic fusion on the features output from different layers of the backbone network, so that the shallow high-resolution features and the deep high-semantic features work together to enhance the recognition ability of small target diseases, weak texture diseases and diseases with obvious scale changes. Steps four and five: Establish a target decoding structure based on Transformer; use a Transformer encoder-decoder structure to perform global modeling and target query of the fused features, and output the disease category and corresponding location parameters; Step 46: Construct a joint loss function for target detection of bridge pier defects; The total loss is expressed as follows: Using a combination of classification loss and bounding box regression loss for optimization. in, , These are the weighting coefficients; Step 47: Output disease detection results; output disease category, bounding box coordinates, target confidence, area, length or relative position parameters, and establish the correspondence between disease targets and original structural regions; Step 48: Perform post-processing of disease results; based on the detection confidence threshold, category constraints, spatial adjacency relationship or structural prior rules, screen, merge and correct the initial detection results to improve the stability and interpretability of the final detection results.

7. The method according to claim 1, characterized in that, Step five specifically includes: Step 51: Construct a multi-view observation sequence; Step 52: Perform feature matching and camera pose recovery; Step 53: Restore local 3D points based on multi-view projection relationships; Step 54: Construct a point cloud screening and denoising mechanism constrained by the diseased area; Step 55: Establish the mapping relationship between disease identification results and 3D point cloud; Steps 5 and 6: Extract spatial parameters of apparent diseases; Step 57: Extract the scour depth and key dimensional parameters of the bridge pier foundation; Step 58: Construct the local three-dimensional representation result.

8. The method according to claim 1, characterized in that, Step six specifically includes: Step 61: Construct an inspection result association set; The fused image results obtained in Step 3, the defect identification results obtained in Step 4, and the local 3D parameter results obtained in Step 5 are uniformly organized to form an inspection result association set, which is used to characterize the pier surface defect information, foundation scour information, and their corresponding spatial parameter information; the inspection result association set is represented as follows: In the formula, Indicates the result of image fusion. This indicates the results of disease identification. This represents the results of local three-dimensional parameters; Step 62: Generate visualized and structured inspection results; based on the fused image, overlay and display the category, location, confidence level and corresponding spatial parameters of the disease target, and simultaneously generate structured result entries; the structured results include at least one or more of the following: disease category, two-dimensional location, three-dimensional location, local size parameters, scour depth parameters and key geometric parameters of the pier; Step 63: Output intelligent inspection results; output the visualized inspection results and structured parameter results in a unified manner to form intelligent inspection results for underwater bridge pier inspection tasks, which are used for structural status analysis, inspection record retention and subsequent re-inspection and comparison; the output results include at least the fused enhanced image, the defect detection mark results, the defect spatial parameters and the bridge pier foundation scour and key dimension assessment results.

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1-8.

10. A computer-readable storage medium for storing computer instructions, characterized in that, When the computer instructions are executed by the processor, they implement the steps of the method according to any one of claims 1-8.