A method for detecting underwater defects of a hydraulic structure by acoustic-optical imaging semantic fusion

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By employing a semantically guided acoustic-optical imaging fusion detection framework that combines multimodal information from acoustic and optical images, the problem of insufficient detection accuracy and environmental interference in underwater inspection is solved, achieving efficient and accurate defect identification.

CN120656051BActive Publication Date: 2026-06-26HOHAI UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HOHAI UNIV
Filing Date: 2025-06-11
Publication Date: 2026-06-26

Application Information

Patent Timeline

11 Jun 2025

Application

26 Jun 2026

Publication

CN120656051B

IPC: G06V20/05; G06V10/54; G06V10/56; G06V10/80; G06V10/22; G06V10/764; G06V10/44; G06T5/20; G06V10/82; G06V10/774

CPC: G06V20/05; G06V10/54; G06V10/56; G06V10/80; G06V10/22; G06V10/764; G06V10/44; G06T5/20

AI Tagging

Technology Topics

Hydraulic structureEngineering

Technical Efficacy Phrases

Improve work efficiency

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies for underwater defect detection suffer from insufficient detection accuracy, limited information acquisition, and susceptibility to interference from complex underwater environments, making it difficult to meet the demand for efficient and accurate detection. Furthermore, multimodal detection technology struggles to balance accuracy and operational efficiency in practical applications.

Method used

A semantically guided acoustic-optical imaging fusion detection framework is adopted. Deep learning models are used to build deep semantic associations between acoustic and optical images, enabling joint modeling and guided optimization of multimodal information. By combining the texture details of optical images with the structural semantic information of acoustic images, the accuracy and robustness of detection are improved.

Benefits of technology

It improves the accuracy and robustness of underwater defect detection, effectively controls algorithm complexity, and enhances detection efficiency, demonstrating strong engineering practicality and promotional value.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120656051B_ABST

Patent Text Reader

Abstract

The application discloses a kind of underwater defect acoustic-optical imaging semantic fusion detection methods of hydraulic structure, for underwater defect of hydraulic structure, underwater acoustic-optical imaging sensing is simultaneously obtained underwater acoustic and optical image of hydraulic structure;On this basis, acoustic semantic features in acoustic image are extracted by semantic feature extraction module, and acoustic semantic features are guided to process optical image features in optical image, when underwater defect target exists in image, defect detection result is obtained, to realize underwater defect acoustic-optical imaging semantic fusion detection, the present application is suitable for the health monitoring of various library dam system hydraulic structure, guarantee the operation safety and water safety of water conservancy project.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a semantic fusion detection method for underwater defects in hydraulic structures using acoustic-optical imaging, and particularly to a semantically guided acoustic-optical imaging fusion detection framework and method, as well as its application technology in health monitoring of various underwater engineering structures, belonging to the field of underwater engineering structure monitoring technology. Background Technology

[0002] With the development of marine engineering and underwater intelligent operation and maintenance, the demand for structural health monitoring of underwater structures (such as bridge piers, pipelines, ship hulls, and offshore platforms) is becoming increasingly urgent. Their safety and integrity directly affect the long-term service capability and operational support level of water conservancy projects. Traditional underwater inspection mainly relies on single-modal imaging methods. Especially in complex environments such as low visibility and turbid water, conventional optical imaging systems are severely affected by light attenuation and scattering, have short detection distances, and are subject to many limitations, making it difficult to complete large-scale underwater defect monitoring tasks. Therefore, acoustic imaging technology, due to its good penetration ability and stable imaging effect, has been widely used in underwater target identification and defect detection. However, single acoustic imaging modes still face many challenges, such as low resolution, blurred target boundaries, poor signal-to-noise ratio, and susceptibility to background interference, which seriously affect the robustness and accuracy of the detection model. Furthermore, due to the complexity of the underwater propagation medium, sound wave propagation is often accompanied by attenuation, multipath effects, bubble interference, and seabed sediment artifacts, further exacerbating the misjudgment rate of the detection model. In recent years, computer vision and deep learning technologies have developed rapidly, and detection methods based on CNNs and Transformers have made breakthrough progress in the field of natural images. However, sonar images and natural images differ significantly in imaging mechanisms, texture representation, and visual attributes: First, low contrast and blurred boundaries make it difficult for traditional image detection algorithms to accurately segment defect regions; second, strong background interference significantly increases the risk of false detection; and third, the information representation capability of a single modality is limited, making it difficult to support the classification and measurement of complex defects. Considering the complementarity of acoustic and optical images in their imaging principles—the former has advantages such as strong penetration and stable imaging, while the latter has characteristics such as high resolution and clear texture—multimodal information fusion has gradually become an important development direction for underwater defect detection technology. Some studies have attempted to jointly model acoustic-optical images, such as using image registration or attention mechanisms to achieve preliminary fusion, but most methods still remain at the level of shallow feature stitching, making it difficult to deeply explore the structural relationship between the two modalities at the semantic level. Moreover, they are often accompanied by high computational complexity and poor adaptability, making them difficult to apply to complex real-world marine engineering environments. Summary of the Invention

[0003] Objective: To address the problems and shortcomings of existing technologies, this invention proposes a semantic fusion detection method for underwater defects in hydraulic structures using acoustic-optical imaging. The aim is to construct deep semantic associations between acoustic and optical images through a deep learning model, achieving joint modeling and guided optimization of multimodal information. This method designs a semantic guidance mechanism, enabling structural semantic information in the acoustic image to effectively guide the defect discrimination process in the optical image. While fully utilizing the stability of acoustic imaging, it fuses the texture details of the optical image, achieving high-precision identification of various types of underwater defects such as cracks, erosion, and holes. Furthermore, the fusion strategy proposed in this invention effectively controls algorithm complexity and improves fusion efficiency, possessing strong engineering practicality and promotional value.

[0004] Technical solution: A semantic fusion detection method for underwater defects in hydraulic structures using acoustic-optical imaging, comprising:

[0005] Step 1: Acquire optical and sonar image data.

[0006] Step 2: Construction of the underwater image feature extraction model. The underwater image feature extraction model is used to extract multi-level and multi-dimensional visual information from optical images, including texture feature extraction, boundary feature extraction, and color feature extraction. The extracted features are then fused to obtain a unified image embedding feature representation.

[0007] Step 3: Construction of the Acoustic Image Semantic Feature Extraction Model. The acoustic image semantic feature extraction model is used to extract structural semantic information from acoustic images and encode it into embedding vectors that can be used for image detection guidance.

[0008] Step 4: Underwater Acoustic-Optical Imaging Semantic Fusion Detection Model. This step constructs a multimodal defect detection network based on a semantic guidance mechanism, aiming to deeply fuse the visual information of optical images with the semantic information of acoustic images to improve detection accuracy and robustness. The optical image embedding features extracted in Step 2 and the semantic vectors obtained in Step 3 are respectively input into parallel image encoders and text encoders; the two encoders perform deep modeling of the input image region features and semantic category information; the detection model achieves semantic-level region localization by matching the correlation between image region features and semantic vectors, that is, the optical image is guided by sonar information to focus on detecting potential defect areas; the detection model also includes a bounding box prediction branch to accurately output the spatial location of defects; finally, the defect type, location information, and detection confidence are output.

[0009] The design combines two modal imaging devices: an optical imaging sensor and an acoustic imaging sensor. The sensing areas of the acoustic and optical sensors are ensured to spatially overlap, thereby obtaining multimodal image data of the same structural region. The optical imaging sensor acquires optical data (optical images), while the acoustic imaging sensor acquires acoustic data (sonar images).

[0010] The underwater image feature extraction model comprises three feature extraction branches: 1) Texture feature extraction unit: employing a Gabor-based neural network structure, combined with a directional filter bank and a convolutional feature encoder, to extract local texture information from the image, including attributes such as texture frequency, direction, stripe structure, and roughness. This is suitable for identifying defect areas with significant texture changes, such as cracks and erosion. 2) Boundary feature extraction unit: this module combines the Sobel operator and a deep edge detection network to detect and enhance locations with structural abrupt changes in the image, focusing on extracting information such as defect edges, contour abrupt changes, and crack boundaries, thereby improving the model's ability to identify linear defects such as cracks and fractures. 3) Color feature extraction unit: converting the image from RGB space to color spaces such as HSV or Lab, and extracting features such as color distribution, brightness changes, and color shifts in each channel to capture visual anomalies related to color changes, such as corrosion and erosion. The above three feature extraction branches construct the feature representation of the optical image from the three dimensions of texture, structure, and color, respectively. Their outputs are spliced or weighted by the feature fusion module to finally generate a unified image embedding feature representation, denoted as F. opt This is used in the multimodal fusion process of subsequent semantic alignment and defect detection tasks.

[0011] An acoustic image semantic feature extraction model is used to extract structural semantic information from acoustic images and encode it into embedding vectors that can be used for image detection guidance. This model includes: 1) A semantic representation encoding unit: First, defect category identification is performed on the sonar image. A classification model based on a deep convolutional neural network is used to perform multi-category discrimination on the sonar image, identifying the existence of defects and their specific types, such as cracks, holes, and erosion. Subsequently, based on the identified categories, corresponding structured natural language descriptions are generated through preset mapping rules, such as "There is a crack." or "There is a hole.", achieving the conversion from image to semantics. 2) A semantic knowledge recognition unit: The above natural language description is used as the semantic prompt input to the semantic knowledge recognition unit. A pre-trained text encoding network (such as BERT or Transformer) extracts semantic vector representations. This semantic vector carries prior knowledge about the defect category, used for semantic guidance of the image detection region in subsequent stages. Simultaneously, the semantic feature extraction model can integrate a preset knowledge base or class embedding vectors to achieve knowledge enhancement of semantic representation and improve the generalization ability of semantics to image understanding.

[0012] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of the acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures as described above.

[0013] A computer-readable storage medium storing a computer program that performs the acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures as described above.

[0014] Beneficial effects: Based on multimodal fusion flaw detection technology, this invention develops and constructs a semantic guidance strategy, and innovatively proposes an acoustic-optical imaging semantic fusion detection framework. While leveraging the advantages of multimodal information fusion in the high accuracy of underwater defect detection in hydraulic structures, it effectively improves the operational efficiency of underwater defect detection in hydraulic structures. Attached Figure Description

[0015] Figure 1 This is a flowchart illustrating the present invention;

[0016] Figure 2 This is a schematic diagram of image feature extraction according to the present invention;

[0017] Figure 3 This is a schematic diagram of the defect detection steps in an embodiment of the present invention. Detailed Implementation

[0018] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. After reading the present invention, any modifications of the present invention in various equivalent forms by those skilled in the art will fall within the scope defined by the appended claims.

[0019] Existing methods for underwater defect detection in hydraulic structures generally suffer from problems such as insufficient detection accuracy, limited information acquisition, and susceptibility to interference from complex underwater environments, making it difficult to meet the practical needs of efficient and accurate underwater defect detection operations. Furthermore, current multimodal detection technologies often present a trade-off between accuracy and operational efficiency in practical applications, making it difficult to improve overall detection efficiency while maintaining detection precision.

[0020] To address the aforementioned problems, this invention proposes an underwater defect detection method based on semantic-level acoustic-optical fusion. It constructs an acoustic-optical semantic-level fusion framework to achieve deep fusion of acoustic and optical imaging information at the semantic feature level. This method fully leverages the penetrating power of acoustic imaging, effectively overcoming the limitations of optical imaging in turbid water environments. Simultaneously, it incorporates detailed information from the optical image to compensate for the insufficient resolution of the acoustic image, thereby improving the robustness, accuracy, and reliability of defect identification. Specifically, it includes the following:

[0021] Step 1: By deploying Blueprint Subsea Oculus M750d forward-looking sonar (FLS) underwater acoustic imaging sensor and RGB underwater optical imaging sensor in the water area of the hydraulic structure, the imaging areas of the acoustic sensor and the optical sensor overlap, and the acoustic image and optical image of the same hydraulic structure area are acquired simultaneously.

[0022] Step Two: The optical data (underwater optical image) acquired by the underwater optical imaging sensor in Step One is input into the underwater image feature extraction model. This model consists of a texture feature extraction unit, a boundary feature extraction unit, and a color feature extraction unit to extract the image features of the optical image. The specific network architecture is as follows: Figure 2As shown, the core objective of this network architecture is to extract semantic information through image classification. First, the network extracts features from the input image using multiple convolutional-batch-normalized-SiLU (CBS) modules, extracting image features at different levels. Then, a residual feature extraction module (C3K2) further refines the high-level semantic information of the image. The network structure also includes an SPPF (Fast Spatial Pyramid Pooling) module, which enhances the model's ability to capture features at different scales through multi-scale pooling. An upsample module is used for upsampling, increasing the resolution of the feature maps for better subsequent classification. In the final part of the network, the extracted features are classified by a classification module, and the classification results are further transformed into semantic information. This semantic information is used to guide region selection and object detection in subsequent tasks.

[0023] The texture feature extraction unit convolves the image using a set of Gabor filter kernels with four directions and frequencies, as shown in the following expression:

[0024]

[0025] Where F edge G represents the edge feature map. x G represents the gradient of the image in the x-direction. y Indicates in y Gradient of direction; I gray A grayscale representation of an optical image.

[0026] The color feature extraction unit transforms the image from RGB to HSV and Lab spaces to extract color features, as shown in the following expression:

[0027]

[0028] Where F opt W represents the multidimensional optical image features after fusion. t W e W c It is the learned weight matrix. Different fully connected layers are used to map the three features to the same dimension, and then weighted fusion is performed. This represents the feature weighting coefficient.

[0029] Step 3: Input the acoustic data (underwater sonar image) acquired by the underwater sonar imaging sensor in Step 1 into the semantic feature extraction model. The acoustic image semantic feature extraction model consists of a semantic representation encoding unit and a semantic knowledge recognition unit to extract the semantic features of the acoustic image.

[0030] Semantic representation encoding unit: This unit first uses the YOLOv11 framework to extract and classify features from the image, then generates structured semantic text and performs embedding encoding.

[0031] Sonar image semantic classification uses a deep learning model to output the probability distribution of defect categories, as shown in the following expression:

[0032]

[0033] Where I sonar It is a sonar image, f cls It's a classification model, where P represents the probability of classifying different defects, and the category corresponding to the highest probability is: Defect tags can be used to generate corresponding semantic text descriptions, p j The first prediction made by the acoustic image classification model represents the... j The probability value of a class of defects. j ∈{none, crack, hole, spalling}.

[0034] Based on the classification results, a structured semantic description is generated using a fixed template, as shown in the following expression:

[0035]

[0036] TextGen() is a mapping function used to generate structured semantics.

[0037] The generated text T is input into a semantic encoder to extract a semantic vector representation, as shown in the following expression:

[0038] t1=f text (T)

[0039] Where f text Represented as a text embedding network, t1 is the semantic embedding feature ultimately used for alignment with the optical image region. Semantic knowledge recognition unit: This unit further models the semantic vector, incorporating an attention mechanism to improve semantic alignment capabilities. An external defect knowledge vector library is introduced to dynamically enhance the current semantic vector, as shown in the following expression:

[0040]

[0041] in , representing the attention weights associated with various types of knowledge; K j t1 represents the knowledge vector of the j-th type of defect; t2 represents the enhanced semantic vector, which is used for subsequent cross-modal matching.

[0042] Step 4: Input the underwater optical image features and acoustic image semantic features obtained in Steps 2 and 3 into the underwater acoustic-optical imaging semantic fusion detection model. This model utilizes parallel Transformer-based image encoders and text encoders to process the image features of the underwater optical image and the semantic features of the acoustic image, respectively. The specific model architecture is as follows: Figure 3 As shown in the figure, this diagram illustrates a network architecture for defect detection, where semantic information is extracted from sonar images and combined with optical images for defect detection. First, the network extracts semantic information (such as "no defects," "with holes," etc.) from the sonar images using a semantic model and generates text descriptions that guide subsequent image processing. Next, the sonar and optical images undergo spatial and channel-structured convolutional processing to extract multi-level features, enhancing feature representation capabilities. In the feature sharing module, features from the sonar and optical images are fused and dimensionality reduced using pooling layers, employing 3×3 and 1×1 convolutional pooling to reduce computational complexity. Subsequently, these features are input to fully connected layers for further processing, ultimately outputting classification results to determine the defect category in the image. Finally, the network generates and displays the detection results, including classification labels and bounding boxes, defining the defect region and displaying it on the optical image, thus providing operators with accurate defect location information. This architecture, by combining features from sonar and optical images and utilizing the semantic information from the sonar images for guidance, significantly improves the accuracy and robustness of defect detection in complex environments.

[0043] The underwater acoustic-optical imaging semantic fusion detection model takes image-text pairs as input, and its expression is as follows:

[0044] (F opt ,T)

[0045] Where T represents the category text semantic knowledge base, T={“There is no defect.”,“There is acrack.”,“There is a crack.”,“There is a spalling.”}.

[0046] The image encoder and text encoder encode the underwater optical image features and acoustic image semantic features, respectively, as shown in the following expressions:

[0047]

[0048] Wherein: F img This represents an image encoded with different features; It is the image encoder's processing of optical image texture features F texture The encoding result is based on texture feature F textureThe obtained defect detection candidate region; It is the image encoder's response to the optical image boundary features F edge The encoding result is based on the boundary feature F. edge The obtained defect detection candidate region; It is the image encoder's response to the optical image boundary features F color The encoding result is based on the boundary feature F. color The obtained defect detection candidate region. Different semantic features are represented. It is the result of the text encoder encoding the semantic vector t2.

[0049] By calculating the underwater optical image feature encoding result F img Acoustic image semantic feature encoding result F txt The cosine similarity is used as the detection score S cls The expression is as follows:

[0050]

[0051] in ∈[0,1] represents the underwater optical image feature encoding result F img Acoustic image semantic feature encoding result F txt The degree of matching.

[0052] For each candidate region Using a separate box regressor for bounding box prediction, the expression is as follows:

[0053]

[0054] in f represents the predicted boundary coordinates of each candidate box. reg Indicates the regressor. This represents the predicted x-coordinate of the center point of the candidate box. This represents the predicted value of the y-coordinate of the center point of the candidate box. The predicted value representing the width of the candidate box. The predicted value representing the height of the candidate box.

[0055] The final model is optimized using a joint loss, which consists of two parts: semantic matching loss and bounding box localization loss.

[0056] The semantic alignment loss expression is as follows:

[0057]

[0058] in The cross-entropy loss represents the matching of image regions with text categories, y ij∈{0,1} represents a region-category truth match label.

[0059] The bounding box localization loss expression is as follows:

[0060]

[0061] in It is the localization loss of the bounding box, a smoothed L1 loss function, which calculates the positional error between the predicted bounding box and the ground truth box; Represents the true frame. Represents the prediction box.

[0062] The total loss expression is as follows:

[0063]

[0064] in This represents the loss weighting factor.

[0065] Obviously, those skilled in the art should understand that the steps of the acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures described in the above embodiments of the present invention can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. Optionally, they can be implemented using computer-executable program code, thereby storing them in a storage device for execution by the computing device. Furthermore, in some cases, the steps shown or described can be performed in a different order than presented herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the embodiments of the present invention are not limited to any particular hardware and software combination.

Claims

1. A method for detecting underwater defects in hydraulic structures using acoustic-optical imaging semantic fusion, characterized in that, include: Simultaneously acquire underwater optical and sonar image data of the same hydraulic structure area; The underwater optical image features are extracted using an underwater image feature extraction model, including texture, boundary, and color features, and these features are fused into a single image feature. A semantic feature extraction model is used to extract structural semantic information from the acoustic image and encode it into an embedded semantic vector that can be used for image detection guidance. The image features and semantic vector are then input in parallel into an underwater acoustic-optical imaging semantic fusion detection model. Guided by the semantic features of the acoustic image, the model processes the image features of the optical image and outputs underwater defect detection results. In the acoustic image semantic feature extraction model, the semantic representation encoding unit first performs feature extraction and classification of the image based on the YOLOv11 framework, then generates structured semantic text and performs embedding encoding; for sonar image semantic classification, a deep learning model is used to output the probability distribution of defect categories, and the category corresponding to the highest probability is: Defect tags are used to generate corresponding semantic text descriptions, p j This represents the probability value of the j-th type of defect predicted by the acoustic image classification model. j ∈{none,crack,hole,spalling}; Based on the classification results, a structured semantic description is generated using a fixed template to obtain the generated text T. The generated text T is then input into a semantic encoder to extract the semantic vector representation. Semantic Knowledge Recognition Unit: This unit further models the semantic vectors semantically, integrates an attention mechanism to improve semantic alignment capabilities, and introduces an external defect knowledge vector library to dynamically enhance the current semantic vectors, as shown in the following expression: in This represents the attention weight associated with various types of knowledge; The knowledge vector representing the j-th type of defect; This represents the enhanced semantic vector, used for subsequent cross-modal matching. These are semantic embedding features ultimately used for alignment with optical image regions; In the underwater acoustic-optical imaging semantic fusion detection model, the underwater optical image feature encoding result F is calculated. img Acoustic image semantic feature encoding result F txt The cosine similarity is used as the detection score S cls The expression is as follows: in ∈[0,1] represents the underwater optical image feature coding result. Acoustic image semantic feature encoding results The degree of matching; For each candidate region Using a separate box regressor for bounding box prediction, the expression is as follows: in This represents the predicted boundary coordinates of each candidate box. Indicates the regressor; This represents the predicted x-coordinate of the center point of the candidate box. This represents the predicted value of the y-coordinate of the center point of the candidate box. The predicted value representing the width of the candidate box. The predicted value representing the height of the candidate box; The final model is optimized using a joint loss, which includes two parts: semantic matching loss and bounding box localization loss. The semantic alignment loss expression is as follows: in The cross-entropy loss represents the matching of image regions with text categories. Indicates a truth value matching label for a region-category; The bounding box localization loss expression is as follows: in It is the localization loss of the bounding box; The total loss expression is as follows: in This represents the loss weighting factor.

2. The acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures according to claim 1, characterized in that, The underwater image feature extraction model comprises the following three feature extraction branches: 1) Texture feature extraction unit: employing a Gabor-based neural network structure, combined with a directional filter bank and a convolutional feature encoder, to extract local texture information from the image, including texture frequency, direction, stripe structure, and roughness; 2) Boundary feature extraction unit: combining the Sobel operator and a deep edge detection network, to detect and enhance locations with structural abrupt changes in the image, extracting defect edges, contour abrupt changes, and crack boundary information; 3) Color feature extraction unit: converting the image from RGB space to HSV or Lab color space, and extracting color distribution, brightness variation, and color shift features of each channel to capture visual anomalies related to corrosion, erosion, and color changes; the above three feature extraction branches construct the feature representation of the optical image from the three dimensions of texture, structure, and color, respectively. Their outputs are then spliced or weighted by a feature fusion module to finally generate a unified image embedding feature representation, denoted as image feature F. opt This is used in the multimodal fusion process of subsequent semantic alignment and defect detection tasks.

3. The acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures according to claim 1, characterized in that, The device for acquiring underwater optical and sonar image data of the same hydraulic structure area is an imaging device that combines two modes: an optical imaging sensor and an acoustic imaging sensor. The optical imaging sensor acquires optical image data, while the acoustic imaging sensor acquires sonar image data. The sensing areas of the acoustic sensor and the optical sensor have spatial overlap, thereby obtaining multimodal image data of the same structural area.

4. The acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures according to claim 1, characterized in that, The underwater acoustic-optical imaging semantic fusion detection model constructs a multimodal defect detection network based on a semantic guidance mechanism. It aims to fuse visual information from optical images with semantic information from acoustic images, inputting optical image features and semantic vectors into parallel image encoders and text encoders, respectively. The two encoders model the input image region features and semantic category information, respectively. The detection model achieves semantic-level region localization by matching the correlation between image region features and semantic vectors, i.e., using sonar information to guide the optical image to detect potential defect areas. The detection model also includes a bounding box prediction branch to output the spatial location of the defect. Finally, it outputs the defect type, location information, and detection confidence.

5. The acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures according to claim 1, characterized in that, The acoustic image semantic feature extraction model is used to extract structural semantic information from acoustic images and encode it into an embedding vector that can be used for image detection guidance. It includes: 1) A semantic representation encoding unit: First, defect category identification is performed on the sonar image. A classification model based on a deep convolutional neural network is used to perform multi-category discrimination on the sonar image to identify whether defects exist and their specific types. Subsequently, based on the identified categories, a corresponding structured natural language description is generated through a preset mapping rule to achieve the conversion from image to semantics; 2) A semantic knowledge recognition unit: The natural language description is used as the semantic prompt input to the semantic knowledge recognition unit, which extracts semantic vector representations through a pre-trained text encoding network. This semantic vector carries prior knowledge about the defect category and is used for semantic guidance of the image detection area in subsequent stages.

6. The acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures according to claim 2, characterized in that, The texture feature extraction unit uses a set of Gabor filter kernels with four directions and frequencies to convolve the underwater optical image; the color feature extraction unit transforms the image from RGB to HSV and Lab spaces to extract color features, as shown in the following expression: in, Represents color feature map, This represents a vector concatenation operation; Hist represents the color histogram function. This indicates an optimized RGB image; This indicates an optimized HSV image; This represents an optimized Lab image; The texture features, boundary features, and color features are fused together, as shown in the following expression: = , + + =1 in This represents the multidimensional optical image features after fusion; It is the learned weight matrix. Different fully connected layers are used to map the three features to the same dimension, and then weighted fusion is performed. This represents the feature weighting coefficient.

7. A computer device, characterized in that: The computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of the acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures as described in any one of claims 1-6.

8. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores a computer program that performs the acoustic-optical imaging semantic fusion detection method for underwater defects in hydraulic structures as described in any one of claims 1-6.