A multi-modal underwater structure intelligent detection method and system
By employing a multimodal intelligent underwater structure detection method, combining an encoder and decoder architecture, and using LSTM and TMCA mechanisms to generate structured detection reports, the problem of inconsistent semantic information extraction and non-standard report generation in underwater detection is solved, achieving efficient and accurate detection and report generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUNAN UNIV
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-26
Smart Images

Figure CN121962877B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of underwater structure detection technology, and in particular relates to a multimodal intelligent detection method and system for underwater structures. Background Technology
[0002] Underwater engineering plays a crucial role in infrastructure construction and maintenance, encompassing a wide range of areas including bridge piers, dams, port facilities, and subsea pipelines. However, the complexity of the underwater environment and the high difficulty of inspection work present numerous challenges to the inspection and maintenance of underwater engineering projects. In recent years, with the continuous development of underwater imaging technology, acquiring images or videos of engineering structures using underwater cameras has become an important inspection method, providing new possibilities for the efficient management and maintenance of underwater engineering projects.
[0003] Underwater structural health monitoring is a challenging task, primarily due to the complexity and unpredictability of the underwater environment. Traditional methods rely mainly on manual inspections and sensor data, but these methods are costly, inefficient, and susceptible to human error, failing to meet the high precision and efficiency requirements of modern underwater engineering management.
[0004] To overcome these limitations, deep learning-based visual analysis techniques have been gradually introduced into the field of underwater inspection. Object detection technology can identify predefined object categories and their location information from underwater images, such as various components of underwater structures and inspection equipment, and has already seen initial applications in underwater defect detection. However, relying solely on object detection technology is still insufficient for some complex underwater inspection tasks, such as detailed descriptions of defects, correlation analysis between defects and structural components, and maintenance decision support based on defect information.
[0005] To generate more comprehensive and accurate underwater inspection reports, it is necessary to further extract rich semantic information from underwater images, including features such as the type and location of defects, as well as the correlation information between defects and underwater structural components. Furthermore, this semantic information needs to be combined with underwater engineering maintenance templates to generate standardized and regulated inspection reports, providing strong support for subsequent maintenance decisions.
[0006] Currently, while some studies have attempted to extract semantic information from underwater images using individual models, these methods have several limitations. Firstly, executing multiple independent models is not only time-consuming, but also, because the models are trained on different datasets, may lead to a lack of consistency in the extracted semantic information across entity labels. Secondly, the semantic information extracted by individual models often lacks visual connections to image regions, making it difficult to directly apply the extracted information to practical analysis and decision-making. Therefore, combining object detection with semantic information extraction, and using a template-driven approach to generate defect-aware underwater inspection reports, is a promising method. This approach can provide richer and more accurate information support for vision-based analysis and decision-making in underwater engineering, thereby improving the efficiency and quality of underwater engineering management. Summary of the Invention
[0007] To address the above technical problems, this invention provides a multimodal intelligent detection method and system for underwater structures.
[0008] The technical solution adopted by this invention to solve its technical problem is:
[0009] A multimodal intelligent detection method for underwater structures, the method comprising the following steps:
[0010] S100: Acquire optical images of underwater structures and preprocess the images to obtain a dataset;
[0011] S200: Input the dataset into the encoder, perform feature extraction and defect recognition through the encoder, and output defect detection results, segmentation mask and image features. The defect detection results include defect category, confidence level and bounding box.
[0012] S300: Input the image features output by the encoder into the decoder, and use the Long Short-Term Memory (LSTM) network unit to perform temporal modeling on the image feature sequence output by the encoder to obtain the hidden state vector;
[0013] S400: The hidden state vector is used as the query vector and interacts with the image features output by the encoder through the Template Mask Cross Attention (TMCA) mechanism. The TMCA mechanism is based on a predefined detection report template that distinguishes between fixed and variable parts, and controls the direct copying of fixed text content and the dynamic generation of variable text content in the report. Based on the output of the TMCA mechanism, a structured detection report containing defect descriptions and segmentation mask images is generated.
[0014] S500: The encoder and decoder are trained using a phased joint training strategy and a preset loss function. After the training converges, the generated structured detection report is verified based on preset performance indicators. The performance indicators include at least a first indicator for evaluating the accuracy of fixed part replication and a second indicator for evaluating the consistency of variable part defect categories.
[0015] Preferably, S100 includes:
[0016] S110: Using self-acquired images, including high-resolution images with a pixel size of 5568×4872 taken by GoPro12 and images with different pixel sizes taken by other devices, a total of 2000 valid original images were selected, covering various underwater environments such as turbid water, low light, and floating interference, to build a dataset.
[0017] S120: Use the labelImg tool to classify and segment each image, and label the defect categories including pillars, cracks, corrosion, exposed reinforcement and holes;
[0018] S130: Convert the annotation data from JSON format to TXT format that the encoder can read;
[0019] S140: The dataset was augmented to 8000 images using a general augmentation method involving rotation, translation, scaling, blurring, and noise injection.
[0020] S150: Adjust the image pixel size to a uniform 512×512 pixels;
[0021] S160: Divide the augmented 8000 data points into training, validation, and test sets in a 7:2:1 ratio.
[0022] Preferably, the encoder in S200 includes a shared backbone network, a feature pyramid network, and a multi-task output head containing a detection head and a segmentation branch. The backbone network is a YOLOv10 model, and the segmentation branch in the multi-task output head uses the P3 feature from the multi-scale features output by the feature pyramid network to perform mask prediction. The resolution of the P3 feature is 64×64.
[0023] Preferably, the segmentation branch includes a feature enhancement module, a context aggregation module, a multi-scale fusion module, a first upsampling module, a detail restoration module, a second upsampling module, and a prediction head, which are connected in sequence. The feature enhancement module is used to enhance local feature representation, the context aggregation module is used to extract multi-scale context information, the multi-scale fusion module is used to reduce the number of channels and enhance local semantic features, the first upsampling module enlarges the feature map to 128×128, the detail restoration module is used to adjust details, the second upsampling module enlarges the feature map to 256×256, and the prediction head is used to generate the final segmentation mask and upsample it to a resolution of 512×512.
[0024] Preferably, S300 includes:
[0025] LSTM progressively receives feature vectors from each frame or each defect candidate region extracted by the encoder. Through memory units and gating mechanisms, it performs temporal modeling on the feature sequence: when the input is a video or a multi-frame image sequence, it fuses the features of the same defect in different frames to generate a unified defect description to avoid duplicate reporting; when the input is a single-frame image, it treats multiple defect candidate regions in the image as a sequence and generates a description of each region in order; in order to capture the order in which defects appear, cross-frame contextual information, and the feature change trend of the same defect.
[0026] Preferably, the predefined inspection report template in S400 clearly distinguishes between fixed and variable parts. The fixed part includes the report title, inspection time, column name of the inspection result description, and its fixed structural framework; the variable part includes the title content, inspection date, and defect description entries generated by the model; wherein, each defect description includes the defect type and its corresponding segmentation mask image, and is arranged in order according to the encoder's inspection results.
[0027] Preferably, S400 includes:
[0028] Generate a binary mask matrix based on a predefined detection report template. Where L represents the token length of the report template, and K represents the number of image regions output by the encoder;
[0029] when When, it indicates the first time in the template Each token is a fixed part that cannot be modified, and the model is forced to copy this content.
[0030] when When, it indicates the first time in the template Each token belongs to the variable part, allowing the model to dynamically generate content based on the features of the k-th image region;
[0031] The TMCA mechanism is implemented through the following formula:
[0032] ;
[0033] Where Q is the query vector, which comes from the hidden state of the LSTM unit; K is the key vector, which comes from the image region features output by the encoder; V is the value vector, which comes from the image region features output by the encoder; and d is the dimension of the query vector Q and the key vector K. The mask matrix M is converted into a bias term of the Softmax function, which is used to control the attention weights between the fixed and generated parts of the template.
[0034] Preferably, the total loss function in S500 is as follows:
[0035] ;
[0036] in, For YOLO to detect loss, The segmentation loss is used for structure and defect mask prediction. To account for the replication loss, the fixed portion of the constraint template is completely identical to the Ground Truth. To optimize the quality of defect description generation, the cross-entropy loss of the generated part is used. , , , These are the weighting coefficients.
[0037] Preferably, the first indicator in S500 is the replication accuracy CA, where CA = the number of correctly replicated fixed tokens / the total number of fixed tokens;
[0038] The second metric is the Defect Consistency Rate (DCR), which is calculated as: (Number of correctly categorized defects in the generated text) / (Total number of generated defects).
[0039] A multimodal underwater structure intelligent detection system, comprising:
[0040] The image acquisition and preprocessing module is used to acquire optical images of underwater structures and preprocess the images to obtain a dataset.
[0041] The encoder module is used to receive the dataset for feature extraction and defect identification, and output defect detection results, segmentation mask and image features. The defect detection results include defect category, confidence score and bounding box.
[0042] The decoder module receives image features output by the encoder module. The decoder module includes:
[0043] The temporal modeling unit uses a Long Short-Term Memory (LSTM) network unit to perform temporal modeling on the image feature sequence output by the encoder, thereby obtaining the hidden state vector.
[0044] The Template Mask Attention Unit uses the hidden state vector as the query vector and interacts with the image features output by the encoder through the Template Mask Cross Attention (TMCA) mechanism. The TMCA mechanism is based on a predefined detection report template that distinguishes between fixed and variable parts, and controls the direct copying of fixed text content and the dynamic generation of variable text content in the report.
[0045] The report generation module is used to generate a structured inspection report containing defect descriptions and segmentation mask images based on the TMCA mechanism output.
[0046] The model training and validation module is used to train the encoder and decoder modules using a phased joint training strategy and a preset loss function. After the training converges, the generated structured detection report is validated based on preset performance indicators. The performance indicators include at least a first indicator for evaluating the accuracy of fixed part replication and a second indicator for evaluating the consistency of variable part defect categories.
[0047] This invention achieves the following significant benefits through the synergistic innovation of an encoder-decoder architecture and a template mask attention mechanism: First, it improves detection efficiency and accuracy. The integrated encoder simultaneously completes defect detection and segmentation, avoiding the error accumulation of traditional multi-stage processing and significantly improving detection efficiency and positioning accuracy. Second, it ensures report standardization and completeness. The template mask cross-attention mechanism strictly adheres to professional report formats, guaranteeing accurate replication of fixed content while allowing flexible filling of key information, ensuring the report's structural standardization and content completeness. Third, it enhances adaptability to complex environments. LSTM temporal modeling and fusion of multi-frame information effectively distinguish between real defects and transient interference, improving detection robustness in complex underwater environments such as turbid water and low light. Finally, it optimizes engineering application value. End-to-end generation of multimodal output including defect location maps and structured reports greatly enhances the interpretability of the results, providing an intuitive and reliable decision-making basis for engineering inspection. Attached Figure Description
[0048] Figure 1 This is a framework diagram of a multimodal intelligent underwater structure detection method according to an embodiment of the present invention;
[0049] Figure 2 This is a flowchart of a multimodal intelligent underwater structure detection method according to an embodiment of the present invention;
[0050] Figure 3 The images in the underwater structure image dataset in one embodiment of the present invention are classified into five types: (a) column; (b) crack; (c) corrosion; (d) exposed reinforcement; and (e) hole.
[0051] Figure 4 This is a diagram of an improved encoder architecture in one embodiment of the present invention;
[0052] Figure 5 This is a diagram of the branching architecture in one embodiment of the present invention;
[0053] Figure 6 This is an example diagram of report generation in one embodiment of the present invention;
[0054] Figure 7 This is a schematic diagram of the decoding process of the decoder in one embodiment of the present invention;
[0055] Figure 8 This is a schematic diagram of the loss curve in one embodiment of the present invention; wherein, (a) is a schematic diagram of the loss curve of the training set, and (b) is a schematic diagram of the loss curve of the test set. Detailed Implementation
[0056] To enable those skilled in the art to better understand the technical solution of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings.
[0057] The purpose of this invention is to provide a multimodal intelligent underwater structure detection method. By combining target detection with template-guided natural language generation technology, end-to-end generation from input image to structured detection report is achieved. Figure 1 As shown, the overall working framework of this invention adopts a dual-module structure of "encoder-decoder", which can automatically identify, segment and generate multimodal output reports of structural defects in complex underwater environments.
[0058] like Figure 2 As shown, a multimodal intelligent detection method for underwater structures includes the following steps:
[0059] S100: Acquire optical images of underwater structures and preprocess the images to obtain a dataset;
[0060] S200: Input the dataset into the encoder, perform feature extraction and defect recognition through the encoder, and output defect detection results, segmentation mask and image features. The defect detection results include defect category, confidence level and bounding box.
[0061] S300: Input the image features output by the encoder into the decoder, and use the Long Short-Term Memory (LSTM) network unit to perform temporal modeling on the image feature sequence output by the encoder to obtain the hidden state vector;
[0062] S400: The hidden state vector is used as the query vector and interacts with the image features output by the encoder through the Template Masked Cross-Attention (TMCA) mechanism. The TMCA mechanism is based on a predefined detection report template that distinguishes between fixed and variable parts, controlling the direct copying of fixed text content and the dynamic generation of variable text content in the report. Based on the output of the TMCA mechanism, a structured detection report containing defect descriptions and segmentation mask images is generated.
[0063] S500: The encoder and decoder are trained using a phased joint training strategy and a preset loss function. After the training converges, the generated structured detection report is verified based on preset performance indicators. The performance indicators include at least a first indicator for evaluating the accuracy of fixed part replication and a second indicator for evaluating the consistency of variable part defect categories.
[0064] Specifically, the encoder employs an improved YOLOv10 model, adding a segmentation branch to output a high-precision segmentation mask for each detected target while retaining its core feature extraction capabilities. The encoder performs defect category identification, image feature extraction, and segmentation on optical image data of underwater structures, outputting defect category prediction, confidence level, segmentation mask, and a high-dimensional semantic vector after feature compression. The decoder is a hybrid model built on LSTM units and an attention mechanism, namely an attention-enhanced recurrent neural network. The decoder integrates an innovative template mask cross-attention mechanism, matching the current features with a pre-set detection report template to achieve defect type comparison verification and semantic enhancement. It processes and analyzes the information transmitted by the encoder, generates a detection report, and achieves multimodal output. This design ensures end-to-end association between detection information and generated text.
[0065] In one embodiment, S100 includes:
[0066] S110: Using self-acquired images, including high-resolution images with a pixel size of 5568×4872 taken by GoPro12 and images with different pixel sizes taken by other devices, a total of 2000 valid original images were selected, covering various underwater environments such as turbid water, low light, and floating interference, to build a dataset.
[0067] S120: Use the labelImg tool to classify and segment each image, and label the defect categories including pillars, cracks, corrosion, exposed reinforcement and holes;
[0068] S130: Convert the annotation data from JSON format to TXT format that the encoder can read;
[0069] S140: The dataset was augmented to 8000 images using a general augmentation method involving rotation, translation, scaling, blurring, and noise injection.
[0070] S150: Adjust the image pixel size to a uniform 512×512 pixels;
[0071] S160: Divide the augmented 8000 data points into training, validation, and test sets in a 7:2:1 ratio.
[0072] Specifically, this invention uses self-acquired underwater structural optical images to construct a dataset. The original image dataset consists of two parts: one is a high-resolution underwater bridge pier image with a pixel size of 5568×4872 captured by a GoPro 12; the other is underwater structural images of varying pixel sizes captured by other devices. The selected 2000 valid original images cover various typical underwater environments, including turbid water, low light, and floating interference. Manual annotation is performed using the labelImg tool, and the labeled defect categories include five types: column (reference structural area), cracks, corrosion, exposed reinforcement, and holes. An example annotation is shown below. Figure 3 As shown in Table 1, after converting the labeled data from JSON format to a TXT format readable by the encoder, the dataset was expanded to 8000 images using data augmentation methods such as rotation, translation, scaling, blurring, and noise injection. All images were then uniformly adjusted to 512×512 pixels. The dataset was divided into training, validation, and test sets in a 7:2:1 ratio to improve the model's generalization ability across multiple scenarios. The specific composition of the dataset is shown in Table 1.
[0073] Table 1 Dataset Composition
[0074]
[0075] In one embodiment, the encoder in S200 includes a shared backbone network, a feature pyramid network, and a multi-task output head containing a detection head and a segmentation branch. The backbone network is a YOLOv10 model, and the segmentation branch in the multi-task output head uses the P3 feature from the multi-scale features output by the feature pyramid network to perform mask prediction. The resolution of the P3 feature is 64×64.
[0076] Specifically, this invention uses YOLOv10 as the baseline model for the encoder, adding a segmentation branch to its detection head. The original YOLOv10 detection head outputs the category, bounding box, and confidence score; the expanded output includes a segmentation mask. The segmentation mask is used for precise defect localization and can be directly overlaid on the original image to generate a visual defect localization map. The YOLOv10 structure includes a backbone network, a feature pyramid network, and a detection head. This implementation maintains the YOLOv10 backbone, neck, and original detection head structure unchanged, adding only a parallel segmentation branch at the detection head and utilizing the feature maps used by the detection head. The improved encoder architecture is as follows: Figure 4 As shown.
[0077] The segmentation branch utilizes multi-scale features from the Feature Pyramid Network (FPN) and follows the Path Aggregation Network (PAN) structure to output multi-level features {P3, P4, P5}. To obtain a high-precision mask, the P3 feature output by the Feature Pyramid Network, with a size of 64×64, is selected. This feature has high resolution and is beneficial for detailed segmentation.
[0078] In one embodiment, the segmentation branch includes a feature enhancement module, a context aggregation module, a multi-scale fusion module, a first upsampling module, a detail restoration module, a second upsampling module, and a prediction head, which are connected in sequence. The feature enhancement module is used to enhance local feature representation, the context aggregation module is used to extract multi-scale context information, the multi-scale fusion module is used to reduce the number of channels and enhance local semantic features, the first upsampling module enlarges the feature map to 128×128, the detail restoration module is used to adjust details, the second upsampling module enlarges the feature map to 256×256, and the prediction head is used to generate the final segmentation mask and upsample it to a resolution of 512×512.
[0079] Specifically, the branched architecture is as follows: Figure 5 As shown: Starting with input feature P3, the local feature representation is first enhanced by a feature enhancement module. The enhanced result is then processed by a context aggregation module to extract multi-scale contextual information. Next, a multi-scale fusion module reduces the number of channels and further enhances local semantic features. The first upsampling enlarges the feature map to 128×128, and details are adjusted by a detail restoration module. The second upsampling reaches 256×256, and the final prediction map is generated by the prediction head. Finally, the map is upsampled to the target resolution of 512×512.
[0080] In one embodiment, S300 includes:
[0081] LSTM progressively receives feature vectors from each frame or each defect candidate region extracted by the encoder. Through memory units and gating mechanisms, it performs temporal modeling on the feature sequence: when the input is a video or a multi-frame image sequence, it fuses the features of the same defect in different frames to generate a unified defect description to avoid duplicate reporting; when the input is a single-frame image, it treats multiple defect candidate regions in the image as a sequence and generates a description of each region in order; in order to capture the order in which defects appear, cross-frame contextual information, and the feature change trend of the same defect.
[0082] Specifically, a Long Short-Term Memory (LSTM) network is used as the core temporal modeling unit of the decoder, receiving the feature vector extracted by the encoder for each frame as input. The encoder output is in time series format. ,in LSTM progressively receives the feature vectors of each frame or each defect candidate region extracted by the encoder. Through memory units and gating mechanisms (input gate, forget gate, output gate), it captures the order in which defects appear (output in the order of detection time), the feature change trend of the same defect across multiple frames, and the contextual information across frames (the temporal consistency of defect location and type).
[0083] If the input is a video sequence or multi-frame detection results, LSTM fuses the features of the same defect across multiple frames at different times to avoid duplicate reporting. If the input is a single frame with multiple bounding boxes, LSTM treats them as a sequence and generates descriptions sequentially in chronological order. LSTM updates the current hidden state based on the token and state of the previous time step; the LSTM output serves as the query vector Q, which is fed into the TMCA mechanism to interact with image features (K,V); the template mask M controls the scope of attention: fixed parts are forcibly copied, while variable parts are dynamically generated; the outputs of LSTM and TMCA are fused to obtain the probability distribution of the token at the current time step; this process is iterated until a complete report text is generated.
[0084] In one embodiment, the predefined inspection report template in S400 clearly distinguishes between fixed and variable parts. The fixed part includes the report title, inspection time, column name of the inspection result description, and its fixed structural framework; the variable part includes the title content, inspection date, and defect description entries generated by the model; wherein, each defect description contains the defect type and its corresponding segmentation mask image, and is arranged in order according to the encoder's inspection results.
[0085] Specifically, to ensure the structural consistency of the generated text, a predefined template was designed. The template content is based on the on-site quality record form of a professional inspection agency for the repair of defects in bridges over water, clearly distinguishing between fixed and variable sections. The template structure is as follows:
[0086] Report Title: {Title}
[0087] Detection time: {date}
[0088] Description of test results:
[0089] 1. [Defect Description]
[0090] Segmentation mask image
[0091] 2. [Defect Description]
[0092] Segmentation mask image
[0093] In this design, {Title}, {Date}, and [Defect Description] are variable parts, while the rest are fixed content. The fixed part is the mandatory use of the Ground Truth template (accurate copying is ensured through the Copy Loss function). This template design provides a clear structural framework for the generated text, ensuring consistent formatting of the generated content.
[0094] Report Title: The {title} within {title} will be manually entered later. Inspection Time: The {date} within {date} will be manually entered later. Inspection Result Description: The [defect description] within [defect description] will be generated by the model. The generated content should include: defect type (e.g., cracks, corrosion, holes, etc.), the category output from the encoder inspection model; if multiple defects exist, they should be listed in order of inspection time sequence, each containing a complete description of "defect type + segmentation mask image". An example of the output format is shown below. Figure 6 As shown (the generated parts are marked with [ ]).
[0095] In one embodiment, S400 includes:
[0096] Generate a binary mask matrix based on a predefined detection report template. Where L represents the token length of the report template, and K represents the number of image regions output by the encoder;
[0097] when When, it indicates the first time in the template Each token is a fixed part that cannot be modified, and the model is forced to copy this content.
[0098] when When, it indicates the first time in the template Each token belongs to the variable part, allowing the model to dynamically generate content based on the features of the k-th image region;
[0099] The TMCA mechanism is implemented through the following formula:
[0100] ;
[0101] Where Q is the query vector, which comes from the hidden state of the LSTM unit; K is the key vector, which comes from the image region features output by the encoder; V is the value vector, which comes from the image region features output by the encoder; and d is the dimension of the query vector Q and the key vector K. The mask matrix M is converted into a bias term of the Softmax function, which is used to control the attention weights between the fixed and generated parts of the template.
[0102] Specifically, the template masking mechanism is implemented as follows: a predefined report template is encoded into a binary mask matrix. Where L is the template length and K is the number of image regions. Fixed text parts (such as "Report Title:") correspond to M=0, forcing the model to directly copy the template content; variable parts (such as defect descriptions) correspond to M=1, allowing attention-based dynamic generation.
[0103] The improved traditional cross-attention calculation formula uses the LSTM output as the query vector Q, the encoder image region features as the key vector K and value vector V, and combines a predefined report template mask. The mask matrix M is used to control whether the attention mechanism allows dynamic generation: when When , it indicates that the l-th template token has fixed content and cannot be modified; when The time interval indicates that the l-th template token can be dynamically generated from image features. The logarithm of the logM mask is 0 or -∞, serving as the Softmax bias to control the attention weights between the fixed and generated parts of the template. Ultimately, the Attention output is a sequence of structured text vectors following a preset template format. The decoding process of the TMCA mechanism is as follows... Figure 7 As shown.
[0104] In one embodiment, the total loss function in S500 is specifically as follows:
[0105] ;
[0106] in, For YOLO to detect loss, The segmentation loss is used for structure and defect mask prediction. To account for the replication loss, the fixed portion of the constraint template is completely identical to the Ground Truth. To optimize the quality of defect description generation, the cross-entropy loss of the generated part is used. , , , These are the weighting coefficients.
[0107] In one embodiment, the first metric in S500 is the replication accuracy CA, where CA = the number of correctly replicated fixed tokens / the total number of fixed tokens;
[0108] The second metric is the Defect Consistency Rate (DCR), which is calculated as: (Number of correctly categorized defects in the generated text) / (Total number of generated defects).
[0109] Specifically, to comprehensively evaluate the performance of the proposed decoder in generating underwater structure defect detection reports, this invention considers two aspects: the accuracy of copying fixed parts and the quality of generating variable parts. Specific metrics are as follows: Copy Accuracy (CA) measures the fidelity of the decoder's copying of the fixed template portion of the report, ensuring format conformity. Defect Consistency Rate (DCR) measures whether the defect categories in the generated text are consistent with the output of the detection model. Wherein:
[0110] CA = Correctly replicated fixed token / Total number of fixed tokens;
[0111] DCR = Number of correctly categorized defects in the generated text / Total number of generated defects;
[0112] After training convergence, CA≈1 and DCR≥0.95 were achieved, indicating that the generated descriptions are almost completely faithful to the encoder's detection results; the decoder performs well in maintaining the standardization of report format and the accuracy of content.
[0113] Furthermore, to achieve end-to-end optimization of the proposed multimodal detection framework, the visual encoder (YOLOv10+Seg) and text decoder (LSTM with template mask cross-attention, TMCA) were trained in a unified manner. In this setting, the encoder extracts multi-scale visual representations from underwater structure images, while the decoder generates structured inspection reports constrained by predefined templates. Both components are updated simultaneously through a multi-task loss, ensuring that visual feature learning and language generation are mutually beneficial.
[0114] To improve stability and convergence, training is conducted in three phases:
[0115] 1. First stage – Encoder pre-training
[0116] The encoder is trained solely on detected and segmented targets to generate robust visual representations.
[0117] 2. Second stage – Decoder pre-training
[0118] The encoder is frozen, and the LSTM+TMCA decoder is trained to generate report text for a given encoder feature and adapt to template-masking constraints.
[0119] 3. Third stage – End-to-end fine-tuning
[0120] The selected encoder layer is unfrozen, and the encoder and decoder are jointly trained using a loss function, enabling the encoder to learn features suitable for visual recognition and language generation.
[0121] The training of the following models was completed on four 3090 graphics cards with 24GB of VRAM each. The encoder precision, recall, and mAP50 score were 0.945, 0.89, and 0.881, respectively. Figure 8 The loss curves of the model on the training and test sets are shown: (a) is the loss curve on the training set, and (b) is the loss curve on the test set.
[0122] Compared with existing defect identification methods based on single detection or segmentation, the multimodal intelligent underwater structure detection method proposed in this invention has significant advantages in the following aspects:
[0123] 1. High-precision detection and segmentation integrated for improved efficiency: Traditional methods typically perform target detection or segmentation separately, resulting in complex processes and low efficiency. This invention introduces a segmentation branch, achieving pixel-level defect recognition. Simultaneously, it utilizes high-resolution feature maps (P3) to capture clear defect boundaries and shapes, significantly improving the accuracy of defect localization. By directly outputting the detection results and pixel-level masks in the encoder, it achieves integrated processing of high-precision detection and segmentation.
[0124] 2. Report Generation Format Consistency and Information Completeness: Traditional automatic report generation methods often lack a fixed format, easily leading to missing content or inconsistent style. This invention introduces a Template Mask Cross-Attention (TMCA) mechanism into the decoder. Through a predefined detection report template, it ensures that fixed parts of the report are strictly copied, while the defect description is automatically generated by the model based on image features. "Semantic consistency" is attributed to the Template Mask Cross-Attention mechanism in the decoder. By forcing the fixed parts to be accurately copied according to the predefined format, it avoids format deviations in text generation. Simultaneously, the dynamic semantic generation of variable parts relies on the encoder's image features, ensuring consistency between content and image context. This ensures that the generated report always conforms to the semantic and format requirements of the template. "Copy Loss" guarantees complete consistency of the fixed parts, while the dynamically generated parts are constrained by image features, ensuring that the output report conforms to a professional format and contains key information.
[0125] 3. Temporal Consistency and Multi-Frame Information Fusion: Underwater structure detection is typically based on video or multi-frame images. Traditional methods process single frames independently, easily missing continuous defect information. This invention uses LSTM temporal units to model multi-frame image features, achieving temporal consistency between defect detection and report generation. By fusing temporal context through LSTM, it can better distinguish between real defects and transient noise, improving the robustness of defect identification.
[0126] 4. High interpretability and conducive to engineering applications: Compared with existing "black box" detection models, the output of this invention includes not only defect location maps (original image overlaid with bounding boxes and masks), but also structured inspection reports. The fixed parts of the report are derived from predefined templates, while the defect description and defect quantification indicators are derived from model inference. The dual-modal output of images and text enhances interpretability, facilitates manual review and engineering quality recording, and meets the standardization needs of testing institutions.
[0127] In one embodiment, a multimodal underwater structure intelligent detection system is also provided, comprising:
[0128] The image acquisition and preprocessing module is used to acquire optical images of underwater structures and preprocess the images to obtain a dataset.
[0129] The encoder module is used to receive the dataset for feature extraction and defect identification, and output defect detection results, segmentation mask and image features. The defect detection results include defect category, confidence score and bounding box.
[0130] The decoder module receives image features output by the encoder module. The decoder module includes:
[0131] The temporal modeling unit uses a Long Short-Term Memory (LSTM) network unit to perform temporal modeling on the image feature sequence output by the encoder, thereby obtaining the hidden state vector.
[0132] The Template Mask Attention Unit uses the hidden state vector as the query vector and interacts with the image features output by the encoder through the Template Mask Cross Attention (TMCA) mechanism. The TMCA mechanism is based on a predefined detection report template that distinguishes between fixed and variable parts, and controls the direct copying of fixed text content and the dynamic generation of variable text content in the report.
[0133] The report generation module is used to generate a structured inspection report containing defect descriptions and segmentation mask images based on the TMCA mechanism output.
[0134] The model training and validation module is used to train the encoder and decoder modules using a phased joint training strategy and a preset loss function. After the training converges, the generated structured detection report is validated based on preset performance indicators. The performance indicators include at least a first indicator for evaluating the accuracy of fixed part replication and a second indicator for evaluating the consistency of variable part defect categories.
[0135] Specific limitations regarding the multimodal intelligent underwater structure detection system can be found in the limitations of the multimodal intelligent underwater structure detection method described above, and will not be repeated here. Each module in the aforementioned multimodal intelligent underwater structure detection system can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.
[0136] In one embodiment, a computer device is also provided, including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of a multimodal underwater structure intelligent detection method.
[0137] In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps of a multimodal underwater structure intelligent detection method.
[0138] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM), etc.
[0139] The present invention provides a detailed description of a multimodal intelligent underwater structure detection method and system. Specific examples have been used to illustrate the principles and implementation methods of the invention. The descriptions of these embodiments are merely for the purpose of helping to understand the core ideas of the invention. It should be noted that those skilled in the art can make various improvements and modifications to the invention without departing from its principles, and these improvements and modifications also fall within the scope of protection of the claims of the present invention.
Claims
1. A multimodal intelligent detection method for underwater structures, characterized in that, The method includes the following steps: S100: Acquire optical images of underwater structures and preprocess the images to obtain a dataset; S200: The dataset is input into the encoder, which performs feature extraction and defect identification, outputting defect detection results, segmentation masks, and image features. The defect detection results include defect category, confidence score, and bounding box. The encoder in S200 consists of a shared backbone network, a feature pyramid network, and a multi-task output head containing a detection head and a segmentation branch. The backbone network is a YOLOv10 model. The segmentation branch in the multi-task output head uses the P3 features from the multi-scale features output by the feature pyramid network for mask prediction. The P3 features have a resolution of 64×64. The system comprises a feature enhancement module, a context aggregation module, a multi-scale fusion module, a first upsampling module, a detail restoration module, a second upsampling module, and a prediction head, all connected in sequence. The feature enhancement module enhances local feature representations, the context aggregation module extracts multi-scale context information, the multi-scale fusion module reduces the number of channels and enhances local semantic features, the first upsampling module enlarges the feature map to 128×128, the detail restoration module adjusts details, the second upsampling module enlarges the feature map to 256×256, and the prediction head generates the final segmentation mask and upsamples it to a resolution of 512×512. S300: Input the image features output by the encoder into the decoder, and use the Long Short-Term Memory (LSTM) network unit to perform temporal modeling on the image feature sequence output by the encoder to obtain the hidden state vector; S400: The hidden state vector is used as the query vector, and interacts with the image features output by the encoder through the Template Mask Cross-Attention (TMCA) mechanism. The TMCA mechanism is based on a predefined detection report template that distinguishes between fixed and variable parts, controlling the direct copying of fixed text content and the dynamic generation of variable text content in the report. Based on the output of the TMCA mechanism, a structured detection report containing defect descriptions and segmentation mask images is generated. S400 includes: Generate a binary mask matrix based on a predefined detection report template. Where L represents the token length of the report template, and K represents the number of image regions output by the encoder; when When, it indicates the first time in the template Each token is a fixed part that cannot be modified, and the model is forced to copy this content. when When, it indicates the first time in the template Each token belongs to the variable part, allowing the model to dynamically generate content based on the features of the k-th image region; The TMCA mechanism is implemented through the following formula: Where Q is the query vector, which comes from the hidden state of the LSTM unit; K is the key vector, which comes from the image region features output by the encoder; V is the value vector, which comes from the image region features output by the encoder; and d is the dimension of the query vector Q and the key vector K. The mask matrix M is converted into a bias term of the Softmax function, which is used to control the attention weights between the fixed part of the template and the generated part. S500: The encoder and decoder are trained using a phased joint training strategy and a preset loss function. After the training converges, the generated structured detection report is verified based on preset performance indicators. The performance indicators include at least a first indicator for evaluating the accuracy of fixed part replication and a second indicator for evaluating the consistency of variable part defect categories.
2. The method according to claim 1, characterized in that, S100 includes: S110: Using self-acquired images, including high-resolution images with a pixel size of 5568×4872 taken by GoPro12 and images with different pixel sizes taken by other devices, a total of 2000 valid original images were selected, covering various underwater environments such as turbid water, low light, and floating interference, to build a dataset. S120: Use the labelImg tool to classify and segment each image, and label the defect categories including pillars, cracks, corrosion, exposed reinforcement and holes; S130: Convert the annotation data from JSON format to TXT format that the encoder can read; S140: The dataset was augmented to 8000 images using a general augmentation method involving rotation, translation, scaling, blurring, and noise injection. S150: Adjust the image pixel size to a uniform 512×512 pixels; S160: Divide the augmented 8000 data points into training, validation, and test sets in a 7:2:1 ratio.
3. The method according to claim 2, characterized in that, The S300 includes: LSTM progressively receives feature vectors from each frame or each defect candidate region extracted by the encoder. Through memory units and gating mechanisms, it performs temporal modeling on the feature sequence: when the input is a video or a multi-frame image sequence, it fuses the features of the same defect in different frames to generate a unified defect description to avoid duplicate reporting; when the input is a single-frame image, it treats multiple defect candidate regions in the image as a sequence and generates a description of each region in order; in order to capture the order in which defects appear, cross-frame contextual information, and the feature change trend of the same defect.
4. The method according to claim 3, characterized in that, The predefined inspection report template in S400 clearly distinguishes between fixed and variable parts. The fixed part includes the report title, inspection time, column names for the inspection results description, and their fixed structural framework; the variable part includes the title content, inspection date, and defect description entries generated by the model; each defect description includes the defect type and its corresponding segmentation mask image, and is arranged in order according to the encoder's inspection results.
5. The method according to claim 4, characterized in that, The total loss function in S500 is as follows: in, For YOLO to detect loss, The segmentation loss is used for structure and defect mask prediction. To account for the replication loss, the fixed portion of the constraint template is completely identical to the Ground Truth. To optimize the quality of defect description generation, the cross-entropy loss of the generated part is used. , , , These are the weighting coefficients.
6. The method according to claim 5, characterized in that, The first metric in S500 is the replication accuracy CA, where CA = the number of correctly replicated fixed tokens / the total number of fixed tokens. The second metric is the Defect Consistency Rate (DCR), which is calculated as: (Number of correctly categorized defects in the generated text) / (Total number of generated defects).
7. A multimodal intelligent underwater structure detection system, used to perform the steps of the method as described in any one of claims 1 to 6, characterized in that, include: The image acquisition and preprocessing module is used to acquire optical images of underwater structures and preprocess the images to obtain a dataset. The encoder module is used to receive the dataset for feature extraction and defect identification, and output defect detection results, segmentation mask and image features. The defect detection results include defect category, confidence score and bounding box. The decoder module receives image features output by the encoder module. The decoder module includes: The temporal modeling unit uses a Long Short-Term Memory (LSTM) network unit to perform temporal modeling on the image feature sequence output by the encoder, thereby obtaining the hidden state vector. The Template Mask Attention Unit uses the hidden state vector as the query vector and interacts with the image features output by the encoder through the Template Mask Cross Attention (TMCA) mechanism. The TMCA mechanism is based on a predefined detection report template that distinguishes between fixed and variable parts, and controls the direct copying of fixed text content and the dynamic generation of variable text content in the report. The report generation module is used to generate a structured inspection report containing defect descriptions and segmentation mask images based on the TMCA mechanism output. The model training and validation module is used to train the encoder and decoder modules using a phased joint training strategy and a preset loss function. After the training converges, the generated structured detection report is validated based on preset performance indicators. The performance indicators include at least a first indicator for evaluating the accuracy of fixed part replication and a second indicator for evaluating the consistency of variable part defect categories.