Defect semantic segmentation method and device for unmanned inspection robot
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HOHAI UNIV
- Filing Date
- 2025-04-28
- Publication Date
- 2026-06-26
AI Technical Summary
Existing semantic segmentation models suffer from decreased segmentation accuracy in dam inspection scenarios where defect edge information is unclear. Traditional manual inspection suffers from problems such as low efficiency, high cost, and high safety risks. Existing datasets are insufficient to fully cover defect types, and the models have low reusability.
We introduce weakly supervised prompts and combine them with a teacher-student network framework. By optimizing the student network encoder through data resampling, data augmentation, pseudo-label generation and filtering, self-training loss and low-rank matrix update, and combining hierarchical segmentation strategy to make the model lightweight, it is deployed on unmanned inspection equipment for real-time defect detection.
It improves the defect detection accuracy in scenarios with unclear defect edges, overcomes the limitations of traditional manual inspection, realizes efficient and safe unmanned inspection, has strong adaptability, and reduces computational overhead.
Smart Images

Figure CN120411520B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a defect semantic segmentation method and apparatus for unmanned inspection robots, specifically a defect semantic segmentation method based on weakly supervised domain adaptive segmentation. It addresses issues such as unclear defect edge information and class imbalance, thereby improving segmentation accuracy in unmanned inspection scenarios and belongs to the field of semantic segmentation technology. Background Technology
[0002] As a crucial component of my country's engineering system, reservoirs and dams perform multiple functions, including flood control, water supply, power generation, irrigation, and ecological regulation. They are vital infrastructure for regulating water resources, optimizing their allocation, preventing water-related disasters, and protecting the ecosystem. Due to long-term exposure to water pressure, temperature variations, weathering erosion, and geological movements, dams are prone to defects such as cracks, leakage, spalling, and steel reinforcement corrosion, which can compromise their structural safety. Timely and accurate identification of dam defects is crucial for ensuring safe dam operation. Regular inspections and real-time monitoring are indispensable components of a dam safety assurance system. However, traditional dam inspection methods, primarily relying on manual inspection or fixed sensor monitoring, suffer from low efficiency, high cost, limited coverage, and significant risks, making them unsuitable for the demands of modern smart water management.
[0003] Semantic segmentation, a crucial technology in computer vision, can classify each pixel in an image, enabling accurate defect identification and localization, and has significant application value in industrial defect detection. However, in dam inspection applications, the diverse defect types, difficulties in data annotation, and blurred target edges limit the detection accuracy of existing semantic segmentation models. Faced with the diversity of defect types, existing datasets often fail to provide comprehensive coverage, and traditional machine vision algorithms have limitations in building comprehensive models, with low model reusability. While existing large-scale semantic segmentation models possess strong generalization capabilities, they still suffer from decreased segmentation accuracy in scenarios with unclear defect edge information, making it difficult to meet practical inspection needs. Summary of the Invention
[0004] Objective: This invention addresses the problem of decreased segmentation accuracy in domain-adaptive semantic segmentation techniques when defect edge information is unclear. It proposes a defect semantic segmentation method and apparatus for unmanned inspection. This invention introduces weakly supervised prompts and combines them with a teacher-student network framework to improve the semantic segmentation accuracy of dam defect images, particularly demonstrating strong adaptability in defect detection with blurred edges. Simultaneously, it utilizes unmanned inspection to overcome the limitations of traditional manual defect detection tasks, such as strong subjectivity, low efficiency, numerous interference factors, high labor costs, poor accuracy, and significant safety risks.
[0005] Technical Solution: A defect semantic segmentation method for unmanned inspection robots. The method processes three-channel optical images, primarily in scenarios where defect region boundaries are unclear. The method includes the following steps:
[0006] Step 1) Data Acquisition and Preprocessing: Obtain defect images collected during dam inspection, extract different types of defect samples from the concrete surface, construct an initial defect dataset, resample the defect data based on the training consistency principle, and adjust the data distribution;
[0007] Step 2) Network Construction and Data Augmentation: Construct a ternary network framework consisting of an anchor network, a teacher network, and a student network. Perform strong data augmentation and weak data augmentation on the defective images respectively. Input the augmented images into the anchor network, teacher network, and student network respectively to extract the features of the defective images.
[0008] Step 3) Generation and filtering of pseudo-labels: The teacher network generates initial pseudo-labels under the guidance of weak supervision signals (such as bounding boxes, point labels or coarse segmentation masks), and designs a filtering mechanism (confidence threshold, consistency check, etc.) to screen pseudo-labels and retain high-quality pseudo-labels for subsequent training.
[0009] Step 4) Network Training and Parameter Update: The high-quality pseudo-labels obtained in Step 3) are used to guide the training of the student network. The self-training loss, anchor loss and contrast loss are calculated. The weights of the student network encoder are optimized by combining the low-rank matrix update strategy. The teacher network synchronizes the parameters of the student network through the exponential moving average (EMA) mechanism. The defective image semantic segmentation model is obtained through iterative training.
[0010] Step 5) Model Lightweighting: Perform full-stage distillation on the trained semantic segmentation model to reduce model complexity. Combine this with a hierarchical segmentation strategy to reduce the computational overhead of SAM in the "segment everything" mode.
[0011] Step 6) Model Deployment: Deploy the lightweight, weakly supervised adaptive semantic segmentation model onto the unmanned inspection equipment to collect images of the dam surface in real time, and input the image data into the semantic segmentation model to achieve online image segmentation and defect area identification.
[0012] The data acquisition and preprocessing in step 1) includes the following steps:
[0013] 1-1) Collect images of different types of defects on the concrete surface of the dam from data obtained from the Internet and on-site shooting, and use weakly supervised annotation (such as bounding boxes, point annotations, and coarse masks) to annotate the defect areas and generate preliminary label data;
[0014] 1-2) Defective images are fed into a training consistency-based resampling module. For each category, the online average category score (ACS) is calculated using the output probability map of the teacher model, dynamically estimating the distribution of each category. At each time step, the ACS is updated using the exponential moving average (EMA).
[0015] 1-3) Based on the obtained online average category score, minority categories are determined. Categories with scores below a set value are identified as minority categories. For each category k, the sampling rate SR is... k By using normalization operations, we can ensure that the sampling rate for a few categories is significantly improved.
[0016] 1-4) For each sample in the target domain, calculate its prediction consistency across multiple iterations, i.e., the reliability of each class. According to the reliability of each category All target domain images are sorted in descending order, and the top C samples are selected as the copy-paste candidate set to ensure high reliability of the candidate samples.
[0017] 1-5) Copy and paste the images in the candidate set to complete the resampling.
[0018] Step 2) involves constructing a ternary network framework comprising an anchor network, a teacher network, and a student network. Strong and weak data augmentation are then applied to the defective images, and the augmented images are input into the anchor network, teacher network, and student network, respectively. This process includes the following steps:
[0019] 2-1) Initialize the anchor network, teacher network, and student network. The three networks use a pre-trained SAM image encoder as their basic architecture. The weights of the anchor network are set to a fixed state, that is, the anchor network weights are frozen and do not change with training; the initial weights of the student network are the same as those of the anchor network, and the weights are updated with training; the initial weights of the teacher network are the same as those of the student network, and the weights are updated with those of the student network.
[0020] 2-2) The original data x of the target domain i Weak data augmentation is performed by inputting the processed image data into the anchor network to extract features.
[0021] 2-3) Perform weak data augmentation on the data resampled in step 1, and input the processed image data into the teacher network to extract features.
[0022] 2-4) Apply enhanced data augmentation to the resampled data in step 1, and then input it into the student network to extract features.
[0023] Step 3) introduces weak supervision as a cue to the existing pseudo-labels generated by the teacher model, generating a fixed cue set. The teacher model then generates pseudo-labels based on this cue set and filters them to obtain high-quality pseudo-labels to guide student network training. This specifically includes the following steps:
[0024] 3-1) Prepare the weak supervision hints, where weak supervision includes: (1) Bounding box: the minimum bounding rectangle of the target defect; (2) Point annotation: randomly sample positive points in the defect area and negative points outside the area; (3) Coarse mask: fit the real mask by polygons.
[0025] 3-2) Construct a cue encoder that converts weakly supervised cues into fixed-dimensional embedding vectors. Specifically, for bounding boxes, the coordinates of the top-left and bottom-right corners of the bounding box are mapped to position embeddings; for point labels, the coordinates of positive and negative points are encoded into position embeddings respectively; and for coarse masks, the vertex coordinate sequence of the polygon is used as the position embedding.
[0026] 3-3) Convert the weakly supervised input cue encoder into embedding vectors to generate a cue set {e}. j}
[0027] 3-4) The teacher model receives the weakly enhanced image and cue set and generates pseudo-labels corresponding to class j.
[0028] 3-5) Calculate the IoU between each pseudo-label mask and the mask generated by the anchor network. If the IoU is greater than 0.5, then retain the pseudo-label.
[0029] 3-6) Calculate the prediction consistency of each pseudo-label mask under different weak enhancements. If the stability is greater than 0.8, retain the current pseudo-label mask. The different weak enhancements refer to different forms of weak data augmentation, including setting small-angle rotation, translation, and scaling. Here, it is necessary to calculate the prediction consistency of each pseudo-label under these three different forms. If the stability exceeds 80%, retain the current pseudo-label; otherwise, discard the current pseudo-label.
[0030] 3-7) Sort by stability score, retain the highest score mask, remove masks with IoU greater than 0.7, and generate a high-quality pseudo-label set.
[0031] Step 4) Use the high-quality pseudo-labels generated in Step 3) to guide the training of the student network and update the weights of the student network and teacher network. This specifically includes the following steps:
[0032] 4-1) Use the high-quality pseudo-label set generated by the teacher network to supervise the training of the student network.
[0033] 4-2) Calculating the self-training loss consists of two parts: Focal Loss and Dice Loss. Focal Loss addresses the class imbalance problem, focusing on samples that are difficult to segment, as follows:
[0034]
[0035] in, This is the mask value for student network prediction, and γ is the focusing parameter (default is set to 2). Calculate the loss for all pixels in the image, N p Represents the total number of prompts. This is the mask value predicted by the teacher network. In this invention, the base of the logarithm is always e. The purpose of Dice Loss is to optimize the intersection-over-union (IoU) ratio of the mask, as follows:
[0036]
[0037] Here, ∈ is used to prevent division by zero of small constants (default value is 1). This represents summing over all pixels (h, w). It is the mask value predicted by the student network. It is the mask value for teacher network prediction.
[0038] The total self-training loss is the sum of FocalLoss and Dice Loss, that is:
[0039]
[0040] 4-3) Output using anchor network Constrain the predictions of the student and teacher networks to prevent the models from deviating from the source domain knowledge:
[0041]
[0042] Where, m s and m t These are the prediction masks for the student network and the teacher network, respectively, λ. tea and λ stu This is the weighting coefficient, which is set to 0.5 by default.
[0043] 4-4) Extracting features F from the image encoder of the anchor network and teacher network a and F t The contrast loss between the anchor network and the teacher network is calculated, and the intermediate features of the anchor network and the teacher network are aligned to enhance feature consistency. The feature alignment is as follows:
[0044]
[0045] Where τ is the temperature coefficient (default setting is 0.3).
[0046] 4-5) The student network updates its weights under the guidance of high-quality pseudo-labels, employing a low-rank matrix update strategy. The encoder weights of the student network... It is decomposed into two low-rank matrices and θ = AB, where r is the rank of the low-rank matrix, defaulting to 4. During the weight update process, only A and B are updated; the original weights θ are retained. original Fixed, specifically θ = θ original +A·B optimizes A and B through backpropagation, reducing memory usage and computational overhead.
[0047] 4-6) The weights of the teacher network are updated through the EMA of the student network, as follows:
[0048]
[0049] Where α is the smoothing coefficient, set to 0.999, Θ tea and Θ stu The weights are divided into those of the teacher network and those of the student network. The teacher network indirectly inherits the low-rank matrix update of the student network through the EMA.
[0050] 4-7) Repeat steps 3) and 4). In each iteration, generate new pseudo-labels and update the student network and teacher network. After multiple iterations, gradually improve the performance of the student network. After training, the student network is used as a defect segmentation model for UAV inspection tasks.
[0051] Step 5) Lighten the semantic segmentation model obtained in Step 4) through full-stage distillation, and apply a hierarchical segmentation strategy to reduce the computational cost of SAM in the "segment everything" mode. The specific steps are as follows:
[0052] 5-1) Perform full-stage knowledge distillation on the model: Achieve knowledge transfer by aligning the features and outputs of the student and teacher networks at multiple levels. The first step is image embedding alignment, as follows:
[0053]
[0054] Among them, E tea (I) and E stu (I) represents the embedding outputs of image I from the teacher and student networks, respectively. ||·||1 represents the L1 distance, used to enforce feature similarity.
[0055] 5-2) Output characteristics of the bidirectional Transformer in the alignment mask decoder:
[0056]
[0057] in, and It is the output of the Transformer module for the teacher and student network.
[0058] 5-3) Align the student and teacher segmentation mask outputs using Dice Loss and Focal Loss:
[0059]
[0060] Step 6) Transfer the semantic segmentation model based on weakly supervised domain adaptation, which has been trained in step 5), to the unmanned inspection equipment. The unmanned inspection system acquires images of the dam concrete surface in real time to achieve online defect identification; the unmanned inspection system includes:
[0061] 6-1) Data acquisition module, which is a camera device, is capable of acquiring and storing images of the dam within the shooting range;
[0062] 6-2) Image recognition module: receives defect images captured in real time by unmanned inspection equipment, and uses the constructed weakly supervised domain adaptive semantic segmentation model to analyze the type information of defect images in real time.
[0063] 6-3) Data transmission module, which transmits the image information and mask information processed by the image segmentation module;
[0064] 6-4) The inspection system receives images captured by unmanned inspection equipment and defect mask information after image recognition. Based on the identified defect locations, it optimizes the inspection route to ensure that the inspection equipment can efficiently cover all areas that need to be inspected, generates a detailed inspection task list, and sends the tasks to the control system.
[0065] 6-5) Control system: Receives instructions from the inspection system and controls the unmanned inspection equipment to take pictures, identify, and transmit data of the areas that need to be identified.
[0066] A defect semantic segmentation device for unmanned inspection includes the following components:
[0067] A dedicated dataset module for semantic segmentation in defect detection is constructed to acquire defect images obtained during dam inspections. Different types of defect images on the concrete surface are collected, and defect images of minority classes are resampled to adjust the data distribution and solve the class imbalance problem.
[0068] The data augmentation module performs weak data augmentation on the original defect image dataset, applying slight rotation, translation, scaling, etc., and inputs the processed images into the anchor network for feature extraction. It also performs weak data augmentation on the resampled dataset, applying slight rotation, translation, scaling, etc., and inputs the processed images into the teacher network for feature extraction. Finally, it performs strong data augmentation on the resampled dataset, applying color jitter, noise, large rotation, etc., and inputs the processed images into the student network for feature extraction.
[0069] The self-training module takes weakly supervised input (bounding boxes, point annotations, and coarse masks) into the cue encoder to generate a fixed cue set. Under the guidance of weak supervision, the teacher model generates pseudo-labels. A pseudo-label filtering strategy is applied to obtain high-quality pseudo-labels. The self-training loss is calculated based on the high-quality pseudo-labels to guide the student network update. The anchoring loss and contrast loss are calculated to standardize the update and feature alignment of the student network. The parameters of the student network are updated through a low-rank matrix. The weights of the teacher network are synchronized with the student network parameters through EMA, indirectly inheriting the update of the low-rank matrix.
[0070] The lightweight module performs full-stage knowledge distillation on the trained model, including image embedding, intermediate feature output, and alignment of the final segmentation mask output.
[0071] The application device loads a pre-trained weakly supervised adaptive semantic segmentation model, and uses the device to acquire images of defects on the concrete surface of the dam in real time. The images are then input into the weakly supervised adaptive defect image semantic segmentation model to achieve online defect image segmentation.
[0072] The application device is an unmanned inspection device. It loads a pre-trained weakly supervised adaptive semantic segmentation model and collects defect images in real time to achieve online image segmentation.
[0073] The unmanned inspection system includes:
[0074] The data acquisition module, mounted on a camera on an unmanned inspection device, collects and temporarily stores images of defects in the dam within its field of view.
[0075] The image recognition module receives defect images captured in real time and uses a constructed weakly supervised adaptive semantic segmentation model to analyze the type information of the defect images in real time.
[0076] The data transmission module transmits the image information and mask information processed by the image segmentation module.
[0077] The inspection system receives images captured by unmanned inspection equipment and defect mask information after image recognition. Based on the identified defect locations, it optimizes the inspection route to ensure that the inspection equipment can efficiently cover all areas that need to be inspected, generates a detailed inspection task list, and sends the tasks to the control system.
[0078] The control system receives instructions from the inspection system and controls the inspection equipment to take pictures, identify, and transmit data of the areas that need to be identified.
[0079] The implementation process and method of the device module are the same.
[0080] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the defect semantic segmentation method for unmanned inspection as described above.
[0081] A computer-readable storage medium storing a computer program that performs the defect semantic segmentation method for unmanned inspection as described above.
[0082] Beneficial Effects: Given the diversity of defect types, existing datasets often fall short of comprehensive coverage. Traditional machine vision algorithms have limitations in building comprehensive models, and their reusability is low. Large-scale semantic segmentation models, with their superior generalization ability, offer new possibilities for addressing challenges under varying working conditions. However, the training sets of these large models primarily consist of natural images with strong edge information, while industrial defect detection datasets often suffer from unclear defect edge information. This invention proposes a semantic segmentation method based on weakly supervised domain adaptation, combining weakly supervised prompts and domain adaptation strategies to solve the domain shift problem caused by unclear defect edge data, thereby improving defect detection accuracy in scenarios with unclear defect edges. Attached Figure Description
[0083] Figure 1 This is a flowchart of a method according to an embodiment of the present invention;
[0084] Figure 2 This is a general framework diagram of an embodiment of the present invention based on weakly supervised domain adaptation. Detailed Implementation
[0085] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. After reading the present invention, any modifications of the present invention in various equivalent forms by those skilled in the art will fall within the scope defined by the appended claims.
[0086] The environment used in this embodiment of the invention is as follows: The model is trained and tested on a server equipped with one NVIDIA A100 GPU, one Jetson AGX development board (16G of video memory), and one Ubuntu operating system, based on the PyTorch deep learning framework.
[0087] A defect semantic segmentation method for unmanned inspection robots includes the following steps:
[0088] Step 1) Obtain defect images collected during dam inspection, resample the defect data based on the training consistency principle, and adjust the data distribution;
[0089] 1-1) Collect images of different types of defects on the concrete surface of the dam from data obtained from the Internet and on-site shooting, and use weakly supervised annotation (such as bounding boxes, point annotations, and coarse masks) to annotate the defect areas and generate preliminary label data;
[0090] 1-2) The defective images are fed into the training consistency-based resampling module. For each category k, the output probability map p of the teacher model is used. t Calculate the average score (ACS) and dynamically estimate the distribution of each category:
[0091]
[0092] Where h·w is the resolution of the probability map. This represents the predicted probability of class k at position (i,j). At each time step t, ACS is updated using an exponential moving average (EMA):
[0093]
[0094] α is the smoothing coefficient (default setting is 0.999) to ensure the stability of ACS.
[0095] 1-3) Based on the online average category score obtained in step 1-2), determine the minority categories. Categories with scores lower than a set value are identified as minority categories. For each category k, the sampling rate SR k By using normalization, we can ensure a significant increase in the sampling rate for a few categories:
[0096]
[0097] K is the total number of all categories, and j iterates through all categories. Calculate the total score for all categories.
[0098] 1-4) For each sample in the target domain Calculate the consistency of its predictions across multiple iterations, i.e., the reliability of each category.
[0099]
[0100] ξ(·) is the label generation function, and IoU calculates the intersection-union ratio of two predicted masks, measuring the prediction consistency for class k. This is the mask generated by the teacher model at time tI. According to... All target domain images are sorted in descending order, and the top C samples are selected as the copy-paste candidate set to ensure high reliability of the candidate samples.
[0101] 1-5) Copy and paste the images in the candidate set to complete the resampling.
[0102] Step 2) Construct a ternary network framework consisting of an anchor network, a teacher network, and a student network, and perform data augmentation processing on the image data, as follows:
[0103] 2-1) Initialize the anchor network, teacher network, and student network. All three networks use a pre-trained SAM image encoder as their basic architecture. The anchor network weights are frozen and do not change during training. In contrast, the student network uses the initial weights of the anchor network as a basis and continuously adjusts and optimizes its own weights during training. The teacher network has the same initial weights as the student network, and its weight update strategy is to adjust them synchronously according to the learning progress of the student network.
[0104] 2-2) The original data x of the target domain i Perform weak data augmentation (Set small-angle rotation, translation, and scaling), input the processed image data into the anchor network to extract features. f(·) is the anchor network encoder, Θ a These are the parameters of the anchor network.
[0105] 2-3) Perform weak data augmentation on the data resampled in step 1. (Set small-angle rotation, translation, and scaling), and then input the processed image data into the teacher network to extract features. f(·) is the teacher network encoder, Θ t These are the parameters of the teacher network.
[0106] 2-4) Apply enhanced data augmentation to the resampled data in step 1. (Color jitter, added noise, large rotation (over 90 degrees)), then input into the student network to extract features:
[0107]
[0108] f(·) is the student network encoder, Θ s These are the parameters of the student network.
[0109] Step 3) Introduce weak supervision as a cue, generate a fixed cue set, and the teacher model generates pseudo-labels based on the cue set. The pseudo-labels are then filtered in three layers: first, IoU is used to filter pseudo-labels; then, stability scores are used to filter low-quality pseudo-labels; finally, duplicate masks are removed to obtain high-quality pseudo-labels. This specifically includes the following steps:
[0110] 3-1) Prepare the hints for weak supervision, which include: bounding box: the smallest bounding rectangle of the target defect; point annotation: randomly sample positive points in the defect area and negative points outside the area; coarse mask: fit the true mask by polygons.
[0111] 3-2) Construct a cue encoder, g(p; Ω), which converts weakly supervised cues into fixed-dimensional embedding vectors. Specifically, for bounding boxes, the coordinates of the top-left and bottom-right corners of the bounding box are mapped to position embeddings; for point labels, the coordinates of positive and negative points are encoded into position embeddings respectively; and for coarse masks, the vertex coordinate sequence of the polygon is used as the position embedding.
[0112] 3-3) Convert the weakly supervised input cue encoder into embedding vectors to generate a cue set {e}. j}
[0113] 3-4) Teacher Model Receive weakly enhanced image x weak and hint set {e j}, generate pseudo-labels corresponding to j:
[0114]
[0115] f(x weak ;Θ tea ) represents the teacher's network output features, h(·) is the mask decoder, and Φ represents the decoder parameter set. Pseudo-labels are generated by combining image features and cue embedding.
[0116] 3-5) Calculate the mask for each pseudo-label. Mask generated with anchor network If the IoU is greater than 0.5, the pseudo-tag is retained.
[0117]
[0118] 3-6) Calculate the mask for each pseudo-label. The consistency of predictions under different weak enhancements, if stability If the value is greater than 0.8, then retain it.
[0119]
[0120] Where K is the number of weak enhancements, It is the prediction mask after the k-th enhancement; It is the predicted mask of the original image.
[0121] 3-7) Sort by stability score, retain the highest score mask, remove masks with an IoU greater than 0.7, and generate a high-quality pseudo-label set.
[0122] Step 4) Use the high-quality pseudo-labels generated in Step 3) to guide the training of the student network, calculate the self-training loss, contrastive loss, and anchor loss, and update the weights of the student network and the teacher network, as follows:
[0123] 4-1) Use high-quality pseudo-tags generated by the teacher network Supervise students' online training.
[0124] 4-2) Calculating the self-training loss consists of two parts: Focal Loss and Dice Loss. Focal Loss addresses the class imbalance problem, focusing on samples that are difficult to segment, as follows:
[0125]
[0126] in, This is the mask value for student network prediction, and γ is the focusing parameter (default is set to 2). Calculate the loss for all pixels in the image, N p Represents the total number of hints. This is the mask value predicted by the teacher network. The purpose of Dice Loss is to optimize the Intersection over Union (IoU) of the mask, as follows:
[0127]
[0128] Here, ∈ is used to prevent division by zero of small constants (default value is 1). This represents summing over all pixels (h, w). It is the mask value predicted by the student network. It is the mask value for teacher network prediction.
[0129] The total self-training loss is the sum of FocalLoss and Dice Loss, that is:
[0130]
[0131] 4-3) Output using anchor network Constrain the predictions of the student and teacher networks to prevent the models from deviating from the source domain knowledge:
[0132]
[0133] Where, m s and m t These are the prediction masks for the student network and the teacher network, respectively, λ. tea and λ stu This is the weighting coefficient, which is set to 0.5 by default.
[0134] 4-4) Extracting features F from the image encoder of the anchor network and teacher network a and F t The contrast loss between the anchor network and the teacher network is calculated, and the intermediate features of the anchor network and the teacher network are aligned to enhance feature consistency. The feature alignment is as follows:
[0135]
[0136] Where τ is the temperature coefficient (default setting is 0.3).
[0137] 4-5) The student network updates its weights under the guidance of high-quality pseudo-labels, employing a low-rank matrix update strategy. The encoder weights of the student network... It is decomposed into two low-rank matrices and θ = AB, where r is the rank of the low-rank matrix, defaulting to 4. During the weight update process, only A and B are updated; the original weights θ are retained. original Fixed, specifically θ = θ original +A·B optimizes A and B through backpropagation, reducing memory usage and computational overhead.
[0138] 4-6) The weights of the teacher network are updated through the EMA of the student network, as follows:
[0139]
[0140] Where α is the smoothing coefficient, set to 0.999, Θ tea and Θ stu The weights are divided into those of the teacher network and those of the student network. The teacher network indirectly inherits the low-rank matrix update of the student network through the EMA.
[0141] 4-7) Repeat steps 3) and 4). In each iteration, generate new pseudo-labels and update the student network and teacher network. After multiple iterations, gradually improve the performance of the student network. After training, the student network is used as a defect segmentation model for UAV inspection tasks.
[0142] Step 5) Lighten the semantic segmentation model obtained in Step 4). The specific steps are as follows:
[0143] Knowledge transfer is achieved by aligning the features and outputs of the student and teacher networks at multiple levels. The first step is image embedding alignment: Output characteristics of the bidirectional Transformer in the alignment mask decoder: Align the student and teacher segmentation mask outputs using Dice Loss and Focal Loss: The total distillation loss is the sum of the three comparative losses mentioned above:
[0144] Step 6) Transfer the trained weakly supervised adaptive semantic segmentation model to the unmanned inspection equipment. Based on the real-time image acquisition by the unmanned inspection system, online defect image region recognition is achieved, specifically:
[0145] Establish a control connection between the inspection system and the unmanned inspection equipment control system;
[0146] The inspection equipment control system receives instructions from the inspection system and triggers the corresponding function to control the movement of the inspection equipment.
[0147] During the movement of the inspection equipment, images within the shooting range are collected and transmitted to the control system.
[0148] The control system inputs the acquired images into the weakly supervised domain adaptive semantic segmentation model, outputs the defect area identification results, and transmits the results and the original images to the inspection system.
[0149] The data processing model of the inspection system receives the segmentation results, parses them, and presents them in the user interface.
[0150] A defect semantic segmentation device for unmanned inspection includes the following components:
[0151] A dedicated dataset module for semantic segmentation in defect detection is constructed to acquire defect images obtained during dam inspections. Different types of defect images on concrete surfaces are collected, and defect images of minority classes are resampled to adjust the data distribution and solve the class imbalance problem.
[0152] The data augmentation module performs weak data augmentation on the original defect image dataset, applying slight rotation, translation, scaling, etc., and inputs the processed images into the anchor network for feature extraction. It also performs weak data augmentation on the resampled dataset, applying slight rotation, translation, scaling, etc., and inputs the processed images into the teacher network for feature extraction. Finally, it performs strong data augmentation on the resampled dataset, applying color jitter, noise, large rotation, etc., and inputs the processed images into the student network for feature extraction.
[0153] The self-training module inputs weakly supervised data (bounding boxes, point annotations, and coarse masks) into the cue encoder to generate a fixed cue set. The teacher model generates pseudo-labels under the guidance of this weak supervision. A pseudo-label filtering strategy is applied to obtain high-quality pseudo-labels. The self-training loss is calculated based on these high-quality pseudo-labels to guide the student network update. Anchoring loss and contrastive loss are calculated to standardize the student network update and feature alignment. The parameters of the student network are updated through a low-rank matrix, and the weights of the teacher network are synchronized with the student network parameters via EMA, indirectly inheriting the low-rank matrix update.
[0154] The lightweight module performs full-stage knowledge distillation on the trained model, including image embedding, intermediate feature output, and alignment of the final segmentation mask output.
[0155] The unmanned inspection system loads a pre-trained weakly supervised adaptive semantic segmentation model, and the inspection equipment collects defect images in real time to achieve online image segmentation.
[0156] Obviously, those skilled in the art should understand that the steps of the defect semantic segmentation method for unmanned inspection or the units of the defect semantic segmentation device for unmanned inspection described in the above embodiments of the present invention can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. Optionally, they can be implemented using computer-executable program code, thereby storing them in a storage device for execution by the computing device. In some cases, the steps shown or described can be performed in a different order than those described herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the embodiments of the present invention are not limited to any particular hardware and software combination.
Claims
1. A defect semantic segmentation method for unmanned inspection robots, characterized in that, The method includes the following steps: Step 1) Obtain defect images collected during dam inspection, extract different types of defect samples from the concrete surface, construct an initial defect dataset, and resample the defect data based on the training consistency principle to adjust the data distribution; Step 2) Construct the anchor network, teacher network, and student network. Perform strong data augmentation and weak data augmentation on the defective image respectively. Input the augmented image into the anchor network, teacher network, and student network respectively to extract the defective image features. Step 3) The teacher network generates initial pseudo-labels under the guidance of weak supervision signals, and designs a filtering mechanism to screen the pseudo-labels to obtain high-quality pseudo-labels for subsequent training. Step 4) Use the high-quality pseudo-labels obtained in Step 3) to guide the training of the student network, calculate the self-training loss, anchor loss and contrast loss, optimize the weights of the student network encoder by combining the low-rank matrix update strategy, and synchronize the parameters of the student network through the exponential moving average mechanism. Iterative training yields the defective image semantic segmentation model. Step 5) Lightweight the trained defect image semantic segmentation model through full-stage distillation; Step 6) Deploy the lightweight, weakly supervised adaptive semantic segmentation model onto the unmanned inspection equipment. The monitoring equipment collects images of the dam surface in real time and inputs the image data into the weakly supervised adaptive semantic segmentation model to achieve online image segmentation and defect area identification. Step 4) uses the high-quality pseudo-labels generated in step 3) to guide the training of the student network and update the weights of the student network and the teacher network. This specifically includes the following steps: 4-1) Supervise the training of student networks using a high-quality set of pseudo-labels generated by the teacher network; 4-2) Calculate the self-training loss, which consists of two parts: Focal Loss and Dice Loss; Focal Loss is: in, It is the mask value predicted by the student network. It is the focusing parameter. Calculate the loss for all pixels in the image. Represents the total number of prompts. This is the mask value predicted by the teacher network, with the base of the logarithm being e; the purpose of Dice Loss is to optimize the intersection-union ratio of the mask, as follows: in, This is to prevent small constants from being divided by zero. This indicates all pixels. Summation, It is the mask value predicted by the student network. It is the mask value for teacher network prediction; The total self-training loss is the sum of Focal Loss and Dice Loss, that is: 4-3) Output using anchor network Constrain the predictions of the student and teacher networks to prevent the models from deviating from the source domain knowledge: in, and These are the prediction masks for the student network and the teacher network, respectively. and These are the weighting coefficients; 4-4) Extracting features from the image encoder of the anchor network and teacher network and The contrast loss between the anchor network and the teacher network is calculated, and the intermediate features of the anchor network and the teacher network are aligned. The feature alignment is as follows: in, It is the temperature coefficient; 4-5) The student network updates its weights under the guidance of high-quality pseudo-labels, employing a low-rank matrix update strategy. The encoder weights of the student network... It is decomposed into two low-rank matrices and : r is the rank of the low-rank matrix; during the weight update process, only the rank of the matrix is updated. and Original weights Fixed, specifically Optimization through backpropagation and ; 4-6) The weights of the teacher network are updated through the EMA of the student network, as follows: , in, For smoothing coefficients, and The weights are divided into teacher network and student network; the teacher network indirectly inherits the low-rank matrix update of the student network through EMA. 4-7) Repeat steps 3) and 4), generating new pseudo-labels and updating the student network and teacher network in each iteration. After multiple iterations, the performance of the student network is gradually improved. After training, the student network is used as a defect segmentation model for UAV inspection tasks.
2. The defect semantic segmentation method for unmanned inspection robots according to claim 1, characterized in that, Step 1) includes the following process: 1-1) Collect images of different types of defects on the concrete surface of the dam, use weakly supervised annotation to label the defect areas, and generate preliminary label data; 1-2) Defective images are fed into a training consistency-based resampling module. For each category, the online average category score (ACS) is calculated using the output probability map of the teacher model, and the distribution of each category is dynamically estimated. At each time step, the ACS is updated using the exponential moving average (EMA). 1-3) Based on the obtained online average category score, minority categories are determined. Categories with scores lower than the set value are identified as minority categories. The sampling rate of each category is determined through normalization. 1-4) For each sample in the target domain, calculate the reliability of each category; sort all target domain images in descending order based on the reliability of each category, and select the top-ranked samples. 1 sample as a copy-paste candidate set; 1-5) Copy and paste the images in the candidate set to complete the resampling.
3. The defect semantic segmentation method for unmanned inspection robots according to claim 1, characterized in that, Step 2) includes the following steps: 2-1) Initialize the anchor network, teacher network, and student network. All three networks use the pre-trained SAM image encoder as their basic architecture. The weights of the anchor network are set to a fixed state. The initial weights of the student network are the same as those of the anchor network, and the weights of the student network are updated during training. The initial weights of the teacher network are the same as those of the student network, and the weights of the teacher network are updated along with those of the student network. 2-2) Perform weak data augmentation on the original data of the target domain, and input the processed image data into the anchor network to extract features; 2-3) Perform weak data augmentation on the resampled data, and input the processed image data into the teacher network to extract features; 2-4) Apply enhanced data augmentation to the resampled data, and then input it into the student network to extract features.
4. The defect semantic segmentation method for unmanned inspection robots according to claim 1, characterized in that, Step 3) includes the following steps: 3-1) Prepare the hints for weak supervision, which include: (1) Bounding box: the minimum bounding rectangle of the target defect; (2) Point annotation: randomly sample positive points in the defect area and negative points outside the area; (3) Coarse mask: fit the real mask by polygons; 3-2) Construct a cue encoder that converts weakly supervised cues into fixed-dimensional embedding vectors; for bounding boxes, map the coordinates of the top-left and bottom-right corners of the bounding box into position embeddings; for point labels, encode the coordinates of positive and negative points into position embeddings respectively; for coarse masks, use the vertex coordinate sequence of the polygon as the position embedding. 3-3) Convert the weakly supervised input cue encoder into embedding vectors to generate a cue set; 3-4) The teacher model receives the weakly enhanced images and cue sets, and generates pseudo-labels corresponding to the classes; 3-5) Calculate the IoU between each pseudo-label mask and the mask generated by the anchor network. If the IoU is greater than the first set threshold, then retain the pseudo-label. 3-6) Calculate the prediction consistency of each pseudo-label mask under different weak enhancements. If the stability is greater than the second set threshold, then retain the current pseudo-label mask. 3-7) Sort by stability score, retain the highest score mask, remove masks whose IoU exceeds the third set threshold, and generate a high-quality pseudo-label set.
5. The defect semantic segmentation method for unmanned inspection robots according to claim 1, characterized in that, Step 5) involves lightweighting the semantic segmentation model obtained in step 4) through full-stage distillation and applying a hierarchical segmentation strategy to reduce the computational load of SAM in the "segment everything" mode; the specific steps are as follows: 5-1) Perform full-stage knowledge distillation on the model: achieve knowledge transfer by aligning the features and outputs of the student network and the teacher network at multiple levels; the first step is image embedding alignment, as follows: in, and These are teacher and student networks for images. Embedded output; This represents the L1 distance, used to enforce feature similarity. 5-2) Output characteristics of the bidirectional Transformer in the alignment mask decoder: in, and It is the output of the Transformer module for the teacher and student network; 5-3) Align the student and teacher segmentation mask outputs using Dice Loss and Focal Loss: ; It's a Focal loss. This is a loss for Dice.
6. The defect semantic segmentation method for unmanned inspection robots according to claim 1, characterized in that, In step 6), the semantic segmentation model based on weakly supervised domain adaptation, which was completed by lightweight training in step 5), is transferred to the unmanned inspection equipment. The unmanned inspection system collects images of the dam concrete surface in real time to achieve online defect identification. The unmanned inspection system includes: 6-1) Data acquisition module, which is a camera device, is capable of acquiring and storing images of the dam within the shooting range; 6-2) Image recognition module: receives defect images captured in real time by unmanned inspection equipment, and uses the constructed weakly supervised domain adaptive semantic segmentation model to analyze the type information of defect images in real time. 6-3) Data transmission module, which transmits the image information and mask information processed by the image segmentation module; 6-4) The inspection system receives images captured by unmanned inspection equipment and defect mask information after image recognition. Based on the identified defect locations, it optimizes the inspection route to ensure that the inspection equipment can efficiently cover all areas that need to be inspected, generates a detailed inspection task list, and sends the tasks to the control system. 6-5) Control system: Receives instructions from the inspection system and controls the unmanned inspection equipment to take pictures, identify, and transmit data of the areas that need to be identified.
7. A defect semantic segmentation device for implementing the method of claim 1, characterized in that, Includes the following modules: A dedicated dataset module for semantic segmentation in defect detection is constructed to acquire defect images obtained during dam inspections, collect images of different types of defects on the concrete surface, resample images of minority classes of defects, and adjust the data distribution. The data augmentation module performs weak data augmentation on the collected original defect image dataset, applying slight transformations, and then inputs the processed images into the anchor network for feature extraction; it also performs weak data augmentation on the resampled dataset, applying slight transformations, and then inputs the processed images into the teacher network for feature extraction. Strong data augmentation is performed on the resampled dataset, and the processed images are input into the student network for feature extraction. The self-training module generates a fixed set of prompts from the weakly supervised input prompt encoder. The teacher model generates pseudo-labels under the guidance of weak supervision. A pseudo-label filtering strategy is applied to obtain high-quality pseudo-labels. The self-training loss is calculated based on the high-quality pseudo-labels to guide the student network update. The anchoring loss and contrast loss are calculated to standardize the update and feature alignment of the student network. The parameters of the student network are updated through a low-rank matrix, and the weights of the teacher network are synchronized with the student network parameters through EMA, indirectly inheriting the update of the low-rank matrix. The lightweight module performs full-stage knowledge distillation on the trained model, including image embedding, intermediate feature output, and alignment of the final segmentation mask output. The application device loads a pre-trained weakly supervised adaptive semantic segmentation model, and uses the application device to collect images of defects on the concrete surface of the dam in real time. The images are then input into the weakly supervised adaptive defect image semantic segmentation model to achieve online defect image segmentation. The application device is an unmanned inspection device, which loads a pre-trained weakly supervised adaptive semantic segmentation model. The inspection device collects defect images in real time to achieve online image segmentation. The unmanned inspection equipment includes: The data acquisition module, mounted on a camera on an unmanned inspection device, collects and temporarily stores images of defects in the dam within its field of view. The image recognition module receives defect images captured in real time and uses a constructed weakly supervised adaptive semantic segmentation model to analyze the type information of the defect images in real time. The data transmission module transmits the image information and mask information processed by the image segmentation module. The inspection system receives images captured by unmanned inspection equipment and defect mask information after image recognition. Based on the identified defect locations, it optimizes the inspection route to ensure that the inspection equipment can efficiently cover all areas that need to be inspected, generates a detailed inspection task list, and sends the tasks to the control system. The control system receives instructions from the inspection system and controls the inspection equipment to take pictures, identify, and transmit data of the areas that need to be identified.
8. A computer device, characterized in that: The computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the defect semantic segmentation method for unmanned inspection as described in any one of claims 1-6.
9. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores a computer program that performs the defect semantic segmentation method for unmanned inspection as described in any one of claims 1-6.