Skin lesion image segmentation method based on boundary dynamic adaptive attention
By introducing a dynamic adaptive attention mechanism and a lightweight design for skin lesion image segmentation, the problems of insufficient feature fusion and high computational complexity in existing technologies are solved, achieving efficient and accurate skin lesion segmentation and boundary restoration.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-26
AI Technical Summary
Existing deep learning methods for skin lesion image segmentation suffer from problems such as insufficient feature fusion mechanisms, high computational complexity, difficulty in real-time deployment on clinical devices, and insufficient boundary recovery capabilities.
A skin lesion image segmentation method based on boundary dynamic adaptive attention is adopted. Through a dynamic adaptive convolutional attention module, an adaptive reweighted spatial attention gate module, and an efficient dynamic sampling module, multi-scale feature fusion and lightweight model design are achieved, thereby improving the ability to depict boundary details and computational efficiency.
It improves the accuracy of skin lesion segmentation and the ability to depict boundary details, while reducing computational resource consumption and enabling real-time and efficient deployment on clinical devices.
Smart Images

Figure CN121811053B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical image processing technology, and in particular to a method for skin lesion image segmentation based on boundary dynamic adaptive attention. Background Technology
[0002] Skin cancer is one of the most common malignant tumors worldwide. Among them, malignant melanoma poses a serious threat to human health due to its highly aggressive and rapidly growing characteristics. Clinical medical research has confirmed that if melanoma can be accurately detected and diagnosed in its early stages, its cure rate can be as high as 95%. Therefore, accurate and reliable early screening of skin lesions is key to improving patient survival rates. In computer-aided diagnostic (CAD) processes, accurately segmenting the lesion area from the complex skin background is the first and crucial step in achieving automated analysis, quantitative assessment, and subsequent diagnosis.
[0003] However, the task of automatic segmentation of skin lesions faces numerous technical challenges stemming from the images themselves. Skin lesions exhibit significant inter-class and intra-class variations in morphology, size, color, and internal texture, lacking a fixed pattern. Furthermore, many lesion areas have blurred boundaries and low contrast with the surrounding normal skin tissue, making them difficult to distinguish clearly. Moreover, the quality of dermoscopy images is often affected by various factors, such as hair occlusion, air bubbles, uneven lighting, and natural skin texture; these noises further increase the difficulty of accurate segmentation.
[0004] To address these challenges, early methods primarily relied on traditional machine learning techniques such as Support Vector Machines (SVM) and Random Forests (RF). These methods require manual design and extraction of complex features, which is not only tedious but also results in limited generalization ability in complex and varied scenarios due to the direct impact of the selected features on the segmentation performance. With the development of deep learning technology, methods represented by Fully Convolutional Networks (FCN) and U-Net have become mainstream techniques in skin lesion segmentation due to their end-to-end pixel-level segmentation capabilities. U-Net, with its classic encoder-decoder structure and skip connection design, can effectively fuse multi-scale feature information. Building upon this, a series of improved models have been proposed, such as introducing Spatial Pyramid Pooling with Dirt (ASPP) to capture richer contextual information or combining it with Visual Transformers (ViT) to establish global long-range dependencies.
[0005] Despite the success of existing deep learning methods, their inherent limitations remain prominent. A core issue lies in the inadequacy of feature fusion mechanisms. Traditional U-Net skip connections fuse deep and shallow features through simple concatenation or addition, failing to effectively bridge the semantic gap between them. This results in low efficiency in fusing local spatial details (such as boundary information) with global semantic information (such as the lesion itself), ultimately leading to blurred and discontinuous lesion boundaries in the segmented data. Furthermore, modules like Transformers, introduced to enhance global modeling capabilities, while effective, introduce a massive number of parameters and high computational complexity. This makes the models demanding on hardware resources, hindering real-time and efficient deployment on computationally limited clinical equipment, creating a trade-off between performance and efficiency. Moreover, the bilinear interpolation and other upsampling methods commonly used in the decoder path, due to their fixed sampling strategies, cannot adaptively adjust to the feature content. This makes it difficult to recover the fine structure and sharp edges of irregular lesions during reconstruction, resulting in significant information loss. Summary of the Invention
[0006] To address the aforementioned technical shortcomings, this invention provides a skin lesion image segmentation method based on boundary dynamic adaptive attention. This method aims to improve the accuracy of skin lesion segmentation, the ability to depict boundary details, and the computational efficiency of the model, effectively overcoming the deficiencies of existing methods in multi-scale feature fusion and complex boundary processing. This invention can effectively capture complex lesion boundaries and efficiently fuse multi-scale features while maintaining computational efficiency and lightweight design, thus meeting the practical needs of clinical applications.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0008] A skin lesion image segmentation method based on boundary dynamic adaptive attention includes the following steps:
[0009] S1. Acquire and preprocess RGB images of skin lesions and their expert-annotated segmentation masks to obtain a dataset of skin lesion images;
[0010] S2. Construct an EDAN segmentation network based on the U-Net framework, including a dynamic adaptive convolutional attention module, an efficient dynamic sampling module, and an adaptive reweighted spatial attention gate module; input the RGB image of the skin lesion into the EDAN segmentation network to generate a probability segmentation map; in each decoding layer, the dynamic adaptive convolutional attention module enhances the features to be strengthened at multiple scales through a multi-branch convolutional structure to generate refined features; the efficient dynamic sampling module automatically adjusts the sampling position and fusion strategy by combining dynamic offset sampling and depthwise separable convolution to restore the spatial resolution of the refined features; the output of the efficient dynamic sampling module is connected to the skip features of the corresponding encoder, and the multi-scale feature dynamic fusion is achieved through the adaptive reweighted spatial attention gate module, and the fusion result is used as the input of the next layer of the dynamic adaptive convolutional attention module;
[0011] S3. Verify and optimize the EDAN segmentation network;
[0012] S4. Use the optimized EDAN segmentation network to perform inference segmentation on the test set of the skin lesion image dataset.
[0013] Preferably, the multi-branch convolutional structure includes depthwise convolution, deformable convolution, and global convolution branches with different directions and scales; the deformable convolution adaptively adjusts the sampling position according to the morphology of the lesion region.
[0014] Preferably, S2 includes:
[0015] The features to be enhanced are input into the dynamic adaptive convolutional attention module;
[0016] In the dynamic adaptive convolutional attention module, channel attention weights of the feature to be enhanced are generated through the parallel results of global average pooling and max pooling; multi-scale spatial aggregation is applied to the feature to be enhanced and spatial attention fusion weights are calculated; the spatial attention fusion weights and the channel attention weights are coupled element-wise to generate refined features.
[0017] Preferably, the high-efficiency dynamic sampling module is implemented as follows:
[0018] The refined features output by the dynamic adaptive convolutional attention module in the current decoding stage, the coordinate offsets output by the offset prediction branch, and the base grid are superimposed to generate dynamic sampling coordinates; adaptive reconstruction is performed through offset-driven grid sampling, and the upsampling result is output; the upsampling result is then subjected to depthwise separable convolution and normalization, activation, and... Channel compression generates decoding features.
[0019] Preferably, the adaptive reweighted spatial attention gate module is implemented as follows:
[0020] Channel alignment is performed on the jump features output by the encoder and the decoded features at the same resolution;
[0021] Local and global features are generated based on the jump features and the decoding features, and the two are fused to generate fused features. The fused features are then subjected to 1×1 convolution and Softmax mapping to generate weights assigned to the two branches of the jump features and the decoding features, and finally the reweighted fused features are output.
[0022] Preferably, generating the fusion feature includes:
[0023] The global features are generated by averaging and max pooling the decoded features;
[0024] The local features are generated by performing depthwise convolution or pointwise convolution on the jump features.
[0025] The global features and the local features are added together and then the residual weights are generated by sigmoid activation.
[0026] The global features and the local features are fused using the residual weights to generate the fused features.
[0027] As a preferred embodiment, S2 also includes:
[0028] A set of auxiliary predictions is generated for the decoded features at different scales to perform hierarchical constraints on each resolution layer during training; the final high-resolution fusion feature of the reweighted fusion features is... The probability segmentation map is obtained by single-channel convolution mapping followed by the Sigmoid function:
[0029] in, This represents the predicted probability that each pixel belongs to the lesion region.
[0030] Preferably, S3 includes: generating a segmentation prediction mask based on the probability segmentation map; constructing a total loss including binary cross-entropy loss and Dice loss based on the segmentation prediction mask, and simultaneously applying supervision to the main segmentation output and the auxiliary level output.
[0031] Secondly, a skin lesion image segmentation system based on boundary dynamic adaptive attention includes:
[0032] The acquisition and preprocessing module is used to acquire and preprocess RGB images of skin lesions and their expert-annotated segmentation masks to obtain a dataset of skin lesion images.
[0033] An EDAN segmentation network construction module is used to build an EDAN segmentation network based on the U-Net framework, including a dynamic adaptive convolutional attention module, an adaptive reweighted spatial attention gate module, and an efficient dynamic sampling block. The RGB image of the skin lesion is input into the EDAN segmentation network to generate a probability segmentation map. In each decoding layer, the dynamic adaptive convolutional attention module enhances the features to be strengthened at multiple scales through a multi-branch convolutional structure, generating refined features. The efficient dynamic sampling module automatically adjusts the sampling position and fusion strategy by combining dynamic offset sampling and depthwise separable convolution to restore the spatial resolution of the refined features. The output of the efficient dynamic sampling module is connected to the skip features of the corresponding encoder, and the multi-scale feature dynamic fusion is achieved through the adaptive reweighted spatial attention gate module. The fusion result is used as the input to the next layer of the dynamic adaptive convolutional attention module.
[0034] The verification and optimization module is used to verify and optimize the EDAN segmentation network.
[0035] The testing module is used to perform inference segmentation on the test set of the skin lesion image dataset using the optimized EDAN segmentation network.
[0036] The skin lesion image segmentation system based on boundary dynamic adaptive attention is used to implement the skin lesion image segmentation method based on boundary dynamic adaptive attention as described in the first aspect.
[0037] Compared with the prior art, the beneficial effects of the present invention are reflected in:
[0038] This invention addresses the challenges of blurred boundaries and diverse lesion morphologies that often arise during skin lesion segmentation by systematically designing a segmentation network structure based on a joint optimization of attention mechanisms and dynamic sampling. Its core innovation lies in the introduction of dynamic feature acquisition, fine-grained spatial weight allocation, and adaptive reconstruction during the decoding stage. This allows the model to autonomously identify and enhance key boundary information within skin lesion regions of varying scales and complexities, effectively improving the accuracy and continuity of segmentation.
[0039] This method dynamically adjusts the perceptual range of the feature space during the overall feature extraction and fusion process, enabling the network to have stronger representation and recovery capabilities at detailed locations such as lesion contours and boundary transitions. Simultaneously, it utilizes an interactive feature fusion mechanism to integrate global semantics and local details, allowing the model to fully cope with variations in lesion morphology, size, and texture, thus improving segmentation performance for various types of skin lesions. These series of technical optimizations fully balance lightweight model structure and computational efficiency, reducing computational resource consumption and achieving simultaneous improvement in segmentation accuracy and efficiency in practical applications.
[0040] Experimental results verify that the proposed technical system significantly outperforms traditional segmentation methods on publicly available skin lesion datasets, including ISIC2018, in terms of lesion recognition accuracy, boundary recovery ability, and model generalization performance. The comprehensive evaluation indicators and segmentation performance on complex lesion scenarios fully demonstrate the synergistic effect of the technical methods in this invention, providing a more efficient and reliable automated technical solution for clinical skin lesion analysis. Attached Figure Description
[0041] Figure 1 This is a flowchart of the method in Embodiment 1 of the present invention;
[0042] Figure 2 This is a diagram of the EDAN segmentation network structure in Embodiment 1 of the present invention;
[0043] Figure 3 This is a structural diagram of the DACA module in Embodiment 1 of the present invention;
[0044] Figure 4 This is a structural diagram of the EDSB module in Embodiment 1 of the present invention;
[0045] Figure 5 This is a structural diagram of the AR-SAG module according to Embodiment 1 of the present invention;
[0046] Figure 6 This is a comparison diagram of the segmentation results of this invention and the actual labels. Detailed Implementation
[0047] To make the technical means, inventive features, objectives, and effects of the invention readily understandable, the invention is further described below with reference to specific illustrations. However, the invention is not limited to the embodiments described below.
[0048] It should be noted that the structures, proportions, sizes, etc., illustrated in the accompanying drawings of this specification are only used to complement the content disclosed in the specification for those skilled in the art to understand and read, and are not intended to limit the conditions under which the present invention can be implemented. Therefore, they have no substantial technical significance. Any modifications to the structure, changes in the proportions, or adjustments to the size, without affecting the effects and objectives that the present invention can produce, should still fall within the scope of the technical content disclosed in the present invention.
[0049] Example 1:
[0050] like Figure 1 , Figure 2 The skin lesion image segmentation method based on boundary dynamic adaptive attention shown includes the following steps:
[0051] S1. Acquire and preprocess RGB images of skin lesions and their expert-annotated segmentation masks to obtain a dataset of skin lesion images;
[0052] The publicly available skin lesion image dataset ISIC2018 was selected as the base data source for both training and validation. This dataset contains RGB lesion images acquired via dermoscopy and their corresponding expert-annotated segmentation masks. Preprocessing included:
[0053] First, the original images and corresponding labels are standardized to a uniform size. The color skin lesion images are scaled to 3×256×256, and the segmentation labels (masks) are scaled to 1×256×256. Nearest neighbor interpolation is used for label scaling to avoid smoothing boundary information. To enhance the model's generalization ability, online data augmentation is performed on the images and labels used during the training phase. This includes: random horizontal and vertical flipping (the trigger probability of each operation can be set to no higher than 0.5); random rotation (in one implementation, the angle range is set to −15° to +15°); random brightness and contrast perturbation (in one implementation, the coefficient relative to the original image range is set to 0.8–1.2) to simulate different lighting and acquisition conditions; and selectively using mild random cropping and padding to alleviate scale differences. The augmented images maintain the same standardized size as the model input. After the above processing, a set of image and label samples with uniform structure and resolution, which can be directly used for training the segmentation network, is obtained.
[0054] Dataset partitioning:
[0055] After data preprocessing, all samples in the normalized ISIC2018 dataset are randomly partitioned into three subsets: approximately 70% for training, 10% for validation, and 20% for testing. A fixed random seed is used during partitioning to ensure reproducibility. In one implementation, to improve the robustness and fairness of performance evaluation, a five-fold cross-validation strategy is further adopted: the data is partitioned by folds, and a combination of four folds as the training set and one fold as the validation set is executed cyclically. The validation results of each fold are recorded for model selection and generalization ability assessment. Finally, the test set remains independent and does not participate in parameter updates; it is only used for objective performance evaluation of the best model.
[0056] S2. Based on the U-Net framework, an EDAN segmentation network is constructed, including a dynamic adaptive convolutional attention module, an adaptive reweighted spatial attention gate module, and an efficient dynamic sampling block. RGB images of skin lesions are input into the EDAN segmentation network to generate probability segmentation maps. In each decoding layer, the dynamic adaptive convolutional attention module enhances the features to be strengthened at multiple scales through a multi-branch convolutional structure, generating refined features. The efficient dynamic sampling module automatically adjusts the sampling position and fusion strategy by combining dynamic offset sampling and depthwise separable convolution, restoring the spatial resolution of the refined features. The output of the efficient dynamic sampling module is connected to the skip features of the corresponding encoder, and the multi-scale feature dynamic fusion is achieved through the adaptive reweighted spatial attention gate module. The fusion result is used as the input to the next layer's dynamic adaptive convolutional attention module.
[0057] This implementation improves upon the U-Net framework by jointly enhancing the multi-scale semantics and fine-grained boundary information of the lesion region. The input color image of the skin lesion is denoted as... The output probability segmentation map has a dimension of The improved segmentation network is called EDAN, and it adopts an encoder-decoder structure. During the encoding stage, layer-by-layer features are output. ,in ), This represents the number of channels in the corresponding layer. The corresponding skip connection feature is denoted as... The decoding stage restores resolution step by step, generating decoding features. After attention and dynamic sampling processing, hierarchical predictions are output at each decoding layer. ,in The final probability segmentation map is obtained after passing through the Sigmoid function. The core of the network consists of three functional modules: Dynamic Adaptive Convolutional Attention (DACA), Adaptive Reweighted Spatial Attention Gating (AR-SAG), and Efficient Dynamic Sampling Block (EDSB). In this invention, the three modules work collaboratively in a hierarchical order during the decoding stage of the segmentation network: the bottom layer first uses the DACA module to enhance deep features at multiple scales, improving feature representation capabilities; in subsequent decoding layers, the features of the previous layer are enhanced by the DACA module, then spatial resolution is restored by the EDSB module. The upsampled features are then combined with the skip connection features of the corresponding encoder, and the multi-scale features are dynamically fused by the AR-SAG module. The fusion result serves as the input to the next layer's DACA module. The progressive embedding and layer-by-layer collaboration of the modules effectively improves the segmentation network's ability to represent complex lesion regions and restore boundary details. The overall model architecture is as follows: Figure 1 As shown, the number of training rounds is set to 100 rounds in one implementation; the detailed module construction strategy is as follows.
[0058] (1) Dynamic Adaptive Convolutional Attention Module (DACA), whose input is the feature to be enhanced and whose output is the refined feature.
[0059] This module references the traditional Convolutional Block Attention (CBAM) module but innovates upon it. Unlike existing technologies, this invention designs a multi-branch convolutional structure in the spatial attention part, including depthwise convolutions, deformable convolutions, and global convolutional branches with different directions and scales, collaboratively extracting diverse spatial features. In particular, the introduction of deformable convolutions can adaptively adjust the sampling position according to the morphology of the lesion region, effectively locating irregular boundaries—a feature not found in existing traditional attention modules such as Convolutional Block Attention (CBAM) and Squeeze-and-Excitation (SE) blocks. This design significantly improves the network's ability to perceive the boundaries and details of complex skin lesions, enhancing the contour integrity and boundary fineness of the segmentation results. Its module architecture is as follows: Figure 3 As shown, the specific construction strategy is as follows:
[0060] Suppose any feature to be enhanced Channel recalibration improves discrimination. Channel attention weights. Generated by the parallel results of global average pooling and max pooling via a shared perceptual mapping:
[0061]
[0062] in , For Sigmoid mapping, used to compress channel responses to Interval. Then for the same... Apply multi-scale spatial aggregation.
[0063] Spatial attention utilizes a five-branch convolutional structure and introduces deformable and expandable large kernel semantics. The following formula gives the spatial attention fusion weights:
[0064]
[0065] in, Outputs from depthwise convolutions at different scales. The output is a deformable convolution, used to adaptively adjust the sampling position. This is the global convolution output, used to expand the receptive field. Each local and expansion branch is computed separately.
[0066]
[0067]
[0068] in (Depth-wise Convolution) branches capture directional and mesoscale textures, while deformable convolution branches... Adaptive offset sampling for irregular boundaries. (Global convolution) branches expand the receptive field to supplement long-range context. Spatial attention weights are then added after fusion. The enhanced output (refined features) is obtained by coupling the channel weights element-wise:
[0069]
[0070] It preserves significant lesion areas while suppressing noise and low-correlation background textures.
[0071] (2) The high-efficiency dynamic sampling module (EDSB) takes the low-resolution feature map output by the current decoding stage of the DCA as input, i.e. the refined feature map, and outputs the high-resolution feature map after dynamic offset upsampling and depth-separable convolution, i.e. the decoded feature map.
[0072] Existing segmentation networks commonly use fixed sampling methods such as bilinear interpolation, which cannot adaptively handle the complex structures and irregular boundaries of lesion regions during the upsampling stage. This can easily lead to overly smoothed contours and loss of local boundary information in the segmentation results. To address this, this invention proposes combining the existing Dysample (dynamic offset sampling) method with depthwise separable convolution. This automatically adjusts the sampling position and fusion strategy based on the input image content, thereby flexibly restoring the true boundary morphology of various lesion regions. Its module architecture is as follows: Figure 4 As shown, the specific construction steps are as follows:
[0073] First, let's define the current decoding features. Offset prediction branch output coordinate offset , with the base grid The coordinates are superimposed to form dynamic sampling coordinates. Among them, the basic grid... The standard upsampling grid represents a fixed sampling coordinate system for each pixel. It is defined as the center grid point in two-dimensional space with a step size equal to the target scale. Its calculation method is as follows:
[0074]
[0075] in The height and width of the input feature map are given.
[0076] Offset prediction From the input features conduct Obtained through convolutional mapping, representing the adaptive spatial offset of each grid point, it is a learnable parameter, and its formula is:
[0077]
[0078] in Represents the sampling offset in the horizontal and vertical directions, where N is the number of sampling points at each location.
[0079] The final dynamic sampling coordinates are The normalized coordinates are used as input parameters for subsequent mesh sampling operators, enabling adaptive reconstruction at scale 2 through offset-driven mesh sampling.
[0080]
[0081] in This is the standard bilinear interpolation algorithm. These represent the height and width of the image, respectively. This process adjusts the actual interpolation position based on the local structure of the lesion edge, reducing the excessive smoothing of sharp contours by standard bilinear interpolation. The upsampling result is then subjected to depthwise separable convolution and normalization, activation, and... Channel compression yields:
[0082]
[0083] Output size Moving on to subsequent fusion and prediction, among which This represents the number of channels after compression. This dynamic offset mechanism maintains the geometric integrity of the contour under conditions of elongated shapes and jagged boundaries.
[0084] (3) Adaptive reweighted spatial attention gate module (AR-SAG), whose input is the skip features and the decoded features output by EDSB, and whose output is the reweighted fused features;
[0085] Unlike traditional skip connections and simple feature concatenation methods, this invention employs dynamic weight allocation and cross-scale fusion mechanisms to adaptively balance global semantics and local details, effectively mitigating the semantic gap between deep and shallow feature fusion. The results significantly enhance the model's ability to restore blurred and irregular boundaries, improving the coherence and robustness of the segmentation results. Its modular architecture is as follows: Figure 5 As shown, the specific construction method is as follows:
[0086] This module replaces traditional skip connections, effectively easing the semantic gap between shallow details and deep semantic features. For skip features at the same resolution... and decoding features First, align the channels of both.
[0087] Generate global features: Global features are obtained by jointly averaging and max pooling the decoded features.
[0088]
[0089] Where GAP(·) represents global average pooling, and GMP(·) represents global max pooling operation. It is a feature after global averaging. These are the features after max pooling. The features after global pooling undergo a non-linear transformation (e.g., 1×1 convolution + normalization). The transformation function is used to obtain global features. :
[0090]
[0091] Where [·,·] represents feature splicing.
[0092] Generate local features: Local features are generated by taking the input skip features Fs through a set of depthwise convolutions or pointwise convolutions ( l) Mapping yields:
[0093]
[0094] The above This represents a local convolution operation used for detail extraction.
[0095] Global and local features are summed and then activated using a Sigmoid function to generate residual weights. Specifically:
[0096]
[0097] in For the Sigmoid function, This is a convolution with a kernel size of 1×1. , which represents the global / local proportion of the output weight at each spatial location.
[0098] This weight is used to fuse global and local features to obtain fused features. :
[0099]
[0100] in This is an element-wise multiplication. The fused features are processed through a 1×1 convolution and a softmax mapping, and then distributed to the two original branches. and :
[0101]
[0102] in These are the weights assigned to shallow and deep features, respectively.
[0103] The shallow feature weights and deep feature weights are weighted, convolved, and normalized to generate reweighted fused feature weights, which are then output as the module's output.
[0104]
[0105] BN stands for batch normalization.
[0106] The above mechanism can enhance the model's mixed perception of local details and global structure at complex boundaries, and improve the continuity and robustness of segmented contours.
[0107] (4) Multi-level prediction and parameter update, input high-resolution fusion features Output the final probability segmentation map.
[0108] The decoding stage progressively restores spatial resolution, generating a set of auxiliary predictions for decoding features at different scales. These predictions are used to hierarchically constrain low, medium, and high resolution layers during training, stabilizing the joint representation of lesion boundaries and the main body region by deep and shallow layers. Let the terminal high-resolution fusion feature be denoted as... The final probability segmentation map is obtained by single-channel convolution mapping followed by the Sigmoid function: in This represents the predicted probability that each pixel belongs to the lesion region. Auxiliary prediction is used during the training phase to enhance consistency and boundary responses of features at different scales; however, in the inference and evaluation phases, it is only used for... This is the final output. Parameter optimization employs an iterative update strategy with adaptive weight decay. In one implementation, the initial learning rate is set to... to The training parameters are selected within a certain range and can be reduced in segments according to the performance curve. An early stopping strategy is adopted for training termination: training stops when the core validation set metrics (such as IoU or Dice coefficient) show no substantial improvement after several consecutive rounds (e.g., 10-15 rounds), to avoid overfitting and redundant computation. The optimization process gradually updates the entire set of trainable parameters based on accumulated gradient information, ensuring that the main prediction and auxiliary branches at each scale converge together, ultimately obtaining stable high-resolution lesion segmentation results.
[0109] S3, Model Validation
[0110] In one specific implementation, the current model is evaluated for inference using a validation set after each training round, without gradient updates. The probability map output by the model is first mapped using a Sigmoid function, and then binarized with a threshold of 0.5 to generate a segmentation mask.
[0111] To improve the learning quality of lesion regions and boundaries, a multi-branch supervision strategy combining binary cross-entropy loss and Dice loss is adopted to simultaneously supervise the main segmentation output and the auxiliary level output.
[0112] In one implementation, the network generates four-level segmentation predictions (p_1, p_2, p_3, p_4), with each branch corresponding to its ground truth label. (Binary segmentation mask) The loss is calculated using a hybrid loss function, which is a weighted average of the binary cross-entropy loss and the Dice loss. The loss function takes the following form:
[0113]
[0114] in This is a Sigmoid mapping. The Dice loss is used to enhance the reconstruction capability of the entire target region and its boundaries, and its form is as follows:
[0115]
[0116] in To prevent a smoothing constant with a denominator of zero, in one implementation, the total loss can be written as:
[0117]
[0118] in The weighting coefficients can be empirically set or optimized based on the validation set performance. In another optional implementation, a four-branch average output can also be added. Additional supervision terms are added to enhance multi-scale consistency.
[0119] The performance selection during the validation phase is primarily based on the Intersection over Union (IoU) ratio, supplemented by the Dice metric. When multiple rounds of similar IoU occur, the snapshot with the higher Dice metric is prioritized as the current best model to ensure that both the boundary and main regions maintain high-quality parameters.
[0120] To quantify the model's lesion segmentation ability on the validation set, metrics including Dice, IoU, accuracy (ACC), sensitivity (SE), specificity (SP), and precision (Precision) are calculated. Their definitions are as follows:
[0121]
[0122]
[0123]
[0124]
[0125]
[0126]
[0127] According to the official information of the ISIC dataset, IoU and Dice are the most important evaluation metrics. The four basic terms are: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). TP represents the number of samples correctly classified as skin lesions. FP represents the number of samples misclassified as skin lesions. Similarly, TN is the number of samples correctly classified as background pixels, and FN is the number of samples misclassified as background pixels. Through round-by-round monitoring and log recording of these metrics, the model weights with the best overall performance on the validation set are selected for use in subsequent testing phases.
[0128] S4. Use the optimized EDAN segmentation network to perform inference segmentation on the test set of the skin lesion image dataset.
[0129] The optimal model selected from S3 is used for inference segmentation on an independent test set. No data augmentation or parameter updates are performed during the testing phase, maintaining consistency with the input size and preprocessing strategy. The output probability map is binarized with a threshold of 0.5 and compared with the true labels on the test set. Indicators such as Dice, IoU, ACC, SE, SP, and Precision are calculated and summarized to comprehensively evaluate the model's overall lesion recognition, boundary fit, false negative and false positive control, and background suppression. If necessary, the mean and standard deviation of each indicator can be calculated to measure stability. In one implementation, several representative samples with hair occlusion, uneven lighting, or blurred boundaries can be visually overlaid (comparing the segmented contour with the true mask) to intuitively verify boundary continuity and detail preservation capabilities. By comparing with common segmentation methods under the same testing protocol (such as traditional U-Net, methods based on multi-scale hollow structures, or attention fusion methods), the comprehensive advantages of the method of this invention in terms of boundary clarity and overall segmentation accuracy in complex lesion regions are further confirmed. The test and evaluation results obtained in this step serve as the basis for determining the final effectiveness of the method of this invention, and can be used for subsequent integration of clinical auxiliary diagnosis or lightweight deployment and adaptation of resource-constrained devices.
[0130] Example 2:
[0131] Using the ISIC-2018 publicly available skin lesion segmentation dataset as a validation platform, a systematic experiment was conducted on the method of this invention, and its objective performance was compared and evaluated with current mainstream deep segmentation networks (such as U-Net, DeepLabv3+, TransUnet, CASF-Net, CodeNet, etc.). As shown in Table 1, under a unified evaluation metric, the method of this invention achieved the best results in core segmentation metrics such as Dice coefficient (91.95%), Intersection over Union (IoU) (85.10%), and accuracy (97.03%). Among them, IoU and Dice are the key metrics officially evaluated by the ISIC dataset. Compared with traditional U-Net and various models that fuse convolutional and Transformer and attention mechanisms, the method of this invention significantly improves overall segmentation accuracy while ensuring extremely low computational cost and parameter size (total parameters are only about 15.94M, and inference FLOPs are only 3.83G).
[0132] Furthermore, a visual comparison was conducted on representative samples from the ISIC-2018 test set that exhibited different lesion morphologies, blurred boundaries, and noise interference (e.g., Figure 6 As shown in the figure, it can be observed that the mask contour obtained by the segmentation method of the present invention closely matches the artificial real label, the lesion boundary is smooth and continuous and more closely fits the real structure. It not only has good robustness to hair and light interference, but also maintains the ability to identify small lesions and irregular areas. In some cases, traditional U-Net and Swin-Unet are prone to excessive edge smoothing or local breakage and missed detection, while the present method effectively restores the complex geometric shape of skin lesions, achieving high accuracy and high integrity of the segmented area.
[0133] Example 3:
[0134] A skin lesion image segmentation system based on boundary dynamic adaptive attention includes:
[0135] The acquisition and preprocessing module is used to acquire and preprocess RGB images of skin lesions and their expert-annotated segmentation masks to obtain a dataset of skin lesion images.
[0136] This module constructs the EDAN segmentation network, building upon the U-Net framework. It includes a dynamic adaptive convolutional attention module, an adaptive reweighted spatial attention gate module, and an efficient dynamic sampling block. The dynamic adaptive convolutional attention module comprises a multi-branch convolutional structure for extracting diverse spatial features. The adaptive reweighted spatial attention gate module adaptively balances global semantics and local details through dynamic weight allocation and cross-scale fusion mechanisms. The efficient dynamic sampling module automatically adjusts the sampling position and fusion strategy based on the input RGB image of a skin lesion by combining dynamic offset sampling and depthwise separable convolution. The RGB image of a skin lesion is input into the EDAN segmentation network to generate a probability segmentation map.
[0137] The verification and optimization module is used to verify and optimize the EDAN segmentation network.
[0138] The test module is used to perform inference segmentation on the test set of the optimized EDAN segmentation network for skin lesion image datasets.
[0139] Table 1: Comparison of skin lesion segmentation performance of the present invention with other segmentation networks on the ISIC 2018 dataset.
[0140]
Claims
1. A skin lesion image segmentation method based on boundary dynamic adaptive attention, characterized in that, Includes the following steps: S1. Acquire and preprocess RGB images of skin lesions and their expert-annotated segmentation masks to obtain a dataset of skin lesion images; S2. An EDAN segmentation network is constructed based on the U-Net framework, consisting of a dynamic adaptive convolutional attention module, an efficient dynamic sampling module, and an adaptive reweighted spatial attention gate module. The RGB image of the skin lesion is input into the EDAN segmentation network to generate a probability segmentation map. In each decoding layer, the dynamic adaptive convolutional attention module enhances the features to be strengthened at multiple scales through a multi-branch convolutional structure, generating refined features. The efficient dynamic sampling module automatically adjusts the sampling position and fusion strategy by combining dynamic offset sampling and depthwise separable convolution to restore the spatial resolution of the refined features. The output of the efficient dynamic sampling module is connected to the skip features of the corresponding encoder, and the multi-scale feature dynamic fusion is achieved through the adaptive reweighted spatial attention gate module. The fusion result is used as the input of the next layer of the dynamic adaptive convolutional attention module. The efficient dynamic sampling module is specifically implemented as follows: The refined features output by the dynamic adaptive convolutional attention module in the current decoding stage, the coordinate offsets output by the offset prediction branch, and the base grid are superimposed to generate dynamic sampling coordinates; adaptive reconstruction is performed through offset-driven grid sampling, and the upsampling result is output; the upsampling result is then subjected to depthwise separable convolution and normalization, activation, and... Channel compression generates decoding features; S3. Verify and optimize the EDAN segmentation network; S4. Use the optimized EDAN segmentation network to perform inference segmentation on the test set of the skin lesion image dataset.
2. The skin lesion image segmentation method based on boundary dynamic adaptive attention according to claim 1, characterized in that, The multi-branch convolutional structure includes depthwise convolution, deformable convolution, and global convolution branches with different directions and scales; the deformable convolution adaptively adjusts the sampling position according to the morphology of the lesion region.
3. The skin lesion image segmentation method based on boundary dynamic adaptive attention according to claim 2, characterized in that, S2 include: The features to be enhanced are input into the dynamic adaptive convolutional attention module; In the dynamic adaptive convolutional attention module, the channel attention weights of the feature to be enhanced are generated through the parallel results of global average pooling and max pooling. Multi-scale spatial aggregation is applied to the features to be enhanced, and spatial attention fusion weights are calculated; The spatial attention fusion weights and the channel attention weights are coupled element-wise to generate refined features.
4. The skin lesion image segmentation method based on boundary dynamic adaptive attention according to claim 1, characterized in that, The adaptive reweighted spatial attention gate module is implemented as follows: Channel alignment is performed on the jump features output by the encoder and the decoded features at the same resolution; Local and global features are generated based on the jump features and the decoding features, and the two are fused to generate fused features. The fused features are then subjected to 1×1 convolution and Softmax mapping to generate weights assigned to the two branches of the jump features and the decoding features, and finally the reweighted fused features are output.
5. The skin lesion image segmentation method based on boundary dynamic adaptive attention according to claim 4, characterized in that, Generating the fusion feature includes: The global features are generated by averaging and max pooling the decoded features; The local features are generated by performing depthwise convolution or pointwise convolution on the jump features. The global features and the local features are added together and then the residual weights are generated by sigmoid activation. The global features and the local features are fused using the residual weights to generate the fused features.
6. The skin lesion image segmentation method based on boundary dynamic adaptive attention according to claim 4, characterized in that, S2 also includes: A set of auxiliary predictions is generated for the decoded features at different scales to perform hierarchical constraints on each resolution layer during training; the final high-resolution fusion feature of the reweighted fusion features is... The probability segmentation map is obtained by single-channel convolution mapping followed by the Sigmoid function: ; in, This represents the predicted probability that each pixel belongs to the lesion region.
7. The skin lesion image segmentation method based on boundary dynamic adaptive attention according to claim 1, characterized in that, S3 includes: generating a segmentation prediction mask based on the probability segmentation map; constructing a total loss including binary cross-entropy loss and Dice loss based on the segmentation prediction mask; and simultaneously applying supervision to the main segmentation output and the auxiliary level output.
8. A skin lesion image segmentation system based on boundary dynamic adaptive attention, characterized in that, include: The acquisition and preprocessing module is used to acquire and preprocess RGB images of skin lesions and their expert-annotated segmentation masks to obtain a dataset of skin lesion images. EDAN segmentation network construction module; This is used to construct an EDAN segmentation network based on the U-Net framework, including a dynamic adaptive convolutional attention module, an adaptive reweighted spatial attention gate module, and an efficient dynamic sampling block. The RGB image of the skin lesion is input into the EDAN segmentation network to generate a probability segmentation map. In each decoding layer, the dynamic adaptive convolutional attention module enhances the features to be strengthened at multiple scales through a multi-branch convolutional structure, generating refined features. The efficient dynamic sampling module automatically adjusts the sampling position and fusion strategy by combining dynamic offset sampling and depthwise separable convolution to restore the spatial resolution of the refined features. The output of the efficient dynamic sampling module is connected to the skip features of the corresponding encoder, and the multi-scale feature dynamic fusion is achieved through the adaptive reweighted spatial attention gate module. The fusion result is used as the input to the next layer of the dynamic adaptive convolutional attention module. The verification and optimization module is used to verify and optimize the EDAN segmentation network. The testing module is used to perform inference segmentation on the test set of the skin lesion image dataset using the optimized EDAN segmentation network; The skin lesion image segmentation system based on boundary dynamic adaptive attention is used to implement the skin lesion image segmentation method based on boundary dynamic adaptive attention as described in any one of claims 1-7.