Medical image segmentation method combining curve structure prompter and deep neural network

CN118314121BActive Publication Date: 2026-06-26SICHUAN UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SICHUAN UNIV
Filing Date: 2024-05-13
Publication Date: 2026-06-26

AI Technical Summary

Technical Problem

Existing medical image segmentation methods suffer from incomplete segmentation and inaccurate localization when dealing with curved structures, especially performing poorly in segmentation tasks with limited open resources and complex, detailed curved structures. Furthermore, existing deep learning methods have limited portability and adaptability.

Method used

By combining curve structure cues and deep neural networks, a lightweight medical image segmentation framework is constructed by building an image encoder based on Transformer and CNN to generate dense and sparse cue embeddings, and using a mask decoder with self-attention and cross-attention for feature fusion. Finally, the segmentation result is generated through transposed convolution.

Benefits of technology

It achieves efficient segmentation of complex and intricate curved medical images, improving segmentation accuracy and localization accuracy. It also performs well in several challenging medical image segmentation tasks with relatively few parameters and computational cost.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN118314121B_ABST

Patent Text Reader

Abstract

The application relates to the technical field of medical image segmentation, and discloses a medical image segmentation method combining a curve structure prompter and a deep neural network. First, a public curve structure segmentation medical image dataset is selected and preprocessed; then, an image encoder based on a Transformer and a CNN is constructed to generate image embedding; a prompter encoder is constructed to extract the features of the prompter and generate two kinds of prompter embedding, namely dense embedding and sparse embedding; a mask decoder based on self-attention and cross-attention is designed to fuse the image embedding and the prompter embedding, decode the fused features to obtain final decoding features, and process the final decoding features to generate a final segmentation result; an encoder structure combining the Transformer and the CNN is designed to generate an image segmentation framework, and medical image segmentation for a curve structure is completed. The application uses the curve structure prompter as good segmentation prior knowledge to guide the network to segment complex and detailed curve structures.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to medical image segmentation technology in the field of image processing, and particularly to a medical image segmentation method based on an innovative skeleton cueing strategy specific to curve structure and a deep neural network. Background Technology

[0002] Medical image segmentation is a widely studied and challenging subject, aiming to help clinicians focus more intently on pathological regions and extract detailed information from medical images for more accurate diagnosis and analysis. In medical image segmentation tasks, structures of interest can be categorized into two types: block structures (such as organs like the lungs and kidneys) and curved structures (such as blood vessels and nerves). Curved structures are typically elongated, tortuous, and have variable scales. Accurate segmentation of these curved structures is crucial for clinical practice. For example, segmenting retinal vessels aids in the early diagnosis of diseases such as diabetes, glaucoma, and hypertension. Furthermore, segmentation of structures such as the optic nerve and the inner retinal layer provides a non-invasive view of the central nervous system, helping to understand the progression of neurodegenerative diseases. The morphology of segmented retinal vessels can also reflect the health status of other vascular systems. Therefore, developing accurate Curve Structure Segmentation (CSS) models is key to enhancing medical practice and advancing disease research, particularly in ophthalmology.

[0003] Traditional CSS methods heavily rely on manually designed features and domain-specific knowledge, resulting in significant performance and efficiency gaps compared to deep learning-based methods. Current deep learning-based CSS methods focus on designing complex modules, network architectures, and loss functions, achieving substantial progress. However, when faced with new CSS tasks, these methods typically require training from scratch, leading to limited portability and adaptability in practical applications. For example, experiments have shown that the performance of a U-Net model trained on the HRF dataset significantly degrades when transferred to the LES-AV dataset. Similarly, experiments have shown that while a model trained on a retinal vessel segmentation dataset can perform basic vessel segmentation tasks, it struggles with more specific tasks, such as segmenting veins and arteries.

[0004] Recently, SAM (Segment Anything Model) has broken through the limitations of traditional expert models that rely on fully supervised learning through interactive training methods, demonstrating excellent performance in zero-shot and few-shot scenarios. Studies such as MedSAM and SAM-Med2D have shown that fine-tuning based on SAM is a common practice in medical imaging to quickly adapt to specific medical image segmentation tasks. However, SAM's cueing strategies—dot cues and box cues—are mainly applicable to block structures, and their segmentation accuracy for complex, detailed curve structures needs improvement. Therefore, to improve SAM's ability to handle curve structure segmentation tasks in medical images, a cue specifically for curve structures needs to be designed. Furthermore, publicly available CSS datasets are typically small in size with a limited number of annotated samples. This raises another important issue we need to address: how to utilize limited open resources for the research and application of CSS-based models.

[0005] In summary, efficiently predicting curve structures, designing a new curve structure cue, and utilizing existing open resources to research and apply CSS-based models to achieve good performance are key issues that urgently need to be addressed in medical image segmentation. Summary of the Invention

[0006] To address the aforementioned problems, the purpose of this invention is to implement a medical image segmentation method that combines curve structure cues with deep neural networks. This method utilizes curve structure cues as valuable prior knowledge to guide the network in segmenting complex and detailed curve structures. The technical solution is as follows:

[0007] A medical image segmentation method combining curve structure cues and deep neural networks includes the following steps:

[0008] Step 1: Select a publicly available medical image dataset for curve structure segmentation, and perform data augmentation and preprocessing on the training set in the dataset;

[0009] Step 2: Construct an image encoder based on Transformer and CNN to extract image features and generate image embeddings;

[0010] Step 3: Construct a cue encoder, extract cue features, and generate two types of cue embeddings: dense embedding and sparse embedding.

[0011] Step 4: Design a mask decoder based on self-attention and cross-attention, fuse image embedding and cue embedding, and decode the fused features to obtain the final decoded features;

[0012] Step 5: Design a prediction head based on transposed convolution to process the final decoded features to generate the final segmentation result;

[0013] Step 6: Design an image segmentation framework consisting of four parts: an encoder structure combining Transformer and CNN, a cue encoder structure with dense and sparse embeddings, a mask decoder structure, and a final prediction head structure, to complete the segmentation of medical images with curved structures.

[0014] Furthermore, the datasets in step 1 include the DCA1 and CORN datasets for segmentation tasks of coronary arteries and corneal nerves, and the HRF, LES-AV, CHASEDB1, DRIVE, OCTA500, STARE, IOSTAR, and ORVS datasets for retinal vessel segmentation studies; wherein the OSTAR and ORVS datasets are used for zero-shot testing.

[0015] The specific steps for data augmentation and preprocessing of the training set in the dataset are as follows:

[0016] On the original image and its corresponding Ground Truth, several sliding windows of different pixel sizes are used to extract local image patches to expand the dataset. At the same time, randomly combined single-connected component masks are performed, and the number of connected components in the generated masks is limited to a set size. Then, masks with foreground smaller than a set number of pixels are excluded to ensure the quality of the masks. Finally, all images are normalized.

[0017] Furthermore, step 2 specifically includes:

[0018] Step 2.1: The mapping process of any input feature map after passing through the adapter layer module can be represented as:

[0019] Step 2.1.1: Pass the input feature map through a global pooling layer to reduce the feature map resolution:

[0020]

[0021] in, The feature map output by the global pooling layer; Global Pooling represents global pooling. This is the input feature map for the adapter layer module;

[0022] Step 2.1.2: Convert the feature map after the global pooling layer The process of performing linear and nonlinear mappings is as follows: feature map The image sequentially passes through an Mlp1 layer, an activation layer ReLU3, and an Mlp2 layer, followed by Sigmoid activation. The resulting intermediate feature map is then processed. Feature map before the input adapter layer module By performing point-by-point multiplication, we obtain The formula is expressed as:

[0023]

[0024]

[0025] in, The feature map is obtained through linear and nonlinear mapping. This indicates point-by-point multiplication;

[0026] Step 2.1.3: Convert the feature maps obtained through linear and nonlinear mappings. Further nonlinear activation is performed, specifically as follows:

[0027] First, the feature map The feature map resolution is magnified by one time by passing it sequentially through a convolutional layer (Conv3), an activation layer (ReLU4), and a transposed convolutional layer (TransConv); then it passes through an activation layer (ReLU5), and finally the feature map is combined with the feature map before the input adapter layer module. By adding the results point by point, we can obtain the final output. The formula is expressed as:

[0028]

[0029] Step 2.2: The mapping process of any input feature map passing through the Transformer Block can be represented as:

[0030]

[0031]

[0032]

[0033]

[0034]

[0035]

[0036]

[0037]

[0038] Where TB stands for Transformer Block. The feature map before inputting the Transformer Block has a dimension of . H T W T and C T These are the feature maps before input to the Transformer Block. Height, width, and dimensions; The feature map after passing through the Transformer Block, i.e., the output of the Transformer Block, still has the same dimension. In other words, the Transformer Block does not change the dimension of the input feature map; the Softmax operation is used to normalize the vector, mapping all elements of the vector to the range of 0-1; Q, K, and V are three matrices with the same dimension, representing the input feature map. Each layer passes through three different Mlp layers: Mlp Q1 Mlp K2 and Mlp V3 The resulting three matrices; each Mlp consists of two linear layers and one activation layer, d k Represents the dimensions of the Q and K matrices;

[0039] Step 2.3: Use a block embedding layer to extract shallow features of the image through a convolutional layer with a stride of S and a sliding window size of K, and obtain the output result X. EB It can be expressed by the following formula:

[0040] X EB =Conv EB (X In ,K,S)

[0041] Among them, X In This represents the original image before the input block embedding layer, with dimension R. H×W×C H represents the height of the original image, W represents the width of the original image, C represents the dimension of the original image; K represents the sliding window size of the max pooling, and S represents the stride of the convolution kernel in each slide.

[0042] Step 2.4: Calculate the output result X obtained in Step 2.3. EB The Transformer Block and adapter layer processes are performed in 12 stages. The Transformer Block and adapter layer are the modules mentioned in steps 2.1 and 2.2, respectively. Each stage involves processing the feature map. Simultaneously passing through Transformer Block TB i and adapter layer AL i 'i' indicates the nth stage. After 12 stages, the final output image embedding of the image encoder is obtained.

[0043] Furthermore, step 3 specifically includes:

[0044] Step 3.1: The process of generating dense embeddings is represented as follows:

[0045] skeleton prompt H SK W SK and C SK These represent the height, width, and dimensions of the skeleton prompt, respectively, and are passed sequentially through a convolutional layer (Conv). PE_1 One activation layer ReLU PE_1 A convolutional layer Conv PE_2 and an activation layer ReLU PE_2 To obtain the final dense embedding Its shape is C DE The dimension representing dense embedding is used, and the entire process is expressed by the following formula:

[0046]

[0047] Step 3.2: The process of generating sparse embeddings is represented as follows:

[0048] Sparse cue points are a number of points in the original image, each labeled 0 or 1, indicating foreground or background segmentation. Assume there are N coordinates [P0(x,y), P1(x,y)...P...]. N-1 [x,y]; First, linearly transform the coordinates to [0,2π], then combine these N coordinates to obtain P. n Then use the Gaussian distribution matrix Perform matrix multiplication; as shown below:

[0049] P n ′ =P n *G

[0050] in, Let P be the coordinates n The product matrix is generated after performing matrix multiplication. * indicates matrix multiplication, N is the number of points in the sparse prompt, and C represents the feature dimension preset in the generated sparse embedding.

[0051] Finally, the sine and cosine values of each coordinate are concatenated to obtain the sparse embedding.

[0052] Step 3.3: Apply the final dense embedding obtained in Step 3.1 and the image embedding obtained in step 2.4 Feature fusion is performed by adding the features point by point, and the fused result is used as the fused feature output by the prompt encoder. The formula is expressed as follows:

[0053]

[0054] Furthermore, the mask decoder in step 4 includes two identical mask decoding processing blocks, and the processing procedure is as follows:

[0055] Step 4.1: Utilize the sparse embedding obtained in Step 3.2 Sparse embedding with enhanced self-attention The enhanced sparse embedding is then passed through an Mlp layer. To obtain a query matrix Expressed as a formula:

[0056]

[0057]

[0058] Step 4.2: Combine the fused features obtained in Step 3.3 Through three Mlp layers respectively and Obtain the matrix and Then use and Compared with the query matrix in step 4.1 After performing cross-attention, it is then combined with the enhanced sparse embedding. The intermediate sparse embeddings are obtained by summing and passing the results through a normalization layer Norm1. Then pass through an Mlp layer Mlp MD_r Finally, with the sparse embedding in the middle The elements are added together, and then passed through a normalization layer (Norm2) to obtain the final learned sparse embedding. Expressed using the following formula:

[0059]

[0060]

[0061]

[0062]

[0063]

[0064] Where Cross-Attention1 represents the cross-attention operation;

[0065] Step 4.3: Apply the final sparse embedding obtained in Step 4.2 Through two Mlp layers and The matrices were obtained respectively. and Then the matrix Compared with the matrix in step 4.2 Enhanced fusion embedding through cross-attention Then it is fused with the features obtained in step 3.3. After point-by-point addition, the result is passed through a normalization layer (Norm3) to obtain the final output of the mask decoding processing block. The final feature map is obtained after two such processing blocks. The formula for this step is expressed as follows:

[0066]

[0067]

[0068]

[0069]

[0070]

[0071] Where Cross-Attention2 represents the cross-attention operation. The feature map is obtained after one mask decoding processing block. MD2 is the final feature map obtained after two mask decoding processing blocks. MD2 is the second mask decoding processing block, and its process is the same as that of the first mask decoding processing block.

[0072] Furthermore, step 5 specifically includes:

[0073] Step 5.1: The final feature map generated from the mask decoding processing block mentioned in Step 4.3. The resolution is restored to the original image size before it was input into the entire network by performing two transposed convolutions. This process can be expressed by the following formula:

[0074]

[0075] X Trans2 =Conv2DTranspose2(X Trans1 )

[0076] Among them, X Trans1X is the feature map output by the first transposed convolution Conv2DTranspose1. Trans2 The feature map output by the second transposed convolution Conv2DTranspose2;

[0077] Step 5.2: Transfer the feature map X obtained in Step 5.1 to... Trans2 The final prediction result is generated after Sigmoid activation, and the process is represented as follows:

[0078]

[0079] in, This is the final curve structure segmentation prediction map.

[0080] Furthermore, step 6 specifically includes:

[0081] Step 6.1: Input the original image into an encoder structure based on a combination of Transformer and CNN. Freeze the Transformer and use an adapter with a CNN structure to progressively learn the features of the medical image with a curved structure. Finally, output the image embedding through 12 parallel stages of Transformer and adapter.

[0082] Step 6.2: By inputting the skeleton prompt and dot prompt into the prompt encoder, the final dense embeddings are obtained respectively. and final sparse embedding Then the final dense embedding and image embedding in step 6.1 The two points are added point by point, and the result is used as the fusion feature output by the prompt encoder.

[0083] Step 6.3: Merge features and final sparse embedding The final feature map is obtained after passing through two mask decoder blocks.

[0084] Step 6.4: Output the final feature map from the mask decoder. The image is transposed twice to restore it to the original image size before it was input into the entire network, and then passed through a sigmoid activation function. PH This yields the final curve structure segmentation prediction map.

[0085]

[0086] The beneficial effects of adopting the above technical solution are as follows:

[0087] 1) This invention proposes a novel and effective medical image segmentation framework that combines skeleton cue, dot cue, and deep neural network to comprehensively solve the problem of segmenting medical images with complex and fine curve structures. This framework can handle the problems of incomplete segmentation results and inaccurate localization of curve structures with complex and fine structures in medical images.

[0088] 2) This invention organizes existing medical image datasets with curved structures, uses a skeleton-finding algorithm to find the skeletons of the curved structure labels, thus constructing a preliminary skeleton dataset. The dataset is then expanded using four different sized sliding windows and randomly combined simply connected component masks, while excluding masks with foreground pixels smaller than 100 pixels to ensure mask quality. Furthermore, morphological erosion operations are iteratively performed, and endpoint pixels are deleted to generate a corresponding skeleton for each mask. Finally, a high-quality skeleton cue dataset is generated for use in training large CSS models.

[0089] 3) This invention constructs a lightweight mask decoder that can effectively fuse image embedding, skeleton embedding (dense embedding), and point embedding (sparse embedding) using only one self-attention and two cross-attention. That is, it completes the features under different modalities and fuses various features in a lightweight way to output the final feature layer to the segmentation head to output the final curve structure segmentation result. Attached image description:

[0090] Figure 1 This is the image encoder module of the present invention.

[0091] Figure 2 This is the prompt encoder module of the present invention.

[0092] Figure 3 This is the mask decoder module of the present invention.

[0093] Figure 4 This is the prediction head module of the present invention.

[0094] Figure 5 This is a flowchart of the medical image segmentation process for curve structures that combines skeleton prompts, dot prompts, and deep neural networks according to the present invention. Detailed Implementation

[0095] The technical solutions of the present invention will be further described in detail below with reference to the accompanying drawings.

[0096] This invention designs a medical image segmentation method combining curve structure cues, dot cues, and deep neural networks. First, a Transformer Block and an Adapter are combined as a single stage, with a total of 12 stages. These 12 stages, along with a Patch Embedding layer, constitute an encoder. Second, a cue encoder is used to encode the skeleton cue and dot cues to obtain dense and sparse embeddings, respectively. These embeddings serve as strong prior knowledge for curve segmentation. Furthermore, a lightweight mask decoder is designed, comprising two decoder blocks. Each block uses only one self-attention and two cross-attention to fuse the image embedding, skeleton embedding (dense embedding), and dot embedding (sparse embedding) to generate the final feature fusion layer. Finally, a prediction head is used to upsample and activate the final feature fusion layer to obtain the final curve structure segmentation prediction result.

[0097] The proposed method was evaluated on several different challenging medical segmentation tasks involving curved structures, demonstrating superior performance compared to most state-of-the-art methods, and exhibiting fewer parameters and fewer GFlops compared to other methods.

[0098] Step 1: Select a publicly available curve structure medical image segmentation dataset, perform data augmentation and preprocessing on the dataset, and obtain our own constructed skeleton dataset.

[0099] The specific implementation of data augmentation and preprocessing of the training set and construction of the skeleton dataset is as follows:

[0100] 1) On the original images and their corresponding ground truths, we use sliding windows (sizes of 128, 224, 256 and 384, with an overlap of 0.5) to extract labeled local patch images of the curve structure, thereby expanding the dataset to 101K.

[0101] 2) The mask containing multiple connected components is split into multiple independent connected component masks. To improve the diversity of masks during interactive training, single connected component masks are randomly combined, and the number of connected components in the generated masks is limited to no more than 5. Following these steps, the number of masks increases to 582K. Since expert models cannot predict multiple masks for a single image, all expert models are trained and tested on 101K images and the corresponding masks from step 2).

[0102] 3) Masks with foreground pixels smaller than 100 pixels were excluded to ensure the quality of the mask.

[0103] 4) Generate a corresponding skeleton dataset for each mask by iteratively performing morphological erosion operations and deleting endpoint pixels.

[0104] 5) All images, masks, and skeletons are saved in PNG format to ensure data loading consistency.

[0105] Step 2: Construct an image encoder module based on Transformer and CNN to enable the network to extract rich image features and generate image embeddings. A Transformer Block and an Adapter are combined as one stage, with a total of 12 stages. These 12 stages, along with a PatchEmbedding layer, constitute an encoder. Figure 1 This is the image encoder module of the present invention. The specific construction steps are as follows:

[0106] Step 2.1: The mapping process of any input feature map after passing through the Adapter layer module can be represented as:

[0107] ① The input feature map is passed through a global pooling layer to reduce the feature map resolution, i.e.

[0108]

[0109] ② After passing through the global pooling layer The linear and nonlinear mapping is performed by first passing the data through an Mlp1 layer, then an activation layer (ReLU3), followed by an Mlp2 layer with Sigmoid activation. The resulting feature map is then processed. Feature map before input adapter By performing point-by-point multiplication, we obtain The formula is expressed as:

[0110]

[0111]

[0112] in, This indicates point-by-point multiplication.

[0113] ③ The feature maps obtained through linear and nonlinear mapping Further nonlinear activation is performed, specifically by first... After passing through a convolutional layer (Conv3), an activation layer (ReLU4), and a transposed convolutional layer (TransConv), the feature map resolution is doubled, ensuring that the feature map does not change its resolution after passing through the Adapter Layer. Then, it passes through an activation layer (ReLU5), and finally, it is compared with the feature map initially input into the Adapter Layer. The values are added point by point to correct any errors that may occur after passing through the Adapter Layer, and finally the final output of the AdapterLayer is obtained. The formula is expressed as:

[0114]

[0115] Step 2.2: The mapping process of any input feature map passing through the Transformer Block can be represented as:

[0116]

[0117]

[0118]

[0119]

[0120]

[0121]

[0122]

[0123]

[0124] Where TB stands for Transformer Block. The feature map before inputting the Transformer Block has a dimension of . H T W T and C T These represent the height, width, and dimension of the feature map before inputting the Transformer Block; The feature map after passing through the Transformer Block, i.e., the output of the Transformer Block, still has the same dimension. In other words, the Transformer Block does not change the dimension of the input feature map; the Softmax operation is used to normalize the vector, mapping all elements to the range of 0-1. Q, K, and V are three matrices with the same dimension, representing the input feature map. Each layer passes through three different Mlp layers: Mlp Q1 Mlp K2 and Mlp V3 The resulting three matrices; each Mlp consists of two linear layers and one activation layer, d k Represents the dimensions of the Q and K matrices;

[0125] Step 2.3: Use a patch embedding layer to extract shallow features of the image through a convolutional layer with a stride of 16 and a sliding window size of 16, and obtain the output result X. EB It can be expressed by the following formula:

[0126] X EB =Conv EB (X In ,K,S)

[0127] Among them, X In This represents the original image before the input block embedding layer, with dimension R. H×W×C H represents the height of the image, W represents the width of the image, C represents the dimension of the image, K represents the sliding window size of the max pooling, and S represents the stride of the convolution kernel in each slide.

[0128] Step 2.4: Take X obtained in Step 2.3 EB The process involves 12 stages of Transformer Block and Adapter Layer processing. The Transformer Block and Adapter Layer are the modules mentioned in steps 2) and 1), respectively. Each stage involves processing the feature map... At the same time, through TB i and AL i Finally, it will be through TB i and AL i The results are added together to obtain the intermediate output of the i-th stage, where TB represents the feature map before the i-th stage of input. i and AL i Let i represent the i-th TransformerBlock and Adapter Layer, respectively, where i ∈ [1, 12]. Each stage is represented as... After 12 stages, the final output of the Image Encoder is obtained:

[0129] Step 3: Construct a prompt encoder to extract features from skeleton prompts and dot prompts, and generate two types of prompt embeddings: dense embeddings and sparse embeddings. Both of these embeddings can provide manual settings for the segmentation position, and therefore can serve as strong prior knowledge for curve segmentation. Figure 2 This is the prompt encoder module of the present invention. The specific construction process is as follows:

[0130] Step 3.1: The process of generating dense embeddings can be represented as follows:

[0131] Skeleton Prompts H SK W represents the height of the skeleton prompt. SK Indicates the width of the skeleton prompt, C SK The dimensions of the skeleton prompt are represented by a convolutional layer Conv. PE_1 This reduces the resolution to one-quarter of its original value, and then passes it through an activation layer (ReLU). PE_1 Then through a convolutional layer Conv PE_2 The feature map resolution is downsampled to one-quarter of its original value, and then passed through a ReLU activation layer. PE_2 To obtain the final dense embedding Its shape is C DE The dimension representing dense embedding can be represented by the following formula:

[0132]

[0133] Step 3.2: The process of generating sparse embeddings can be represented as follows:

[0134] Sparse cue points are a number of points in the original image, each labeled 0 or 1, indicating foreground or background segmentation. Assume there are N coordinates [P0(x,y), P1(x,y)...P...]. N-1 [x,y]; First, linearly transform the coordinates to [0,2π], then combine these N coordinates to obtain P. n Then use the Gaussian distribution matrix Perform matrix multiplication; as shown below:

[0135] P n ′ =P n *G

[0136] in, Let P be the coordinates n After performing matrix multiplication, the product matrix is generated. * indicates matrix multiplication, N is the number of points in the sparse cue, and C represents the pre-defined feature dimension for generating the sparse embedding. Finally, the sine and cosine values of each coordinate are concatenated to obtain the sparse embedding.

[0137] Step 3.3: Combine the dense embeddings obtained in steps 3.1 and 2.4 and image embedding Feature fusion is performed by adding the features point by point, and the fused result is used as the output of the prompt encoder. The formula is expressed as follows:

[0138]

[0139] Step 4: Construct a mask decoder based on self-attention and cross-attention to fuse the image embedding and two cue embeddings—skeleton embedding (dense embedding) and dot embedding (sparse embedding). Decode the fused features to obtain the final decoded feature layer. During the decoding process, allow the fused features and dot embeddings to cross-attention each other, and swap the positions of Q, K, and V during the two cross-attention processes to enhance the feature layer. Figure 3 This is the mask decoder module of the present invention. The specific construction process is as follows:

[0140] Step 4.1: Utilize the sparse embedding obtained in Step 3.2 Sparse embedding with enhanced self-attention The enhanced sparse embedding is then processed by an Mlp to obtain a query matrix. Expressed as a formula:

[0141]

[0142] Step 4.2: Combine the fusion result X obtained in step 3.3 O P u E t The results were obtained through three Mlp layers. Then use Compared with the query matrix in step 4.1 After performing cross-attention, it is then combined with the enhanced sparse embedding. Add them together and obtain the result through a normalization layer. Then it goes through another Mlp layer, and finally... The summation is then passed through a normalization layer to obtain the final learned sparse embedding. Expressed using the following formula:

[0143]

[0144]

[0145]

[0146]

[0147]

[0148] Here, Cross-Attention1 represents the cross-attention operation, and Norm1 and Norm2 represent two normalization layers.

[0149] Step 4.3: Apply the final sparse embedding obtained in Step 4.2 The results are obtained through two Mlp layers. Then Compared with step 4.2 Enhanced fusion embedding through cross-attention Then it is fused with the features obtained in step 3.3. After adding the results point by point, a normalization layer is applied to obtain the final output of the mask decoding block. The final feature map is obtained after two such processing blocks. The formula for this step is expressed as follows:

[0150]

[0151]

[0152]

[0153]

[0154]

[0155] Where Cross-Attention2 represents the cross-attention operation. The feature map is obtained after one mask decoding processing block. The final feature map is obtained after two mask decoding processing blocks. MD2 is the second mask decoding processing block, which follows the same process as the first mask decoding processing block. Norm3 represents a normalization layer.

[0156] Step 5: Construct a prediction head based on transposed convolution. The final decoded features are then restored to the original image resolution using two transposed convolutions. An activation function is used to increase non-linearity, and the final curve-structured segmentation result is generated. Figure 4 This is the prediction head module of the present invention. The specific construction process is as follows:

[0157] Step 5.1: Use the feature map finally generated by the Mask Decoder module mentioned in Step 4.3. The resolution is restored to the original image size before it was input into the entire network by performing two transposed convolutions. This process can be expressed by the following formula:

[0158]

[0159]

[0160] Step 5.2: Transfer the feature map X obtained in Step 5.1 to... Trans2The final prediction result is generated after Sigmoid activation, and the process is represented as follows:

[0161] Step 6: Construct an image segmentation framework consisting of four parts: an image encoder structure combining Transformer and CNN, a prompt encoder structure that can generate dense and sparse embeddings, a mask decoder structure, and a final prediction head structure, to complete the segmentation of medical images with curved structures. Figure 5 This is a network flowchart of the present invention. The specific construction process is as follows:

[0162] Step 6.1: The original image input is based on an image encoder structure that combines Transformer and CNN. The Transformer is frozen, and an adapter with a CNN structure is used to learn the features of the medical image with a curve structure step by step. Finally, the image embedding is output through 12 parallel stages of Transformer and Adapter.

[0163] Step 6.2: Obtain dense embeddings by inputting skeleton prompts and point prompts into the prompt encoder. Sparse Embedding Then, the dense embedding and the image embedding from step 6.1 are added pointwise, and the result of the sum is used as the output of the cue encoder.

[0164] Step 6.3: Merge the features and sparse embedding The final feature map is obtained after passing through two mask decoder blocks.

[0165] Step 6.4: Convert the feature map output by the mask decoder. The image is transposed twice to restore it to the original image size before it was input into the entire network, and then passed through a sigmoid activation function. PH This yields the final curve structure segmentation prediction map.

Claims

1. A medical image segmentation method combining curve structure cues and deep neural networks, characterized in that, Includes the following steps: Step 1: Select a publicly available medical image dataset for curve structure segmentation, and perform data augmentation and preprocessing on the training set in the dataset; Step 2: Construct an image encoder based on Transformer and CNN to extract image features and generate image embeddings; Step 3: Construct a cue encoder, extract cue features, and generate two types of cue embeddings: dense embedding and sparse embedding. Step 4: Design a mask decoder based on self-attention and cross-attention, fuse image embedding and cue embedding, and decode the fused features to obtain the final decoded features; Step 5: Design a prediction head based on transposed convolution to process the final decoded features to generate the final segmentation result; Step 6: Design an image segmentation framework consisting of four parts: an encoder structure combining Transformer and CNN, a cue encoder structure with dense and sparse embeddings, a mask decoder structure, and a final prediction head structure, to complete the segmentation of medical images with curved structures. Step 3 specifically involves: Step 3.1: The process of generating dense embeddings is represented as follows: skeleton prompt , , and The height, width, and dimensions of the skeleton prompt are represented respectively, and then passed through a convolutional layer in sequence. An activation layer A convolutional layer and an activation layer To obtain the final dense embedding Its shape is , The dimension representing dense embedding is used, and the entire process is expressed by the following formula: = ； Step 3.2: The process of generating sparse embeddings is represented as follows: Sparse cue symbols are a number of points in the original image, each with a label of 0 or 1, indicating foreground or background segmentation. Assume there are N coordinates. First, the coordinates are linearly transformed to [0, 2]. Then combine these N coordinates to obtain Then use the Gaussian distribution matrix Perform matrix multiplication; as shown below: ； in, coordinates After performing matrix multiplication, the product matrix is formed. * indicates matrix multiplication. The number of dots in the sparse prompt. This represents the pre-defined feature dimension in the generation of sparse embeddings; Finally, the sine and cosine values of each coordinate are concatenated to obtain the sparse embedding. ∈ ; Step 3.3: Apply the final dense embedding obtained in Step 3.1 Embedded with the final output image obtained in step 2 Feature fusion is performed by adding the features point by point, and the fused result is used as the fused feature output by the prompt encoder. The formula is expressed as follows: = + 。 2. The medical image segmentation method combining curve structure cues and deep neural networks according to claim 1, characterized in that, The datasets in step 1 include the DCA1 and CORN datasets for segmentation tasks of coronary arteries and corneal nerves, and the HRF, LES-AV, CHASEDB1, DRIVE, OCTA500, STARE, IOSTAR, and ORVS datasets for retinal vessel segmentation studies; among which the OSTAR and ORVS datasets are used for zero-shot testing. The specific steps for data augmentation and preprocessing of the training set in the dataset are as follows: On the original image and its corresponding Ground Truth, several sliding windows of different pixel sizes are used to extract local image patches to expand the dataset. At the same time, randomly combined single-connected component masks are performed, and the number of connected components in the generated masks is limited to a set size. Then, masks with foreground smaller than a set number of pixels are excluded to ensure the quality of the masks. Finally, all images are normalized.

3. The medical image segmentation method combining curve structure cues and deep neural networks according to claim 1, characterized in that, Step 2 specifically involves: Step 2.1: The mapping process of any input feature map after passing through the adapter layer module can be represented as: Step 2.1.1: Pass the input feature map through a global pooling layer to reduce the feature map resolution: ； in, This is the feature map output by the global pooling layer. Indicates global pooling; This is the input feature map for the adapter layer module; Step 2.1.2: Convert the feature map after the global pooling layer The process of performing linear and nonlinear mappings is as follows: feature map Passing through one in succession Layer, an activation layer ReLU3 and a Layer, and perform Sigmoid activation, finally obtain the intermediate feature map Feature map before the input adapter layer module By performing point-by-point multiplication, we obtain The formula is expressed as: =Sigmoid Relu3 ( ; ； in, The feature map is obtained through linear and nonlinear mapping. This indicates point-by-point multiplication; Step 2.1.3: Convert the feature maps obtained through linear and nonlinear mappings. Further nonlinear activation is performed, specifically as follows: First, the feature map The feature map resolution is magnified by one time by passing it sequentially through a convolutional layer (Conv3), an activation layer (ReLU4), and a transposed convolutional layer (TransConv); then it passes through an activation layer (ReLU5), and finally the feature map is combined with the feature map before the input adapter layer module. By adding the results point by point, we can obtain the final output. The formula is expressed as: =Relu5 Trans Conv Relu4 Conv3 + ; Step 2.2: The mapping process of any input feature map passing through the Transformer Block can be represented as: ；；；；；；；； Where TB stands for Transformer Block. The feature map before inputting the Transformer Block has a dimension of . , , and These are the feature maps before input to the Transformer Block. Height, width, and dimensions; The feature map after passing through the Transformer Block, i.e., the output of the Transformer Block, still has the same dimension. In other words, the Transformer Block does not change the dimension of the input feature map; the Softmax operation is used to normalize the vector, mapping all elements of the vector to the range of 0-1; Q, K, and V are three matrices with the same dimension, representing the input feature map. Each layer passes through three different Mlp layers: , and The resulting three matrices; each Mlp consists of two linear layers and one activation layer. Represents the dimensions of the Q and K matrices; Step 2.3: Use the block embedding layer to extract shallow features of the image through a convolutional layer with a stride of S and a sliding window size of K, and obtain the output result. It can be expressed by the following formula: ； in, This represents the original image before the input block embedding layer, with dimension . H represents the height of the original image, W represents the width of the original image, C represents the dimension of the original image; K represents the sliding window size of the max pooling, and S represents the stride of the convolution kernel in each slide. Step 2.4: Calculate the output results obtained in Step 2.

3. The Transformer Block and adapter layer processes are performed in 12 stages. The Transformer Block and adapter layer are the modules mentioned in steps 2.1 and 2.2, respectively. Each stage involves processing the feature map. Simultaneously passing through Transformer Block and adapter layer 'i' indicates the nth stage. After 12 stages, the final output image embedding of the image encoder is obtained. .

4. The medical image segmentation method combining curve structure cues and deep neural networks according to claim 3, characterized in that, The mask decoder in step 4 includes two identical mask decoding processing blocks, and the processing procedure is as follows: Step 4.1: Utilize the sparse embedding obtained in Step 3.2 Sparse embedding with enhanced self-attention The enhanced sparse embedding is then passed through an Mlp layer Ml To obtain a query matrix This can be expressed as a formula: =Self-Attention( ); =Ml （）； Step 4.2: Combine the fused features obtained in Step 3.3 Through three Mlp layers respectively , and , to obtain the matrix , and Then use and Compared with the query matrix in step 4.1 After performing cross-attention, it is then combined with the enhanced sparse embedding. The intermediate sparse embeddings are obtained by summing and passing the results through a normalization layer Norm1. Then it goes through an Mlp layer Ml Finally, with the sparse embedding in the middle The elements are added together, and then passed through a normalization layer (Norm2) to obtain the final learned sparse embedding. It can be expressed by the following formula: ；；； =Norm1( +Cross-Attention1( , )); =Norm2( +Ml ( )); Where Cross-Attention1 represents the cross-attention operation; Step 4.3: Apply the final sparse embedding obtained in Step 4.2 Through two Mlp layers and The matrices were obtained respectively. and Then the matrix , Compared with the matrix in step 4.2 Enhanced fusion embedding through cross-attention Then, it is fused with the features obtained in step 3.

3. After point-by-point addition, the result is passed through a normalization layer (Norm3) to obtain the final output of the mask decoding processing block. After two such processing blocks, the final feature map is obtained. The formula for this step is as follows: ；； =Cross-Attention2( , ); =norm3( + ); =MD2( ); Where Cross-Attention2 represents the cross-attention operation. The feature map is obtained after one mask decoding processing block. MD2 is the final feature map obtained after two mask decoding processing blocks. MD2 is the second mask decoding processing block, and its process is the same as that of the first mask decoding processing block.

5. The medical image segmentation method combining curve structure cues and deep neural networks according to claim 4, characterized in that, Step 5 specifically involves: Step 5.1: The final feature map generated from the mask decoding processing block mentioned in Step 4.

3. The resolution is restored to the original image size before it was input into the entire network by performing two transposed convolutions. This process can be expressed by the following formula: 1（）； 2（）； in, For the first transposed convolution 1. Output feature map For the second transposed convolution 2. Output feature maps; Step 5.2: Convert the feature map obtained in Step 5.1 into a single image. The final prediction result is generated after Sigmoid activation, and the process is represented as follows: = （）； in, This is the final curve structure segmentation prediction map.

6. The medical image segmentation method combining curve structure cues and deep neural networks according to claim 5, characterized in that, Step 6 specifically involves: Step 6.1: Input the original image into an encoder structure based on a combination of Transformer and CNN. Freeze the Transformer and use an adapter with a CNN structure to progressively learn the features of the medical image with a curved structure. Finally, output the image embedding through 12 parallel stages of Transformer and adapter. ; Step 6.2: By inputting the skeleton prompt and dot prompt into the prompt encoder, the final dense embeddings are obtained respectively. and final sparse embedding Then the final dense embedding and image embedding in step 6.1 The two points are added point by point, and the result is used as the fusion feature output by the prompt encoder. ; Step 6.3: Merge features and final sparse embedding The final feature map is obtained after passing through two mask decoder blocks. ; Step 6.4: Output the final feature map from the mask decoder. Perform two transposed convolutions to restore the image to its original size before it was input into the entire network, and then pass it through an activation function. This yields the final curve structure segmentation prediction map. .