A blood vessel segmentation method, device, equipment and storage medium
By employing a spatiotemporal correspondence attention network and deep supervised training, the complexity and class imbalance issues in vessel segmentation of X-ray coronary angiography image sequences were addressed, resulting in more accurate vessel segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INST OF TECH
- Filing Date
- 2023-11-23
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies for extracting blood vessels from X-ray coronary angiography image sequences suffer from low contrast, high noise, artifacts caused by organ motion, and segmentation difficulties due to complex vascular structures. Furthermore, there is an imbalance between foreground and background pixels in the blood vessel segmentation task.
A blood vessel segmentation method based on spatiotemporal correspondence attention network is adopted. Through encoder-decoder structure, spatiotemporal correspondence attention component is used to establish the connection between segmented frame and preceding frame. Spatial and channel attention modules are combined to enhance feature representation. Deep supervised training is carried out through CE loss function and DICE loss function.
It improves the accuracy and noise resistance of vessel segmentation, solves the class imbalance problem in vessel segmentation tasks, and enhances the segmentation ability of foreground vessels.
Smart Images

Figure CN117611815B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to a method, apparatus, device, and storage medium for coronary artery sequence segmentation based on a spatiotemporal correspondence attention network. Background Technology
[0002] Coronary artery disease is primarily caused by atherosclerosis, which leads to narrowing or blockage of blood vessels. X-ray coronary angiography is widely recognized as the gold standard for the clinical diagnosis and interventional treatment of coronary artery disease. Extracting blood vessels from X-ray coronary angiography image sequences is a crucial prerequisite for computer-aided diagnosis and treatment.
[0003] However, due to the low contrast and high noise levels in coronary angiography images; artifacts caused by the movement of organs such as the respiratory and cardiac systems; and the complex vascular structures, such as multi-scale, bifurcation, and overlapping structures, efficiently and accurately extracting blood vessels from X-ray coronary angiography image sequences remains a challenging task.
[0004] Convolutional neural networks (CNNs) have made significant progress in blood vessel segmentation in recent years due to their weight sharing, sparse connections, and powerful feature extraction capabilities. UNet is a medical image processing network based on CNNs, designed with a symmetric encoder-decoder and skip connection structure. This network can fuse multi-level features of medical images, thus many extended methods have been used to extract complex vascular structures. However, if these networks only focus on pixel-level segmentation and ignore the temporal information of vascular structures, it will further exacerbate the fragmentation and incompleteness of blood vessel segmentation. Furthermore, blood vessel segmentation tasks often suffer from class imbalance in the number of foreground and background pixels. Summary of the Invention
[0005] In view of the above problems, the present invention provides a method, apparatus, device and storage medium for overcoming or at least partially solving the above problems.
[0006] This invention provides the following solution:
[0007] A method for segmenting blood vessels, comprising:
[0008] Acquire multiple consecutive frames of angiography images, and select the current segmentation frame image to be segmented and several consecutive preceding frame images from the multiple consecutive frames of angiography images.
[0009] A value encoder is used to encode the current segmented frame image and several preceding frame images to obtain multi-level backbone features; a memory encoder is used to encode several preceding frame images and their respective segmentation masks to obtain memory features.
[0010] The spatiotemporal correspondence attention component is used to process the highest-level backbone features and the memory features to obtain a spatial attention feature map, a channel attention feature map, and a spatiotemporal correspondence feature map. The spatiotemporal correspondence attention component includes a spatial attention module that uses the highest-level backbone features to obtain the spatial attention feature map, a channel attention module that uses the highest-level backbone features to obtain the channel attention feature map, and a spatiotemporal correspondence module that uses the highest-level backbone features and the memory features to obtain the spatiotemporal correspondence feature map.
[0011] The spatial attention feature map, the channel attention feature map, and the spatiotemporal correspondence feature map are concatenated along the channel dimension and then input into the decoder. At the same time, the multi-level backbone features are input into the decoder in a skip connection manner so that the decoder can generate a segmentation result.
[0012] Preferably, the value encoder and the memory encoder have the same structure, both using multiple convolutional blocks to extract multi-scale features of the input data and passing them through a max pooling downsampling layer; each convolutional block includes two convolutional kernels of size 3×3, batch normalization, and a linear rectification unit.
[0013] Preferably, the spatial attention module is used to weight all spatial locations and selectively aggregate the features of each spatial location; the spatial attention module is used to perform the following operations:
[0014] respectively and Perform a convolution operation on the highest-level backbone features to generate the corresponding feature maps. and C represents the number of channels in the feature map, and H and W are the height and width of the image;
[0015] By multiplying the transposes of K and Q and then applying a softmax layer, the spatial attention matrix is represented as follows:
[0016]
[0017] In the formula: S (x,y) K represents the influence of the y-th position on the x-th position. x Represents the x-th position of feature map K. This represents the y-th position after the feature map Q is transposed.
[0018] Connect feature maps V and S (x,y) Multiplication yields the spatial attention-enhanced feature F′.
[0019] Preferably, the channel attention module is used to perform global representation and normalization of features from different channels, and the channel attention module is used to perform the following operations:
[0020] The channel dependency matrix is obtained by multiplying the highest-level backbone feature with its transpose feature. Channel dependency matrix After performing softmax processing, the channel attention matrix is represented as follows:
[0021]
[0022] In the formula: C (x,y) B represents the influence of the y-th channel on the x-th channel. x The x channels represent the highest-level backbone features of the current frame. This represents the y-th channel after the transpose of the highest-level backbone feature of the current frame;
[0023] C (x,y) Multiplying by B yields the channel attention-enhanced feature F″.
[0024] Preferably, the spatiotemporal correspondence module is used to compare the correlation between the current segmented frame and the previous frame in the backbone features to obtain the spatiotemporal correlation matrix, and then extract salient features for time discrimination from the high-level memory features; the spatiotemporal correspondence module is used to perform the following operations:
[0025] respectively The convolution operation generates the corresponding feature map. and The feature map of the current segmented frame Shape reshaping Prequel Frame Feature Map Shape reshaping For the current segmented frame Q 1 and preceding frame Q 0 The transpose of the similarity function After further processing by a softmax layer, the spatiotemporal matching matrix is obtained as follows:
[0026]
[0027] In the formula: This represents point x in the current segmented frame. Let y represent point y in the preceding frame, and t represent the similarity function, i.e., the negative squared Euclidean distance;
[0028] T (0,1) Multiply with the memory feature M to obtain the temporal feature F″′ associated with the current frame.
[0029] Preferably, the four outputs of the decoder are trained under deep supervision by combining the CE loss function and the DICE loss function;
[0030] The ce loss function is expressed by the following formula:
[0031]
[0032] In the formula: y i ∈{0,1} is the gold standard for position i, p i ∈[0,1] is the predicted value at position i, and N is the total number of pixels;
[0033] The Dice loss function is expressed by the following formula:
[0034]
[0035] In the formula: ε is a very small constant that keeps the value stable.
[0036] Preferably: the current segmented frame image is used as the preceding frame image for the next segmentation, and the preceding segmented frame image and the next frame image are used together as input value encoders. The segmentation mask of the preceding segmented frame image and the preceding segmented frame image are input together as preceding frames into memory encoders to start the next frame segmentation.
[0037] A blood vessel segmentation device, comprising:
[0038] The image acquisition unit is used to acquire multiple consecutive frames of angiography images, and select and determine the current segmentation frame image to be segmented and several consecutive preceding frame images from the multiple consecutive frames of angiography images.
[0039] The feature acquisition unit is used to encode the current segmented frame image and several preceding frame images using a value encoder to obtain multi-level backbone features; and to encode several preceding frame images and their respective segmentation masks using a memory encoder to obtain memory features.
[0040] An attention feature map acquisition unit is used to process the highest-level backbone features and the memory features using a spatiotemporal correspondence attention component to obtain a spatial attention feature map, a channel attention feature map, and a spatiotemporal correspondence feature map; the spatiotemporal correspondence attention component includes a spatial attention module that uses the highest-level backbone features to obtain the spatial attention feature map, a channel attention module that uses the highest-level backbone features to obtain the channel attention feature map, and a spatiotemporal correspondence module that uses the highest-level backbone features to obtain the spatiotemporal correspondence feature map.
[0041] The segmentation unit is used to input the spatial attention feature map, the channel attention feature map, and the spatiotemporal correspondence feature map into the decoder after concatenating them in the channel dimension, and at the same time input the multi-level backbone features into the decoder in a skip connection manner so that the decoder can generate a segmentation result.
[0042] A blood vessel segmentation device, the device including a processor and a memory:
[0043] The memory is used to store program code and transmit the program code to the processor;
[0044] The processor is used to execute the above-described blood vessel segmentation method according to the instructions in the program code.
[0045] A computer-readable storage medium for storing program code for performing the above-described blood vessel segmentation method.
[0046] According to specific embodiments provided by the present invention, the present invention discloses the following technical effects:
[0047] This application provides a blood vessel segmentation method, apparatus, device, and storage medium. The method uses a spatiotemporal correspondence module to establish a connection between the segmented frame and the preceding frame, extracting spatiotemporal features from the preceding frame to enhance the feature representation of the current segmented frame. Simultaneously, a spatial attention module and a channel attention module are employed to establish spatial and channel dependencies in the current segmented frame, enhancing the segmentation capability of foreground blood vessels. Furthermore, deep supervised training of the four outputs of the decoder can be performed using the CE loss function and the DICE loss function to address the class imbalance problem caused by the imbalance between background and foreground pixels in X-ray coronary angiography images.
[0048] Of course, any product implementing this invention does not necessarily need to achieve all of the advantages described above at the same time. Attached Figure Description
[0049] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly described below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.
[0050] Figure 1 This is a flowchart of a blood vessel segmentation method provided in an embodiment of the present invention;
[0051] Figure 2 This is a schematic diagram of the network structure of a blood vessel segmentation method provided in an embodiment of the present invention;
[0052] Figure 3 This is a schematic diagram of the spatiotemporal correspondence attention component provided in an embodiment of the present invention;
[0053] Figure 4 This is a schematic diagram of a blood vessel segmentation device provided in an embodiment of the present invention;
[0054] Figure 5 This is a schematic diagram of a blood vessel segmentation device provided in an embodiment of the present invention. Detailed Implementation
[0055] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention are within the scope of protection of the present invention.
[0056] See Figure 1 This invention provides a method for segmenting blood vessels, such as... Figure 1 As shown, the method may include:
[0057] S101: Acquire multiple consecutive frames of angiography images, and select the current segmentation frame image to be segmented and several consecutive preceding frame images from the multiple consecutive frames of angiography images.
[0058] S102: Encode the current segmented frame image and several preceding frame images using a value encoder to obtain multi-level backbone features; encode the several preceding frame images and their respective segmentation masks using a memory encoder to obtain memory features.
[0059] S103: The spatiotemporal correspondence attention component is used to process the highest-level backbone features and the memory features to obtain a spatial attention feature map, a channel attention feature map, and a spatiotemporal correspondence feature map; the spatiotemporal correspondence attention component includes a spatial attention module that uses the highest-level backbone features to obtain the spatial attention feature map, a channel attention module that uses the highest-level backbone features to obtain the channel attention feature map, and a spatiotemporal correspondence module that uses the highest-level backbone features to obtain the spatiotemporal correspondence feature map;
[0060] Furthermore, the spatial attention module is used to weight all spatial locations and selectively aggregate the features of each spatial location; the spatial attention module is used to perform the following operations:
[0061] respectively and Perform a convolution operation on the highest-level backbone features to generate the corresponding feature maps. and C represents the number of channels in the feature map, and H and W are the height and width of the image;
[0062] By multiplying the transposes of K and Q and then applying a softmax layer, the spatial attention matrix is represented as follows:
[0063]
[0064] In the formula: S (x,y) K represents the influence of the y-th position on the x-th position. x Represents the x-th position of feature map K. This represents the y-th position after the feature map Q is transposed.
[0065] Connect feature maps V and S (x,y) Multiplication yields the spatial attention-enhanced feature F′.
[0066] The channel attention module is used to perform global representation and normalization of features from different channels, and the channel attention module is used to perform the following operations:
[0067] The channel dependency matrix is obtained by multiplying the highest-level backbone feature with its transpose feature. Channel dependency matrix After performing softmax processing, the channel attention matrix is represented as follows:
[0068]
[0069] In the formula: C (x,y) B represents the influence of the y-th channel on the x-th channel. x The x channels represent the highest-level backbone features of the current frame. This represents the y-th channel after the transpose of the highest-level backbone feature of the current frame;
[0070] C (x,y) Multiplying by B yields the channel attention-enhanced feature F″.
[0071] The spatiotemporal correspondence module is used to compare the correlation between the current segmented frame and the previous frame in the backbone features to obtain the spatiotemporal correlation matrix, and then extract the significant features for time discrimination from the high-level memory features; the spatiotemporal correspondence module is used to perform the following operations:
[0072] respectively The convolution operation generates the corresponding feature map. and Then, the feature map of the current segmented frame. Shape reshaping Prequel Frame Feature Map Shape reshaping For the current segmented frame Q 1 and preceding frame Q 0 The transpose of the similarity function After further processing by the softmax layer, the spatiotemporal matching matrix representation can be obtained as follows:
[0073]
[0074] In the formula: This represents point x in the current segmented frame. Let y represent the point in the preceding frame, and t represent the similarity function, i.e., the negative squared Euclidean distance.
[0075] T (0,1) Multiply with the memory feature M to obtain the temporal feature F″′ associated with the current frame;
[0076] The similarity function is expressed as follows: (Negative squared Euclidean distance is used as the similarity function.)
[0077]
[0078] S104: The spatial attention feature map, the channel attention feature map, and the spatiotemporal correspondence feature map are concatenated along the channel dimension and then input into the decoder. At the same time, the multi-level backbone features are input into the decoder in a skip connection manner so that the decoder can generate a segmentation result.
[0079] Furthermore, the value encoder has the same structure as the memory encoder, both employing multiple convolutional blocks to extract multi-scale features of the input data and passing them through a max-pooling downsampling layer; each convolutional block includes two convolutional kernels of size 3×3, batch normalization, and a linear rectification unit.
[0080] In order to supervise the output of the decoder, embodiments of this application may also provide deep supervised training of the four outputs of the decoder by combining the CE loss function and the DICE loss function;
[0081] The ce loss function is expressed by the following formula:
[0082]
[0083] In the formula: y i ∈{0,1} is the gold standard for position i, p i ∈[0,1] is the predicted value at position i, and N is the total number of pixels;
[0084] The Dice loss function is expressed by the following formula:
[0085]
[0086] In the formula: ε is a very small constant that keeps the value stable.
[0087] To further enable the above operations to be performed on multiple consecutive frames of images, embodiments of this application may also provide that the current segmented frame image is used as the preceding frame image for the next segmentation, the preceding segmented frame image and the next frame image are used together as input value encoders, and the segmentation mask of the preceding segmented frame image and the preceding segmented frame image are input together as preceding frames into memory encoders to start the next frame segmentation.
[0088] The vessel segmentation method provided in this application adopts an encoder-decoder structure, comprising three components: an encoder, a spatiotemporal correspondence attention component, and a decoder. There are two encoders: a value encoder and a memory encoder. The two encoder branches have the same structure. The value encoder takes two consecutive frames of images as input, encodes the coronary angiography images of the preceding frame and the current segmentation frame, and generates multi-level backbone features encoded by the frame number. The memory encoder takes the image of the previous frame and the segmentation mask as input, and encodes the preceding frame and its corresponding mask to obtain memory features. The spatiotemporal correspondence attention component includes one spatiotemporal correspondence module and two attention modules. The spatiotemporal correspondence module is used to establish the connection between the segmentation frame and the preceding frame, extracting spatiotemporal features from the preceding frame to enhance the feature representation of the current segmentation frame. Simultaneously, spatial attention modules and channel attention modules are used to establish the spatial and channel dependencies of the current segmentation frame, enhancing the segmentation ability of foreground vessels. The decoder is structurally symmetrical with the encoder. Skip connections connect the low-level and high-level features of the value encoder to the decoder at the same scale, receiving the same input features from the encoder.
[0089] In addition, the CE loss function and the DICE loss function can be combined to perform deep supervised training on the four outputs of the decoder to solve the class imbalance problem caused by the imbalance between background and foreground pixels in X-ray coronary angiography images.
[0090] The method provided in this application embodiment will be described in detail below using the processing of coronary artery sequence vascular images as an example. The method includes:
[0091] First, acquire multiple consecutive frames of coronary angiography images. Then, determine the current segmented frame image to be processed from the multiple consecutive frames of coronary angiography images, as well as one or more previous frames as preceding frames. Input the current segmented frame image and all preceding frames into a value encoder. The value encoder encodes the coronary angiography images of the preceding frames and the current segmented frame image, and outputs multi-level backbone features encoded with the number of frames.
[0092] All preceding coronary angiography images and the segmentation mask corresponding to each preceding frame image are input into the memory encoder. The memory encoder encodes the preceding frames and their corresponding masks to obtain memory features.
[0093] like Figure 2 As shown, the two encoder branches have the same structure, using multiple convolutional blocks to extract multi-scale features from the input data, followed by a max-pooling downsampling layer. Each convolutional block consists of two 3×3 kernel convolutions, batch normalization, and linear rectified units.
[0094] In this embodiment, multiple consecutive frames of coronary angiography images and the mask images of the preceding frames are used as inputs to two encoders, and feature maps of multiple consecutive frames of coronary angiography images are output.
[0095] Then, the feature map extracted by the encoder is input into the spatiotemporal correspondence attention module, which consists of three parts: a spatial attention module, a channel attention module, and a spatiotemporal correspondence module, as follows: Figure 3 As shown.
[0096] The highest-level backbone features of the current segmented frame are input into the spatial attention module. The spatial attention module selectively aggregates the features of each spatial location by weighting all spatial locations, enabling the model to capture long-range dependencies.
[0097] Specifically, the highest-level backbone features of the current segmented frame are respectively processed by... and value The convolution operation generates the corresponding feature map. and V∈ Then, the shapes of these three new feature maps were reshaped as follows: Where C represents the number of channels in the feature map, and N = H × W, where H and W are the height and width of the image. Therefore, by multiplying the transposes of K and Q and then applying a softmax layer, the spatial attention matrix can be obtained, as shown below:
[0098]
[0099] In the formula: S (x,y) K represents the influence of the y-th position on the x-th position. x Represents the x-th position of feature map K. This represents the y-th position after the feature map Q is transposed.
[0100] Connect feature maps V and S (x,y) Multiplication yields the spatial attention-enhanced feature F′, which is then reshaped into... The final output of the spatial attention module is defined as B+F′.
[0101] The highest-level backbone features of the current segmented frame are input into the channel attention module. The channel attention module enhances the interaction and comparison between the features of different channels by performing global representation and normalization, thereby improving the feature recognition ability of the model.
[0102] Specifically, the channel dependency matrix is obtained by multiplying the highest-level backbone feature of the current segmented frame with its transpose feature. The channel dependency matrix is processed using softmax to obtain the channel attention matrix, as shown below:
[0103]
[0104] In the formula: C (x,y) B represents the influence of the y-th channel on the x-th channel. x The x channels represent the highest-level backbone features of the current frame. This represents the y-th channel after the transpose of the highest-level backbone feature of the current frame;
[0105] C (x,y) Multiplying by B yields the channel attention-enhanced feature F″. The final output of the channel attention module is defined as B+F″.
[0106] The highest-level backbone features and memory features are input into the spatiotemporal correspondence module. The correlation between the current segmented frame and the previous frame in the backbone features is compared to obtain the spatiotemporal correlation matrix. Then, significant features for time discrimination are extracted from the high-level memory features.
[0107] Specifically, the highest-level backbone characteristics were respectively processed through The convolution operation generates the corresponding feature map. and Then, the feature map of the current segmented frame. Shape reshaping Prequel Frame Feature Map Shape reshaping For the current segmented frame Q 1 and preceding frame Q 0 The transpose of the similarity function After further processing by a softmax layer, the spatiotemporal matching matrix can be obtained, as shown below:
[0108]
[0109] In the formula: This represents point x in the current segmented frame. Let y represent point y in the preceding frame, and t represent the similarity function, i.e., the negative squared Euclidean distance.
[0110] T (0,1)Multiplying the memory feature M by the signal yields the temporal feature F″′ associated with the current segmented frame. The final output of the channel attention module is defined as K. 1 +M+F″′.
[0111] Specifically, to ensure that the calculation of the spatiotemporal matching matrix is fast and efficient, the negative squared Euclidean distance is used as the similarity function:
[0112]
[0113] Substituting formula (4) into formula (3) yields
[0114]
[0115] Therefore, the similarity function can be simplified to:
[0116]
[0117] This embodiment integrates a spatial attention module, a channel attention module, and a spatiotemporal correspondence module into a single component. The spatiotemporal correspondence module calculates the matching degree between the current segmented frame and the preceding frame to obtain a spatiotemporal matching matrix. Then, it extracts consistent spatiotemporal features from the preceding frame using the spatiotemporal matching matrix to enhance the feature representation of the current segmented frame. The spatial attention and channel attention modules take the highest-level backbone features as input and weight the spatial location and channel of the input features, making regions with high weights represent significant coronary artery features, thereby enhancing the ability to identify coronary artery features.
[0118] The spatial attention feature map, channel attention feature map, and spatiotemporal correspondence feature map obtained from the spatiotemporal correspondence attention component are concatenated along the channel dimension and then output to the decoder. Simultaneously, to preserve the multi-scale features of the encoder, skip connections are used to connect the low-level and high-level features of the value encoder to the decoder at the same scale. The decoder is structurally symmetrical to the encoder, receiving the same input features from the encoder. Specifically, each decoder consists of multiple convolutional blocks, followed by bilinear interpolation upsampling layers. In the final layer of the network, the decoder output is passed through a 1x1 convolutional kernel to generate the final segmentation mask.
[0119] Based on the above embodiments, in order to segment the coronary angiography image sequence separately, after the current segmentation frame is processed, the current segmentation frame can be regarded as the preceding frame for the next segmentation. The current segmentation frame and the next frame are input into the value encoder, and the segmentation mask of the current segmentation frame and the current segmentation frame as the preceding frame are input together into the memory encoder to start the segmentation of the next frame, such as... Figure 2 As shown in the figure. In this embodiment, the imaging image sequence is input into the same network and segmented frame by frame without introducing additional parameters.
[0120] Based on the above embodiments, deep supervised training is performed on the four outputs of the decoder by combining the ce loss function and the dice loss function.
[0121] Specifically, the decoder obtains four mask outputs at different scales through upsampling and skip connection operations. We use these masks together with the coronary gold standard as inputs to the loss function, as shown in Equations 6 and 7, where α and β are scaling coefficients.
[0122] L loss =1L1+0.4L2+0.2L3+0.1L4#(6)
[0123] L i =αL ce(i) +βL dice(i) #(7)
[0124] The cross-entropy loss function compares the prediction and label pixel by pixel, with each background pixel and blood vessel pixel contributing equally to the ce loss function, as shown in Equation (8):
[0125]
[0126] In the formula: y i ∈{0,1} is the gold standard for position i, p i ∈[0,1] is the predicted value at position i, and N is the total number of pixels.
[0127] Meanwhile, to avoid the problem of training being dominated by background classes with more pixels, this embodiment of the application incorporates the Dice loss function, which calculates the overlap rate between the predicted blood vessels and the gold standard mask, making the training more inclined to learn blood vessel features with fewer pixels. The Dice loss function is defined as shown in formula (9):
[0128]
[0129] In the formula: ε is a very small constant that keeps the value stable.
[0130] This embodiment uses a combination of Dice loss and CE loss to perform depth-supervised training on masks at different levels to solve the class imbalance problem caused by the imbalance between background and foreground pixels in X-ray coronary angiography images.
[0131] In summary, the vessel segmentation method provided in this application uses a spatiotemporal correspondence module to establish the connection between the segmented frame and the preceding frame, extracting spatiotemporal features from the preceding frame to enhance the feature representation of the current segmented frame. Simultaneously, spatial attention and channel attention modules are employed to establish spatial and channel dependencies in the current segmented frame, enhancing the segmentation capability of foreground vessels. Furthermore, the CE and DICE loss functions can be combined to perform deep supervised training on the four outputs of the decoder to address the class imbalance problem caused by the imbalance between background and foreground pixels in X-ray coronary angiography images.
[0132] See Figure 4 This application embodiment can also provide a blood vessel segmentation device, such as... Figure 4 As shown, the device may include:
[0133] Image acquisition unit 401 is used to acquire multiple consecutive frames of angiography images, and select and determine the current segmentation frame image to be segmented and several consecutive preceding frame images from the multiple consecutive frames of angiography images.
[0134] The feature acquisition unit 402 is used to encode the current segmented frame image and several preceding frame images using a value encoder to obtain multi-level backbone features; and to encode several preceding frame images and their respective segmentation masks using a memory encoder to obtain memory features.
[0135] Attention feature map acquisition unit 403 is used to process the highest-level backbone features and memory features using a spatiotemporal correspondence attention component to obtain a spatial attention feature map, a channel attention feature map, and a spatiotemporal correspondence feature map; the spatiotemporal correspondence attention component includes a spatial attention module for obtaining the spatial attention feature map using the highest-level backbone features, a channel attention module for obtaining the channel attention feature map using the highest-level backbone features, and a spatiotemporal correspondence module for obtaining the spatiotemporal correspondence feature map using the highest-level backbone features and memory features.
[0136] The segmentation unit 404 is used to input the spatial attention feature map, the channel attention feature map, and the spatiotemporal correspondence feature map into the decoder after concatenating them in the channel dimension, and at the same time input the multi-level backbone features into the decoder in a skip connection manner so that the decoder can generate a segmentation result.
[0137] This application embodiment can also provide a blood vessel segmentation device, the device including a processor and a memory:
[0138] The memory is used to store program code and transmit the program code to the processor;
[0139] The processor is used to execute the steps of the above-described blood vessel segmentation method according to the instructions in the program code.
[0140] like Figure 5 As shown in the figure, a blood vessel segmentation device provided in this application embodiment may include: a processor 10, a memory 11, a communication interface 12, and a communication bus 13. The processor 10, memory 11, and communication interface 12 all communicate with each other through the communication bus 13.
[0141] In this embodiment, the processor 10 may be a central processing unit (CPU), an application-specific integrated circuit, a digital signal processor, a field-programmable gate array, or other programmable logic devices.
[0142] The processor 10 can call the program stored in the memory 11. Specifically, the processor 10 can execute the operations in the embodiments of the blood vessel segmentation method.
[0143] The memory 11 is used to store one or more programs. The programs may include program code, which includes computer operation instructions. In this embodiment, the memory 11 stores at least a program for implementing the following functions:
[0144] Acquire multiple consecutive frames of angiography images, and select the current segmentation frame image to be segmented and several consecutive preceding frame images from the multiple consecutive frames of angiography images.
[0145] A value encoder is used to encode the current segmented frame image and several preceding frame images to obtain multi-level backbone features; a memory encoder is used to encode several preceding frame images and their respective segmentation masks to obtain memory features.
[0146] The spatiotemporal correspondence attention component is used to process the highest-level backbone features and the memory features to obtain a spatial attention feature map, a channel attention feature map, and a spatiotemporal correspondence feature map. The spatiotemporal correspondence attention component includes a spatial attention module that uses the highest-level backbone features to obtain the spatial attention feature map, a channel attention module that uses the highest-level backbone features to obtain the channel attention feature map, and a spatiotemporal correspondence module that uses the highest-level backbone features and the memory features to obtain the spatiotemporal correspondence feature map.
[0147] The spatial attention feature map, the channel attention feature map, and the spatiotemporal correspondence feature map are concatenated along the channel dimension and then input into the decoder. At the same time, the multi-level backbone features are input into the decoder in a skip connection manner so that the decoder can generate a segmentation result.
[0148] In one possible implementation, the memory 11 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function (such as file creation or data read / write). The data storage area may store data created during use, such as initialization data.
[0149] In addition, memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device or other volatile solid-state storage device.
[0150] Communication interface 12 can be an interface for the communication module, used to connect with other devices or systems.
[0151] Of course, it should be noted that, Figure 5 The structure shown does not constitute a limitation on the blood vessel segmentation device in the embodiments of this application. In practical applications, the blood vessel segmentation device may include more than Figure 5 More or fewer components as shown, or combinations of certain components.
[0152] This application embodiment may also provide a computer-readable storage medium for storing program code for performing the steps of the above-described blood vessel segmentation method.
[0153] It should be noted that, in the embodiments of this application, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0154] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of this application.
[0155] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for system or system embodiments, since they are basically similar to method embodiments, the description is relatively simple, and relevant parts can be referred to the descriptions in the method embodiments. The systems and system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0156] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention are included within the scope of protection of the present invention.
Claims
1. A method for segmenting blood vessels, characterized in that, include: Acquire multiple consecutive frames of angiography images, and select the current segmentation frame image to be segmented and several consecutive preceding frame images from the multiple consecutive frames of angiography images. The current segmented frame image and several preceding frame images are encoded using a value encoder to obtain multi-level backbone features. Memory features are obtained by encoding several preceding frame images and their respective segmentation masks using a memory encoder. The spatiotemporal correspondence attention component is used to process the highest-level backbone features and the memory features to obtain a spatial attention feature map, a channel attention feature map, and a spatiotemporal correspondence feature map. The spatiotemporal correspondence attention component includes a spatial attention module that uses the highest-level backbone features to obtain the spatial attention feature map, a channel attention module that uses the highest-level backbone features to obtain the channel attention feature map, and a spatiotemporal correspondence module that uses the highest-level backbone features and the memory features to obtain the spatiotemporal correspondence feature map. The spatial attention feature map, the channel attention feature map, and the spatiotemporal correspondence feature map are concatenated along the channel dimension and then input into the decoder. At the same time, the multi-level backbone features are input into the decoder in a skip connection manner so that the decoder can generate a segmentation result.
2. The blood vessel segmentation method according to claim 1, characterized in that, The value encoder has the same structure as the memory encoder, both of which use multiple convolutional blocks to extract multi-scale features of the input data and pass them through a max pooling downsampling layer; each convolutional block includes two convolutional kernels of size 3×3, batch normalization and linear rectification units.
3. The blood vessel segmentation method according to claim 1, characterized in that, The spatial attention module is used to weight all spatial locations and selectively aggregate the features of each spatial location; the spatial attention module is used to perform the following operations: respectively and Perform a convolution operation on the highest-level backbone features to generate the corresponding feature maps. and C represents the number of channels in the feature map, and H and W are the height and width of the image; By multiplying the transposes of K and Q and then applying a softmax layer, the spatial attention matrix is represented as follows: In the formula: S (x,y) K represents the influence of the y-th position on the x-th position. x Represents the x-th position of feature map K. This represents the y-th position after the Q-transpose of the feature map; Connect feature maps V and S (x,y) Multiplication yields the spatial attention-enhanced feature F′.
4. The blood vessel segmentation method according to claim 1, characterized in that, The channel attention module is used to perform global representation and normalization of features from different channels, and the channel attention module is used to perform the following operations: The channel dependency matrix is obtained by multiplying the highest-level backbone feature with its transpose feature. Channel dependency matrix After performing softmax processing, the channel attention matrix is represented as follows: In the formula: C (x,y) B represents the influence of the y-th channel on the x-th channel. x The x channels represent the highest-level backbone features of the current frame. This represents the y-th channel after the transpose of the highest-level backbone feature of the current frame; C (x,y) Multiplying by B yields the channel attention-enhanced feature F″.
5. The blood vessel segmentation method according to claim 1, characterized in that, The spatiotemporal correspondence module is used to compare the correlation between the current segmented frame and the previous frame in the backbone features to obtain the spatiotemporal correlation matrix, and then extract salient features for time discrimination from the high-level memory features; the spatiotemporal correspondence module is used to perform the following operations: respectively The convolution operation generates the corresponding feature map. and The feature map of the current segmented frame Shape reshaping Prequel Frame Feature Map Shape reshaping For the current segmented frame Q 1 and preceding frame Q 0 The transpose of the similarity function After further processing by a softmax layer, the spatiotemporal matching matrix is obtained as follows: In the formula: This represents point x in the current segmented frame. Let y represent point y in the preceding frame, and t represent the similarity function, i.e., the negative squared Euclidean distance; T (0,1) Multiply with the memory feature M to obtain the temporal feature F″′ associated with the current frame.
6. The blood vessel segmentation method according to claim 1, characterized in that, The four outputs of the decoder are trained using a combination of the CE loss function and the DICE loss function; The ce loss function is expressed by the following formula: In the formula: y i ∈{0,1} is the gold standard for position i, p i ∈[0,1] is the predicted value at position i, and N is the total number of pixels; The Dice loss function is expressed by the following formula: In the formula: ε is a very small constant that keeps the value stable.
7. The blood vessel segmentation method according to claim 1, characterized in that, The current segmented frame image is used as the preceding frame image for the next segmentation. The preceding segmented frame image and the next frame image are used together as input values to the encoder. The segmentation mask of the preceding segmented frame image and the preceding frame image are input together as the preceding frame to the memory encoder to start the next frame segmentation.
8. A blood vessel segmentation device, characterized in that, include: The image acquisition unit is used to acquire multiple consecutive frames of angiography images, and select and determine the current segmentation frame image to be segmented and several consecutive preceding frame images from the multiple consecutive frames of angiography images. The feature acquisition unit is used to encode the current segmented frame image and several preceding frame images using a value encoder to obtain multi-level backbone features; and to encode several preceding frame images and their respective segmentation masks using a memory encoder to obtain memory features. An attention feature map acquisition unit is used to process the highest-level backbone features and the memory features using a spatiotemporal correspondence attention component to obtain a spatial attention feature map, a channel attention feature map, and a spatiotemporal correspondence feature map; the spatiotemporal correspondence attention component includes a spatial attention module that uses the highest-level backbone features to obtain the spatial attention feature map, a channel attention module that uses the highest-level backbone features to obtain the channel attention feature map, and a spatiotemporal correspondence module that uses the highest-level backbone features to obtain the spatiotemporal correspondence feature map. The segmentation unit is used to input the spatial attention feature map, the channel attention feature map, and the spatiotemporal correspondence feature map into the decoder after concatenating them in the channel dimension, and at the same time input the multi-level backbone features into the decoder in a skip connection manner so that the decoder can generate a segmentation result.
9. A blood vessel segmentation device, characterized in that, The device includes a processor and a memory: The memory is used to store program code and transmit the program code to the processor; The processor is configured to execute the blood vessel segmentation method according to any one of claims 1-7 according to the instructions in the program code.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store program code for performing the blood vessel segmentation method according to any one of claims 1-7.