A breast ultrasound video lesion segmentation method

By applying a backbone network and Transformer module to breast ultrasound videos, combined with multi-task learning and instance sequence matching, the problem of insufficient lesion segmentation accuracy in ultrasound videos was solved, and high-precision segmentation of lesion areas was achieved.

CN114359556BActive Publication Date: 2026-06-16SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Filing Date
2021-12-09
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing methods for segmenting breast ultrasound lesions mainly rely on static images and cannot effectively utilize the continuous frame information in ultrasound videos, resulting in insufficient segmentation accuracy. This is especially true when breast ultrasound images contain noise and the lesion area is small, making it difficult to achieve high-precision automatic lesion segmentation.

Method used

The backbone network is used to extract features from consecutive video frames, which are then encoded and decoded using the Transformer module. Through multi-task learning, the lesion category, bounding box, and segmentation mask are directly predicted. Instance sequence matching and loss function optimization are used to finally achieve end-to-end lesion segmentation.

🎯Benefits of technology

It improves the lesion segmentation accuracy of the whole image during real-time ultrasound imaging, overcomes the problem that single-frame image segmentation is easily affected by noise, and achieves accurate segmentation of lesion areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114359556B_ABST
    Figure CN114359556B_ABST
Patent Text Reader

Abstract

The application discloses a breast ultrasound video lesion segmentation method, and specifically comprises the following steps: step 1, feature extraction of video continuous frames is carried out through a backbone network, the features of each frame are extracted, and all feature maps are spliced together to obtain the feature map of continuous frames; step 2, the feature map obtained in step 1 is sequentially encoded and decoded to obtain learned instance queries output in the order of the original video frame sequence; step 3, the learned instance queries in step 2 are predicted through FC to predict the bounding box and segmentation mask vector through MLP, respectively, to obtain the class, bounding box and segmentation mask vector of the target instance output in the order of the original video frame sequence with a fixed size; and step 4, the result obtained in step 3 is subjected to instance sequence matching and loss function calculation. The application realizes accurate segmentation of lesions in the whole image obtained in the process of real-time imaging of ultrasound by modeling the features of video continuous frames.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer-aided diagnostic technology and relates to a method for segmenting breast ultrasound video lesions. Background Technology

[0002] Breast cancer is one of the three major malignant tumors in women and has become a major public health problem in society. The global incidence of breast cancer has been on the rise since the late 1970s. According to cancer incidence data released in 2020 by the National Cancer Center and the Bureau of Disease Prevention and Control of the Ministry of Health, breast cancer has surpassed lung cancer, the most common cancer in humans, to become the leading cause of cancer-related illness and death among women. The treatment outcome of breast cancer depends on early diagnosis and treatment of breast tumors. The cure rate for early-stage carcinoma in situ is 95%, while late-stage breast cancer is very difficult to cure. Ultrasound examination, with its advantages of being non-invasive, radiation-free, fast imaging, highly sensitive, relatively inexpensive, and simple to operate, is the preferred imaging examination and preoperative assessment method for breast cancer. In recent years, machine learning, especially deep learning, has been applied to the research of lesion segmentation in breast ultrasound images. In fact, the ultrasound imaging process is dynamic. If information from consecutive frames can be integrated, the problem of segmentation results using only a single frame image being easily affected by blurred boundaries and background noise in the ultrasound image can be overcome, thereby improving the accuracy of segmentation.

[0003] To obtain a deep learning-based video lesion segmentation model, it is typically necessary to annotate pixel-level lesion masks for all frames in a collected ultrasound video dataset, and then train the lesion segmentation model based on the video data. However, since ultrasound videos can only be annotated by qualified, experienced experts, and ultrasound video annotation is time-consuming and expensive, most literature on breast ultrasound lesion segmentation focuses on static ultrasound images, rather than ultrasound videos.

[0004] Breast ultrasound lesion areas typically vary from person to person and from disease to disease, lacking fixed morphological and textural features; moreover, lesion areas are usually very small, making precise lesion segmentation difficult. Furthermore, due to the significant noise in breast ultrasound images and the difficulty in distinguishing between breast lesions and fat, existing ultrasound-based lesion segmentation methods only utilize information from a single frame of the image, and most are segmentation methods targeting lesion areas manually cropped from the entire image—i.e., pre-processed images. The accuracy of automatic lesion segmentation for the entire image acquired during real-time ultrasound imaging needs improvement. Summary of the Invention

[0005] The purpose of this invention is to provide a method for segmenting breast ultrasound video lesions. This method achieves accurate segmentation of lesions in the entire image acquired during real-time ultrasound imaging by effectively modeling the features of consecutive video frames.

[0006] The technical solution adopted in this invention is a method for segmenting breast ultrasound video lesions, which specifically includes the following steps:

[0007] Step 1: Extract features from consecutive video frames using the backbone network. By extracting features from each frame and stitching all the feature maps together, we obtain the feature maps of consecutive frames.

[0008] Step 2: The feature map obtained in Step 1 is encoded and decoded sequentially through the Transformer module to obtain the learned instance query output in the order of the original video frame sequence;

[0009] Step 3: For the instance query learned in Step 2, predict the category through FC and predict the bounding box and segmentation mask vector respectively through MLP. This results in a fixed-size target instance category, bounding box, and segmentation mask vector output in the order of the original video frame sequence, enabling the network to predict the category, bounding box, and segmentation mask in a unified manner.

[0010] Step 4: Perform instance sequence matching and calculate the loss function on the results obtained in Step 3.

[0011] The invention is further characterized by:

[0012] The specific process of step 1 is as follows:

[0013] Assuming the initial video clip has a T-frame resolution of H0×W0, represented as The backbone network uses ResNet50 or ResNet100 to generate a low-resolution feature map for each frame. By extracting features from each frame and concatenating all the feature maps together, feature maps of consecutive frames are obtained.

[0014] In step 2, the Transformer module includes a Transformer encoder E and a Transformer decoder D.

[0015] The encoding process in step 2 is as follows: First, a 1×1 convolution is applied to reduce the dimension of the feature map extracted by the backbone network from C to d, thereby generating a new feature map. To form a clipped-level feature sequence that can be input into the Transformer encoder, the spatial and temporal dimensions of f1 are flattened into one dimension, resulting in a two-dimensional feature map of size d×(T×H×W); the temporal order is always consistent with the initial input order; the Transformer encoder consists of K Transformer coding layers; f1 is input into the K Transformer coding layers of the Transformer encoder, through... Iterative optimization of feature map f k .

[0016] The encoding process in step 2 is as follows: each Transformer encoding layer E k (·) includes a multi-head self-attention module and a fully connected feedforward network.

[0017] In step 2, to reflect the characteristics of consecutive video frames in the x, y, and time dimensions T, fixed-position encoding information is used to supplement these characteristics. For each coordinate dimension, d / 3 sine and cosine functions of different frequencies are used independently:

[0018]

[0019] in, pos'pos' is the position in the corresponding dimension. d is divisible by 3 because the position codes of the three dimensions are connected to form the final d-channel position code.

[0020] The decoding process of the Transformer decoder D is as follows: Assuming the model decodes n instances per frame, then for T frames, the total number of instance queries is N = n·T; First, a set of learnable instance queries is randomly initialized. Then the initial object queries q0 and the refined feature maps f in the K Transformer decoder layers. K Interaction to obtain instance-aware query embedding Compared to the Transformer encoder layer, each Transformer decoder layer D k Each of the (·) layers has an additional multi-head cross-attention layer with the same dimension as the input features. Therefore, with the output of encoder E and N instance queries Q as input, the Transformer decoder D outputs N instance features q through model learning. K ,

[0021] In step 3, the output q of the Transformer decoder D is directly processed. K It can simultaneously predict the category, bounding box, and segmentation mask vector.

[0022] In step 3, the classification branch is an FC layer used to predict class confidence. The localization branch is a multilayer perceptron MLP with a hidden layer size of 256, predicting the normalized bounding box center, width, and height; the masking branch is also a multilayer perceptron MLP with a hidden layer size of 1024, predicting the mask vector. n kThis is the dimension of each mask vector; to enable the network to predict and segment uniformly based on category, bounding box, and segmentation mask, the mask compression coding module uses discrete cosine transform to compress the instance mask into a one-dimensional fixed-length n. k The segmentation mask vector.

[0023] In step 4, the output q of the Transformer decoder D K Given N fixed-length sequences, to ensure that the relative positions of identical instances remain unchanged in the predicted sequences across different frames (i.e., to find the Ground Truth corresponding to each instance in each frame), an instance sequence matching strategy is used. The decoder obtains n instances in each frame, therefore the number of instance sequences is also n. Let y represent the predicted sequence of instances, and y represent the ground truth set of the instance sequence; assuming n is greater than the number of instances in the video clip, y is considered as a set of instances... The set to be filled is a set of size n; to find a bipartite matching between two sets, search for the n elements σ∈S with the lowest loss. n Arrangement:

[0024]

[0025] Among them, L match For paired matching GT y i Predicting the instance sequence with index value σ(i) The loss between them; using the normalized center coordinates, height and width of the bounding box and the predicted class label, a "background" class is added to represent no detected objects, and each element i of the GT set is regarded as follows (3):

[0026] y i ={(c i ,c i ,…,c i ),(b i,0 ,b i,1 ,…,b i,T )} (3);

[0027] Among them, c i This is the target class tag for this instance, which may be... And b i,t ∈[0,1] is a vector, and T represents the number of input frames. Therefore, for the prediction of the instance with index σ(i), the category c will be... i The probability is expressed as:

[0028]

[0029] The predicted bounding box sequence is represented as:

[0030] b(σ(i)) ={b (σ(i),0) ,…,b (σ(i),T)}, (5);

[0031] Then define the matching loss:

[0032]

[0033] in,

[0034] The loss is a linear combination of the negative log-likelihood of the class prediction, the box loss of the instance sequence, and the mask vector loss:

[0035]

[0036] in, and It is the optimal allocation calculated in the equation;

[0037] in:

[0038]

[0039] The beneficial effects of this invention are as follows: Compared with the static ultrasound image lesion segmentation method, the advantage of this invention is that it utilizes the proposed multi-task video instance segmentation model and the ensemble prediction method to effectively model the correspondence between the features of continuous video frames and the temporal changes of lesion instances, thereby improving the segmentation accuracy of lesions in the entire image acquired during real-time ultrasound imaging. Attached Figure Description

[0040] Figure 1 This is a flowchart of a breast ultrasound video lesion segmentation method according to the present invention. Detailed Implementation

[0041] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.

[0042] This invention discloses a method for segmenting lesions in breast ultrasound videos. It treats lesion segmentation in a video as a video instance segmentation problem, achieving accurate segmentation of lesions in the entire image acquired during real-time ultrasound imaging by effectively modeling the features of consecutive frames. The entire architecture consists of four parts: a backbone network, a transformer module, a parallel regression network for class, bounding box, and segmentation mask vectors, and an instance sequence matching and segmentation mask compression coding module. Given a breast ultrasound video sequence, the network outputs the lesion mask for each frame of the video sequence, achieving end-to-end instance segmentation without any post-processing. The backbone network extracts features from consecutive video frames by concatenating all feature maps. The transformer module includes a transformer encoder and a transformer decoder. The transformer encoder models long-distance dependencies between feature maps using a self-attention mechanism. Given a fixed set of learning object queries, the transformer decoder calculates the relationship between the query object and the global image context, obtaining the learned instance queries output in the order of the original video frame sequence. A parallel regression network for class, bounding box, and segmentation mask vectors is used. For queries of learned instances, the network predicts the class using a fully connected (FC) approach and predicts the bounding box and segmentation mask vector using a multi-level programming (MLP) approach, resulting in fixed-size target instance class, bounding box, and segmentation mask vectors output in the order of the original video frame sequence. The prediction sequence in each frame output by the Transformer decoder is unordered. Instance sequence matching uses the Hungarian algorithm to find the Ground Truth (GT) corresponding to each instance in each frame for supervised training. The segmentation mask compression coding module uses discrete cosine transform to compress the instance mask into a one-dimensional, fixed-length segmentation mask vector, enabling the network to predict the class, bounding box, and segmentation mask uniformly.

[0043] This invention discloses a method for segmenting breast lesions using ultrasound video, the process of which is as follows: Figure 1 As shown, the specific process is as follows:

[0044] Step 1: The backbone network extracts the original pixel-level feature sequence of the input video clip. Assume the T frames of the initial video clip have a resolution of H0×W0, represented as... The backbone network can be either ResNet50 or ResNet100, generating a low-resolution feature map for each frame. By extracting features from each frame and concatenating all the feature maps together, feature maps for consecutive frames are obtained. HxW represents the resolution of the stitched image; C represents the dimension of the feature map.

[0045] Step 2: The Transformer module includes a Transformer encoder E and a Transformer decoder D. The Transformer encoder models long-distance dependencies between feature maps through a self-attention mechanism. Given a fixed set of learned object queries, the Transformer decoder calculates the relationship between the query object and the global image context, and obtains the query of learned instances output in the order of the original video frame sequence.

[0046] First, a 1×1 convolution is applied to reduce the dimension of the feature map extracted by the backbone network from C to d (d < C), thereby generating a new feature map. To form a clipped-level feature sequence that can be input into the Transformer encoder, the spatial and temporal dimensions of f1 are flattened into one dimension, resulting in a two-dimensional feature map of size d×(T×H×W). The temporal order is always consistent with the initial input order. The Transformer encoder consists of K Transformer coding layers. f1 is input into the K Transformer coding layers of the Transformer encoder, through... Iterative optimization of feature map f k Each Transformer coding layer E k (·) consists of a multi-head self-attention module (MHSA) and a fully connected feedforward network (FFN).

[0047] To reflect the characteristics of consecutive video frames in the x, y, and time dimensions (T), fixed-position encoding information is used to supplement these characteristics. For each coordinate dimension, d / 3 sine and cosine functions of different frequencies are used independently:

[0048]

[0049] in, pos'pos' represents the position in the corresponding dimension. d is divisible by 3 because the positional codes of the three dimensions should be concatenated to form the final d-channel positional code. These codes are added to the input of each attention layer.

[0050] The Transformer decoder D aims to decode the most representative features that can represent the instance of each frame, called instance-level features. Referring to VisTR[1], given a fixed number of learnable input embeddings to generate query instance features from the features obtained from the encoder, it is called instance query. Assuming that the model decodes n instances per frame, then for T frames, the total number of instance queries is N = n·T. First, a set of learnable instance queries is randomly initialized. Then the initial object queries q0 and the refined feature maps f in the K Transformer decoder layers.K Interaction to obtain instance-aware query embedding Compared to the Transformer encoder layer, each Transformer decoder layer D k Each of the (·) features an additional multi-head cross-attention layer with the same dimensionality as the input features. Thus, taking the output of encoder E and N instance queries Q as input, the Transformer decoder D, through model learning, outputs N instance features q. K , The overall prediction follows the order of the input frames, and instances in different frames are predicted in the same order. Therefore, tracking instances in different frames can be achieved by directly linking items to the corresponding indices.

[0051] Step 3: Parallel regression network for category, bounding box, and segmentation mask vectors;

[0052] In order to utilize the idea of ​​multi-task learning to achieve end-to-end model training, referring to SOLQ[2], the output q of the Transformer decoder D is directly processed. K It simultaneously predicts the class, bounding box, and segmentation mask vector. The classification branch is a fully connected (FC) layer used to predict class confidence. The localization branch is a multilayer perceptron (MLP) with a hidden layer size of 256, predicting the normalized bounding box center, width, and height. Similar to the localization branch, the masking branch is also a multilayer perceptron (MLP) with a hidden layer size of 1024, predicting the mask vector. n k This is the dimension of each mask vector. To enable the network to predict and segment uniformly based on category, bounding box, and segmentation mask, the mask compression coding module uses discrete cosine transform to compress the instance mask into a one-dimensional fixed-length n. k The segmentation mask vector.

[0053] Step 4: Instance sequence matching strategy and loss function; output q of Transformer decoder D. K Given N fixed-length sequences, to ensure that the relative positions of identical instances remain unchanged across prediction sequences in different frames (i.e., to find the Ground Truth corresponding to each instance in each frame), an instance sequence matching strategy is used. The decoder obtains n instances in each frame, therefore the number of instance sequences is also n. Let represent the predicted sequence of instances, and y represent the ground truth set of the instance sequence. Assuming n is greater than the number of instances in the video clip, y is considered as a set... The set to be filled is of size n. To find a bipartite matching between two sets, search for the n elements σ∈S with the lowest loss. n Arrangement:

[0054]

[0055] Among them, L match For paired matching GT y i Predicting the instance sequence with index value σ(i) The loss between them. The optimal allocation is efficiently calculated using the Hungarian algorithm.

[0056] Because directly calculating the similarity of masked sequences is computationally expensive, a "background" class is added to represent no detected objects, utilizing the normalized center coordinates, height, and width of the bounding boxes, along with the predicted class label, to calculate the similarity between the predicted sequence and the sequences in the ground truth (GT) set. Given N = n·T bounding box (Bounding Box) predictions for an object prediction sequence, the n box sequences for each instance can be associated through their indices. The matching loss needs to take into account the class prediction and the similarity between the predicted box and the ground truth box. Each element i in the GT set can be viewed as:

[0057] y i ={(c i ,c i ,…,c i ),(b i,0 ,b i,1 ,…,b i,T )} (3);

[0058] Among them, c i The target class label for this instance (possibly) ), and b i,t ∈[0,1] is a vector that defines the center coordinates of the ground truth bounding box and its relative height and width in frame t. T represents the number of input frames. Therefore, for the prediction of the instance with index σ(i), the category c will be... i The probability is expressed as:

[0059]

[0060] The predicted bounding box sequence is represented as:

[0061] b (σ(i)) ={b (σ(i),0) ,…,b (σ(i),T)}, (5);

[0062] Then define the matching loss:

[0063]

[0064] in, Based on the above criteria, a one-to-one match of the sequence can be found using the Hungarian algorithm. Given the optimal assignment, the loss function can be calculated, which is the Hungarian loss for all pairs matched in the previous step. The loss is a linear combination of the negative log-likelihood of the class prediction, the box loss of the instance sequence, and the mask vector loss:

[0065]

[0066] here, and This is the optimal assignment calculated in the equation. The Hungarian loss is used to train the entire framework. The bounding box loss is defined as a linear combination of the sequence-level L1 loss and the generalized IOU loss:

[0067]

[0068] The method for segmenting breast ultrasound video lesions of the present invention has the following characteristics:

[0069] 1. Lesion segmentation in breast ultrasound videos is treated as a video instance segmentation problem. By utilizing the ensemble prediction method in VisTR, the segmentation accuracy is improved by effectively modeling the correspondence between the features of consecutive video frames and the temporal changes of lesion instances.

[0070] 2. The end-to-end video instance segmentation model is trained using the idea of ​​multi-task learning, directly predicting the category, bounding box, and segmentation mask vector in parallel from the output of the Transformer decoder.

Claims

1. A method for segmenting breast lesions using ultrasound video, characterized in that: Specifically, the steps include the following: Step 1: For the initial video, feature extraction is performed on T consecutive frames of the video through the backbone network. The feature maps of the consecutive frames are obtained by extracting the features of each frame and stitching all the feature maps together. Step 2: The feature map obtained in Step 1 is sequentially encoded and decoded by the Transformer module to obtain the learned instance query output in the order of the original video frame sequence. The Transformer module includes a Transformer encoder E and a Transformer decoder D. With the output of encoder E and Query Q as input for each instance, and output... Individual Instance Features A Transformer encoder consists of K Transformer coding layers. Step 3: For the instance query learned in Step 2, predict the category through FC and predict the bounding box and segmentation mask vector respectively through MLP. This results in a fixed-size target instance category, bounding box, and segmentation mask vector output in the order of the original video frame sequence, enabling the network to predict the category, bounding box, and segmentation mask in a unified manner. Step 4: Perform instance sequence matching and loss function calculation on the results obtained in Step 3, including: The output of the Transformer decoder D for Given a fixed-length sequence, the decoder uses an instance sequence matching strategy to obtain the sequence in each frame. There are 100 instances, therefore the number of instance sequences is also 100. ;use Represents the predicted sequence of instances. The GT set represents the sequence of instances; assuming If the number of instances in the video clip is greater than the number of instances, Consider as a group of The size of the fill is Given a set; to find a bipartite matching between two sets, search for the set with the lowest loss. element Arrangement: in, For paired matching GT With index value Instance sequence prediction The losses between; Using the normalized center coordinates, height, and width of the bounding box, along with the predicted class label, a "background" class was added to represent objects not detected. Each element i in the ground truth set was considered as... , This is the target class tag for this instance, and It is a vector, where T represents the number of input frames. Therefore, for index T... The prediction of instances will be categorized. The probability is expressed as ; Matching loss: The predicted bounding box sequence is represented as: , ; The loss is a linear combination of the negative log-likelihood of the class prediction, the box loss of the instance sequence, and the mask vector loss: in, ,and It is the optimal allocation calculated in the equation; in: .

2. The method for segmenting breast ultrasound video lesions according to claim 1, characterized in that: The backbone network is either ResNet50 or ResNet100.

3. The method for segmenting breast ultrasound video lesions according to claim 1, characterized in that: The encoding process in step 2 is as follows: First, a 1×1 convolution is applied to reduce the dimension of the feature map extracted by the backbone network from C to d, thereby generating a new feature map. In order to form a clipped feature sequence that can be input into the Transformer encoder, The spatial and temporal dimensions are flattened into one dimension, resulting in a two-dimensional feature map of size d×(T×H×W); the temporal order always remains consistent with the order of the initial input; Input the K Transformer encoder layers, through Iterative optimization of feature maps H×W represents the resolution of the stitched image; C represents the dimension of the feature map, and each Transformer coding layer... It includes a multi-head self-attention module and a fully connected feedforward network.

4. The method for segmenting breast ultrasound video lesions according to claim 1, characterized in that: The Transformer decoder The decoding process is as follows: assuming the model decodes each frame... For each instance, then for Frame, total number of instance queries is First, randomly initialize a set of learnable instance queries. ; Then initial object query and Refined feature maps in each Transformer decoder layer Interaction to obtain instance-aware query embedding ; Compared to the Transformer encoder layer, each Transformer decoder layer Each has an additional multi-head cross-attention layer with the same dimensions as the input features.

5. The method for segmenting breast ultrasound video lesions according to claim 4, characterized in that: In step 3, the classification branch is an FC layer used to predict class confidence. The localization branch is a multilayer perceptron MLP with a hidden layer size of 256, predicting the normalized bounding box center, width, and height; the masking branch is also a multilayer perceptron MLP with a hidden layer size of 1024, predicting the mask vector. , This refers to the dimension of each mask vector. To enable the network to predict and segment uniformly based on category, bounding box, and segmentation mask, the mask compression coding module uses discrete cosine transform to compress the instance mask into a one-dimensional fixed-length mask. The segmentation mask vector.