Generative modal completion method for medical image segmentation task

By adding an attention-guided contrastive learning network to StarGANv2, the problems of unstable multi-domain transformation and missing lesion details in multimodal medical image segmentation tasks are solved, and stable generation and high-precision segmentation of multimodal images are achieved.

CN115731227BActive Publication Date: 2026-06-30HEBEI UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HEBEI UNIV OF TECH
Filing Date
2022-12-01
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing generative modality completion methods suffer from instability in multimodal transformation and loss of lesion details in medical image segmentation tasks, especially inefficient and complex multi-domain transformation.

Method used

We employ a StarGANv2-based multi-domain generative network combined with an attention-guided contrastive learning network. Through attention feature extraction and contrastive loss calculation, we stabilize multi-domain style transfer and enhance lesion detail generation, thus constructing a complete dataset for multimodal medical image segmentation tasks.

Benefits of technology

It effectively fills in the missing modality data in multimodal medical image segmentation tasks, improves the accuracy and stability of the segmentation model, reduces the complexity of multimodal transformation, and enhances the segmentation accuracy of the multimodal segmentation model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115731227B_ABST
    Figure CN115731227B_ABST
Patent Text Reader

Abstract

This invention discloses a generative modality completion method for medical image segmentation tasks. The method includes the following: constructing a multi-domain generative network based on attention-guided contrastive learning: the entire network uses the StarGANv2 multi-domain generative model as its backbone, and adds an attention-guided contrastive learning network and a contrastive learning loss to this network; the attention-guided contrastive learning network is used to constrain the generator's generation details at the image patch level; the attention-guided contrastive learning network includes a source domain attention feature extraction network A. s Target Domain Attention Feature Extraction Network A t The method comprises three parts: a feature ranking module, a multi-domain generative network based on attention-contrast learning, and a feature ranking module. It utilizes this network to obtain missing modalities and achieve modality completion. This improves the reliability of the generated modalities and enhances the capabilities of subsequent multimodal medical image segmentation models.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a generative modality completion method for medical image segmentation tasks with modal deficiencies. Background Technology

[0002] Generative modality completion refers to a method that uses generative adversarial networks (GANs) to learn the domain information of existing modalities and generate modal information of the target domain, thereby completing the missing modalities in the dataset for subsequent image analysis tasks. In medical imaging, for datasets with missing modalities, generative modality completion methods can complete the patient's modal information, enabling multimodal medical image segmentation tasks. This is of great help in reducing patients' medical costs and further training and practical applications of computer-aided diagnostic systems.

[0003] Currently, generative modality completion methods for medical image segmentation tasks mainly fall into two categories. One category is based on cross-domain generative completion. Generally, cross-domain generative adversarial models are used to generate modalities from the source domain to the target domain, achieving unpaired intermodal transformation. Here, the source domain refers to the known image modal information, while the target domain refers to the unknown modal information that we want to generate. Subsequently, the completed modal data is analyzed for medical image segmentation tasks. For example, Pan et al.'s article "Synthesizing Missing PET from MRI with Cycle-consistent Generative Adversarial Networks for Alzheimer's Disease Diagnosis," published at Medical Image Computing and Computer Assisted Intervention (MICCAI2018), proposed a cross-domain model for PET and CT interconversion using two conditional generative adversarial networks (CGAN) and a cycle-consistent approach. The patent "A Multi-Domain Image Conversion Method and System Based on Generative Adversarial Networks" (2019 CN 110084863 A) establishes a cross-domain modality conversion technology based on CycleGAN. While these methods can achieve good generation results for two modalities, they require the establishment of multiple one-to-one cross-domain models to achieve the same result for data with missing data in more than two modalities. This approach can achieve multi-domain conversion, but it is complex and inefficient. Another approach is to generate modalities in a many-to-many manner through a multi-domain generative model, establishing channels between each domain and other domains, and completing the many-to-many conversion with only one model, thus achieving many-to-many generation. For example, StarGAN and the improved StarGANv2 can achieve multi-domain to multi-domain conversion very well. In particular, the domain labels in StarGANv2 are replaced with style features of specific domains, achieving style diversity compared to StarGAN. However, this method is usually used to process natural images, and in medical images, it faces the problems of unstable modal generation styles and poor modal generation details (especially the inability to well restore lesions in the image). Therefore, there is an urgent need for a generative multimodal complementation method that can solve the above problems.

[0004] Therefore, generative modality completion methods for medical image segmentation tasks with modal deficiencies require a generative adversarial model that generates style stably and has good lesion details for multi-domain transformation. Summary of the Invention

[0005] To address the shortcomings of existing technologies, the purpose of this invention is to provide a generative modality completion method for medical image segmentation tasks with missing modalities. This method avoids the complexity of training and implementing multimodal completion by eliminating the need to build multiple cross-domain generative models. Furthermore, this method demonstrates multi-domain generative capabilities in medical images that surpass those of StarGANv2. The completed multimodal images can be used for subsequent multimodal medical image segmentation tasks, resolving the problem of poor medical image segmentation performance caused by missing modalities.

[0006] To achieve the above objectives, the technical solution of the present invention is as follows:

[0007] A generative modality completion method for modality-deficient medical image segmentation tasks, comprising the following:

[0008] Loading Modal Missing Case Data: A modal missing database for a specific disease is obtained. This database includes randomly missing multimodal image data for each patient, as well as the corresponding ground truth annotations of the lesion regions, i.e., masks of the lesion's location and shape. The image data for all corresponding modalities of a given case are first normalized and standardized in grayscale, and recorded as source domain modal images. Then, modal category annotations corresponding one-to-one with the source domain modal images are obtained as source domain modal category annotations. The source domain modal images and source domain modal category annotations are collectively referred to as source domain information. Subsequently, the source domain information is randomly shuffled (while ensuring the one-to-one correspondence between image data and modal categories remains), yielding target domain information, i.e., target domain modal images and target domain modal category annotations.

[0009] A multi-domain generative network based on attention-guided contrastive learning is constructed: the entire network uses the StarGANv2 multi-domain generative model as its backbone, and adds an attention-guided contrastive learning network and a contrastive learning loss. The StarGANv2 network includes a style feature mapping network M, a multi-domain generative reconstruction network, and a multi-domain image discrimination and style discrimination network. The style feature mapping network M is used to establish the mapping relationship between the modal category labels of the source domain and the target domain; the multi-domain generative reconstruction network is used to learn the similarity and difference of modalities between the source domain and the target domain images using cycle consistency, mainly including a source domain generator G that establishes the mapping from the source domain to the target domain. s And the target domain generator G that establishes the mapping from the target domain to the source domain. t G s Input source domain information, output generated target domain information; G t Input source domain generator G s The output includes the generated target domain information and the generated source domain information; the multi-domain image discrimination and style discrimination network is used to identify whether an image is a generated image and whether the style of the generated image is consistent with the target style.

[0010] Attention-guided contrastive learning networks are used to constrain the generator's generation details at the image patch level; specifically, attention-guided contrastive learning networks include a source domain attention feature extraction network A. s Target Domain Attention Feature Extraction Network A t The system consists of three parts: an attention-guided contrastive learning network, a feature ranking module, and an image patch-level information output by the network. Subsequently, the contrastive loss during training is calculated using this patch-level information, thereby constraining the generated details.

[0011] The following section details the construction of attention-guided contrastive learning networks:

[0012] First, the source domain attention feature extraction network A s And target domain attention feature extraction network A t Both networks share the same structure, including a spatial channel attention module (CBAM) and a multilayer perceptron (MLP) architecture. The CBAM extracts channel and spatial features of the source / target domain modal images that are of interest to the generation task; these are called attention features. The MLP adjusts the nonlinear mapping of the attention features to discover features important to the network, enabling targeted calculation of subsequent contrastive loss.

[0013] Secondly, the feature ranking module calculates the entropy of the features of interest output by the source domain attention feature extraction network, sorts the calculated entropy values ​​in descending order, finds the top N most "valuable" entropy values, and obtains the corresponding image patches in the source domain modality image. Simultaneously, the feature ranking module calculates the entropy of the features of interest output by the target domain attention feature extraction network, and obtains the corresponding image patches in the target domain based on the positions of the image patches with the N most "valuable" entropy values ​​selected from the source domain. Finally, a comparative loss is calculated using the corresponding image patches from the source and generated target domains.

[0014] The overall workflow of the attention-guided contrastive learning network is roughly as follows:

[0015] 1) Input the source domain modal image into the source domain attention feature extraction network A s The source domain attention features after nonlinear mapping are obtained. Then, these attention features are input into the feature sorting module for entropy calculation, descending sorting, and confirmation sorting. The top N entropy values ​​after confirmation are the corresponding source domain image patches in the input source domain modal image.

[0016] 2) Input the generated target domain modal image into the target domain attention feature extraction network A. tThe nonlinearly mapped target domain attention features are obtained. Subsequently, the input is given to the feature sorting module, which does not perform entropy calculation or sorting, but only finds the corresponding target domain image patches in the generated target domain image data that correspond to the top N entropy values ​​of the target domain attention features and the source domain attention features.

[0017] 3) Finally, calculate the contrastive loss based on the generated target domain image patches and source domain image patches. Select any one image patch from the N source domain image patches as a positive sample, and the remaining N-1 image patches as negative samples. Obtain the attention-guided contrastive learning loss by narrowing the distance between positive samples and widening the distance between negative samples.

[0018] In summary, an attention feature extraction network is constructed to extract attention features from source / target domain modal images, highlighting important feature weights. A feature ranking module is then used to calculate the entropy value of this feature weight, selecting the top N image patches with the highest entropy values ​​corresponding to the source / target domain modal images. By narrowing the distance between similar positions in the source and target domain image patches and widening the distance between different positions in the source and target domain image patches, control over the style and content of the generated details is achieved.

[0019] Training a multi-domain generative network based on attention-guided contrastive learning: calculating adversarial loss, multi-domain reconstruction loss, style reconstruction loss, and attention-guided contrastive loss;

[0020] The attention-guided contrastive loss is calculated to enable the model to generate style-stable multimodal images that focus on lesion details, which are then used for subsequent multimodal segmentation tasks. The process of calculating the attention-guided contrastive loss involves using contrastive learning during training to maximize the difference between the source domain modal image X and the generated target domain modal image. The mutual information of the image patches is used to input the source domain attention feature extraction network A. s And target domain attention feature extraction network A t The attention features are obtained, and then input into the feature sorting module to obtain the final top N entropy values ​​corresponding to the source domain modal image X and the generated target domain modal image. Image patches. First, based on the input source domain attention feature extraction network A... s The attention features are calculated, and the feature entropy values ​​are sorted from largest to smallest (Formula 1). Then, according to the order in {x1, ..., x...} N In the given set, select one image patch as the positive sample patch and the remaining N-1 patches as the negative sample patches, and calculate the attention-guided contrast loss l(x, x). + x - )(Formula 2).

[0021] {x1, ..., x N} = Range Max→Min (D(A(X))) (Formula 1)

[0022] Here, A refers to the attention feature extraction network in general, without distinguishing between the source domain attention feature extraction network and the target domain attention feature extraction network. D represents entropy calculation.

[0023]

[0024] Where x is the modal image from the generated target domain. There are N image patches in total. + These are positive samples from the image patch of the source domain modal image X. These are the remaining N-1 negative image patch samples. Here, τ represents the temperature hyperparameter. Note that image patch x is always located within the generated target domain modal image, and its x... + Located at the same position in the source modal image X. N-1 negative images are randomly selected from X.

[0025] Obtaining complete multimodal data: After training a multi-domain generative network based on attention-contrast learning, all missing modal images are input into the network, which then outputs a corresponding complete modal image set. That is, if there are four modalities, inputting information for one modality (modal image and modality category) will generate images for the other three modalities. Fusing the generated modal images with the original missing images yields a complete dataset with all four modalities for each patient. Because different modalities correspond to different organ and tissue information, the differences between modalities are significant, but the differences in lesion locations are not substantial. The complete data obtained through the attention-contrast learning-based multi-domain generative network and subsequent operations can then be used for multimodal medical image segmentation tasks.

[0026] Compared with the prior art, the beneficial effects of the present invention are:

[0027] 1) This invention proposes a novel generative modality completion method for multimodal medical image segmentation. Based on the generative multi-domain style transfer network (StarGANv2), this invention incorporates an attention-guided contrastive learning network. This network constrains the stability of multi-domain style information and the completeness of content information. In real-world medical scenarios, complete multimodal image data is often difficult to obtain due to patients' financial and time constraints. This necessitates domain transformation from source domain images to target domain images. However, multi-domain style transfer networks are unstable for grayscale medical images, and the generation of lesion targets for medical image segmentation tasks is completely lacking. The attention-guided contrastive learning network incorporated in this invention strengthens the transformed style and content information by comparing the mutual information content of the "regions of interest" between the source and target domain information. This improves the reliability of the generated modality, completes the missing modality dataset, and thereby enhances the capabilities of subsequent multimodal medical image segmentation models.

[0028] 2) This invention utilizes an attention-guided contrastive learning network based on attention-contrast learning in a multi-domain generative network. This network consists of three parts: a source domain attention feature extraction network, a target domain attention feature extraction network, and a feature ranking module. The network outputs image patch-level information. Subsequently, the contrastive loss during training is calculated using this patch-level information, thereby constraining the generated details. The source domain attention feature extraction network and the target domain attention feature extraction network extract attention features from the source domain modal image and the generated target domain modal image, respectively. Then, the feature ranking module selects "valuable" image patches from both the source domain modal image and the generated target domain modal image. Finally, the contrastive learning loss is calculated based on the positive and negative samples that distinguish the image patches, ensuring a high correlation between the mutual information of the source domain modal image and the generated target domain modal image, thus stabilizing the style transfer information and fine-grained content information of the modality. Specifically, channel attention provides attention weights to each layer of features by spatially compressing image features, thereby finding important channel features of the feature map. Spatial attention provides attention weights to spatial features by channel-wise compressing image features, thereby finding important spatial features of the features.

[0029] 3) This application utilizes a multi-domain generative network based on attention contrastive learning to generate modal images (target domain modal images) missing from medical image diagnosis, which, together with the original image (source domain modal image), constitute a complete multimodal medical image database, thereby improving the accuracy of the multimodal segmentation model. Compared to other generative modality completion methods for segmentation tasks, this invention can perform multimodal-to-multimodal generation without constructing a separate network for each modality-to-modality transformation. Therefore, the model proposed in this invention can reduce the complexity of multimodal transformation models. Furthermore, the generative modality completion method of this invention compares the differences between the original image (source domain modal image) and the generated image (target domain modal image) in image patches through attention contrastive learning, thereby establishing more robust multi-domain connections and generating more realistic multimodal images. By incorporating the generative modality completion method of this invention into the multimodal segmentation model, the segmentation accuracy is significantly improved, proving that the method proposed in this application can effectively complete modality missing data and improve the segmentation model's ability to segment multimodal data. Attached Figure Description

[0030] Figure 1 This is a schematic diagram of the generative modal completion method for medical image segmentation tasks according to the present invention. Detailed Implementation

[0031] The present invention will be further explained below with reference to the embodiments and accompanying drawings, but this is not intended to limit the scope of protection of this application.

[0032] This invention provides a generative modality completion method for multimodal medical image segmentation tasks, based on... Figure 1 This includes the following steps:

[0033] Step 1: Load the image data with missing modalities and the corresponding modal category annotation data:

[0034] The database names folders with each patient's ID, and each folder stores multiple modal images of that case along with the corresponding lesion mask. Modal images of cases from the modality-deficient image data are loaded as source domain modal images, and their corresponding modality category labels are recorded as source domain modality category annotations. Then, the image data is randomly shuffled as target domain modal images and their corresponding target domain modality category labels. All image data are used to form a modality-deficient image dataset, and its grayscale values ​​are normalized to serve as single-channel modal image input.

[0035] Step 2: Constructing the Style Feature Mapping Network M: This part takes the latent encoding z and the target / source domain modality class labels as input, and obtains the modality style encoding provided for the available domain through the style feature mapping network M. The latent encoding is a random Gaussian noise that is randomly generated upon input. The style feature mapping network can effectively learn the target / source domain modality class labels and generate different target / source domain modality style encodings compared to the latent encoding z. The target / source domain modality style encodings obtain style information through corresponding adaptive instance normalization AdaIN and guide the source domain generator G. s (Establish a mapping from the source domain to the target domain) / Target domain generator G t Generate modal images of the specified style. This process will be described in detail in step three.

[0036] Step 3: Construct a multi-domain generation and reconstruction network: As the main network for multi-domain generation, the source domain generator G is guided by cycle-consistent style coding. s and target domain generator G t The generation process. The multi-domain generative reconstruction network inputs source domain modal images to the source domain generator G. s Source domain generator G s Guided by the target domain style encoding output by the style feature mapping network M in step two, the generated target domain modal image is obtained. Subsequently, the generated target domain modal image is input into the target domain generator G, which is guided by the source domain style encoding output by the style feature mapping network M. t The generated source domain modal image is obtained.

[0037] Step 4: Construct an attention-guided contrastive learning network: Extract attention features from the source / target domain using network A. s and A t Attention features F are extracted from the source domain modal image and the generated target domain modal image, respectively. s and F t By focusing on the attention features F of the source domain modal image s Entropy values ​​are calculated and sorted from largest to smallest to determine the corresponding image patches in the first N source domain modal images to be compared. Then, image patches in the generated target domain modal images corresponding to the positions of the N source domain modal images are determined for subsequent contrast loss calculation.

[0038] Step 5: Construct a multi-domain image recognition and style recognition network:

[0039] The multi-domain image discrimination and style discrimination network consists of a style discrimination network S and a generative adversarial network (GAN) discriminator D. The style discrimination network S determines whether the modal categories of the source / target domain modal images generated in the multi-domain reconstruction network are the same as those of the original source / target domain modal images. The GAN discriminator D determines whether the content of the generated source / target domain modal images is authentic. Specifically, the generated source / target domain modal images are input to the style discrimination network S to obtain the generated source / target domain modal style codes, and the original source / target domain modal category labels are input to the style feature mapping network M to obtain the original source / target domain modal style codes. The style reconstruction loss is calculated to guide the attention-based contrastive learning-based multi-domain GAN in style transfer between multiple domains. The input of the GAN discriminator D is similar to that of the style discrimination network, but their outputs differ. The output of discriminator D is not a style code, but rather features used to determine whether an image is not a real image.

[0040] Step Six: Calculate the reconstruction loss, style reconstruction loss, and adversarial loss:

[0041] The style reconstruction loss L is calculated based on the target domain modal style code generated in step five and the target domain style code obtained in step two. sty As shown in Formula 3; furthermore, an adversarial loss L is applied to make the generated image (generated source / target domain modal image) and the original image (original source / target domain modal image) as similar as possible based on the output constraints of the generative adversarial network discriminator in step five. adv As shown in Equation 4; and the cyclic consistency loss L for image reconstruction. cyc For example, formula 5;

[0042]

[0043] in, This represents the style encoding generated by the style feature mapping network M. The generated target domain modal imagery is represented by a style identification network. Extract the style code from the generated target domain modal image. Then, calculate the style code. Style identification network Extract the L1 paradigm loss of style coding for the generated target domain modal image.

[0044]

[0045] Among them, the discriminator D y This indicates that the output corresponds to the corresponding domain, and the discriminator learns to distinguish between the generated target domain modal image and the real source domain modal image.

[0046]

[0047] in, It is the style encoding of the input source domain modal image. This is achieved by encouraging the source domain generator G... s Use style coding The target domain generator G guides the generation of a specific style of target domain modal image from the input image x. t Input generated target domain modal images in style coding Under the guidance of [the relevant authority], the source domain modal image was reconstructed, and the original features of x were preserved while its style was changed. In the above formula, E represents the expectation of a data distribution for the variable used to calculate the loss.

[0048] Step 7: Calculate the attention-guided contrastive learning loss: Based on the attention features F of the source domain modal image obtained in Step 4. s Attention features F of the generated target domain modal image at the corresponding location t And select F s The N image patches of the corresponding domain modal image with the largest entropy values ​​are used as the objects for calculating the contrast loss. Among them, one image patch in the attention feature of the source domain modal image is selected as the positive sample of the contrast loss, and the remaining N-1 are used as the negative samples of the contrast loss. The contrast loss is obtained by making the positive samples as close as possible and the negative samples as far apart as possible, as shown in Formula 2.

[0049] Thus, a multi-domain generative adversarial network based on attention-contrast learning, after training, is obtained.

[0050] Step 8: Input the source domain modal images and category labels into the attention-based contrastive learning-based multi-domain generative network to obtain the generated missing modal image data. Together with the source domain modal images in the missing modal image dataset from Step 1, they form a segmentation database. After several rounds of training, the entire attention-based contrastive learning-based multi-domain generative network obtains data that complements the missing modal dataset. Therefore, the generated missing modal images (generated target domain modal images) are combined with the original modal images (source domain modal images) to obtain the supplemented multi-modal segmentation data.

[0051] Step 9: Construct a multimodal segmentation model: Adjust the input data based on the new multimodal segmentation data. Perform multi-channel fusion of the multimodal image data and input it into the lesion segmentation model, such as U-Net, to train the segmentation model.

[0052] Example 1

[0053] This example demonstrates generative modality completion and segmentation of multimodal MRI images from a glioma database released in 2021.

[0054] The glioma database contains 340 cases, each with four modalities of MRI images: Flair, T1, T1ce, and T2. There are a total of 1360 3D images, occupying 1.15GB of space.

[0055] Constructing a multimodal missing dataset: A database of gliomas from 2021 was loaded. This database includes image data for each patient across four modalities, along with the corresponding diagnostic results (whether the lesion is present or not). The database also includes ground truth annotations of the lesion regions, i.e., masks representing the lesion's location and shape. These masks are used for training and optimizing the subsequent lesion segmentation model. To reflect real-world medical scenarios, multiple modalities for each patient are randomly missing, but each patient must have at least one modality missing. This constructs a multimodal missing dataset with uncertain existing modalities. The multimodal missing dataset is named with each patient's ID, and each folder stores one or more modalities of the image for that case, along with the corresponding lesion mask.

[0056] The following details each step and the model parameter settings:

[0057] Step 1: Load the data required for the attention-based contrastive learning-based multi-domain generative network:

[0058] The data required for the multi-domain generative network based on attention contrast learning is divided into several parts: source domain modal images and source domain modal category labels, target domain modal images and target domain modal category labels. In the acquisition process, the 3D .nii files need to be converted into 2D .png images that the network can process first, and then the two parts are acquired in sequence. The specific processing method is as follows: (1) Read the modality missing dataset: use the os module in the Python language to traverse all .nii format files in the folder where the data is stored, and save the modality category label of each path corresponding to the modality. Then, obtain the real label of the corresponding segmentation mask according to the file name. The dataset has four modalities: flair, t1, tlce, and t2. Each patient has different degrees of modality missing. Taking the case with the folder name patient1 stored in the path ' / Brain / ' as an example, first obtain the storage path of the case, i.e., ' / Brain / patient1', and then find multiple modality .nii format files in the file according to this path. They are patient1_flair.nii and patient1_t1.nii. The patient, patient1, is missing both t1ce and t2 modalities. The modalities are placed in the corresponding modal path, for example, ' / Brain / patient1 / t1 / '. This corresponds to the segmentation mask path ' / Brain / patient1 / patient1_seg.nii', which is used for training the segmentation model later.

[0059] After obtaining the paths of the modal images and segmentation masks corresponding to all cases, the pandas library was used to save them into a table.

[0060] (2) Converting 3D data of modal images to 2D data: First, load the modal image data in pandas, and then use the Nnilble library to read the 3D modal image data. Then, perform grayscale standardization on the 3D data, and then slice the 3D modal image data along the vertical axis and save it as a PNG image of the corresponding number of slices, such as ' / Brain / patientl / t1 / patient1_34.png', and reset the table in (1) built by the pandas library, denoted as s_modal_list.

[0061] (3) Acquisition of source domain modal category labeling and target domain modal image and category labeling: First, obtain the source domain modal category labeling, that is, attach the corresponding category label to the images of different modalities. The category labels of the four modalities Flair, T1, T1ce, and T2 are 0, 1, 2, and 3, respectively. For example, the image of ' / Brain / patient1 / t1 / patient1_34.png' is labeled as 1. After saving the category label in the updated table in (1), that is, in the s_modal_list table, it is saved as ' / Brain / patient1 / t1 / patient1_34.png', '1'. Then, the data of each row in the s_modal_list table is randomly shuffled as the target domain modal image features and category labeling, denoted as t_modal_list.

[0062] (4) Modal Image Normalization: Modal image data loaded into the network needs to be normalized. Based on the grayscale value range [a, b] of the image data, the grayscale value range of the image data is calculated as ba. If the grayscale value of a pixel in the source domain modal image is x, then the normalized grayscale value of this pixel is... Each image is (240, 240) in size, and the processed image format is (1, 240, 240), where 1 represents a single-channel image mode;

[0063] At this point, the data required for the attention-based contrastive learning-based multi-domain generative network has been prepared and loaded.

[0064] Step 2: Construct and train a multi-domain generative network based on attention contrastive learning.

[0065] The attention-guided contrastive learning-based multi-domain generative adversarial network mainly comprises four networks: a multi-domain generative reconstruction network, a style feature mapping network, an attention-guided contrastive learning network, and a multi-domain image and style identification network. These four networks are trained on the attention-guided contrastive learning-based multi-domain generative network by inputting source domain modal image data and modal category annotations, as well as target domain modal images and modal category annotations, to obtain a reliable target domain modal image corresponding to the source domain modal image. The specific implementation details of each module are as follows:

[0066] ① Style Feature Mapping Network

[0067] The network obtains target / source domain style codes by inputting latent spatial codes and target / source domain modal category annotations. The target / source domain modal style codes are then processed by a source / target domain generator G guided by the style codes. s Generate target / source domain modal images. For example... Figure 1The style feature mapping network M in the multi-domain generative adversarial network based on attention contrastive learning is shown in the left part. The style feature mapping network M is [L64_512, ML512_512_3, MML512(3)64(1)_4], where L64_512 represents a fully connected layer-ReLU function structure with an input dimension of 64 and an output dimension of 512, ML512_512_3 represents three fully connected layer-ReLU function structures with 512 dimensions for both input and output, and MML512(3)_64(1)_4 represents four sets of fully connected layer-ReLU function structures with 3 fully connected layer-ReLU function structures with 512 dimensions for both input and output and one fully connected layer-ReLU function structure with 512 input and 64 output. Finally, the output of M and the target domain modality category label are fused as the target domain modality style code, and finally input into the source domain generator G. s Generate a target domain modal image that has the style (modality) of the target domain but whose content is the source domain modal image.

[0068] ② Multi-domain generative reconstruction network

[0069] The multi-domain generative reconstruction network inputs source domain modal images to the source domain generator G. s The generated target domain modal image is processed by the target domain generator G. t The reconstructed source domain modal image (generated source domain modal image) is used to improve the generator's generation capability through a reconstruction loss-constrained generator. For example... Figure 1 As shown in the backbone of the attention-based contrastive learning-based multi-domain generative network, the target domain generator G... t Source domain generator G s G s / G t The parts are all guided by the style encoding s output by the style feature mapping network. Among them, the source domain generator G... s Guided by the target domain modality category annotations obtained through a style feature mapping network, the target domain style encoding is generated by the target domain generator G. tThe source domain style encoding, obtained through a style feature mapping network based on source domain modality category annotations, is guided by the source domain style encoding. The specific structure of the multi-domain generative reconstruction network is [c1s1_64, dR128, dR256, dR512, dR512, R512, R512, R512, AdaR512, AdaR512, uR512, uR256, uR128, u64, c1s1_3]. Here, c1s1_64 represents a convolutional layer with a kernel size of 1*1, a stride of 1, and 64 layers. dR128 represents a residual block-average pooling layer module with an instance normalization layer, featuring a kernel size of 3*3, a stride of 1, and 128 layers. Similarly, dR256 and dR512 represent residual block-average pooling layer modules with the same kernel size and stride as dR128, but with 256 and 512 layers respectively. R512 represents a residual block with a kernel size of 3*3, a stride of 1, and 512 layers. AdaR512 represents a residual block with an adaptive normalization layer, also with a kernel size of 3*3, a stride of 1, and 512 layers. uR512 represents an upsampled residual block with an adaptive normalization layer, also with a kernel size of 3*3, a stride of 1 / 2, and 512 layers. Similarly, uR256 and uR128 represent upsampled residual blocks with the same kernel size and stride as uR512, but with 256 and 128 layers respectively, and adaptive normalization layers. cls1_3 is a convolutional layer with a kernel size of l*1, a stride of 1, and 3 layers. G t With G s They have the same layer structure. The process of calculating the reconstruction loss is as follows: during training, for the source domain modal image x... s Enter into G s The target domain modal image x′ generated in t Then enter G t The source domain modal image x′ generated in s By constraining ||x′ s -x s ||2 enables the model to learn common features of the domain, thereby generating more realistic target modal images.

[0070] ③ Attention-guided contrastive learning network

[0071] To further improve the generation of modal image details, especially the generation of tumor lesions, G... t With G s Output generated target domain modal image x′ t Source domain modal image x s Enter A respectively t and A s Attention feature extraction network. Where A... tThe structure is [CBAM_512, L512_256, L256_256, L256_256]. CBAM_512 is a convolutional block of interest with 512 channels. L512_256 is a fully connected layer-ReLU function structure with 512-dimensional input and 256-dimensional output. L256_256 is a fully connected structure with 256-dimensional input and output. The fully connected layer-ReLU function structure and the two fully connected structures constitute a multilayer perceptron (MLP). s With A t They have the same structure and share the parameters of the fully connected layer. Subsequently, through A... s With A t The entropy values ​​of the obtained feature vectors are calculated and sorted in descending order. The top 50 image patches with the highest entropy values ​​out of 256 are selected as the data for calculating the contrastive loss. Specifically, one image patch from the source domain modal image features is selected as the positive sample for the contrastive loss, and the remaining N-1 are selected as the negative samples. This contrastive loss is an attention-guided contrastive learning loss obtained by narrowing the distance between positive samples and widening the distance between negative samples.

[0072] ④ Multi-domain image identification and style identification network

[0073] The multi-domain image discriminator D has the same structure as the style discrimination network S, which is [c1s1_64, dR128, dR256, dR512, dR512, dR512, dR512, c4s1_512, L512_k]. Here, c1s1_64, dR128, dR256, dR512, dR512, and dR512 are the same as in the generator, but without instance normalization layers. c4s1_512 represents a convolutional layer with a kernel size of 4*4, a stride of 1, and 512 layers. L512_k represents a 512-dimensional input with a k-dimensional fully connected layer and a ReLU function structure. k is 1 in the multi-domain image discriminator D because D determines whether the input is a generated image. k is 4 in the style discrimination network S because S determines the modality of the input. This will be discussed later. This enables the model to focus on the differences between modal images and the differences between the generated image (the generated target domain modal image) and the original image (the source domain modal image) at the image patch level.

[0074] Step 3: Obtain the complete multimodal dataset.

[0075] After training the attention-based contrastive learning-based multi-domain generative adversarial network, all training images and their corresponding target modalities are re-inputted into the trained style feature mapping network M and the source domain generator G. sThe missing modal image data is generated. To avoid increasing the difficulty of model selection, four modalities are generated for each case by default. Then, the missing modal data of the cases in the modal missing database are selected and combined with the original database to form complete multimodal image data.

[0076] Step 4: Dataset partitioning.

[0077] The scikit-learn library was used to partition the complete multimodal image dataset into training, validation, and test sets in a 7:1:2 ratio. The training set was used to train the segmentation model, the validation set to select the optimal model, and the test set to evaluate the model's performance.

[0078] Step 5: Multi-channel fusion of image data.

[0079] Similar to the data preprocessing in step one, the cases under the complete multimodal dataset are loaded, and the grayscale value range of the four modal data under a case is obtained, such as [x,y]. Then, normalization is performed to normalize the value range to [0,1]. The normalized images of the four modalities are stacked in the channel direction to obtain a multi-channel image with a shape of [4,240,240].

[0080] Step Six: Train the segmentation model

[0081] Multi-channel images are input into the segmentation model, and the model is optimized based on the known tumor segmentation mask and binary cross-entropy loss.

[0082] Step 7: Perform tumor segmentation on patients with modality loss

[0083] The system loads image information of existing modalities for a specific case within a multimodal disease category, obtains modality category annotations, and preprocesses them to obtain a single-channel image of a single modality. This single-channel image and its corresponding modality category are used as source domain modality information and input into a pre-trained multi-domain generative adversarial network (GAN) based on attention-based contrastive learning and its style feature mapping network to generate a case-specific image with missing modalities. The generated image with missing modalities is then merged with the original existing modality images to form complete multimodal image data. Finally, a trained segmentation model is used to obtain the final multimodal segmentation result with added modalities.

[0084] This example demonstrates a tumor segmentation task using multimodal missing image data of gliomas on a case-by-case basis. The segmentation DICE index was 0.7662 (greater than the DICE index of 0.7221 for missing modal data in the same segmentation model), representing a 4.41% increase. The segmentation improvement effect after modality completion was significant.

[0085] This invention employs a multi-domain generative modality completion method. It innovatively utilizes multi-domain generative adversarial techniques to provide the complete multi-modal data required for multi-modal segmentation tasks, enabling the conversion between multiple modalities (multiple domains) using a single model. This avoids the problem of large models encountered in traditional multi-modal conversion. Simultaneously, the attention-guided contrastive learning network incorporated with contrastive loss allows for the complete generation of lesion regions during multi-domain generation, preserving the effective information needed for segmentation and stabilizing style information in modality conversion. This makes the invention more suitable for use in clinical CAD (Computer-Aided Diagnosis) systems. This invention maximizes the completion of missing modalities based on existing modal image information, improving the accuracy and effectiveness of multi-modal style representation.

[0086] Any aspects not covered in this invention are applicable to existing technologies.

Claims

1. A generative modal completion method for medical image segmentation tasks, characterized in that, The method includes the following: Loading Modal Missing Case Data: Obtain a modal missing database for a certain disease. This database includes randomly missing multimodal image data for each patient, as well as the true annotations of the corresponding image lesion regions. These true annotations are masks of the lesion's location and shape. First, normalize and standardize the grayscale values ​​of all corresponding modal image data for a given case, and record them as source domain modal images. Then, obtain the modal category annotations that correspond one-to-one with the source domain modal images, as source domain modal category annotations. The source domain modal images and source domain modal category annotations are collectively referred to as source domain information. Randomly shuffle the source domain information, but ensure that the image data and modal categories still maintain a one-to-one correspondence, to obtain target domain information. This target domain information consists of target domain modal images and target domain modal category annotations. A multi-domain generative network based on attention-guided contrastive learning is constructed: the entire network uses the StarGANv2 multi-domain generative model as its backbone, and adds an attention-guided contrastive learning network and a contrastive learning loss on top of it. The StarGANv2 network includes a style feature mapping network M, a multi-domain generative reconstruction network, and a multi-domain image discrimination and style discrimination network. The style feature mapping network M is used to establish the mapping relationship between the modal category labels of the source domain and the target domain. The multi-domain generative reconstruction network is used to learn the similarity and difference of modalities between the source domain and the target domain images using cycle consistency, including a source domain generator that establishes the mapping from the source domain to the target domain. and a target domain generator that establishes a mapping from the target domain to the source domain. A multi-domain image identification and style identification network is used to identify whether an image is a generated image and whether the style of the generated image is consistent with the target style; an attention-guided contrastive learning network is used to constrain the generation details of the generator at the image patch level; the attention-guided contrastive learning network includes a source domain attention feature extraction network. Target domain attention feature extraction network The system consists of three parts: an attention-guided contrastive learning network, an image patch-level information network, and a feature ranking module. The contrastive loss during training is then calculated using the image patch-level information, thereby constraining the generated details. A multi-domain generative network based on attention-contrast learning is used to obtain images of missing modalities, thereby achieving modality completion. The source domain attention feature extraction network Attention feature extraction network for target domain Both networks have the same structure, including a spatial channel attention module (CBAM) and a multilayer perceptron (MLP) structure. The spatial channel attention module (CBAM) can extract the "interesting" features of the source / target domain modal images in both channels and space for the generation task, which are called attention features. The multilayer perceptron (MLP) can adjust the nonlinear mapping of the attention features to discover the features that are important to the network so as to perform subsequent contrast loss calculations in a targeted manner. The feature sorting module can calculate the entropy value of the features of interest output by the source domain attention feature extraction network, sort the calculated entropy values ​​in descending order, find the top N most "valuable" entropy values, and obtain the image patch of the corresponding receptive field in the source domain modality image corresponding to the entropy value. At the same time, the feature sorting module can calculate the entropy value of the features of interest output by the target domain attention feature extraction network, and obtain the corresponding image patch in the target domain according to the position of the image patch with the N most "valuable" entropy value selected in the source domain. Finally, the image patch at the corresponding position in the source domain and the generated target domain is used to calculate the contrast loss.

2. The generative modal completion method for medical image segmentation tasks according to claim 1, characterized in that, The process of calculating the contrast loss is as follows: Any one image patch from the selected N source regions is taken as a positive sample, and the remaining N-1 image patches from the selected N source regions are taken as negative samples, according to Formula 2. (Official 2) in, These are image patches from the generated target domain modal image. There are a total of [number missing] patches in the generated target domain. Image blocks; These are positive samples from image patches in the source domain modal image. In the remaining source domain modal images Each image patch is a negative sample; τ represents the temperature hyperparameter. By narrowing the distance between positive samples and widening the distance between negative samples, we obtain an attention-guided contrastive learning loss.

3. The generative modal completion method for medical image segmentation tasks according to claim 1, characterized in that, After training a multi-domain generative network based on attention contrastive learning, all modal missing images are input into the network, which can output a corresponding modal completion image set. By fusing the generated modal images with the original missing images, a complete patient dataset can be obtained, which can be used for multimodal medical image segmentation tasks.

4. The generative modal completion method for medical image segmentation tasks according to claim 1, characterized in that, The style feature mapping network The style feature mapping network inputs the latent encoding. And target domain / source domain modality category annotations, through a style feature mapping network The modal style encoding provided for the available domain is obtained, where the latent encoding is a random Gaussian noise randomly generated upon input; the style feature mapping network can effectively learn the target domain / source domain modal category annotations and, together with the latent encoding... Generate different target domain / source domain modal style codes; The style feature mapping network M has the structure [L64_512,ML512_512_3,MML512(3)_64(1)_4], where L64_512 represents a fully connected layer-ReLU function structure with an input dimension of 64 and an output dimension of 512, ML512_512_3 represents a fully connected layer-ReLU function structure with 3 inputs and 512 outputs, and MML512(3)_64(1)_4 represents a structure of 4 sets of fully connected layer-ReLU functions with 3 inputs and 512 outputs and 1 fully connected layer-ReLU function with 512 inputs and 64 outputs.

5. The generative modal completion method for medical image segmentation tasks according to claim 1, characterized in that, In the multi-domain generative reconstruction network and Having the same layer structure, The specific structure is [c1s1_64,dR128,dR256,dR512,dR512,R512,R512,R512,AdaR512,AdaR512,uR512,uR256,uR128,u64,c1s1_3]; where c1s1_64 represents a convolutional layer with a kernel size of 1*1, a stride of 1, and 64 layers; dR128 represents a residual block-average pooling module with an instance normalization layer, featuring a kernel size of 3*3, a stride of 1, and 128 layers; similarly, dR256 and dR512 represent residual blocks with the same kernel size and stride as dR128, but with 256 and 512 layers respectively. The module is a difference block with an average pooling layer; R512 represents a residual block with a kernel size of 3*3, a stride of 1, and 512 layers; AdaR512 represents a residual block with an adaptive normalization layer with a kernel size of 3*3, a stride of 1, and 512 layers; uR512 represents an upsampled residual block with an adaptive normalization layer with a kernel size of 3*3, a stride of 1 / 2, and 512 layers. Similarly, uR256 and uR128 represent upsampled residual blocks with adaptive normalization layers that have the same kernel size and stride as uR512, but with 256 and 128 layers respectively; c1s1_3 represents a convolutional layer with a kernel size of 1*1, a stride of 1, and 3 layers.

6. The generative modal completion method for medical image segmentation tasks according to claim 1, characterized in that, The structure is [CBAM_512, L512_256, L256_256, L256_256]. CBAM_512 is a spatial channel attention module with 512 channels. L512_256 is a fully connected layer-ReLU function structure with 512-dimensional input and 256-dimensional output. L256_256 is a fully connected structure with 256-dimensional input and output. One fully connected layer-ReLU function structure and two fully connected structures constitute a multilayer perceptron.

7. The generative modal completion method for medical image segmentation tasks according to claim 5, characterized in that, The multi-domain image discrimination and style discrimination network includes a multi-domain image discriminator D and a style discrimination network S. They have the same structure but different outputs. The structure is [c1s1_64, sR128, sR256, sR512, sR512, sR512, sR512, c4s1_512, L512_k], where sR128 represents a residual block-average pooling layer module with a kernel size of 3*3, a stride of 1, and 128 layers. Similarly, sR256 and sR512 represent residual block-average pooling layer modules with the same kernel size and stride as sR128, but with 256 and 512 layers respectively. c4s1_512 represents a convolutional layer with a kernel size of 4*4, a stride of 1, and 512 layers. L512_k indicates a 512-dimensional input and a k-dimensional fully connected layer-ReLU function structure; k is 1 in the multi-domain image discriminator D and 4 in the style discrimination network S.