Medical image generation method and system based on feature decoupling compensation and dynamic contrast

By employing feature decoupling compensation and dynamic comparison methods, the problems of feature entanglement and model instability in multimodal medical image generation are solved, achieving high-fidelity and accurate image generation and improving the accuracy of diagnosis and segmentation tasks.

CN122244237APending Publication Date: 2026-06-19SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2026-03-05
Publication Date
2026-06-19

Smart Images

  • Figure CN122244237A_ABST
    Figure CN122244237A_ABST
Patent Text Reader

Abstract

This invention proposes a medical image generation method and system based on feature decoupling compensation and dynamic comparison, belonging to the field of medical image generation technology. The method includes: acquiring a partially missing multimodal medical image to be processed; for the partially missing multimodal medical image, introducing a bidirectional contrastive decoupling loss for constraint, extracting cross-modal consistent common features and modality-specific unique features; constructing and updating a modality-specific feature memory using an exponential moving average strategy; mapping the common features to pseudo-text, and retrieving compensated missing unique features based on a cross-attention mechanism; performing multi-source fusion of the common features, the unique features of the non-missing modal, and the compensated missing modality-specific features, and inputting this fusion into a generator of a generative adversarial network to reconstruct and output a complete multimodal medical image; and optimizing the generator using a dynamic modality contrastive learning strategy to improve the image generation quality.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of medical image generation technology, and particularly relates to a medical image generation method and system based on feature decoupling compensation and dynamic comparison. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] In the field of medical image analysis, multimodal magnetic resonance imaging (MRI) is widely used in clinical auxiliary diagnosis. For example, in the diagnosis and target delineation of brain tumors, it is usually necessary to combine multiple modalities of images, such as fluid-attenuated inversion recovery (FLAIR) sequences, T1-weighted (T1) sequences, T1-weighted contrast-enhanced (T1ce) sequences, and T2-weighted (T2) sequences. Different modalities of images can provide complementary physiological and pathological structural features. However, in actual clinical data acquisition, due to limitations such as scan time, patient motion artifacts, or contrast agent injection restrictions, one or more modalities of images are often missing. The absence of key modalities (such as the T1ce modality) can seriously affect the accuracy of subsequent lesion segmentation and auxiliary diagnosis.

[0004] To address the problem of missing modalities, existing technologies typically employ deep learning-based image generation networks (such as Generative Adversarial Networks, GANs) to map and synthesize missing modalities using existing modal images. However, in practical applications, these existing technologies suffer from the following technical limitations: First, existing methods typically extract features from different modalities through simple channel concatenation or fusion, failing to effectively decouple shared anatomical features across modalities from pathological texture features specific to a single modality. Due to the high entanglement of features between different modalities, the network easily loses modality-specific texture details (e.g., tumor enhancement features in the T1ce modality) when reconstructing missing modalities, resulting in images lacking accurate clinicopathological information.

[0005] Furthermore, existing generative networks rely primarily on implicit reasoning from existing input data when predicting missing modalities, lacking a global storage and retrieval mechanism for high-quality, unique pathological features. When the target modality data is completely missing, the network struggles to generate accurate local pathological features based solely on shared features, resulting in low accuracy of missing feature compensation and high local distortion rates in the synthesized images.

[0006] Furthermore, in the network optimization process of multimodal generative models, existing techniques mostly rely on pixel-level reconstruction loss and conventional adversarial loss, lacking strict contrast constraints in the deep semantic feature space. This makes the model prone to mode collapse during training. The generated images may be relatively smooth in the shallow visual features, but there are significant modal gaps in the deep feature distribution compared to real and complete multimodal images, resulting in unstable generation quality. Summary of the Invention

[0007] To overcome the shortcomings of the prior art, this invention provides a medical image generation method and system based on feature decoupling compensation and dynamic comparison, which can effectively decouple features, accurately compensate for missing features, and perform strict optimization constraints in the feature space to generate modal medical images, thereby improving the fidelity and clinical usability of missing modal generation.

[0008] To achieve the above objectives, one or more embodiments of the present invention provide the following technical solutions: Firstly, a medical image generation method based on feature decoupling compensation and dynamic comparison is disclosed, including: Acquire locally missing multimodal medical images to be processed; For multimodal medical images with local missing features, a bidirectional contrastive decoupling loss is introduced as a constraint to extract common features consistent across modalities and unique features specific to each modality. A modality-specific feature memory bank is constructed and updated using an exponential moving average strategy. The common features are mapped to pseudo-text, and the missing unique features are retrieved and compensated based on the cross-attention mechanism. The shared features, the unique features of the non-missing modalities, and the compensated unique features of the missing modalities are fused from multiple sources and then input into the generator of the generative adversarial network to reconstruct and output a complete multimodal medical image. The generator is optimized using a dynamic modality contrastive learning strategy to improve the quality of image generation.

[0009] As a further technical solution, a dual-branch feature extraction network is used to extract features. The dual-branch feature extraction network includes a modality common branch and a modality-specific branch. The modality shared branch shares all network weight parameters across different input modalities and performs feature extraction through the same set of convolutional kernels, extracting highly consistent high-level anatomical semantic information across modalities and outputting a two-dimensional modality shared feature map. The modality-specific branch consists of four parallel sub-encoders with independent network weights, each corresponding to a different modality. Each sub-encoder captures low-level visual features unique to its corresponding modality and outputs a two-dimensional feature map specific to that modality.

[0010] As a further technical solution, a bidirectional comparison decoupling loss is introduced for constraint. The bidirectional comparison decoupling loss specifically includes: common feature comparison loss and unique feature comparison loss. Among them, the common feature contrast loss is used to shorten the Euclidean distance between common features of different modalities of the same patient in the feature space, and to widen the distance between common features of different patients. The Specific Feature Contrast Loss is used to narrow the distance between specific features of the same modality in different patients to form modality clusters, while forcibly widening the distance between specific features of different modalities in the same patient to achieve feature orthogonality.

[0011] As a further technical solution, an exponential moving average strategy is used to construct and update a modality-specific feature memory database, specifically including: Initialize a unique feature memory of a fixed capacity for each input modality; In the initial training state, the prototype features in the unique feature memory can be initialized using a random normal distribution, or assigned values ​​using the unique features extracted from the first training batch. During the model training phase, when one or more real modalities exist in the input sequence, the extracted features of the current batch of that modality are smoothly and dynamically updated to the feature memory of the corresponding modality using an exponential moving average strategy.

[0012] As a further technical solution, the specific update steps are as follows: obtain the unique features extracted in the current iteration cycle, introduce a momentum hyperparameter to control the update smoothness, perform a weighted summation of the unique features and the prototype features stored in the unique feature memory in the previous iteration cycle, obtain the prototype features of the current iteration cycle, and overwrite and update them in the unique feature memory.

[0013] As a further technical solution, a cross-attention mechanism is used to retrieve and compensate for missing specific features, specifically including: First, query vector generation is performed: the extracted cross-modal shared features are input into a semantic inversion module composed of a multilayer perceptron, which maps high-level, abstract anatomical semantic features into a continuous latent space vector, i.e., a pseudo-text query vector. Secondly, key and value extraction is performed: from the modality-specific feature memory that is built and continuously updated, the stored modality prototype features are read and used as key matrices and value matrices respectively. Subsequently, cross-attention calculation based on mask gating is performed: the above pseudo-text query vector is used as a query matrix and input into the cross-attention module, and similarity calculation is performed with the key matrix. A modality missing mask is introduced for gating filtering, and the modality missing mask is directly applied to the internal process of similarity calculation to obtain the final two-dimensional compensation feature.

[0014] As a further technical solution, when reconstructing and outputting a complete multimodal medical image, the specific steps include: First, collect all available 2D feature map sources, including: extracted modality common features, extracted existing modality specific features, and output compensated missing modality specific features; Secondly, the three types of feature maps are concatenated and stitched together along the channel dimension. Due to the surge in the number of channels, the stitched high-dimensional feature map is then fed into the convolutional layer for dimensionality reduction and cross-channel information interaction processing, thereby unifying the number of channels and obtaining a deep fusion full-modal complete feature map. Finally, the fused full-modal complete feature map is input into a generator based on a generative adversarial network. This generator contains a series of two-dimensional deconvolution modules that recover the spatial resolution of the feature map layer by layer, and finally outputs four complete medical images with high-fidelity anatomical structure and pathological texture details.

[0015] As a further technical solution, adversarial loss, reconstruction loss and generation loss are introduced as joint constraints in the game training between the generator and the discriminator.

[0016] As a further technical solution, the generator is optimized using a dynamic modality contrastive learning strategy, specifically including: Constructing positive samples: Introduce a pre-trained full-modal autoencoder whose parameters are completely frozen at the current stage. Input a complete medical image slice containing all modalities without missing parts into the frozen autoencoder and extract its deep features as an absolutely correct reference benchmark, i.e., positive sample features. Constructing anchor points: Input the medical image to be processed carrying the missing mask into the generative network that is currently updating gradients, and extract the features output by a specific hidden layer as anchor point features; Constructing negative samples: Introduce a historical version model that does not participate in direct gradient backpropagation. The network weights of this historical model are smoothly transitioned by exponential moving average using the parameters of the current generated network through a momentum update strategy. Input the same missing medical image to be processed into this historical version model and extract its features as negative sample features. Calculate the dynamic modal contrast loss: In the feature space, calculate the similarity distance between the anchor point, positive sample, and negative sample; During the end-to-end training of the multimodal medical image generation network, the system weighted and fused all the loss functions mentioned above, including: dynamic modal contrast loss, adversarial loss, reconstruction loss, generation loss, and bidirectional contrast decoupling loss, to obtain the final total loss function used for network gradient backpropagation and parameter update.

[0017] Secondly, a medical image generation system based on feature decoupling compensation and dynamic comparison is disclosed, including: The feature extraction and decoupling module is configured to: acquire the locally missing multimodal medical image to be processed; for the locally missing multimodal medical image, introduce bidirectional contrastive decoupling loss for constraint, and extract cross-modal consistent common features and modality-specific features; The memory building module is configured to: build and update the modality-specific feature memory using an exponential moving average strategy; The feature retrieval and compensation module is configured to: map common features to pseudo-text, and retrieve and compensate for missing unique features based on the cross-attention mechanism; The multi-source fusion and generation module is configured to: perform multi-source fusion of the common features, the unique features of the non-missing modalities, and the compensated unique features of the missing modalities, and input them into the generator of the generative adversarial network to reconstruct and output a complete multimodal medical image; The dynamic contrast optimization module is configured to optimize the generator using a dynamic modal contrast learning strategy to improve the image generation quality.

[0018] The above one or more technical solutions have the following beneficial effects: The technical solution of this invention introduces a feature decoupling mechanism into multimodal medical image processing. By constructing a unique feature memory library and combining it with mask-gated cross-attention aggregation of missing features, and finally using dynamic modality contrastive learning for optimization, it achieves accurate feature compensation and high-fidelity image synthesis in the case of local missing features in multimodal scenarios, thereby improving the accuracy and robustness of assisted diagnosis and segmentation tasks.

[0019] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0020] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0021] Figure 1 This is a flowchart of the multimodal medical image generation method based on feature decoupling compensation and dynamic contrast learning in Embodiment 1 of the present invention; Figure 2 This is the overall network architecture diagram of the multimodal medical image generation method based on feature decoupling compensation and dynamic contrast learning in Embodiment 1 of the present invention; Figure 3 This is a schematic diagram of the feature retrieval compensation mechanism based on cross-attention in Embodiment 1 of the present invention; Figure 4 This is a schematic diagram of the dynamic modal contrast loss optimization strategy in Embodiment 1 of the present invention. Detailed Implementation

[0022] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0023] It should be noted that the terminology used herein is for the purpose of describing particular implementations only and is not intended to limit the exemplary implementations of the present invention.

[0024] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0025] It should be noted beforehand that the feature extraction and generation network in this embodiment adopts a two-dimensional convolutional neural network (2D CNN) architecture. For three-dimensional medical image volume data, before inputting it into the network, it is first divided into two-dimensional image slices along a specific axis, such as axial, coronal, or sagittal. The network processes the data slice by slice. Accordingly, all the features circulating within the network are two-dimensional feature maps.

[0026] Before performing network forward propagation, this embodiment first performs mathematical definition and preprocessing on the input data: Define the input set of locally missing multimodal medical images as follows: These correspond to the FLAIR, T1, T1ce, and T2 modes, respectively. Simultaneously, the system receives a mode-deficit mask vector corresponding to the input image. ,in ,when Time represents the first There are missing modalities. To eliminate absolute grayscale differences caused by different MRI scanning devices, Z-score normalization preprocessing is performed on each valid input 2D slice: (1) in, and The first Mean and standard deviation of pixels in non-zero brain tissue regions of various modal images.

[0027] Example 1 See appendix Figure 1 As shown in the figure, this embodiment discloses a multimodal medical image generation method based on feature decoupling compensation and dynamic contrastive learning, including: Step S101: Obtain the local missing multimodal medical image to be processed, use a feature extraction network containing modality-specific branches with independent weights and modality-shared branches with shared weights to process the input image, and introduce bidirectional contrastive decoupling loss for constraint to separate cross-modality consistent common features and modality-specific features.

[0028] Step S102: Construct and update the modality-specific feature memory using an exponential moving average strategy. Specifically, initialize an independent feature memory for each modality. During the model training phase, dynamically update the extracted feature memory to the corresponding modality's feature memory using an exponential moving average strategy. Step S103: Map the common features to pseudo-text and retrieve the compensation for missing specific features based on the cross-attention mechanism. The extracted common features are mapped to pseudo-text query vectors through a multilayer perceptron. Combined with the current input modality missing mask, the pseudo-text query vectors are cross-attentioned with the specific feature memory corresponding to the missing modality to dynamically extract the specific features of the missing modality. Step S104: The common features, the unique features of the non-missing modal, and the compensated unique features of the missing modal are concatenated in the channel dimension, and the fused features are obtained through dimensionality reduction processing. These fused features are then input into the generator of the generative adversarial network (GAN) to reconstruct and output a complete multimodal medical image.

[0029] Step S105: Optimize the generator using a dynamic modal contrastive learning strategy: Introduce a pre-trained, parameter-frozen full-modal autoencoder to extract full-modal standard features as positive samples, use the features reconstructed from the current generator network as anchor points, and use the features output from the historical version of the current model as negative samples to construct a dynamic modal contrastive loss to optimize the parameters of the generator network; By narrowing the distance between the current generator network and the pre-trained full-modal autoencoder in the feature space and widening the distance between it and the historical version of the momentum-updated model, the image generation quality is improved.

[0030] In one implementation example, regarding step S101, in conjunction with Figure 2 The diagram shown is an overall architecture diagram of a multimodal medical image generation network. The input image is... Moving on to the feature extraction and decoupling section. To thoroughly separate the shared anatomical structural information across modalities from the unique pathological texture information of a single modality in medical images, this step employs a two-branch feature extraction network: (1) Modal shared branch: This branch shares all network weight parameters across different input modalities. Regardless of whether the input is a FLAIR or T1 image, the same set of convolutional kernels is used for feature extraction, thereby forcing the network to ignore the differences in grayscale and contrast between modalities, extracting high-level anatomical semantic information that is highly consistent across modalities, such as brain tissue and tumor contours, and outputting a two-dimensional modal shared feature map.

[0031] (2) Modality-specific branch: This branch consists of four parallel sub-encoders with independent network weights, corresponding to the FLAIR, T1, T1ce and T2 modalities, respectively. The independent weight design allows each sub-encoder to focus on capturing the low-level visual features unique to the corresponding modality, such as the contrast agent-enhanced texture caused by blood-brain barrier disruption unique to the T1ce modality, and outputs a two-dimensional feature map unique to the corresponding modality.

[0032] To ensure complete orthogonal decoupling of the two types of features in the feature space, this embodiment introduces a bidirectional decoupling constraint, namely a bidirectional contrastive decoupling loss, during the feature extraction stage. This loss is composed of common feature contrast loss. Loss compared with unique features composition: On the one hand, construct a common feature contrast loss. Based on the InfoNCE loss framework, the Euclidean distance between common features of different modalities of the same patient is shortened in the feature space, while the distance between common features of different patients is widened, so that the modal common branches only extract cross-modal anatomical structural information. The calculation formula is as follows: (2) in, and Each represents the same patient modality and modality Common characteristics; Indicates other patients within the batch modality Common characteristics; A function representing the cosine similarity calculation between features; The temperature hyperparameter is determined through empirical settings and test set tuning. The optimal range is 0.05 to 0.5. In this embodiment, it is specifically set to 0.07 to adjust the degree of attention given to difficult samples by the contrast loss. This refers to the batch size.

[0033] In addition, a unique feature contrast loss is constructed. To achieve modality clustering, the distance between modality-specific features of different patients is reduced to form modality clusters, while the distance between modality-specific features of different patients is increased to achieve feature orthogonality. This ensures that the modality-specific branches extract only the pathological texture and contrast information specific to the corresponding modality. The calculation formula is as follows: (3) in, and They represent different patients. and The same mode Its unique characteristics; Indicates the same patient Different modes Its unique characteristics.

[0034] Ultimately, the bidirectional comparison decoupling loss in the decoupling phase is .

[0035] Through the above step S101, the sub-technical solution of this embodiment successfully decomposes complex medical image information into common features and unique features from a rigorous mathematical optimization level, laying the foundation for subsequent accurate compensation of missing features.

[0036] In one implementation example, regarding step S102: construct and update the modality-specific feature memory using an exponential moving average strategy.

[0037] In a multimodal medical image generation network, to persistently store the most representative and highest-frequency pathological texture features, i.e., feature prototypes, for each modality, this embodiment initializes an independent and fixed-capacity unique feature memory bank for each input modality (such as FLAIR, T1, T1ce, T2). During initial training, the prototype features in the unique feature memory bank can be initialized using a random normal distribution, or assigned values ​​using the unique features extracted from the first training batch.

[0038] During model training, when one or more real modalities exist in the input sequence, the system extracts the unique features of the current batch of that modality and uses an exponential moving average (EMA) strategy to smoothly and dynamically update the corresponding modality's unique feature memory. This update mechanism ensures that the stored modality texture prototypes can absorb the latest feature distribution while retaining stable knowledge from the historical training process.

[0039] The specific update steps are as follows: obtain the unique features extracted in the current iteration cycle, introduce a momentum hyperparameter to control the update smoothness, perform a weighted summation of the unique features and the prototype features stored in the unique feature memory in the previous iteration cycle, obtain the prototype features of the current iteration cycle, and update them to the unique feature memory.

[0040] The calculation formula for the above dynamic update is as follows: (4) in, Indicates the first After one training iteration, the modality The unique feature memory stores the prototype features of the current iteration cycle; Indicates the first The prototype features of the previous iteration cycle after the next iteration; Indicates the currently extracted mode Its unique characteristics; In a preferred embodiment of this embodiment, the momentum hyperparameter is... The value can be set between 0.9 and 0.999 to ensure the stability of the feature prototypes in the memory and avoid drastic fluctuations caused by noise in a single batch.

[0041] In one implementation example, regarding step S103: mapping common features to pseudo-text and retrieving compensation-specific features based on a cross-attention mechanism: the specific steps of performing cross-attention calculation on the pseudo-text query vector and the specific feature memory corresponding to the missing modality, and dynamically extracting the specific features of the missing modality by weighting are as follows: the pseudo-text query vector is used as the query matrix Query; the prototype features stored in the specific feature memory corresponding to the missing modality are mapped to the key matrix Key and the value matrix Value, respectively; the similarity weight between the query matrix Query and the key matrix Key is calculated, and the similarity weight is gated and filtered using the modality missing mask; the value matrix Value is weighted and summed using the filtered similarity weight to obtain the specific features of the missing modality used for compensation.

[0042] This is the core step in achieving accurate compensation in pure feature space using the sub-technical solution of this embodiment. Combined with... Figure 2 and Figure 3 A schematic diagram of feature retrieval compensation based on cross-attention, when there is modality missing in the input data, for example... Figure 2 The system no longer forces the network to randomly generate the missing texture from common features when the T1ce modality is missing. Instead, it transforms this into a dictionary-style feature retrieval process. The specific steps are as follows: First, query vector generation is performed. The cross-modal shared features extracted from S101 are denoted as... The input is fed into a semantic inversion module composed of a Multi-Layer Perceptron (MLP). This module maps high-level, abstract anatomical semantic features into a continuous latent space vector, i.e., a pseudo-text query vector Query, abbreviated as Q. The mathematical expression of its forward mapping process is as follows: (5) Next, key and value extraction is performed. From the modality-specific feature memory built and continuously updated in S102, the stored prototype features of each modality are read and used as the key matrix Key, abbreviated as K, and the value matrix Value, abbreviated as V.

[0043] Subsequently, cross-attention calculation based on mask gating is performed. The aforementioned pseudo-text query vector is used as the query matrix Q and input into the cross-attention module for similarity calculation with the key matrix K. A crucial step in this process is the introduction of a modality missing mask for gating filtering. For example... Figure 3 As shown by the dashed arrow, the modality missing mask is directly applied to the internal process of similarity calculation.

[0044] Specifically, let the modality missing mask vector input in this embodiment be... ,in ,when Time represents the first c Such a mode exists, when The time indicates that the modality is missing. To implement gated filtering, the system uses a mask... M An attention penalty matrix was constructed. P The conversion logic is as follows: If the current target modality c is not missing in the input data, i.e. Then let the corresponding penalty item ; If the current target mode c is a missing mode, i.e. Then let the corresponding penalty item .

[0045] Add the attention penalty matrix P Finally, the cross-attention retrieval calculation and the value matrix are performed. V The weighted summation formula is: (6) in, The feature channel dimension of the key matrix is ​​used to scale the dot product result to prevent gradient vanishing. The compensation feature is the final output of the attention mechanism.

[0046] In this step, by adding negative infinity before the Softmax normalization operation, the network is forced to shield itself from focusing on existing modal features, and their corresponding similarity weights are forcibly decayed to absolute 0, thereby precisely focusing on retrieving the prototype features corresponding to missing modalities (such as T1ce). The final output of this operation is a two-dimensional compensated feature. It not only accurately includes the pathological textures unique to the missing modality, such as tumor enhancement information, but its spatial distribution also highly matches the common anatomical structures of the current patient.

[0047] In one implementation example, regarding step S104: the common features, the unique features of the non-missing modal, and the compensated unique features of the missing modal are multi-source fused and input into the generator of the generative adversarial network (GAN) to reconstruct and output a complete multimodal medical image.

[0048] After completing feature-level retrieval and compensation, the system needs to integrate effective information from all dimensions to guide the final image generation. (Combined with...) Figure 2 The multi-source feature fusion and output section on the right side details the fusion and generation steps as follows: First, the system collects all currently available two-dimensional feature map sources, including: modal common features extracted from S101. Specific features of existing modes (such as FLAIR, T1, T2) extracted from S101 And the unique features of the compensated missing mode (T1ce) output in S103. .

[0049] Secondly, the three types of feature maps are concatenated and stitched together along the channel dimension. Due to the surge in the number of channels, the stitched high-dimensional feature map is then fed into a 1×1 convolutional layer for dimensionality reduction and cross-channel information interaction processing. This process restores the number of input channels required by the generator network while fusing multi-source feature information, thus unifying the channel count and obtaining a deeply fused, full-modal, complete feature map. Its mathematical expression is: (7) Finally, the fused feature map The input is fed into a generator G based on a generative adversarial network (GAN). This generator contains a series of two-dimensional deconvolutional modules that recover the spatial resolution of the feature maps layer by layer, and finally outputs four complete medical images with high-fidelity anatomical structures and pathological texture details, namely the reconstructed FLAIR, T1, T1ce, and T2.

[0050] In the game training between generator G and discriminator D, this embodiment introduces adversarial loss. Reconstruction losses With generation loss Perform joint constraints: To make the generated missing modality images visually and texturally approximate the true distribution, a standard minimax game adversarial loss is employed: (8) in, The input is a partially missing image. For a true, complete image, This is the generated complete image.

[0051] Meanwhile, to ensure absolute fidelity at the pixel level, the reconstruction loss (using the L1 norm) is calculated for existing modalities that are not missing in the input: (9) For the originally missing modalities, calculate their generation loss in the target region: (10) In one implementation example, regarding step S105: the generator is optimized using a dynamic modal contrastive learning strategy, which improves the image generation quality by narrowing the distance between the current generation network and the pre-trained full-modal autoencoder in the feature space and widening the distance between it and the historical version model of momentum update.

[0052] To further constrain the GAN generator and ensure that its generated missing modalities are not only visually realistic but also infinitely close to real complete multimodal data in the deep semantic feature space, this embodiment innovatively introduces a dynamic modality contrastive learning optimization module. Combined with... Figure 4 A diagram illustrating dynamic contrastive loss optimization is provided, along with the specific contrastive training strategy: (1) Constructing a positive sample (P): Introduce a pre-trained sample whose parameters are completely frozen at the current stage (e.g., Figure 4 The image contains a full-modal autoencoder (shown with a padlock icon). A complete medical image slice containing all modalities (without missing modalities) is input into this frozen autoencoder, and its deep features are extracted as an absolutely correct reference benchmark, i.e., positive sample features P.

[0053] (2) Constructing Anchors (A): Inputting the medical image to be processed, carrying the missing mask, into the generative network currently undergoing gradient updates, such as... Figure 4 In networks without a padlock icon, the features output by that icon in a specific hidden layer are extracted as anchor feature A.

[0054] (3) Constructing Negative Samples (N): To provide challenging counterexamples, the system introduces a historical version model that does not participate in direct gradient backpropagation. Specifically, the historical version model refers to a delayed-update network with the exact same network topology as the current generating network, including consistent feature extraction and deconvolution generation modules. The network weights of this historical version model are not randomly initialized, but are obtained by using the model parameters of the current generating network and updating momentum and smoothing the transition through an exponential moving average (EMA) strategy. The same missing medical image to be processed is input into this historical version model, and the features output in the corresponding hidden layer are extracted as the negative sample features N.

[0055] (4) Calculate the dynamic modal contrast loss: In the feature space, calculate the similarity distance between the anchor point, positive sample, and negative sample. To force the anchor point A to be closer to the positive sample P, and at the same time force the anchor point A to be further away from the negative sample N, construct the dynamic modal contrast loss. : (11) in, Indicates the cosine similarity between features; To control the temperature hyperparameter for contrast learning discrimination, the parameter size is also set empirically, with the optimal range being 0.05 to 0.5.

[0056] Overall optimization objective function: During the end-to-end training of the multimodal medical image generation network, the system weights and fuses all the above loss functions, ultimately using the total loss function for network gradient backpropagation and parameter update. The definition is as follows: (12) in, , , , , These are the hyperparameter weights for each loss, used to balance the contribution of various tasks such as feature decoupling, adversarial generation, pixel fidelity, and dynamic contrast space constraints during network training.

[0057] Thus, all steps and end-to-end optimization logic of the multimodal medical image generation method based on feature decoupling compensation and dynamic contrast learning provided in Embodiment 1 of the present invention have been executed.

[0058] The method described in this embodiment is mainly used to solve the technical problem that in actual clinical scanning, due to reasons such as patient non-cooperation, scanning time limitations, or image artifacts, some key MRI modalities (such as the T1ce modality, which is extremely sensitive to the enhancement texture features inside the tumor) are missing, which in turn affects subsequent auxiliary diagnosis and segmentation tasks.

[0059] A multimodal medical image generation method and system based on feature decoupling compensation and dynamic contrastive learning is proposed. A bidirectional feature decoupling framework is proposed, which eliminates cross-modal feature entanglement by strictly separating common anatomical features from specific pathological features. In addition, a cross-attention feature retrieval mechanism based on mask gating is proposed, which accurately compensates for key missing modalities (such as T1ce) by consulting the memory bank in the pure feature space. Finally, a dynamic contrastive learning strategy including a historical momentum model is used to optimize the generator, measure and narrow the modal gap in the deep feature space, so as to achieve high-quality and high-fidelity multimodal medical image synthesis.

[0060] Specifically, in this embodiment of the sub-technical solution, a new feature decoupling and constraint paradigm is proposed. For the first time, a bidirectional contrast decoupling loss is introduced into the multimodal generation task, which breaks the feature confusion caused by traditional channel splicing, completely ensures the orthogonality of intermodal consistency and intramodal specificity, and effectively prevents the loss of key pathological texture information.

[0061] Specifically, in this embodiment, a masked feature retrieval mechanism is proposed. By constructing a continuously updated feature memory and performing gated cross-attention calculation, accurate feature-level compensation based on global prior knowledge is achieved, significantly reducing the local distortion rate.

[0062] Specifically, in this embodiment of the sub-technical solution, a dynamic modality contrastive learning optimization strategy is proposed. By introducing a pre-trained autoencoder and a delayed-updated historical model to form a triplet (positive sample, anchor point, negative sample), push-pull constraints are applied in the deep semantic space, which effectively breaks the optimization bottleneck of traditional GAN ​​networks, completely avoids mode collapse, and endows the generated images with higher clinical diagnostic auxiliary value.

[0063] This embodiment introduces feature decoupling and mask-gated cross-attention mechanism into multimodal image processing. By consulting the memory bank in the pure feature space, it accurately compensates for key missing modalities and uses dynamic contrastive learning to narrow the modal gap in the deep feature space, thereby achieving high-quality, high-fidelity multimodal medical image synthesis.

[0064] Example 2 The purpose of this embodiment is to provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the above-described method.

[0065] Example 3 The purpose of this embodiment is to provide a computer-readable storage medium.

[0066] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of the above method.

[0067] Example 4 The purpose of this embodiment is to provide a medical image generation system based on feature decoupling compensation and dynamic comparison, including: The feature extraction and decoupling module is configured to: acquire the locally missing multimodal medical image to be processed; for the locally missing multimodal medical image, introduce bidirectional contrastive decoupling loss for constraint, and extract cross-modal consistent common features and modality-specific features; The memory building module is configured to: build and update the modality-specific feature memory using an exponential moving average strategy; The feature retrieval and compensation module is configured to: map common features to pseudo-text, and retrieve and compensate for missing unique features based on the cross-attention mechanism; The multi-source fusion and generation module is configured to: perform multi-source fusion of the common features, the unique features of the non-missing modalities, and the compensated unique features of the missing modalities, and input them into the generator of the generative adversarial network to reconstruct and output a complete multimodal medical image; The dynamic contrast optimization module is configured to optimize the generator using a dynamic modal contrast learning strategy to improve the image generation quality.

[0068] Example 5 The purpose of this embodiment is to provide a computer program product containing instructions that, when run on a computer, cause the computer to perform the methods and functions involved in any of the above embodiments.

[0069] The steps and methods involved in the apparatus of the above embodiments correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.

[0070] Those skilled in the art will understand that the modules or steps of the present invention described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, thereby allowing them to be stored in a storage device for execution by a computer device, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.

[0071] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.

Claims

1. A medical image generation method based on feature decoupling compensation and dynamic comparison, characterized by including: Acquire locally missing multimodal medical images to be processed; For multimodal medical images with local missing features, a bidirectional contrastive decoupling loss is introduced as a constraint to extract common features consistent across modalities and unique features specific to each modality. A modality-specific feature memory bank is constructed and updated using an exponential moving average strategy. The common features are mapped to pseudo-text, and the missing unique features are retrieved and compensated based on the cross-attention mechanism. The shared features, the unique features of the non-missing modalities, and the compensated unique features of the missing modalities are fused from multiple sources and then input into the generator of the generative adversarial network to reconstruct and output a complete multimodal medical image. The generator is optimized using a dynamic modality contrastive learning strategy to improve the quality of image generation.

2. The medical image generation method based on feature decoupling compensation and dynamic comparison as described in claim 1, characterized in that, Features are extracted using a dual-branch feature extraction network, which includes a modality-shared branch and a modality-specific branch. The modality shared branch shares all network weight parameters across different input modalities and performs feature extraction through the same set of convolutional kernels, extracting highly consistent high-level anatomical semantic information across modalities and outputting a two-dimensional modality shared feature map. The modality-specific branch consists of four parallel sub-encoders with independent network weights, each corresponding to a different modality. Each sub-encoder captures low-level visual features unique to its corresponding modality and outputs a two-dimensional feature map unique to that modality. Preferably, a bidirectional contrast decoupling loss is introduced for constraint, which specifically includes: common feature contrast loss and unique feature contrast loss; Among them, the common feature contrast loss is used to shorten the Euclidean distance between common features of different modalities of the same patient in the feature space, and to widen the distance between common features of different patients. The Specific Feature Contrast Loss is used to narrow the distance between specific features of the same modality in different patients to form modality clusters, while forcibly widening the distance between specific features of different modalities in the same patient to achieve feature orthogonality.

3. The medical image generation method based on feature decoupling compensation and dynamic comparison as described in claim 1, characterized in that, Using an exponential moving average strategy, a modality-specific feature memory is constructed and updated, specifically including: Initialize a unique feature memory of a fixed capacity for each input modality; In the initial training state, the prototype features in the unique feature memory can be initialized using a random normal distribution, or assigned values ​​using the unique features extracted from the first training batch. During the model training phase, when one or more real modalities exist in the input sequence, the extracted features of the current batch of that modality are smoothly and dynamically updated to the feature memory of the corresponding modality using an exponential moving average strategy. Preferably, the specific update steps are as follows: obtain the unique features extracted in the current iteration cycle, introduce a momentum hyperparameter to control the update smoothness, perform a weighted summation of the unique features and the prototype features stored in the unique feature memory in the previous iteration cycle, obtain the prototype features of the current iteration cycle, and overwrite and update them in the unique feature memory.

4. The medical image generation method based on feature decoupling compensation and dynamic comparison as described in claim 1, characterized in that, Retrieving and compensating for missing features based on the cross-attention mechanism, specifically including: First, query vector generation is performed: the extracted cross-modal shared features are input into a semantic inversion module composed of a multilayer perceptron, which maps high-level, abstract anatomical semantic features into a continuous latent space vector, i.e., a pseudo-text query vector. Secondly, key and value extraction is performed: from the modality-specific feature memory that is built and continuously updated, the stored modality prototype features are read and used as key matrices and value matrices respectively. Subsequently, cross-attention calculation based on mask gating is performed: the above pseudo-text query vector is used as a query matrix and input into the cross-attention module, and similarity calculation is performed with the key matrix. A modality missing mask is introduced for gating filtering, and the modality missing mask is directly applied to the internal process of similarity calculation to obtain the final two-dimensional compensation feature.

5. The medical image generation method based on feature decoupling compensation and dynamic comparison as described in claim 1, characterized in that, When reconstructing and outputting a complete multimodal medical image, the specific steps include: First, collect all available 2D feature map sources, including: extracted modality common features, extracted existing modality specific features, and output compensated missing modality specific features; Secondly, the three types of feature maps are concatenated and stitched together along the channel dimension. Due to the surge in the number of channels, the stitched high-dimensional feature map is then fed into the convolutional layer for dimensionality reduction and cross-channel information interaction processing, thereby unifying the number of channels and obtaining a deep fusion full-modal complete feature map. Finally, the fused full-modal complete feature map is input into a generator based on a generative adversarial network. This generator contains a series of two-dimensional deconvolution modules that recover the spatial resolution of the feature map layer by layer, and finally outputs four complete medical images with high-fidelity anatomical structure and pathological texture details.

6. The medical image generation method based on feature decoupling compensation and dynamic comparison as described in claim 5, characterized in that, In the game training between the generator and the discriminator, adversarial loss, reconstruction loss and generation loss are introduced for joint constraints. Preferably, the generator is optimized using a dynamic modality contrastive learning strategy, specifically including: Constructing positive samples: Introduce a pre-trained full-modal autoencoder whose parameters are completely frozen at the current stage. Input a complete medical image slice containing all modalities without missing parts into the frozen autoencoder and extract its deep features as an absolutely correct reference benchmark, i.e., positive sample features. Constructing anchor points: Input the medical image to be processed carrying the missing mask into the generative network that is currently updating gradients, and extract the features output by a specific hidden layer as anchor point features; Constructing negative samples: Introduce a historical version model that does not participate in direct gradient backpropagation. The network weights of this historical model are smoothly transitioned by exponential moving average using the parameters of the current generated network through a momentum update strategy. Input the same missing medical image to be processed into this historical version model and extract its features as negative sample features. Calculate the dynamic modal contrast loss: In the feature space, calculate the similarity distance between the anchor point, positive sample, and negative sample; During the end-to-end training of the multimodal medical image generation network, the system weighted and fused all the loss functions mentioned above, including: dynamic modal contrast loss, adversarial loss, reconstruction loss, generation loss, and bidirectional contrast decoupling loss, to obtain the final total loss function used for network gradient backpropagation and parameter update.

7. A medical image generation system based on feature decoupling compensation and dynamic comparison, characterized by: include: The feature extraction and decoupling module is configured to: acquire the locally missing multimodal medical image to be processed; for the locally missing multimodal medical image, introduce bidirectional contrastive decoupling loss for constraint, and extract cross-modal consistent common features and modality-specific features; The memory building module is configured to: build and update the modality-specific feature memory using an exponential moving average strategy; The feature retrieval and compensation module is configured to: map common features to pseudo-text, and retrieve and compensate for missing unique features based on the cross-attention mechanism; The multi-source fusion and generation module is configured to: perform multi-source fusion of the common features, the unique features of the non-missing modalities, and the compensated unique features of the missing modalities, and input them into the generator of the generative adversarial network to reconstruct and output a complete multimodal medical image; The dynamic contrast optimization module is configured to optimize the generator using a dynamic modal contrast learning strategy to improve the image generation quality.

8. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 6.

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method described in any one of claims 1-6.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it performs the steps of the method described in any one of claims 1-6 above.