Model training method, device and storage medium for medical image segmentation
By modulating the dynamic network structure and constructing intermediate domain samples, the problem of large cross-domain performance fluctuations in medical image segmentation is solved, improving the adaptability and stability of the model in multi-domain scenarios, and achieving higher overall performance and smaller inter-domain performance fluctuations.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANCHANG UNIV
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing medical image segmentation technologies suffer from large performance fluctuations and insufficient stability in multi-domain scenarios, especially due to feature mismatch and segmentation performance degradation caused by domain offsets under different hospitals, equipment, and scanning protocols.
By acquiring labeled and unlabeled samples and performing weak and strong enhancement, pseudo-labels and intermediate domain samples are generated. The dynamic network structure modulation of the teacher network and student network is utilized, and structural weights are generated based on domain-specific statistical features. Dynamic network structure modulation with multiple receptive fields is performed to construct smoother intermediate domain samples to reduce pseudo-label noise and adaptively select appropriate receptive fields.
It improves the cross-domain adaptability and stability of medical image segmentation, reduces feature mismatch caused by domain offset, enhances generalization ability and segmentation consistency in multi-center, multi-domain unlabeled scenarios, and achieves more robust semi-supervised training.
Smart Images

Figure CN122244066A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of artificial intelligence technology, specifically relating to a model training method, device, and storage medium for medical image segmentation. Background Technology
[0002] With the development of deep learning, medical image segmentation has been widely used in the automatic segmentation of structures such as the fundus, prostate, and heart, providing important basis for computer-aided diagnosis, such as semi-supervised medical image segmentation technology.
[0003] In traditional semi-supervised medical image segmentation techniques, it is often assumed that labeled and unlabeled data originate from the same distribution, and mechanisms such as consistency regularization and pseudo-labels are used to improve model performance. However, in actual clinical practice, medical images often come from different hospitals, different equipment, and different scanning protocols, resulting in significant "domain shift" and a substantial decrease in segmentation performance on new domains.
[0004] To mitigate domain offset, some methods utilize data augmentation, mixup, and cutmix to construct "intermediate domain" samples in pixel space, or align feature distributions through domain adaptation and contrastive learning, or employ a fixed network structure and fixed parameter capacity to uniformly model samples from different domains. However, these techniques share at least the following common shortcomings: (1) Pixel space blending (such as Mixup, CutMix) mainly performs linear / block-level combination at the input layer, which makes it difficult to explicitly characterize the differences in high-order statistical structures of different domains at multiple scales; global distribution alignment (such as common domain adaptation loss) usually emphasizes the consistency of the overall distribution, and easily ignores the domain-specific statistical offsets of local tissue structures, boundary regions and small target regions. Therefore, such methods are prone to the problem of "surface alignment but structural mismatch" in complex medical image scenarios, resulting in insufficient stability of cross-domain segmentation.
[0005] (2) Due to the significant differences in imaging equipment, scanning protocols and lesion representation in multicenter medical images, fixed structures are difficult to adaptively adjust the receptive field, feature channel response and high-low layer information fusion intensity for statistical characteristics of different domains, which easily leads to the phenomenon of "overfitting in some domains and underfitting in some domains".
[0006] Therefore, existing static networks often exhibit large cross-domain performance fluctuations and insufficient stability in multi-domain semi-supervised scenarios. Summary of the Invention
[0007] The purpose of this application is to provide a model training method, device, and storage medium for medical image segmentation, which can solve the problem of how to improve the adaptability, stability, and robustness of cross-domain segmentation of medical images.
[0008] To solve the above-mentioned technical problems, this application is implemented as follows: In a first aspect, embodiments of this application provide a model training method for medical image segmentation, the method comprising: A first labeled sample, a second labeled sample, a first unlabeled sample, and a second unlabeled sample are obtained; wherein, the first labeled sample and the first unlabeled sample are obtained by weakly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; and the second labeled sample and the second unlabeled sample are obtained by strongly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; wherein, the labeled sample from the source domain and the unlabeled sample from the target domain are both medical images; Pseudo-labels are generated based on the first unlabeled sample, and weakly enhanced bidirectional intermediate domain samples are generated based on the pseudo-labels, the first labeled sample, and the first unlabeled sample. Strongly enhanced bidirectional intermediate domain samples are generated based on the pseudo-labels, the second labeled sample, and the second unlabeled sample. The pseudo-labels are obtained from the confidence map output by the teacher network. Based on the pseudo-label, the first labeled sample, the first unlabeled sample, and the weakly enhanced bidirectional intermediate domain sample, two fused labels corresponding to the weakly enhanced bidirectional intermediate domain sample are obtained, as well as the domain-specific statistical features corresponding to each sample; wherein, the fused labels are obtained by fusing the pseudo-label and the ground truth label corresponding to the first labeled sample; The second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample are respectively input into the student network to obtain the bottleneck features corresponding to each sample output by the encoder of the student network. Structural weights are generated based on the specific statistical features of each domain, and the corresponding bottleneck features are modulated using a dynamic network structure with multiple receptive fields based on the structural weights to obtain the modulated features. Each of the modulated features is input into the decoder of the student network to obtain the image segmentation results; The network parameters of the student network are updated based on the ground truth labels, the pseudo labels, the two fused labels, and the image segmentation results.
[0009] Secondly, embodiments of this application provide a model training apparatus for medical image segmentation, the model training apparatus for medical image segmentation comprising: An acquisition module is used to acquire a first labeled sample, a second labeled sample, a first unlabeled sample, and a second unlabeled sample; wherein the first labeled sample and the first unlabeled sample are obtained by weakly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; the second labeled sample and the second unlabeled sample are obtained by strongly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; wherein the labeled sample from the source domain and the unlabeled sample from the target domain are both medical images.
[0010] The generation module is used to generate pseudo-labels based on the first unlabeled samples, generate weakly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the first labeled samples, and the first unlabeled samples, and generate strongly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the second labeled samples, and the second unlabeled samples; wherein, the pseudo-labels are obtained through the confidence map output by the teacher network.
[0011] The first determining module is used to obtain two fused labels corresponding to the weakly enhanced bidirectional intermediate domain samples, and domain-specific statistical features corresponding to each sample, based on the pseudo-labels, the first labeled samples, the first unlabeled samples, and the weakly enhanced bidirectional intermediate domain samples; wherein, the fused labels are obtained by fusing the pseudo-labels and the ground truth labels corresponding to the first labeled samples.
[0012] The second determining module is used to input the second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample into the student network respectively, and obtain the bottleneck features corresponding to each sample output by the encoder of the student network.
[0013] The modulation module is used to generate structural weights based on the specific statistical features of each domain, and to perform multi-receptive field dynamic network structure modulation on the corresponding bottleneck features based on each structural weight to obtain each modulated feature.
[0014] The input module is used to input each of the modulated features into the decoder of the student network to obtain each image segmentation result.
[0015] The first update module is used to update the network parameters of the student network based on the ground truth labels, the pseudo labels, the two fusion labels, and each of the image segmentation results.
[0016] Thirdly, embodiments of this application provide a computer device including a processor, a memory, and a program or instructions stored in the memory and executable on the processor, wherein the program or instructions, when executed by the processor, implement the steps of the method described in the first aspect.
[0017] Fourthly, embodiments of this application provide a computer-readable storage medium on which a program or instructions are stored, which, when executed by a processor, implement the steps of the method described in the first aspect.
[0018] Fifthly, embodiments of this application also provide a computer program product, including a computer program that, when executed by a processor, implements the steps of the method described in the first aspect.
[0019] In this embodiment, dynamic network structure modulation driven by domain-specific statistical features enhances the adaptability and stability of cross-domain segmentation. Specifically, structural weights are generated based on the domain-specific statistical features of each domain, and dynamic network structure modulation with multiple receptive fields is applied to the corresponding bottleneck features based on each structural weight, resulting in modulated features. This allows the network's feature extraction path to adaptively adjust with domain changes. Therefore, statistical differences caused by different hospitals, equipment, or protocols can be directly converted into structural modulation signals, i.e., structural weights, eliminating the need for the network to rely on a fixed structure to "average" multi-domain differences. This reduces feature mismatch caused by domain shift, thereby improving generalization ability and segmentation consistency in multi-center, multi-domain unlabeled scenarios. Furthermore, the construction of intermediate domain samples and the synergy of dynamic network structure modulation enable more robust semi-supervised training under strong domain shifts. Specifically, on the one hand, the difference in input distribution between the source and target domains is reduced by using natural intermediate domain samples; on the other hand, the domain-conditional dynamic structure is used to reconstruct the structure and adjust the response amplitude of different domains at the feature level. The two form the following closed loop: smoother intermediate domains reduce pseudo-label noise, and the dynamic network further absorbs domain-specific differences and adaptively selects appropriate receptive fields, thereby jointly suppressing instability in cross-domain training, enabling the model to achieve higher overall performance and smaller inter-domain performance fluctuations under the condition of "a small number of labeled single domains + a large number of unlabeled multi-domains". Attached Figure Description
[0020] Figure 1 This is one of the flowcharts illustrating a model training method for medical image segmentation provided in some embodiments of this application; Figure 2 This is one of the flowcharts illustrating a model training method for medical image segmentation provided in some embodiments of this application; Figure 3 This is a structural block diagram of a model training device for medical image segmentation provided in some embodiments of this application; Figure 4 These are internal structural diagrams of a computer device provided in some embodiments of this application. Detailed Implementation
[0021] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0022] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such use of data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0023] For ease of understanding, the following explains the technical terms that may be involved in the embodiments of this application: Weak augmentation refers to applying slight, semantically preserved transformations to samples, with the aim of enabling the model to learn local invariance and basic features.
[0024] Strong enhancement refers to applying drastic changes to the samples that may alter their appearance, with the aim of improving the robustness of the model so that it can maintain stable predictions under complex disturbances.
[0025] Bidirectional intermediate domain samples: These are manifold intermediate domain samples, specifically transitional samples obtained by spatially weighted fusion of source domain samples and other domain samples using a hybrid mask. They are used to construct continuous transition paths from the source domain to other domains in the semantic feature space.
[0026] Dynamic network structure modulation refers to generating structural weights based on the domain-specific statistical characteristics of the input samples and then weighting and fusing the results of multi-branch convolutions at the bottleneck layer of the student network.
[0027] Domain-specific statistical features: refers to statistical representations that reflect the differences in grayscale distribution, texture patterns, and artifact characteristics among different domains.
[0028] Gram matrix: refers to the channel correlation statistical matrix constructed from feature maps, used to characterize second-order statistical relationships.
[0029] Teacher network and student network: The teacher network is used to generate pseudo-labels, and its network parameters are obtained by updating the network parameters of the student network through exponential moving average (EMA); the student network is used for actual training and inference output.
[0030] Confidence mask: refers to a mask generated based on a confidence threshold, used to mask or downweight confidence pixels.
[0031] In one exemplary embodiment, this application proposes a model training method for medical image segmentation. The model training method for medical image segmentation provided by this application will be described in detail below with reference to the accompanying drawings and specific embodiments and application scenarios.
[0032] Reference Figure 1 The method includes steps 102-112. Wherein: Step 102: Obtain a first labeled sample, a second labeled sample, a first unlabeled sample, and a second unlabeled sample; wherein, the first labeled sample and the first unlabeled sample are obtained by weakly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively, and the second labeled sample and the second unlabeled sample are obtained by strongly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; wherein, the labeled sample from the source domain and the unlabeled sample from the target domain are both medical images.
[0033] In some embodiments, during the model training phase, after receiving training data, training batches can be output from the training data, wherein the training data consists of a set of labeled samples in a first domain and a set of unlabeled samples in at least one other domain.
[0034] Among them, the labeled samples from the source domain can be denoted as Unlabeled samples from the target domain can be denoted as ,in, .
[0035] During training, labeled and unlabeled samples can be jointly sampled to form the data required for the current iteration. It is then fed into the subsequent network and enhancement process, serving as the starting point for subsequent calculations. Among these, For labeled samples, For truth labels, These are unlabeled samples.
[0036] Step 104: Generate pseudo-labels based on the first unlabeled samples, and generate weakly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the first labeled samples, and the first unlabeled samples, and generate strongly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the second labeled samples, and the second unlabeled samples; wherein, the pseudo-labels are obtained through the confidence graph output by the teacher network.
[0037] In some embodiments, generating pseudo-labels based on the first unlabeled samples, and generating weakly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the first labeled samples, and the first unlabeled samples, includes: Input the first unlabeled sample into the teacher network to obtain the confidence map.
[0038] The pseudo-labels are generated based on the confidence graph.
[0039] Input the first labeled sample and the first unlabeled sample to the multi-scale feature similarity encoder respectively to obtain the first multi-scale feature and the second multi-scale feature.
[0040] A fused similarity map is generated based on the first and second multi-scale features.
[0041] The hybrid mask is updated based on the pseudo-labels and the fusion similarity map.
[0042] Weakly enhanced bidirectional intermediate domain samples are generated based on the updated hybrid mask, the first labeled sample, and the first unlabeled sample.
[0043] It should be noted that the generation process of strongly enhanced bidirectional intermediate domain samples is basically the same as that of weakly enhanced bidirectional intermediate domain samples, and will not be repeated here.
[0044] In some embodiments, the teacher network serves as the supervised path, providing stable supervision signals in the unlabeled domain; the student network serves as the training path, used for end-to-end parameter optimization and final segmentation learning. Teacher network ( ) and student network ( All of them adopt an encoder-decoder structure.
[0045] In some embodiments, the network parameters of the teacher network are updated using the network parameters of the student network based on the EMA method, and the update rule is as follows (1).
[0046] (1) in, This represents the network parameters of the teacher network; This represents the network parameters of the student network; This represents the sliding coefficient.
[0047] In some embodiments, the confidence graph is the probability output of the teacher network on the first unlabeled sample. ,in, This represents the first unlabeled sample. The pseudo-label was passed. Generate, where, This indicates taking the maximum value.
[0048] In some embodiments, based on a confidence threshold Constructing a confidence mask ,Should Used to filter or downweight low-confidence pixels, thereby reducing the adverse effects of noise in false labels on student training.
[0049] In some embodiments, the generated confidence mask is actually a probability matrix, the same size as the medical image. For example, if the image size is 224x224, the probability map size is also 224x224. The value corresponding to each pixel in the probability matrix (the value is between 0 and 1, representing the probability) is the estimate of that pixel by the multi-scale feature similarity encoder, estimating whether the pixel is background or foreground. For example, if the value at coordinate (2, 3) in a 224x224 probability map is 0.02, it means that the teacher network considers the position (2, 3) in the original image to be background, and this point can be preserved as a confidence mask.
[0050] For positions with values close to 0.5, it indicates that the multi-scale feature similarity encoder's estimate of this position is ambiguous, and it is uncertain whether it is foreground or background. In this case, this region will be set to 0 and temporarily not used as a confidence mask.
[0051] As training progresses, fewer and fewer points will be close to 0.5, and eventually all the points will be close to 1 or 0, at which point the training is almost complete.
[0052] In some embodiments, the multi-scale feature similarity encoder extracts features at different scales from the first labeled sample and the first unlabeled sample, thereby obtaining first multi-scale features and second multi-scale features. It can be understood that the multi-scale feature similarity encoder includes multiple feature extraction layers, the first... The first multi-scale feature of the layer Second multi-scale features They are respectively: , ,in, Indicates the first Layer feature extraction layer.
[0053] In some embodiments, at corresponding spatial locations, i.e., corresponding scales, the cosine similarity between the first multi-scale feature and the second multi-scale feature is calculated to obtain cosine similarity maps at each scale. Similar images at different scales are upsampled to a uniform resolution and fused to form a fused similarity map. .
[0054] In some embodiments, the hybrid mask is updated based on the following formula (2): (2) in, Indicates the pixel position in a medical image; This indicates the category index corresponding to the pseudo-label; Indicates the first Training phase in position Location, Target Category The mixing mask value; This represents a clipping function that truncates the input value and restricts it to the interval [0,1]. Indicates the first Training phase in position Location, Target Category The mixing mask value; Indicates the update step size of the blending mask; Indicates the first Training phase in position Location, Target Category The amount of mask change, and , Indicates position The value of the fusion similarity graph at the location, This represents the centering constant, used to adjust the baseline of the mask variation.
[0055] in, The mask value range used to limit the mixing mask in order to achieve a smooth domain transition with low mixing in similar regions and high mixing in different regions.
[0056] in, subscript This parameter indicates that it corresponds to the update process of the blending mask, and is used to distinguish it from other parameters (such as the learning rate).
[0057] The weakly enhanced bidirectional intermediate domain samples include the first weakly enhanced intermediate domain samples transitioning from the target domain to the source domain. The second weakly enhanced intermediate domain sample transitioning from the source domain to the target domain The weakly enhanced bidirectional intermediate domain samples are generated based on the following formulas (3) and (4): (3) (4) in, This refers to the first unlabeled sample; This refers to the updated hybrid mask; This indicates that the first labeled sample is used.
[0058] In some embodiments, the initial value of the mixing mask can be 0, 1, or a central constant, etc.
[0059] In some embodiments, strongly enhanced bidirectional intermediate domain samples are generated based on the updated hybrid mask, the second labeled sample, and the second unlabeled sample. This generation process is essentially the same as that for weakly enhanced bidirectional intermediate domain samples, and will not be described again here.
[0060] In this embodiment, the construction of bidirectional intermediate domain samples is a manifold-guided natural intermediate domain construction, which can reduce unnatural boundary artifacts and improve the reliability and accuracy of pseudo-labels. Specifically, this embodiment utilizes a multi-scale feature similarity encoder to generate an adaptive hybrid mask and constructs bidirectional intermediate domain samples from the target domain to the source domain and from the source domain to the target domain. This allows the transition process to be controlled by "weak mixing of similar regions and strong mixing of different regions," avoiding boundary breaks and artifacts caused by pixel linear interpolation or hard stitching. Furthermore, because the intermediate domain samples better conform to the continuous transition of the semantic manifold, the pseudo-labels generated by the teacher network have less fluctuation and less noise, thereby improving the learning quality and cross-domain segmentation accuracy of the student network at fine contours.
[0061] Step 106: Based on the pseudo-label, the first labeled sample, the first unlabeled sample, and the weakly enhanced bidirectional intermediate domain sample, obtain two fused labels corresponding to the weakly enhanced bidirectional intermediate domain sample, as well as the domain-specific statistical features corresponding to each sample; wherein, the fused labels are obtained by fusing the pseudo-label and the ground truth label corresponding to the first labeled sample.
[0062] Among them, domain-specific statistical features are shallow features output by the encoder of the teacher network.
[0063] It should be noted that the teacher network provides stable supervision signals in the unlabeled domain, while the student network is used for end-to-end parameter optimization and final segmentation learning. Therefore, the pseudo-label is the first supervision signal corresponding to the second unlabeled sample, the two fused labels corresponding to the weakly enhanced bidirectional intermediate domain sample are the second supervision signal corresponding to the strongly enhanced bidirectional intermediate domain sample, and the domain-specific statistical features are the third supervision signal when the bottleneck features corresponding to the second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample are dynamically modulated. The second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample are used as input to the student network to obtain the corresponding predicted values, where the ground truth label is the fourth supervision signal for the predicted value corresponding to the second labeled sample.
[0064] In some embodiments, the ground truth label and the pseudo label are fused using the same updated hybrid mask to obtain a second supervisory signal. Specifically, the two fused labels include the... corresponding and stated corresponding The and stated The following formulas (5) and (6) are combined to obtain the following: (5) (6) Among them, the This refers to the pseudo-label; Indicates the truth value label; The step of obtaining domain-specific statistical features for each sample based on the first labeled sample, the first unlabeled sample, and the weakly enhanced bidirectional intermediate domain sample includes: For any given sample, input the given sample into the teacher network to obtain the domain-specific statistical features output by the encoder of the teacher network; wherein, the given sample is the first labeled sample, the first unlabeled sample, or the weakly enhanced bidirectional intermediate domain sample.
[0065] Step 108: Input the second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample into the student network to obtain the bottleneck features corresponding to each sample output by the encoder of the student network.
[0066] That is, the second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample each correspond to a bottleneck feature, for a total of 4 bottleneck features.
[0067] Step 110: Generate structural weights based on the specific statistical features of each domain, and perform multi-receptive field dynamic network structure modulation on the corresponding bottleneck features based on each structural weight to obtain each modulated feature.
[0068] In some embodiments, generating structural weights based on the specific statistical features of each domain includes: For any domain-specific statistical feature, adaptive pooling is performed on the domain-specific statistical feature, and a normalized Gram matrix is constructed; the normalized Gram matrix Generate based on the following formula (7): (7) Among them, the Indicates a domain index; Indicates the first The normalized Gram matrix corresponding to each domain; Indicates the first The feature matrix of domain-specific statistical features after adaptive pooling; express The transpose of the matrix; Indicates the first The number of channels for each feature after domain pooling; and These represent the height and width of the feature map after adaptive pooling, respectively. This represents the total number of feature elements used for normalization.
[0069] For any normalized Gram matrix, expand the normalized Gram matrix and input it into the routing network to obtain the structural weights.
[0070] In some embodiments, the step of performing multi-receptive-field dynamic network structure modulation on the corresponding bottleneck features based on each of the structural weights to obtain each modulated feature includes: For any structural weight, in the bottleneck layer of the student network, the corresponding bottleneck feature is input to each dilated convolutional branch to obtain multiple branch outputs; The outputs of each branch are weighted and fused based on any of the structural weights, and residual connections are made with the corresponding bottleneck features. Then, an activation operation is performed using ReLU to obtain the modulated features.
[0071] In some embodiments, the modulated features can be obtained by the following formula (8).
[0072] (8) in, Indicates the modulated features; The functional representation of ReLU; Indicates the total number of branches; Indicates the first A branch structure; Indicates the first The output of each branch structure; This indicates a bottleneck characteristic.
[0073] It is understood that this embodiment implements "domain statistical difference". Closed-loop adaptive structural response reconstruction.
[0074] Step 112: Input each of the modulated features into the decoder of the student network to obtain the image segmentation results.
[0075] It can be understood that the image segmentation result is the prediction result of the student network.
[0076] Step 114: Update the network parameters of the student network based on the ground truth label, the pseudo label, the two fusion labels, and each of the image segmentation results.
[0077] In some embodiments, the ground truth label, pseudo label, two fused labels, and each image segmentation result are used to calculate the loss; the loss includes labeled supervised loss and unsupervised loss. The labeled supervised loss is calculated using the ground truth label and the image segmentation result corresponding to the second labeled sample; the unsupervised loss is calculated using the pseudo label, two fused labels, and the corresponding image segmentation result, wherein the corresponding image segmentation result is the image segmentation result corresponding to the second unlabeled sample and two strongly enhanced bidirectional intermediate domain samples.
[0078] In some embodiments, updating the network parameters of the student network based on the ground truth label, the pseudo label, the two fused labels, and each of the image segmentation results includes: The total loss is calculated based on the following formula (9).
[0079] (9) Among them, the Indicates the total loss; the stated This indicates that there is a labeled supervised loss, which is calculated using the ground truth labels and the corresponding image segmentation results; Represents the dynamic weighting coefficient; the The unsupervised loss is calculated using the pseudo-label, the two fused labels, and the corresponding image segmentation results.
[0080] The network parameters of the student network are updated based on backpropagation of the total loss.
[0081] In some embodiments, the model training method for medical image segmentation further includes: Based on the exponential moving average (EMA) method, the network parameters of the teacher network are updated using the network parameters of the student network. The updated parameters are then written back to both the teacher and student networks for the next training iteration, until the termination condition is met. The model parameters of the teacher network are updated using an exponential moving average.
[0082] In some embodiments, during the inference phase, the image segmentation results are output by the student network or its moving average model (the teacher network updated by the network parameters of the student network).
[0083] In this embodiment, dynamic network structure modulation driven by domain-specific statistical features enhances the adaptability and stability of cross-domain segmentation. Specifically, structural weights are generated based on the domain-specific statistical features of each domain, and dynamic network structure modulation with multiple receptive fields is applied to the corresponding bottleneck features based on each structural weight, resulting in modulated features. This allows the network's feature extraction path to adaptively adjust with domain changes. Therefore, statistical differences caused by different hospitals, equipment, or protocols can be directly converted into structural modulation signals, i.e., structural weights, eliminating the need for the network to rely on a fixed structure to "average" multi-domain differences. This reduces feature mismatch caused by domain shifts, thereby improving generalization ability and segmentation consistency in multi-center, multi-domain unlabeled scenarios. Furthermore, the construction of intermediate domain samples and the synergy of dynamic network structure modulation enable more robust semi-supervised training under strong domain shifts. Specifically, on the one hand, the difference in input distribution between the source and target domains is reduced by using natural intermediate domain samples; on the other hand, the domain-conditional dynamic structure is used to reconstruct the structure and adjust the response amplitude of different domains at the feature level. The two form the following closed loop: smoother intermediate domains reduce pseudo-label noise, and the dynamic network further absorbs domain-specific differences and adaptively selects appropriate receptive fields, thereby jointly suppressing instability in cross-domain training, enabling the model to achieve higher overall performance and smaller inter-domain performance fluctuations under the condition of "a small number of labeled single domains + a large number of unlabeled multi-domains".
[0084] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0085] For ease of understanding, such as Figure 2 As shown, a specific embodiment is used for illustration: Step 1: Obtain training data, which includes a set of labeled samples from the first domain (source domain). and an unlabeled sample set from at least one other domain (the target domain). .
[0086] Step 2: Perform weak and strong augmentation on the labeled and unlabeled samples respectively to obtain the first labeled sample. The second labeled sample First unlabeled sample Second unlabeled sample .
[0087] Step 3: Put Inputting the teacher network yields a probability graph, i.e., a confidence graph. And generate pseudo tags. Meanwhile, based on confidence threshold Generate confidence mask .
[0088] Step 4: Label weakly enhanced samples Compared with unlabeled weakly enhanced samples The multi-scale layers of the encoder are input separately, and multi-scale feature similarity calculation is performed to extract the multi-scale features. Calculate the cosine similarity map at each scale. The fusion similarity map was obtained by upsampling and fusion. .
[0089] Step 5: Based on the fusion similarity map Iterative update of the hybrid mask And using the hybrid mask, bidirectional intermediate domain samples are constructed, such as strongly enhanced bidirectional intermediate domain samples. and Simultaneously, the source domain truth labels and target domain pseudo labels are fused using the same updated mixing mask to obtain... and .
[0090] Step 6: Extract the domain-specific statistical features of the input samples, perform pooling on the domain-specific statistical features, and calculate the multi-level Gram matrix set. And generate structural weights through the routing network. .
[0091] Step 7: Dynamically modulate the bottleneck features in the bottleneck layer of the student network: The bottleneck features are processed by multiple convolutional branches with different dilation rates, and then weighted and fused according to structural weights and the residual output is obtained to obtain the dynamically modulated bottleneck features, which are then output by the decoder to produce the segmentation results.
[0092] Step 8: Calculate the loss and update the model: I. Calculate the labeled supervised loss for the output of the second labeled sample and the true label.
[0093] II. Calculate the unsupervised loss for the output of the second unlabeled sample and the strongly enhanced bidirectional intermediate domain sample and the pseudo-label under the confidence mask constraint.
[0094] III. The total loss is obtained by weighted summation of the labeled supervised loss and the unsupervised loss, and then backpropagated to update the student network parameters.
[0095] IV. Update teacher network parameters based on exponential moving average.
[0096] Repeat steps 2 through 8 until the termination condition is met; output the segmentation results of the student network (or its moving average model) during the inference phase.
[0097] Specifically, the hybrid mask iterative update in step 5 is as follows: As an update direction, the mask value is restricted to the [0,1] range through the Clip operation to reduce the blending intensity in similar regions and increase the blending intensity in different regions, thereby forming a smooth and natural domain transition.
[0098] Specifically, the structural weight generation in step 6 involves: expanding the Gram matrix of each layer and inputting it into a multilayer perceptron to obtain branch weights; then concatenating the multilayer outputs and inputting them into a fusion network to obtain fusion weights, which are then normalized using Softmax to generate a weight vector that can be used for multi-branch structure selection and response amplitude adjustment.
[0099] Specifically, the unsupervised loss in step 8 involves calculating the cross-entropy loss and Dice loss for the student network's outputs on strongly enhanced samples and strongly enhanced bidirectional intermediate domain samples, respectively, with the corresponding pseudo-labels. Low-confidence pixel positions are masked or weighted using a confidence mask to reduce the negative impact of erroneous pseudo-labels on training.
[0100] Based on the same inventive concept, this application also provides a model training apparatus for medical image segmentation to implement the model training method for medical image segmentation described above. The solution provided by this apparatus is similar to the implementation described in the above method. Therefore, the specific limitations of one or more embodiments of the model training apparatus for medical image segmentation provided below can be found in the limitations of the model training method for medical image segmentation described above, and will not be repeated here.
[0101] In one exemplary embodiment, such as Figure 3 As shown, a model training device for medical image segmentation is provided, comprising: an acquisition module 100, a generation module 200, a first determination module 300, a second determination module 400, a modulation module 500, an input module 600, and a first update module 700, wherein: The acquisition module 100 is used to acquire a first labeled sample, a second labeled sample, a first unlabeled sample, and a second unlabeled sample; wherein, the first labeled sample and the first unlabeled sample are obtained by weakly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively, and the second labeled sample and the second unlabeled sample are obtained by strongly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; wherein, the labeled sample from the source domain and the unlabeled sample from the target domain are both medical images.
[0102] The generation module 200 is used to generate pseudo-labels based on the first unlabeled samples, generate weakly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the first labeled samples, and the first unlabeled samples, and generate strongly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the second labeled samples, and the second unlabeled samples; wherein, the pseudo-labels are obtained through the confidence map output by the teacher network.
[0103] The first determining module 300 is used to obtain two fused labels corresponding to the weakly enhanced bidirectional intermediate domain samples, and domain-specific statistical features corresponding to each sample, based on the pseudo-labels, the first labeled samples, the first unlabeled samples, and the weakly enhanced bidirectional intermediate domain samples; wherein, the fused labels are obtained by fusing the pseudo-labels and the ground truth labels corresponding to the first labeled samples.
[0104] The second determining module 400 is used to input the second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample into the student network respectively, and obtain the bottleneck features corresponding to each sample output by the encoder of the student network.
[0105] The modulation module 500 is used to generate structural weights based on the specific statistical features of each domain, and to perform multi-receptive field dynamic network structure modulation on the corresponding bottleneck features based on each structural weight to obtain each modulated feature.
[0106] The input module 600 is used to input each of the modulated features into the decoder of the student network to obtain each image segmentation result.
[0107] The first update module 700 is used to update the network parameters of the student network based on the ground truth label, the pseudo label, the two fusion labels and each of the image segmentation results.
[0108] In one embodiment, the generation module 200 is specifically used for: Input the first unlabeled sample into the teacher network to obtain the confidence map.
[0109] The pseudo-labels are generated based on the confidence graph.
[0110] Input the first labeled sample and the first unlabeled sample to the multi-scale feature similarity encoder respectively to obtain the first multi-scale feature and the second multi-scale feature.
[0111] A fused similarity map is generated based on the first and second multi-scale features.
[0112] The hybrid mask is updated based on the pseudo-labels and the fusion similarity map.
[0113] Weakly enhanced bidirectional intermediate domain samples are generated based on the updated hybrid mask, the first labeled sample, and the first unlabeled sample.
[0114] In one embodiment, the hybrid mask is updated based on the following formula: in, Indicates the pixel position in a medical image; This indicates the category index corresponding to the pseudo-label; Indicates the first Training phase in position Location, Target Category The mixing mask value; This represents a clipping function that truncates the input value and restricts it to the interval [0,1]. Indicates the first Training phase in position Location, Target Category The mixing mask value; Indicates the update step size of the blending mask; Indicates the first Training phase in position Location, Target Category The amount of mask change, and , Indicates position The value of the fusion similarity graph at the location, This represents the centering constant, used to adjust the baseline of the mask variation.
[0115] The weakly enhanced bidirectional intermediate domain samples include the first weakly enhanced intermediate domain samples transitioning from the target domain to the source domain. The second weakly enhanced intermediate domain sample transitioning from the source domain to the target domain The weakly enhanced bidirectional intermediate domain samples are generated based on the following formula: in, This refers to the first unlabeled sample; This refers to the updated hybrid mask; This indicates that the first labeled sample is used.
[0116] In one embodiment, the two fusion tags include the corresponding and stated corresponding The and stated The following formula was used to obtain the result: Among them, the This refers to the pseudo-label; Indicates the truth value label; The step of obtaining domain-specific statistical features for each sample based on the first labeled sample, the first unlabeled sample, and the weakly enhanced bidirectional intermediate domain sample includes: For any given sample, input the given sample into the teacher network to obtain the domain-specific statistical features output by the encoder of the teacher network; wherein, the given sample is the first labeled sample, the first unlabeled sample, or the weakly enhanced bidirectional intermediate domain sample.
[0117] In one embodiment, the modulation module 500 is specifically used for: For any domain-specific statistical feature, adaptive pooling is performed on the domain-specific statistical feature, and a normalized Gram matrix is constructed; the normalized Gram matrix Generated based on the following formula: Among them, the Indicates a domain index; Indicates the first The normalized Gram matrix corresponding to each domain; Indicates the first The feature matrix of domain-specific statistical features after adaptive pooling; express The transpose of the matrix; Indicates the first The number of channels for each feature after domain pooling; and These represent the height and width of the feature map after adaptive pooling, respectively. This represents the total number of feature elements used for normalization.
[0118] For any normalized Gram matrix, expand the normalized Gram matrix and input it into the routing network to obtain the structural weights.
[0119] In one embodiment, the modulation module 500 is specifically used for: For any structural weight, in the bottleneck layer of the student network, the corresponding bottleneck feature is input to each dilated convolutional branch to obtain multiple branch outputs; The outputs of each branch are weighted and fused based on any of the structural weights, and residual connections are made with the corresponding bottleneck features. Then, an activation operation is performed using ReLU to obtain the modulated features.
[0120] In one embodiment, the first update module 700 is specifically used for The total loss is calculated based on the following formula: Among them, the Indicates the total loss; the stated This indicates that there is a labeled supervised loss, which is calculated using the ground truth labels and the corresponding image segmentation results; Represents the dynamic weighting coefficient; the The unsupervised loss is calculated using the pseudo-label, the two fused labels, and the corresponding image segmentation results.
[0121] The network parameters of the student network are updated based on backpropagation of the total loss.
[0122] In one embodiment, the model training apparatus for medical image segmentation further includes: The second update module is used to update the network parameters of the teacher network using the network parameters of the student network based on the exponential moving average (EMA) method.
[0123] The modules in the aforementioned model training device for medical image segmentation can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.
[0124] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 4As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides the environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communication with external terminals via a network connection. When the computer program is executed by the processor, it implements a model training method for medical image segmentation.
[0125] Those skilled in the art will understand that Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0126] In one embodiment, a computer-readable storage medium is provided, on which a program or instructions are stored, which, when executed by a processor, implement the steps in the above-described method embodiments.
[0127] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0128] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.
[0129] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.
[0130] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.
[0131] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A model training method for medical image segmentation, characterized in that, The model training method for medical image segmentation includes: A first labeled sample, a second labeled sample, a first unlabeled sample, and a second unlabeled sample are obtained; wherein, the first labeled sample and the first unlabeled sample are obtained by weakly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; and the second labeled sample and the second unlabeled sample are obtained by strongly enhancing the labeled sample from the source domain and the unlabeled sample from the target domain, respectively; wherein, the labeled sample from the source domain and the unlabeled sample from the target domain are both medical images; Pseudo-labels are generated based on the first unlabeled sample, and weakly enhanced bidirectional intermediate domain samples are generated based on the pseudo-labels, the first labeled sample, and the first unlabeled sample. Strongly enhanced bidirectional intermediate domain samples are generated based on the pseudo-labels, the second labeled sample, and the second unlabeled sample. The pseudo-labels are obtained from the confidence map output by the teacher network. Based on the pseudo-label, the first labeled sample, the first unlabeled sample, and the weakly enhanced bidirectional intermediate domain sample, two fused labels corresponding to the weakly enhanced bidirectional intermediate domain sample are obtained, as well as the domain-specific statistical features corresponding to each sample; wherein, the fused labels are obtained by fusing the pseudo-label and the ground truth label corresponding to the first labeled sample; The second labeled sample, the second unlabeled sample, and the strongly enhanced bidirectional intermediate domain sample are respectively input into the student network to obtain the bottleneck features corresponding to each sample output by the encoder of the student network. Structural weights are generated based on the specific statistical features of each domain, and the corresponding bottleneck features are modulated using a dynamic network structure with multiple receptive fields based on the structural weights to obtain the modulated features. Each of the modulated features is input into the decoder of the student network to obtain the image segmentation results; The network parameters of the student network are updated based on the ground truth labels, the pseudo labels, the two fused labels, and the image segmentation results.
2. The model training method for medical image segmentation according to claim 1, characterized in that, The step of generating pseudo-labels based on the first unlabeled sample, and generating weakly enhanced bidirectional intermediate domain samples based on the pseudo-labels, the first labeled sample, and the first unlabeled sample, includes: Input the first unlabeled sample into the teacher network to obtain the confidence map; The pseudo-labels are generated based on the confidence graph; Input the first labeled sample and the first unlabeled sample to the multi-scale feature similarity encoder respectively to obtain the first multi-scale feature and the second multi-scale feature; A fused similarity map is generated based on the first and second multi-scale features; Update the hybrid mask based on the pseudo-labels and the fusion similarity map; Weakly enhanced bidirectional intermediate domain samples are generated based on the updated hybrid mask, the first labeled sample, and the first unlabeled sample.
3. The model training method for medical image segmentation according to claim 2, characterized in that, The hybrid mask is updated based on the following formula: in, Indicates the pixel position in a medical image; This indicates the category index corresponding to the pseudo-label; Indicates the first Training phase in position Location, Target Category The mixing mask value; This represents a clipping function that truncates the input value and restricts it to the interval [0,1]. Indicates the first Training phase in position Location, Target Category The mixing mask value; Indicates the update step size of the blending mask; Indicates the first Training phase in position Location, Target Category The amount of mask change, and , Indicates position The value of the fusion similarity map at the location, This represents the centering constant, used to adjust the baseline of the mask variation. The weakly enhanced bidirectional intermediate domain samples include the first weakly enhanced intermediate domain samples transitioning from the target domain to the source domain. The second weakly enhanced intermediate domain sample transitioning from the source domain to the target domain The weakly enhanced bidirectional intermediate domain samples are generated based on the following formula: in, This refers to the first unlabeled sample; This refers to the updated hybrid mask; This indicates that the first labeled sample is used.
4. The model training method for medical image segmentation according to claim 3, characterized in that, The two fusion tags include the corresponding and stated corresponding The and stated The following formula was used to obtain the result: Among them, the This refers to the pseudo-label; Indicates the truth value label; The step of obtaining domain-specific statistical features for each sample based on the first labeled sample, the first unlabeled sample, and the weakly enhanced bidirectional intermediate domain sample includes: For any given sample, input the given sample into the teacher network to obtain the domain-specific statistical features output by the encoder of the teacher network; wherein, the given sample is the first labeled sample, the first unlabeled sample, or the weakly enhanced bidirectional intermediate domain sample.
5. The model training method for medical image segmentation according to claim 1, characterized in that, The generation of structural weights based on the specific statistical features of each domain includes: For any domain-specific statistical feature, adaptive pooling is performed on the domain-specific statistical feature, and a normalized Gram matrix is constructed; the normalized Gram matrix Generated based on the following formula: Among them, the Indicates a domain index; Indicates the first The normalized Gram matrix corresponding to each domain; Indicates the first The feature matrix of domain-specific statistical features after adaptive pooling; express The transpose of the matrix; Indicates the first The number of channels for each feature after domain pooling; and These represent the height and width of the feature map after adaptive pooling, respectively. This represents the total number of feature elements used for normalization; For any normalized Gram matrix, expand the normalized Gram matrix and input it into the routing network to obtain the structural weights.
6. The model training method for medical image segmentation according to claim 5, characterized in that, The process involves modulating the corresponding bottleneck features using a multi-receptive-field dynamic network structure based on each of the structural weights to obtain the modulated features, including: For any structural weight, in the bottleneck layer of the student network, the corresponding bottleneck feature is input to each dilated convolutional branch to obtain multiple branch outputs; The outputs of each branch are weighted and fused based on any of the structural weights, and residual connections are made with the corresponding bottleneck features. Then, an activation operation is performed using ReLU to obtain the modulated features.
7. The model training method for medical image segmentation according to claim 1, characterized in that, The process of updating the network parameters of the student network based on the ground truth label, the pseudo label, the two fused labels, and each of the image segmentation results includes: The total loss is calculated based on the following formula: Among them, the Indicates the total loss; the stated This indicates that there is a labeled supervised loss, which is calculated using the ground truth labels and the corresponding image segmentation results; Represents the dynamic weighting coefficient; the The unsupervised loss is calculated using the pseudo-labels, the two fused labels, and the corresponding image segmentation results. The network parameters of the student network are updated based on backpropagation of the total loss.
8. The model training method for medical image segmentation according to claim 1, characterized in that, The model training method for medical image segmentation also includes: The network parameters of the teacher network are updated using the network parameters of the student network based on the exponential moving average (EMA) method.
9. A computer device, characterized in that, It includes a processor, a memory, and a program or instructions stored in the memory and executable on the processor, wherein the program or instructions, when executed by the processor, implement the steps of the model training method for medical image segmentation as described in any one of claims 1-8.
10. A readable storage medium, characterized in that, The readable storage medium stores a program or instructions that, when executed by a processor, implement the steps of the model training method for medical image segmentation as described in any one of claims 1-8.