Retina cross-modality segmentation method and device based on domain-invariant contrastive learning

By proposing a retinal cross-modal segmentation method based on domain-invariant contrastive learning and uncertainty-aware self-ensemble framework, the problem of high dependence on labeled data in cross-modal learning of medical images is solved, achieving efficient and accurate OCT image segmentation and improving the cross-modal segmentation performance and generalization ability of the model.

CN118710893BActive Publication Date: 2026-06-26GUANGDONG ARTIFICIAL INTELLIGENCE & DIGITAL ECONOMY LAB (GUANGZHOU)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG ARTIFICIAL INTELLIGENCE & DIGITAL ECONOMY LAB (GUANGZHOU)
Filing Date
2024-06-06
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing deep learning methods have limited effectiveness in cross-modal learning of medical images, especially in OCT image segmentation, and require a large amount of labeled data, which is costly and time-consuming.

Method used

A retinal cross-modal segmentation method based on domain-invariant contrastive learning (CSA-DoCL) and uncertainty-aware self-ensemble ensemble framework (UA-MT) is adopted. The method is trained using labeled CFP images and a small number of labeled OCT images. The U-net model is used as the basic framework. The model is trained by domain-invariant contrastive learning and uncertainty-aware self-ensemble ensemble framework, which reduces the dependence on labeled data and improves cross-modal segmentation performance.

Benefits of technology

It significantly reduces the need for labeled data, improves the accuracy of cross-modal segmentation and the generalization ability of the model, increases the Dice coefficient of OCT image segmentation and reduces the average symmetric surface distance, and enhances the segmentation performance of the model on unseen data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118710893B_ABST
    Figure CN118710893B_ABST
Patent Text Reader

Abstract

The application discloses a retinal cross-modal segmentation method and device based on domain-invariant contrast learning, relates to the technical field of medical image segmentation, and comprises the following steps: acquiring a retinal image; establishing a retinal cross-modal segmentation model; inputting the retinal image into the retinal cross-modal segmentation model to obtain a segmentation result of the retinal image, wherein the retinal cross-modal segmentation model is based on a U-net model as a basic framework, is trained by using image data of different modes, and is obtained through domain-invariant contrast learning and uncertainty perception self-integrated teacher framework. The application can effectively adapt to new modes and realize accurate image segmentation under limited labeled data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image segmentation technology, specifically to a retinal cross-modal segmentation method and apparatus based on domain-invariant contrastive learning. Background Technology

[0002] Significant progress has been made in the field of medical image segmentation using deep learning techniques. Nevertheless, a number of challenges remain when applying these techniques to process optical coherence tomography (OCT) images. Particularly in the context of cross-modal learning—applying knowledge learned from one modality (such as color fundus photography, CFP) to another (such as OCT)—the performance of existing deep learning methods in adapting to new modalities still needs improvement.

[0003] Current methods, such as Domain Adversarial Networks (DANN) proposed by Tzeng et al., and the adversarial example generation techniques further developed by Ganin et al., while addressing the domain adaptation problem to some extent, still have limited effectiveness in handling the specific challenges of medical images, especially OCT images. These methods mainly focus on feature-level adaptation without fully utilizing the spatial and contextual information inherent in the medical images themselves.

[0004] Furthermore, existing technologies often require a large amount of labeled data when processing cross-modal data, which is a significant limitation in practical applications, as acquiring a large number of high-quality medical image annotations is both time-consuming and expensive. Although the semi-supervised learning method proposed by Zhang et al. attempts to reduce the dependence on labeled data, its effectiveness in cross-modal learning scenarios, especially in OCT image segmentation, remains limited. Summary of the Invention

[0005] This invention provides a retinal cross-modal segmentation method and apparatus based on domain-invariant contrastive learning, which can effectively adapt to new modalities and achieve accurate image segmentation with limited labeled data.

[0006] To achieve the above objectives, the present invention provides the following technical solution:

[0007] In a first aspect, the present invention provides a method for retinal cross-modal segmentation, characterized by comprising the following steps:

[0008] Acquire retinal images;

[0009] A retinal cross-modal segmentation model is established. The retinal image is input into the retinal cross-modal segmentation model to obtain the segmentation result of the retinal image. The retinal cross-modal segmentation model is based on the U-net model and is trained using image data from different modalities through domain-invariant contrastive learning and uncertainty-aware self-ensemble learning framework.

[0010] In a second aspect, the present invention also provides an electronic device, including a processor and a memory;

[0011] The memory is used to store programs;

[0012] The processor executes the program to implement the method described above.

[0013] Thirdly, the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement the methods described above.

[0014] Fourthly, the present invention also provides a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device can read the computer instructions from the computer-readable storage medium and execute the computer instructions, causing the computer device to perform the aforementioned method.

[0015] Compared with the prior art, the advantages of this invention are as follows:

[0016] This invention addresses a key challenge in cross-modal data segmentation in the field of medical image segmentation through Domain Invariant Comparative Learning (CSA-DoCL), demonstrating significant advantages, particularly in the accurate hierarchical segmentation of OCT images.

[0017] It also has the following advantages:

[0018] Significantly reduces reliance on large amounts of labeled data: By combining labeled CFP images with a small number of labeled OCT images for learning, this invention significantly reduces the need for large amounts of labeled data. This advantage directly solves the problem of high cost and time consumption in obtaining large amounts of high-quality labeled data in medical image segmentation tasks.

[0019] Effectively improves cross-modal segmentation performance: The Domain Invariant Contrastive Learning (DoCL) strategy enables the model to learn feature representations that are effective in both the source and target domains, thereby significantly improving the accuracy of cross-modal segmentation. This effect is validated through extensive experiments on different datasets. The model employing this invention demonstrates a higher Dice coefficient and a lower average symmetric surface distance (ASSD) compared to existing techniques on OCT image segmentation tasks.

[0020] Enhancing the model's generalization ability: This invention further improves the model's generalization ability to unseen data through an uncertainty-aware self-ensemble balancing framework (UA-MT). This mechanism ensures stability and accuracy during fine-tuning in the target domain by automatically identifying and reducing the uncertainty of pseudo-labels, enabling the model to maintain good segmentation performance on new, unlabeled OCT images.

[0021] By comprehensively applying these technical features, this invention not only theoretically proposes a novel cross-modal medical image segmentation method, but also demonstrates its effectiveness and superiority through experimental data. Particularly in OCT image segmentation, this invention exhibits higher accuracy and better generalization ability compared to existing technologies, significantly improving the efficiency and accuracy of medical image segmentation, and is of great significance for promoting the development and application of medical image analysis technology. Attached Figure Description

[0022] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a flowchart of a retinal cross-modal segmentation method according to an embodiment of the present invention;

[0024] Figure 2 This is a flowchart of a retinal cross-modal segmentation method according to another embodiment of the present invention;

[0025] Figure 3 This is a flowchart of the domain-invariant contrastive learning process in an embodiment of the present invention;

[0026] Figure 4 This is a flowchart of the uncertainty-aware self-integration equalizer framework in an embodiment of the present invention; Detailed Implementation

[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0028] Example:

[0029] It should be noted that the terms "comprising" and "having" and any variations thereof in the embodiments of the present invention are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are explicitly listed, but may include other steps or units that are not explicitly listed or that are inherent to such processes, methods, products, or devices.

[0030] See Figures 1 to 4 This embodiment provides a retinal cross-modal segmentation method, which specifically includes the following process:

[0031] Step 100: Acquire a retinal image and preprocess the retinal image.

[0032] In this step: the retinal image to be segmented across modal layers is acquired, such as an optical coherence tomography (OCT) image, and the image is segmented using the retinal cross-modal segmentation model provided in this embodiment.

[0033] Step 200: Establish a retinal cross-modal segmentation model. Input the preprocessed retinal image into the retinal cross-modal segmentation model to obtain the segmentation result of the retinal image. The retinal cross-modal segmentation model is based on the U-net model and is trained using image data from different modalities through domain-invariant contrastive learning and uncertainty-aware self-ensemble learning framework.

[0034] In this step: different modalities of image data, including source domain data and target domain data, specifically, the source domain data (REFUGE2 dataset): This dataset contains color fundus photography (CFP) images to simulate the features of OCT images. To make the CFP images more similar to OCT images in features, these images are processed using linear polar coordinate transformation and grayscale conversion to obtain a retinal hierarchical structure similar to that of OCT images. The target domain data (GOALS, Duke DME, and MS datasets): These datasets provide actual OCT images and their corresponding hierarchical annotations, used to train the model to identify and segment the various layers of the retina. This embodiment comprehensively utilizes cross-modal learning between color fundus photography (CFP) images and optical coherence tomography (OCT) images to achieve efficient and accurate OCT image hierarchical segmentation.

[0035] The model was trained using a domain-invariant contrastive learning and uncertainty-aware self-ensemble learning framework, comprising two stages. Specifically, in the first stage, the model was pre-trained on CFP images, learning source domain features in a supervised manner. In the second stage, a CSA-DoCL strategy, combined with a UA-MT mechanism, was employed to perform cross-modal learning using a small number of labeled and unlabeled OCT images, reducing the model's dependence on a large amount of labeled data. Then, the pre-trained model was fine-tuned on the target domain (OCT images), using CSA-DoCL for detailed adjustments to adapt to the characteristics of the target domain data. The UA-MT mechanism was then used for final optimization to improve the model's generalization ability to unseen data. Finally, the loss function was set to achieve the expected training objective. Performance evaluation was conducted using an independent test dataset (OCT images) to assess the model's segmentation performance, including metrics such as the Dice coefficient and average symmetric surface distance (ASSD). Specifically, experiments were conducted on the GOALS, DukeDME, and MS datasets, which cover different retinal diseases and conditions, and are highly representative and challenging. Evaluation Metrics: Model performance was evaluated using standard metrics such as DICE coefficient and mean Euclidean distance (MED) by comparing with existing semi-supervised and supervised learning methods. Experimental Results: CSA-DoCL achieved significant performance improvements on various datasets, demonstrating the effectiveness of this approach in retinal OCT image hierarchical segmentation tasks using limited labeled data.

[0036] Optionally, training is performed using domain-invariant contrastive learning, specifically including the following steps: obtaining a labeled dataset as the source domain and a labeled dataset as the target domain; using a mini-batch mean-variance method to obtain a batch normalization layer for the source domain style and a batch normalization layer for the target domain style based on the labeled dataset of the source domain and the labeled dataset of the target domain; obtaining source domain style features through linear layer mapping based on the labeled dataset of the source domain and the batch normalization layer of the source domain style; obtaining target domain style features through linear layer mapping based on the labeled dataset of the target domain and the batch normalization layer of the target domain style; and obtaining cross-domain consistent features through domain-invariant contrastive learning based on the source domain style features and the target domain style features.

[0037] In the above embodiments, the loss function of domain-invariant contrastive learning includes: the loss from the source domain to the target domain and the loss from the target domain to the source domain.

[0038] The loss from the source domain to the target domain is composed of two negative pair losses and one positive pair loss:

[0039]

[0040] Loss from target domain to source domain:

[0041]

[0042] The loss function for domain-invariant contrastive learning is formed by combining the two types of losses mentioned above:

[0043]

[0044] in Normalized feature representation of source domain-specific BN generation; Normalized feature representations generated for BN specific to the target domain; Normalized feature representation of the target domain image generated by source domain-specific BN; Normalized feature representation of the target domain image generated by a target domain-specific BN.

[0045] Optionally, the model is trained using an uncertainty-aware self-integrated learning framework. The specific steps include: feeding the labeled datasets of the source domain and the labeled datasets of the target domain into the student model, and applying supervised learning loss to prevent overfitting to unlabeled data.

[0046] In the above embodiments, this loss function can improve the model's ability to extract features and prevent overfitting on unlabeled data when the amount of unlabeled data is much larger than the amount of labeled data. The loss function for supervised learning loss includes: the DF loss between the predicted and true labels of the source domain data, and the DF loss between the predicted and true labels of the labeled data in the target domain.

[0047]

[0048] DF loss represents a combination of Dice loss and Focal loss:

[0049] L DF =L Dice +L Focal ;

[0050] in This represents the network's predicted output for CFP images in the source domain; This represents the network's predicted output for the OCT image in the target domain; Each represents its corresponding label; w1 and w2 represent two parameters for adjusting category imbalance.

[0051] Optionally, the model is trained using an uncertainty-aware self-integrated ensemble teacher framework. The specific steps include: obtaining an unlabeled dataset as the target domain; feeding the unlabeled dataset of the target domain into the student model and the teacher model respectively through different augmentations; using the pseudo-labels output by the teacher model and applying unsupervised consistency loss to enable the teacher model to supervise the student model at the pixel level; and updating the weights of the teacher model through an average exponential moving average mechanism.

[0052] In the above embodiments, to enhance the model's generalization ability and fully utilize unlabeled data in the target domain, unsupervised consistency loss is used to achieve this goal. Specifically, the embodiments utilize unlabeled data for different enhancements, which are then fed into the teacher model and student model respectively. The output of the teacher model is used to supervise the output of the student model. Here, pixel-level consistency loss is used to enable the model to capture pixel-level features. The loss function of unsupervised consistency loss includes:

[0053]

[0054] Where I(.) represents the indicator function; This represents the teacher network's prediction at the j-th pixel; u represents the student network's prediction at pixel j; j H represents the estimation uncertainty of the j-th pixel; H represents the threshold used to select the most certain target.

[0055] In the above embodiments, updating the weights of the teacher model using the average exponential moving average mechanism includes:

[0056]

[0057] Where α represents the exponential moving average (EMA) decay rate; This represents the teacher model parameters in the (e-1)th training step; This represents the teacher model parameters in the e-th training step; This represents the student model parameters in the e-th training step.

[0058] As can be seen, the innovation of this embodiment lies in combining cross-modal learning, domain-invariant contrastive learning, and uncertainty perception mechanisms, effectively solving key challenges in cross-modal medical image segmentation and achieving high-performance segmentation with limited annotation resources. Furthermore, this embodiment lowers the technical threshold by optimizing the model training and tuning process, enabling even non-expert users to easily apply it, providing an efficient and accurate solution for medical image analysis.

[0059] Example 2

[0060] Based on the same inventive concept, embodiments of the present invention also provide an electronic device, the electronic device including a processor and a memory, the memory storing at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the code set or instruction set being loaded and executed by the processor to implement the retinal cross-modal segmentation method as described above.

[0061] It is understood that the memory may include random access memory (RAM) or read-only memory. Optionally, the memory may include non-transitory computer-readable storage medium. The memory can be used to store instructions, programs, code, code sets, or instruction sets. The memory may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function, instructions for implementing the various method embodiments described above, etc.; the stored data area may store data created according to the use of the server, etc.

[0062] A processor may include one or more processing cores. The processor connects to various parts of the server via various interfaces and lines, executing instructions, programs, code sets, or instruction sets stored in memory, and accessing data stored in memory to perform various server functions and process data. Optionally, the processor may be implemented using at least one of the following hardware forms: Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor may integrate one or more of the following: Central Processing Unit (CPU) and Modem. The CPU primarily handles the operating system and applications; the modem handles wireless communication. It is understood that the modem may also be implemented as a separate chip without being integrated into the processor.

[0063] Since this electronic device is the electronic device corresponding to the retinal cross-modal segmentation method of the present invention, and the principle of solving the problem by this electronic device is similar to that of the method, the implementation of this electronic device can refer to the implementation process of the above method embodiments, and the repeated parts will not be described again.

[0064] Example 3

[0065] Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium storing at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the retinal cross-modal segmentation method as described above.

[0066] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, including read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium capable of carrying or storing data.

[0067] Since this storage medium is the storage medium corresponding to the retinal cross-modal segmentation method of the present invention, and the principle of the storage medium in solving the problem is similar to that of the method, the implementation of this storage medium can refer to the implementation process of the above method embodiments, and the repeated parts will not be described again.

[0068] Example 4

[0069] In some possible implementations, various aspects of the methods of the embodiments of the present invention can also be implemented as a program product comprising program code that, when run on a computer device, causes the computer device to perform the steps of the retinal cross-modal segmentation method according to the various exemplary embodiments of this application described above. The executable computer program code or "code" for performing the various embodiments can be written in high-level programming languages ​​such as C, C++, C#, Smalltalk, Java, JavaScript, Visual Basic, Structured Query Language (e.g., Transact-SQL), Perl, or in various other programming languages.

[0070] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0071] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0072] The above embodiments are merely illustrative of the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement it accordingly. They should not be construed as limiting the scope of protection of the present invention. All equivalent changes or modifications made based on the essence of the content of the present invention should be covered within the scope of protection of the present invention.

Claims

1. A method for cross-modal retinal segmentation, characterized in that, Includes the following steps: Acquire retinal images; A retinal cross-modal segmentation model is established. The retinal image is input into the retinal cross-modal segmentation model to obtain the segmentation result of the retinal image. The retinal cross-modal segmentation model is based on the U-net model and is trained using image data from different modalities through domain-invariant contrastive learning and uncertainty-aware self-ensemble averaging framework. The specific steps of training through domain-invariant contrastive learning include: Obtain the labeled dataset as the source domain and the labeled dataset as the target domain; Based on the labeled dataset of the source domain and the labeled dataset of the target domain, the batch normalization layer of the source domain style and the batch normalization layer of the target domain style are obtained by using the mini-batch mean-variance method. Based on the labeled dataset of the source domain and the batch normalization layer of the source domain style, the source domain style features are obtained through linear layer mapping; based on the labeled dataset of the target domain and the batch normalization layer of the target domain style, the target domain style features are obtained through linear layer mapping. Based on the source domain style features and the target domain style features, cross-domain consistent features are obtained through domain-invariant contrastive learning. The loss functions for domain-invariant contrastive learning include: the loss from the source domain to the target domain and the loss from the target domain to the source domain. Loss from source domain to target domain: Loss from target domain to source domain: Loss function for domain-invariant contrastive learning: ; in Normalized feature representation of source domain-specific BN generation; Normalized feature representations generated for BN specific to the target domain; Normalized feature representation of the target domain image generated by source domain-specific BN; Normalized feature representation of the target domain image generated by Batch Normalization (BN) specific to the target domain. Represents facing losses. It represents a negative loss.

2. The retinal cross-modal segmentation method according to claim 1, characterized in that, The training process, which utilizes an uncertainty-aware self-ensemble learning framework, includes the following steps: The labeled datasets of the source domain and the labeled datasets of the target domain are fed into the student model and optimized using supervised learning loss. The unlabeled dataset of the target domain is input into the teacher model and optimized using unsupervised consistency loss.

3. The retinal cross-modal segmentation method according to claim 2, characterized in that, The loss function for supervised learning includes: the functional difference (DF) loss between the predicted and true labels of the source domain data, and the DF loss between the predicted and true labels of the labeled data in the target domain. DF loss represents a combination of Dice loss and Focal loss: ; in This represents the network's predicted output for CFP images in the source domain; This represents the network's predicted output for the OCT image in the target domain; , Each represents its corresponding tag; , These represent two parameters that adjust for class imbalance.

4. The retinal cross-modal segmentation method according to claim 1, characterized in that, The training process, which utilizes an uncertainty-aware self-ensemble learning framework, includes the following steps: Obtain the unlabeled dataset as the target domain; The unlabeled dataset of the target domain is fed into the student model and the teacher model respectively through different enhancements; By utilizing the pseudo-labels output by the teacher model and applying unsupervised consistency loss, the teacher model can provide pixel-level supervision to the student model. The weights of the teacher model are updated using an average exponential moving average mechanism.

5. The retinal cross-modal segmentation method according to claim 4, characterized in that, The loss functions for unsupervised consistency loss include: ; in Represents an indicator function; This represents the teacher network's prediction at the j-th pixel; This represents the student network's prediction at the j-th pixel; This represents the estimation uncertainty of the j-th pixel; This represents the threshold used to select the most certain target.

6. The retinal cross-modal segmentation method according to claim 1, characterized in that, Obtaining data from the source domain also includes a preprocessing step, which specifically includes: By performing linear polar coordinate transformation and grayscale conversion on the source domain data, a retinal hierarchical structure similar to that of the target domain data is obtained.

7. An electronic device, characterized in that, The electronic device includes a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the retinal cross-modal segmentation method as described in any one of claims 1 to 6.

8. A computer-readable storage medium, characterized in that, The storage medium stores at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the retinal cross-modal segmentation method according to any one of claims 1 to 6.