Cross-modality medical image segmentation method and device, electronic equipment and storage medium

By applying a multi-scale cross-modal interaction module and a cross-attention mechanism, the problem of coarse feature fusion in cross-modal medical image segmentation is solved, improving the accuracy of image segmentation and spatial localization capability. It is particularly suitable for the identification of lesions with irregular shapes and blurred boundaries.

CN122199542APending Publication Date: 2026-06-12NAT UNIV OF DEFENSE TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NAT UNIV OF DEFENSE TECH
Filing Date
2026-05-14
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, the fusion and alignment of multimodal features in cross-modal medical image segmentation methods are relatively coarse, making it difficult to capture the spatial location and boundary features of lesions, resulting in low accuracy of image segmentation results.

Method used

A multi-scale cross-modal interaction module is used for feature fusion. Image features are processed through global flattening, horizontal pooling and vertical pooling. The cross-attention mechanism is used to fuse with text features to generate global, horizontal and vertical fused features, which are then stitched together to improve spatial localization capabilities.

🎯Benefits of technology

It improves the accuracy of medical image segmentation, achieves a robust fusion feature of global semantics and precise spatial localization, and enhances the ability to identify lesion areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199542A_ABST
    Figure CN122199542A_ABST
Patent Text Reader

Abstract

The application discloses a cross-modal medical image segmentation method and device, electronic equipment and storage medium, and belongs to the technical field of image processing. The method comprises the following steps: inputting an original medical image and an original text description into a feature encoder for encoding processing to obtain image features and text features; performing global flattening processing, horizontal post-pooling flattening processing and vertical post-pooling flattening processing on the image features to obtain global information, horizontal information and vertical information respectively; fusing the global information, the horizontal information and the vertical information with the text features respectively by using a cross attention mechanism to obtain global fusion features, horizontal fusion features and vertical fusion features; splicing the global fusion features, the horizontal fusion features and the vertical fusion features to obtain target fusion features; and inputting the target fusion features into a visual decoder for image segmentation processing to obtain an image segmentation result. The embodiment can improve the accuracy of the image segmentation result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing technology, specifically relating to a cross-modal medical image segmentation method, device, electronic device, and storage medium. Background Technology

[0002] Medical image segmentation aims to divide regions of interest (ROIs) within medical images for subsequent analysis. These segmented regions typically represent important medical semantics, such as lesions or organs, supporting disease prediction, diagnosis, and analysis. Segmentation techniques that rely solely on image information are prone to insufficient semantic understanding, thus affecting segmentation performance. Although cross-modal medical image segmentation techniques effectively utilize multimodal information to enhance the model's semantic understanding, the multimodal feature fusion and alignment process often suffers from coarseness.

[0003] Unlike natural images, medical images have low contrast, uniform texture, and blurred lesion boundaries, making accurate cross-modal alignment extremely challenging. Therefore, in cross-modal medical image segmentation, the fusion and alignment of multimodal features (visual and linguistic features) are often coarse, making it difficult to capture the spatial location and boundary features of lesions, resulting in low accuracy of image segmentation results. Summary of the Invention

[0004] The technical problem to be solved by the present invention is that the multimodal feature fusion and alignment methods commonly used in the prior art are often relatively coarse, making it difficult to capture the spatial location and boundary features of lesions, resulting in low accuracy of image segmentation results. In order to solve the above problems, the present invention provides a cross-modal medical image segmentation method, device, electronic device and storage medium.

[0005] The content of this invention includes: In a first aspect, embodiments of the present invention provide a cross-modal medical image segmentation method, comprising: The original medical image and the original text description are input into a feature encoder for encoding processing to obtain image features corresponding to the original medical image and text features corresponding to the original text description, wherein the text description is text data that explains the original medical image; The image features and text features are input into a multi-scale cross-modal interaction module for fusion processing to obtain the target fusion features; The target fusion features are input into a visual decoder for image segmentation processing to obtain the image segmentation result; The step of inputting the image features and the text features into a multi-scale cross-modal interaction module for fusion processing to obtain target fusion features includes: The image features are subjected to global flattening, horizontal pooling followed by flattening, and vertical pooling followed by flattening to obtain global information, horizontal information, and vertical information, respectively. The global information, horizontal information, and vertical information are fused with the text features using a cross-attention mechanism to obtain global fusion features, horizontal fusion features, and vertical fusion features. The target fusion feature is obtained by concatenating the global fusion feature, the horizontal fusion feature, and the vertical fusion feature.

[0006] Optionally, the feature encoder includes a visual encoder and a text encoder. The step of inputting the original medical image and original text description into the feature encoder for encoding processing to obtain image features corresponding to the original medical image and text features corresponding to the original text description includes: The original medical image is input into the visual encoder for encoding processing to obtain the image features, and the original text description is input into the text encoder for encoding processing to obtain the text features; The visual encoder includes a first convolutional layer and A series of sequentially connected downsampled convolutional blocks are used to input the original medical image into the visual encoder for encoding processing to obtain the image features, including: The original medical image is input into the first convolutional layer for convolution processing to obtain the first intermediate feature; Input the first intermediate feature The image features are determined by processing the sequentially connected downsampling convolutional blocks and using the output of the last downsampling convolutional block. It is a positive integer.

[0007] Optionally, the visual decoder includes a second convolutional layer and A series of sequentially connected coordinated attention modules, wherein the target fused features are input into a visual decoder for image segmentation processing to obtain image segmentation results, including: The target fusion features are input into the first coordination attention combination module for processing to obtain the output result of the first coordination attention combination module; The output of the first coordination attention combining module is sequentially connected. The coordinated attention combination module processes the data to obtain the second intermediate feature, wherein the first... The input to the first of the coordinated attention combination modules includes the first The output of the coordination attention module and the first The output of the downsampled convolutional block An integer greater than 1 and less than or equal to N; The second intermediate feature is input into the second convolutional layer to obtain the image segmentation result.

[0008] Optionally, the first The aforementioned coordinated attention combination module is used for: Through the coordination of attention mechanisms, the first The output of the downsampled convolutional block is processed to obtain the first processed feature; For the The output of the coordination attention module is upsampled to obtain the second processed feature. The first processing feature and the second processing feature are concatenated to obtain the intermediate concatenated feature; Perform convolution processing on the intermediate splicing features to obtain the first... The output of the coordinated attention combination module.

[0009] Optionally, before encoding the original medical image and the original text description into a feature encoder to obtain the image features corresponding to the original medical image and the text features corresponding to the original text description, the method further includes: Obtain a training dataset, which includes multiple image-text sample pairs, each of which includes a sample medical image and a corresponding sample descriptive text. For each image-text sample pair, the sample description text is taken as a positive sample, and semantically related description texts are selected from the corpus as negative samples to obtain a negative sample set. The corpus includes multiple corpus texts used to standardize the description of various situations corresponding to medical images. The medical image segmentation model is trained based on the training dataset and the negative sample set. The medical image segmentation model includes the feature encoder, the multi-scale cross-modal interaction module, and the visual decoder.

[0010] Optionally, the sample description text is used as a positive sample, and the corresponding corpus text is obtained from the corpus as a negative sample, resulting in a negative sample set, including: The sample medical image and the corresponding sample description text are input into a large language model for processing to obtain generated text. The generated text is used to describe the areas in the sample medical image that may be infected in the next stage but are not described by the sample description text. The similarity weight of the corpus text is determined based on the similarity between the corpus text and the generated text, and the relevance coefficient is determined based on the similarity between the corpus text and the sample description text. The comprehensive score of the corpus text is calculated based on the similarity weight and the relevance coefficient. The corpus texts whose comprehensive scores meet the preset conditions are identified as negative samples corresponding to the sample description text, thus obtaining a negative sample set.

[0011] Optionally, the training loss includes an optimization loss. Cross-entropy loss Batch comparison loss Loss of comparison with corpus : ; ; ; ; ; Among them, and middle For the number of pixels, This indicates the number of negative samples identified from the corpus. For the number of categories, This represents the number of samples in the current training batch. Represents the predicted pixels Category The probability, This represents the actual label mask. This indicates Gaussian kernel calculation. The features of the text describing the i-th sample in the current batch. The features of the text describing the j-th sample in the current batch, The features of the i-th sample medical image in the current batch, This indicates the similarity calculation between the current sample and the negative samples identified from the corpus. This represents the j-th negative sample identified in the current corpus. For the features of the i-th generated text, This refers to temperature hyperparameters.

[0012] Secondly, embodiments of the present invention also provide a cross-modal medical image segmentation apparatus, comprising: The encoding processing module is used to input the original medical image and the original text description into the feature encoder for encoding processing, so as to obtain the image features corresponding to the original medical image and the text features corresponding to the original text description, wherein the text description is text data that explains the original medical image; The fusion processing module is used to input the image features and the text features into the multi-scale cross-modal interaction module for fusion processing to obtain the target fusion features; The segmentation processing module is used to input the target fusion features into a visual decoder for image segmentation processing to obtain the image segmentation result; The fusion processing module includes: The processing unit is used to perform global flattening, horizontal pooling flattening, and vertical pooling flattening on the image features to obtain global information, horizontal information, and vertical information, respectively. The fusion unit is used to fuse the global information, the horizontal information, and the vertical information with the text features using a cross-attention mechanism, respectively, to obtain global fusion features, horizontal fusion features, and vertical fusion features; The splicing unit is used to splice the global fusion feature, the horizontal fusion feature, and the vertical fusion feature to obtain the target fusion feature.

[0013] Thirdly, embodiments of the present invention provide an electronic device, including: a memory, a processor, and a program stored in the memory and executable on the processor; the processor is configured to read the program in the memory to implement the steps in the cross-modal medical image segmentation method as described in the first aspect.

[0014] Fourthly, embodiments of the present invention provide a readable storage medium for storing a program, which, when executed by a processor, implements the steps of the cross-modal medical image segmentation method as described in the first aspect.

[0015] The beneficial effects of this invention are as follows: In the embodiments of this invention, when fusing image features and text features, the image features are subjected to global flattening, horizontal pooling followed by flattening, and vertical pooling followed by flattening to obtain global information, horizontal information, and vertical information, respectively. A cross-attention mechanism is then used to fuse the global information, horizontal information, and vertical information with the text features, respectively, to obtain global fused features, horizontal fused features, and vertical fused features. These global fused features, horizontal fused features, and vertical fused features are then concatenated to obtain the target fused feature. By independently extracting and fusing global, horizontal, and vertical information, and establishing the interaction between visual and semantic features along the three dimensions of global, horizontal, and vertical, the spatial localization capability of text semantics can be improved. This successfully solves the key problems of weak spatial location perception and difficulty in fine-grained alignment in cross-modal tasks, ultimately generating a robust fused feature that combines global semantics with precise spatial localization capabilities, thereby significantly improving the accuracy of medical image segmentation. Attached Figure Description

[0016] Figure 1 A flowchart of a cross-modal medical image segmentation method provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of the structure of the visual encoder and visual decoder provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of the structure of the coordinated attention combination module provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of the structure of the multi-scale cross-modal interaction module provided in an embodiment of the present invention; Figure 5 A schematic diagram of the training framework for the medical image segmentation model provided in an embodiment of the present invention; Figure 6 This is a schematic diagram of a cross-modal medical image segmentation device provided in an embodiment of the present invention; Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0017] In the embodiments of this application, the term "and / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship. In the embodiments of this application, the term "multiple" refers to two or more, and other quantifiers are similar. The terms "first," "second," etc., in the specification of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It should be understood that such terms can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first" and "second" are usually of the same class, without limiting the number of objects. For example, the first object can be one or multiple.

[0018] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0019] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0020] This application provides a cross-modal medical image segmentation method, a segmentation model training method, and an apparatus, aiming to achieve finer-grained alignment between images and text, thereby strengthening the spatial correspondence between text semantics and lesion localization and improving the accuracy of image segmentation results.

[0021] Please see Figure 1 , Figure 1 This is a flowchart illustrating the cross-modal medical image segmentation method provided in an embodiment of the present invention. The method specifically includes the following steps: Step 101: Input the original medical image and the original text description into the feature encoder for encoding processing to obtain the image features corresponding to the original medical image and the text features corresponding to the original text description, wherein the text description is text data that interprets and explains the original medical image.

[0022] Step 102: Input the image features and the text features into the multi-scale cross-modal interaction module for fusion processing to obtain the target fused features.

[0023] Step 103: Input the target fusion features into the visual decoder for image segmentation processing to obtain the image segmentation result.

[0024] Step 102 includes: Step 1021: Perform global flattening, horizontal pooling followed by flattening, and vertical pooling followed by flattening on the image features to obtain global information, horizontal information, and vertical information, respectively.

[0025] Step 1022: The global information, the horizontal information, and the vertical information are fused with the text features using a cross-attention mechanism to obtain global fusion features, horizontal fusion features, and vertical fusion features.

[0026] Step 1023: The global fusion feature, the horizontal fusion feature, and the vertical fusion feature are concatenated to obtain the target fusion feature.

[0027] In this embodiment, the original medical image serves as the image data to be analyzed, while the original text description is text data used for semantic interpretation and annotation of the image. The original medical image and its corresponding original text description constitute a multimodal input pair. Exemplarily, in some embodiments, the original medical image is a computed tomography (CT) image, and the corresponding original text description is a diagnostic report or description of imaging features written by a radiologist. In other embodiments, the original medical image is a magnetic resonance imaging (MRI) scan, and the corresponding original text description is a medical record, examination findings, or preliminary diagnostic opinion generated by a clinician. In some embodiments, the original text description may also be a patient's self-report, etc., and this is not specifically limited here.

[0028] The original medical image and original text description are input into the feature encoder and encoded separately to obtain image features and text features. The specific method of encoding the original medical image and original text description is not limited here.

[0029] In some embodiments, a unified modality-independent encoder is used as a feature encoder to encode the original medical image and the original text description, obtaining image features and text features. This approach eliminates architectural differences between modalities, enabling the model to efficiently extract visual features of the image and semantic features of the text using the same set of parameters, and achieving deep modality alignment in the latent space.

[0030] Optionally, in other embodiments, the feature encoder includes a visual encoder and a text encoder. The step of inputting the original medical image and original text description into the feature encoder for encoding processing to obtain image features corresponding to the original medical image and text features corresponding to the original text description includes: The original medical image is input into the visual encoder for encoding processing to obtain the image features, and the original text description is input into the text encoder for encoding processing to obtain the text features.

[0031] In this embodiment, different encoders are used to process data from different modalities. The high-dimensional information of each modality is extracted to the maximum extent through a dedicated structure, avoiding the compromise of feature representation caused by a single model to take into account the characteristics of multiple modalities. At the same time, this decoupled design allows for the flexible loading of mature pre-trained weights for each modality, which significantly reduces the training difficulty and improves the robustness and convergence speed of the model when the data distribution is large.

[0032] In the text encoding stage, the original text description is input into a pre-trained text encoder for encoding processing to obtain the corresponding text features. The above process can be described as follows: ; in, For text features, This is the original text description. Used to characterize a text encoder.

[0033] It should be understood that the specific structure of the text encoder is not limited here. In some embodiments, the text encoder is a BERT series model pre-trained based on a masked language model, thereby leveraging its bidirectional context modeling capabilities to obtain richer syntactic structure and deep semantic information, making it suitable for complex natural language understanding tasks.

[0034] In other embodiments, the text encoder from the frozen Contrastive Language-Image Pre-training (CLIP) is used as the backbone network, combined with newly added adapter convolutional layers to form the final text encoder. Since the CLIP model has acquired general text representation capabilities through large-scale contrastive learning, this embodiment utilizes adapter convolutions to enhance its adaptability to medical text modeling, which can improve the modeling effect of local semantic relationships and deep contextual information between medical terms.

[0035] Optionally, in some embodiments, the feature encoder includes a visual encoder and a text encoder, the visual encoder including a first convolutional layer and... A series of sequentially connected downsampled convolutional blocks are used to input the original medical image into the visual encoder for encoding processing to obtain the image features, including: The original medical image is input into the first convolutional layer for convolution processing to obtain the first intermediate feature; Input the first intermediate feature The image features are determined by processing the sequentially connected downsampling convolutional blocks and using the output of the last downsampling convolutional block. It is a positive integer.

[0036] In this embodiment, the visual encoder encodes the input raw medical image to generate visual features. This encoding process includes convolution and downsampling operations, and can be represented as follows: ; ; in, As the first intermediate feature, Used to characterize convolution operations Used to characterize downsampling convolution Original medical images, For the first The output of each downsampled convolutional block. The output of the last (i.e., the Nth) downsampled convolutional block is denoted as the final image feature. .

[0037] As a specific implementation, the first convolutional layer is a 3×3 convolution with N being 4. By using 3×3 convolution and downsampling the feature map by 2 times, the computational burden can be significantly reduced, while the receptive field is gradually expanded to capture the global context and improve the quality of image features.

[0038] Optionally, in some embodiments, the visual decoder includes a second convolutional layer and A series of sequentially connected coordinated attention modules, wherein the target fused features are input into a visual decoder for image segmentation processing to obtain image segmentation results, including: The target fusion features are input into the first coordination attention combination module for processing to obtain the output result of the first coordination attention combination module; The output of the first coordination attention combining module is sequentially connected The coordinated attention combination module processes the data to obtain the second intermediate feature, wherein the first... The input to the first of the coordinated attention combination modules includes the first The output of the coordination attention module and the first The output of the downsampled convolutional block An integer greater than 1 and less than or equal to N; The second intermediate feature is input into the second convolutional layer to obtain the image segmentation result.

[0039] In this embodiment, the visual encoder and visual decoder can be viewed as a single visual network model. This model employs a U-shaped architecture, which effectively extracts multi-scale features. Specifically, the visual decoder reconstructs the fused image and text features to generate a segmentation result.

[0040] like Figure 2 As shown, the visual decoder consists of multiple sequentially connected Coordinate Attention Combined (CAC) modules. The first CAC module receives the target fusion features as input. Each subsequent CAC module receives two types of input: one from the downsampled convolutional block of the corresponding layer in the visual encoder (the original image). (Features formed after downsampling convolutional blocks are encoded in the visual encoder), another feature comes from the output of the previous CAC module.

[0041] Optionally, in some embodiments, the first The aforementioned coordinated attention combination module is used for: Through the coordination of attention mechanisms, the first The output of the downsampled convolutional block is processed to obtain the first processed feature; For the first The output of the coordination attention module is upsampled to obtain the second processed feature. The first processing feature and the second processing feature are concatenated to obtain the intermediate concatenated feature; Perform convolution processing on the intermediate splicing features to obtain the first... The output of the coordinated attention combination module.

[0042] As a specific example, please refer to Figure 3 For the first For the CAC module, the jump input obtained from the visual encoder (i.e., the first) The output of each downsampled convolutional block is processed through a coordinated attention mechanism to enhance spatial attention representation in both horizontal and vertical directions. Simultaneously, the features of the previous CAC module (i.e., the upsampled input) are upsampled to restore their spatial resolution to match the features of the current layer encoder, then concatenated with the encoder stream along the channel dimension, and finally convolved to obtain the output of the current CAC module. This process can be characterized as follows: ; ; in, Used to characterize upsampling, For the first The output of each downsampled convolutional block For the first The output results of each CAC module Used to indicate channel splicing. Used to characterize convolution operations Used to represent the Coordinate Attention (CA) operation. For the first The output results of each CAC module.

[0043] The last CAC module (i.e., the first) Output results of each CAC module This is denoted as the second intermediate feature. (By...) The image segmentation result is obtained by inputting the second convolutional layer. As a specific example, the second intermediate features are input into a 1×1 convolutional layer to map to the number of categories. Then, the probability of each pixel belonging to each category is calculated using the Softmax function to generate the final segmentation mask. The image segmentation result is obtained.

[0044] .

[0045] in, For segmentation mask, Used to characterize the softmax function This is the second intermediate feature.

[0046] In this embodiment, coordinated attention is introduced. By embedding location information into channel attention, the location and extent of the region of interest in a medical image can be accurately located, making it particularly suitable for processing lesions with irregular shapes and blurred boundaries (such as tumors and nodules). Simultaneously, applying coordinated attention (CA) before concatenating skip connection features filters out background noise in the encoder features, ensuring that only features containing key spatial information are fused into the decoder, thus improving the signal-to-noise ratio of feature fusion.

[0047] To improve the alignment accuracy between text semantic information and image spatial features, this application employs a multi-dimensional interaction strategy. For example... Figure 4 As shown, the Multi-Dimensional Cross-modal Interaction (MDCMI) module receives text features output by the feature encoder. and image features As input.

[0048] In some embodiments, to further enhance the semantic relevance of visual features, image features are... By applying a self-attention mechanism, enhanced image features are obtained. Step 1021 includes: performing global flattening, horizontal pooling followed by flattening, and vertical pooling followed by flattening on the enhanced image features to obtain global information, horizontal information, and vertical information, respectively. In some embodiments, the image features... By applying a multi-head self-attention mechanism, enhanced image features are obtained. .

[0049] Specifically, the above process can be represented as follows: ; ; ; in, For global information, For horizontal information, For vertical information, The input image feature map contains information, and this feature dimension is a four-dimensional Tensor vector, where each dimension is... . Used to characterize flattening processes These represent the batch number, channel number, feature map length, and feature map width, respectively.

[0050] In this embodiment, image features are explicitly decomposed into global information, horizontal information, and vertical information. Horizontal pooling and vertical pooling preserve the precise positional dependencies along the Y-axis and X-axis, respectively, which significantly improves the model's ability to perceive fine-grained spatial structures. This is particularly beneficial for image-text matching or detection tasks that require precise localization.

[0051] Then, the cross-attention mechanism is used to fuse global information and text features to obtain global fused features. The cross-attention mechanism is used to fuse horizontal information and text features to obtain horizontal fused features. The cross-attention mechanism is used to fuse vertical information and text features to obtain vertical fused features.

[0052] In other embodiments, a multi-head cross-attention mechanism is used to fuse global information and text features to obtain global fused features, a multi-head cross-attention mechanism is used to fuse horizontal information and text features to obtain horizontal fused features, and a multi-head cross-attention mechanism is used to fuse vertical information and text features to obtain vertical fused features.

[0053] In this embodiment, a cross-attention mechanism is used for three independent fusion paths, and global fusion ensures the consistency of the overall semantics. Horizontal and vertical fusion allow text features to specifically "query" and "focus" on the feature responses of the image in specific directions. For example, when the text mentions "left side," the horizontal fusion branch can more sensitively capture the feature responses on the left side; when the text mentions "top," the vertical fusion branch plays a crucial role. This branch alignment mechanism reduces interference from irrelevant backgrounds, enabling high-precision matching of text and image semantics across all three dimensions.

[0054] It should be understood that the formula for the attention mechanism is as follows: ; Among them, when applying a self-attention mechanism to image features, By utilizing a cross-attention mechanism to fuse global information and textual features, By utilizing a cross-attention mechanism to fuse horizontal information and textual features, By utilizing a cross-attention mechanism to fuse vertical information and textual features, .

[0055] Finally, the global fusion feature, the horizontal fusion feature, and the vertical fusion feature are concatenated to obtain the target fusion feature. As a specific embodiment, the global fusion feature, the horizontal fusion feature, and the vertical fusion feature are concatenated after being processed by convolution to obtain the target fusion feature.

[0056] In this embodiment, the three-way fusion features are concatenated to generate a target fusion feature, which possesses strong complementarity. It simultaneously encompasses the macroscopic global context and the microscopic horizontal and vertical structures. This aggregation of multi-granular features makes the target fusion feature both robust (resistant to interference) and discriminative (distinguishing subtle differences), enabling it to more comprehensively express the complex relationship between images and text.

[0057] In the multimodal feature interaction process of this embodiment, the extracted global, horizontal, and vertical information are respectively subjected to multi-head cross-attention mechanism operations with text features, thereby realizing the fusion of multi-dimensional image features and text features. Since the image features at this stage not only come from the global flattened representation, but also include features processed by horizontal and vertical pooling, cross-modal interaction in three dimensions is realized, enabling the text description to be more accurately spatially located in the image.

[0058] Optionally, in some embodiments, before step 101, the method further includes: Obtain a training dataset, which includes multiple image-text sample pairs, each of which includes a sample medical image and a corresponding sample descriptive text. For each image-text sample pair, the sample description text is taken as a positive sample, and the corresponding corpus text is obtained from the corpus as a negative sample, resulting in a negative sample set. The corpus includes multiple corpus texts used to standardize the description of various situations corresponding to medical images. The medical image segmentation model is trained based on the training dataset and the negative sample set. The medical image segmentation model includes the feature encoder, the multi-scale cross-modal interaction module, and the visual decoder.

[0059] Each sample medical image and the sample descriptive text used to explain the sample medical image constitute an image-text sample pair. Multiple image-text sample pairs are pre-collected in the training dataset. In this embodiment, the medical image segmentation model is trained using the training dataset and the negative sample set to obtain the medical image segmentation model. The feature encoder, multi-scale cross-modal interaction module, and visual decoder used in steps 101 to 103 above are the components of the trained medical image segmentation model.

[0060] It should be understood that the corpus is pre-built, and for a certain type of medical image, the corpus texts included in the corpus cover all possible medical-specific descriptions of that type of medical image. Specifically, in some embodiments, the corpus texts are texts describing the infection type, number of infected areas, and infection location of the medical image.

[0061] For example, taking medical images of lung infection types as an example, the lung infection description is pre-structured and decomposed into four semantic components: infection type, number of infected areas, left lung infection area, and right lung infection area. Specifically, infection type includes "bilateral lung infection" and "unilateral lung infection"; the number of infected areas is divided into "1 infected area", "2 infected areas", "3 infected areas", "4 infected areas", and "multiple infected areas"; left lung infection areas include "left upper lung", "left lower lung", "left middle lung", "left middle upper lung", "left middle lower lung", "left upper middle lower lung", and "entire left lung"; right lung infection areas include "right upper lung", "right lower lung", "right middle lung", "right middle upper lung", "right middle lower lung", "right upper middle lower lung", and "entire right lung". This structured design is based on the inherent left-right division of the lung anatomy and the empirical distribution of the number of infections observed in cases—most infections involve 1-4 different areas, while more extensive infections are classified as "multiple". By integrating the above four components, multiple corpus texts were obtained, and a comprehensive medical descriptive corpus covering various lung infection scenarios was constructed.

[0062] For each sample description text, the corresponding corpus text is obtained from the corpus as its negative sample. The specific number of corpus texts obtained is not limited here. The obtained corpus texts are semantically similar to the sample description text but have potential for confusion, and thus can be used as hard negative samples. These selected hard negative samples are then used to challenge the matching of positive sample pairs in the contrastive learning process, thereby more effectively improving the performance of contrastive learning.

[0063] Optionally, in some embodiments, the sample description text is used as a positive sample, and the corpus text corresponding to the sample description text is obtained from the corpus as a negative sample, resulting in a negative sample set, including: The sample medical image and the corresponding sample description text are input into a large language model for processing to obtain generated text. The generated text is used to describe the areas in the sample medical image that may be infected in the next stage but are not described by the sample description text. The similarity weight of the corpus text is determined based on the similarity between the corpus text and the generated text, and the relevance coefficient is determined based on the similarity between the corpus text and the sample description text. The comprehensive score of the corpus text is calculated based on the similarity weight and the relevance coefficient. The corpus texts whose comprehensive scores meet the preset conditions are identified as negative samples corresponding to the sample description text, thus obtaining a negative sample set.

[0064] Taking any sample medical image and sample descriptive text as an example, the sample medical image and sample descriptive text are input into a pre-trained large language model for processing. Utilizing the multimodal reasoning and generation capabilities of the frozen large language model, text with semantically consistent with the sample descriptive text is generated based on the sample medical image and sample descriptive text, resulting in the generated text. As a specific example, the prompt is as follows: "Sample description text is..." This refers to the confirmed infected areas. From a medical perspective, please predict the areas in the image that may be infected in the next stage but are not yet described. Please answer as follows: No additional information is needed. If using 'and', both the left and right lungs must be described.

[0065] Optionally, in some embodiments, determining the similarity weight of the corpus text based on the similarity between the corpus text and the generated text, and determining the relevance coefficient based on the similarity between the corpus text and the sample description text, includes: The sample description text, the generated text, and the corpus text are respectively input into the encoder for encoding processing to obtain sample description text features, generated text features, and corpus text features; The similarity weights of the corpus texts are obtained by calculating the distance between the generated text features and the sample description text features based on Gaussian kernels. The relevance coefficient of the corpus text is obtained by calculating the distance between the sample description text features and the sample description text features based on the Gaussian kernel.

[0066] First, the sample description text will be... Generate text and corpus text The text features of the samples are obtained by inputting them into a text encoder for encoding. Generate text features and corpus text features : ; ; .

[0067] Based on generated text features and corpus text features The distance between the generated and corpus texts is used to calculate the similarity between them. After normalization, the similarity weight of each corpus text is obtained. This is based on the sample descriptive text features. and corpus text features The distance between the sample description text and the generated text is used to calculate the similarity between them, thereby determining the relevance coefficient for each corpus text.

[0068] By combining the calculated similarity weights and relevance coefficients, a comprehensive score is calculated for each corpus text. Based on this comprehensive score, the corpus texts corresponding to the sample description text are selected. In some embodiments, corpus texts with a comprehensive score greater than a threshold are identified as the corpus texts corresponding to the sample description text. In other embodiments, the comprehensive scores are ranked first. The corpus text of each sample was determined to be the corpus text corresponding to the description text of that sample. It is a positive integer.

[0069] Specifically, this embodiment uses a Gaussian kernel to calculate similarity. Compared to linear or Laplacian kernels, the Gaussian kernel can flexibly capture nonlinear feature similarity through its adjustable bandwidth parameter. This characteristic is more in line with the semantic association hypothesis—the distance between features in the latent space should smoothly reflect their semantic similarity, thereby bringing positive samples closer to each other and negative samples further apart.

[0070] The method for calculating feature similarity using Gaussian kernels is as follows: ; in, This indicates Gaussian kernel calculation. and Used to characterize two different features, Bandwidth, used to control smoothness. Used to characterize the natural exponential function.

[0071] As a specific example, text features are generated based on Gaussian kernel calculation. Text features of the m-th corpus The distances between them are as follows: ; Subsequently, the similarity weight of the m-th corpus text was further calculated. : ; Meanwhile, the text features described by the sample are calculated based on the Gaussian kernel. Text features of the m-th corpus The distances between them are as follows: ; The overall score of each corpus text is determined based on similarity weights and relevance coefficients. The corpus texts with the top N overall scores are identified as the corpus texts corresponding to the description text of that sample, thus obtaining the negative sample set corresponding to that sample description text. By traversing all image-text sample pairs, the negative sample set corresponding to all image-text sample pairs can be obtained.

[0072] In this embodiment, semantically ambiguous or "difficult" medical descriptions generated using a large language model are used to calculate their semantic similarity weights with external corpus texts. These weights are used as a reference to retrieve negative samples from the external corpus that are semantically similar to the original input medical description. After calculating the correlation coefficient between the corpus text and the sample description text, this similarity is multiplied by the semantic similarity weights in the corpus text. Finally, the text with the highest similarity is selected from the corpus text. Top-N text samples are used as challenging negative examples for comparative learning. To achieve effective alignment between images and text, the algorithm utilizes a large language model to generate challenging text descriptions for positive samples and calculates the semantic similarity between these descriptions and corpus samples. Samples that are semantically similar but potentially confusing are then selected as challenging negative examples. By incorporating these challenging negative sample pairs (where the currently matched text is a positive sample and the remaining selected samples are negative samples), cross-modal discrimination capabilities are enhanced in complex semantic scenarios.

[0073] The goals of the training phase include loss Cross-entropy loss Batch comparison loss Loss compared with corpus . and The calculation method is as follows: ; ; in, For the number of pixels, For the number of categories (for example, Set to 1). Represents the predicted pixels Category The probability, This represents the actual label mask.

[0074] Comparative loss It can be represented as: ; in, This indicates Gaussian kernel calculation. This represents the feature of the i-th sample medical image in the current batch. This represents the feature of the description text of the i-th sample in the current batch. The features of the text describing the j-th sample in the current batch, For temperature hyperparameters, This indicates the number of samples in the current training batch.

[0075] Finally, calculate the corpus contrast loss. Used for training the network: ; ; in, The features of the i-th sample medical image in the current batch, This indicates the similarity calculation between the current sample and the negative samples identified from the corpus. This represents the j-th negative sample identified in the current corpus. Let be the feature of the i-th generated text.

[0076] In summary, the loss during the training process The calculation is as follows: ; in, , and These are pre-set weighting coefficients, whose values ​​can be set and adjusted according to actual circumstances; no specific restrictions are imposed here.

[0077] like Figure 5 As shown below, in conjunction with Figure 5 The training process of the medical image model provided in the embodiments of the present invention will be described. For example... Figure 5 As shown, the medical segmentation model provided in this embodiment of the invention includes a feature encoder (text encoder and visual encoder), a multi-scale cross-modal interaction module, and a visual decoder. In this embodiment, a dual-stream image-text selective and contrastive (DITSC) block is also provided.

[0078] During training, based on sample medical images and sample descriptive text, a frozen Large Language Model (LLM) is used to generate semantically challenging but clinically plausible text that captures potential lesion evolution regions. The generated text and sample descriptive text share semantic similarity, serving as a reference for selecting challenging negative samples. Multiple corpus texts, encompassing text descriptions of all possible infected regions, are formed and used as a sample pool for contrastive learning.

[0079] In the DITSC block, similarity weights and relevance coefficients are calculated separately between the corpus text, the generated text, and the sample description text. These weights and coefficients are then fused to obtain a comprehensive score for the corpus text. Based on this comprehensive score, the system identifies and retrieves challenging corpus text as negative samples for contrastive learning. Furthermore, the DITSC module performs contrastive learning at two levels: based on in-batch samples, and simultaneously utilizing large language model text to dynamically retrieve difficult negative samples from the corpus. This dual strategy promotes discriminative cross-modal matching and strengthens multimodal representation learning.

[0080] As a specific implementation, in terms of text encoding, this embodiment uses a pre-trained CLIP text encoder as the backbone network. Visual features are extracted and reconstructed through the U-shaped architecture of the CAC module. The MDCMI module employs a multi-head attention mechanism to enhance the global relationship between visual features and performs global, horizontal, and vertical fusion of image and text features, strengthening the localization of linguistic concepts in visual space, which helps to more accurately locate lesion-related text clues.

[0081] In this embodiment, multi-dimensional interactions across global, horizontal, and vertical dimensions enable more precise integration of image and text features, achieving more accurate attention allocation and clearer focus on the target region. This mechanism significantly enhances feature fusion and alignment, resulting in superior segmentation accuracy and performance in both subjective and objective evaluations. The CAC module enhances the network's attention to fine structural details, contributing to the generation of more accurate and reliable segmentation results. Through these methods, the medical image segmentation model can accurately capture fine-grained lesion details and exhibits robustness on datasets with different imaging features and lesion distributions.

[0082] Please see Figure 6 This invention also provides a cross-modal medical image segmentation device 600, comprising: The encoding processing module 601 is used to input the original medical image and the original text description into the feature encoder for encoding processing, so as to obtain the image features corresponding to the original medical image and the text features corresponding to the original text description, wherein the text description is text data that explains the original medical image; The fusion processing module 602 is used to input the image features and the text features into the multi-scale cross-modal interaction module for fusion processing to obtain the target fusion features; The segmentation processing module 603 is used to input the target fusion features into the visual decoder for image segmentation processing to obtain the image segmentation result; The fusion processing module 602 includes: Processing unit 6021 is used to perform global flattening, horizontal pooling flattening, and vertical pooling flattening on the image features to obtain global information, horizontal information, and vertical information, respectively. The fusion unit 6022 is used to fuse the global information, the horizontal information and the vertical information with the text features respectively using a cross-attention mechanism to obtain global fusion features, horizontal fusion features and vertical fusion features; The splicing unit 6023 is used to splice the global fusion feature, the horizontal fusion feature and the vertical fusion feature to obtain the target fusion feature.

[0083] Optionally, the feature encoder includes a visual encoder and a text encoder, and the encoding processing module 601 is specifically used for: The original medical image is input into the visual encoder for encoding processing to obtain the image features, and the original text description is input into the text encoder for encoding processing to obtain the text features; The visual encoder includes a first convolutional layer and A series of sequentially connected downsampled convolutional blocks are used to input the original medical image into the visual encoder for encoding processing to obtain the image features, including: The original medical image is input into the first convolutional layer for convolution processing to obtain the first intermediate feature; Input the first intermediate feature The image features are determined by processing the sequentially connected downsampling convolutional blocks and using the output of the last downsampling convolutional block. It is a positive integer.

[0084] Optionally, the visual decoder includes a second convolutional layer and A series of sequentially connected coordinated attention modules, wherein the target fused features are input into a visual decoder for image segmentation processing to obtain image segmentation results, including: The target fusion features are input into the first coordination attention combination module for processing to obtain the output result of the first coordination attention combination module; The output of the first coordination attention combining module is sequentially connected The coordinated attention combination module processes the data to obtain the second intermediate feature, wherein the first... The input to the first of the coordinated attention combination modules includes the first The output of the coordination attention module and the first The output of the downsampled convolutional block An integer greater than 1 and less than or equal to N; The second intermediate feature is input into the second convolutional layer to obtain the image segmentation result.

[0085] Optionally, the first The aforementioned coordinated attention combination module is used for: Through the coordination of attention mechanisms, the first The output of the downsampled convolutional block is processed to obtain the first processed feature; For the The output of the coordination attention module is upsampled to obtain the second processed feature. The first processing feature and the second processing feature are concatenated to obtain the intermediate concatenated feature; Perform convolution processing on the intermediate splicing features to obtain the first... The output of the coordinated attention combination module.

[0086] Optionally, before encoding the original medical image and the original text description into a feature encoder to obtain the image features corresponding to the original medical image and the text features corresponding to the original text description, the method further includes: Obtain a training dataset, which includes multiple image-text sample pairs, each of which includes a sample medical image and a corresponding sample descriptive text. For each image-text sample pair, the sample description text is taken as a positive sample, and semantically related description texts are selected from the corpus as negative samples to obtain a negative sample set. The corpus includes multiple corpus texts used to standardize the description of various situations corresponding to medical images. The medical image segmentation model is trained based on the training dataset and the negative sample set. The medical image segmentation model includes the feature encoder, the multi-scale cross-modal interaction module, and the visual decoder.

[0087] Optionally, the sample description text is used as a positive sample, and the corresponding corpus text is obtained from the corpus as a negative sample, resulting in a negative sample set, including: The sample medical image and the corresponding sample description text are input into a large language model for processing to obtain generated text. The generated text is used to describe the areas in the sample medical image that may be infected in the next stage but are not described by the sample description text. The similarity weight of the corpus text is determined based on the similarity between the corpus text and the generated text, and the relevance coefficient is determined based on the similarity between the corpus text and the sample description text. The comprehensive score of the corpus text is calculated based on the similarity weight and the relevance coefficient. The corpus texts whose comprehensive scores meet the preset conditions are identified as negative samples corresponding to the sample description text, thus obtaining a negative sample set.

[0088] Optionally, the training loss includes an optimization loss. Cross-entropy loss Batch comparison loss Loss compared with corpus : ; ; ; ; ; Among them, and middle For the number of pixels, This indicates the number of negative samples identified from the corpus. For the number of categories, This represents the number of samples in the current training batch. Represents the predicted pixels Category The probability, This represents the actual label mask. This indicates Gaussian kernel calculation. The features of the text describing the i-th sample in the current batch. The features of the text describing the j-th sample in the current batch, The features of the i-th sample medical image in the current batch, This indicates the similarity calculation between the current sample and the negative samples identified from the corpus. This represents the j-th negative sample identified in the current corpus. For the features of the i-th generated text, This refers to temperature hyperparameters.

[0089] The cross-modal medical image segmentation device 600 provided in this application embodiment can execute the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0090] It should be noted that the division of units in the embodiments of this application is illustrative and only represents one logical functional division. In actual implementation, other division methods may be used. Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated units described above can be implemented in hardware or as software functional units.

[0091] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a processor-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0092] like Figure 7 As shown, this application provides an electronic device 700, including: a memory 702, a processor 701, and a program stored in the memory 702 and executable on the processor 701; the processor 701 is used to read the program in the memory 702 to implement the steps in the cross-modal medical image segmentation method as described above.

[0093] This application also provides a readable storage medium storing a program that, when executed by a processor, implements the various processes of the above-described cross-modal medical image segmentation method embodiments and achieves the same technical effect. To avoid repetition, it will not be described again here. The readable storage medium can be any available medium or data storage device that the processor can access, including but not limited to magnetic storage (such as floppy disks, hard disks, magnetic tapes, magneto-optical disks (MOs), etc.), optical storage (such as compact disks (CDs), digital video discs (DVDs), Blu-ray discs (BDs), high-definition universal discs (HVDs), etc.), and semiconductor storage (such as read-only memory (ROMs), erasable programmable read-only memory (EPROMs), electrically erasable programmable read-only memory (EEPROMs), non-volatile memory (NAND flash), solid-state drives (SSDs)).

[0094] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0095] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0096] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other modifications under the guidance of this application without departing from its spirit, and all of these modifications are within the scope of protection of this application.

Claims

1. A cross-modal medical image segmentation method, characterized in that, include: The original medical image and the original text description are input into a feature encoder for encoding processing to obtain image features corresponding to the original medical image and text features corresponding to the original text description, wherein the text description is text data that explains the original medical image; The image features and text features are input into a multi-scale cross-modal interaction module for fusion processing to obtain the target fusion features; The target fusion features are input into a visual decoder for image segmentation processing to obtain the image segmentation result; The step of inputting the image features and the text features into a multi-scale cross-modal interaction module for fusion processing to obtain target fusion features includes: The image features are subjected to global flattening, horizontal pooling followed by flattening, and vertical pooling followed by flattening to obtain global information, horizontal information, and vertical information, respectively. The global information, horizontal information, and vertical information are fused with the text features using a cross-attention mechanism to obtain global fusion features, horizontal fusion features, and vertical fusion features. The target fusion feature is obtained by concatenating the global fusion feature, the horizontal fusion feature, and the vertical fusion feature.

2. The cross-modal medical image segmentation method according to claim 1, characterized in that, The feature encoder includes a visual encoder and a text encoder. The process of inputting the original medical image and original text description into the feature encoder for encoding to obtain image features corresponding to the original medical image and text features corresponding to the original text description includes: The original medical image is input into the visual encoder for encoding processing to obtain the image features, and the original text description is input into the text encoder for encoding processing to obtain the text features; The visual encoder includes a first convolutional layer and A series of sequentially connected downsampled convolutional blocks are used to input the original medical image into the visual encoder for encoding processing to obtain the image features, including: The original medical image is input into the first convolutional layer for convolution processing to obtain the first intermediate feature; Input the first intermediate feature The image features are determined by processing the sequentially connected downsampling convolutional blocks and using the output of the last downsampling convolutional block. It is a positive integer.

3. The cross-modal medical image segmentation method according to claim 2, characterized in that, The visual decoder includes a second convolutional layer and A series of sequentially connected coordinated attention modules, wherein the target fused features are input into a visual decoder for image segmentation processing to obtain image segmentation results, including: The target fusion features are input into the first coordination attention combination module for processing to obtain the output result of the first coordination attention combination module; The output of the first coordination attention combining module is sequentially connected The coordinated attention combination module processes the data to obtain the second intermediate feature, wherein the first... The input to the first of the coordinated attention combination modules includes the first The output of the coordination attention module and the first The output of the downsampled convolutional block An integer greater than 1 and less than or equal to N; The second intermediate feature is input into the second convolutional layer to obtain the image segmentation result.

4. The cross-modal medical image segmentation method according to claim 3, characterized in that, No. The aforementioned coordinated attention combination module is used for: Through the coordination of attention mechanisms, the first The output of the downsampled convolutional block is processed to obtain the first processed feature; For the first The output of the coordination attention module is upsampled to obtain the second processed feature. The first processing feature and the second processing feature are concatenated to obtain the intermediate concatenated feature; Perform convolution processing on the intermediate splicing features to obtain the first... The output of the coordinated attention combination module.

5. The cross-modal medical image segmentation method according to claim 1, characterized in that, Before encoding the original medical image and the original text description into a feature encoder to obtain the image features corresponding to the original medical image and the text features corresponding to the original text description, the method further includes: Obtain a training dataset, which includes multiple image-text sample pairs, each of which includes a sample medical image and a corresponding sample descriptive text. For each image-text sample pair, the sample description text is taken as a positive sample, and semantically related description texts are selected from the corpus as negative samples to obtain a negative sample set. The corpus includes multiple corpus texts used to standardize the description of various situations corresponding to medical images. The medical image segmentation model is trained based on the training dataset and the negative sample set. The medical image segmentation model includes the feature encoder, the multi-scale cross-modal interaction module, and the visual decoder.

6. The cross-modal medical image segmentation method according to claim 5, characterized in that, The sample description text is used as a positive sample, and the corresponding corpus text is obtained from the corpus as a negative sample, resulting in a negative sample set, including: The sample medical image and the corresponding sample description text are input into a large language model for processing to obtain generated text. The generated text is used to describe the areas in the sample medical image that may be infected in the next stage but are not described by the sample description text. The similarity weight of the corpus text is determined based on the similarity between the corpus text and the generated text, and the relevance coefficient is determined based on the similarity between the corpus text and the sample description text. The comprehensive score of the corpus text is calculated based on the similarity weight and the relevance coefficient. The corpus texts whose comprehensive scores meet the preset conditions are identified as negative samples corresponding to the sample description text, thus obtaining a negative sample set.

7. The cross-modal medical image segmentation method according to claim 6, characterized in that, The training loss includes optimization loss. Cross-entropy loss Batch comparison loss Loss compared with corpus : ; ; ; ; ; Among them, and middle For the number of pixels, This indicates the number of negative samples identified from the corpus. For the number of categories, This represents the number of samples in the current training batch. Represents the predicted pixels Category The probability, This represents the actual label mask. Indicates Gaussian kernel calculation, The features of the text describing the i-th sample in the current batch. The features of the text describing the j-th sample in the current batch, The features of the i-th sample medical image in the current batch, This indicates the similarity calculation between the current sample and the negative samples identified from the corpus. This represents the j-th negative sample identified in the current corpus. For the features of the i-th generated text, This refers to temperature hyperparameters.

8. A cross-modal medical image segmentation device, characterized in that, include: The encoding processing module is used to input the original medical image and the original text description into the feature encoder for encoding processing, so as to obtain the image features corresponding to the original medical image and the text features corresponding to the original text description, wherein the text description is text data that explains the original medical image; The fusion processing module is used to input the image features and the text features into the multi-scale cross-modal interaction module for fusion processing to obtain the target fusion features; The segmentation processing module is used to input the target fusion features into a visual decoder for image segmentation processing to obtain the image segmentation result; The fusion processing module includes: The processing unit is used to perform global flattening, horizontal pooling flattening, and vertical pooling flattening on the image features to obtain global information, horizontal information, and vertical information, respectively. The fusion unit is used to fuse the global information, the horizontal information, and the vertical information with the text features using a cross-attention mechanism, respectively, to obtain global fusion features, horizontal fusion features, and vertical fusion features; The splicing unit is used to splice the global fusion feature, the horizontal fusion feature, and the vertical fusion feature to obtain the target fusion feature.

9. An electronic device, comprising: A memory, a processor, and a program stored in the memory and executable on the processor; characterized in that the processor is configured to read the program from the memory to implement the steps in the cross-modal medical image segmentation method as described in any one of claims 1 to 7.

10. A readable storage medium for storing a program, characterized in that, When the program is executed by the processor, it implements the steps in the cross-modal medical image segmentation method as described in any one of claims 1 to 7.