A three-dimensional tumor medical image multi-modal deep feature representation method based on a large language model

By constructing a multimodal representation method for three-dimensional tumor images based on a large language model, and combining spatial and semantic prompts for coarse segmentation and fine feature extraction, the problem of insufficient spatial and semantic utilization in tumor image analysis in existing technologies is solved, and a fine characterization of tumor internal heterogeneity and an improvement in the stability of prediction models are achieved.

CN122243859APending Publication Date: 2026-06-19NANJING TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING TECH UNIV
Filing Date
2026-01-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to fully utilize spatial and semantic priors in tumor image analysis. Single-modal radiomics and deep learning methods fall short in preserving detailed information, while multimodal methods fail to effectively co-encode image content and textual semantics, making it difficult to characterize the heterogeneity within tumors.

Method used

A multimodal representation method for 3D tumor medical images based on a large language model is constructed. Through an interactive volume data segmentation network and a feature extraction network, combined with spatial and semantic cues, coarse segmentation and fine feature extraction are performed. Multimodal feature fusion is carried out using a sliding window to form a patient-level feature representation.

Benefits of technology

It improves the ability to characterize tumor morphology and heterogeneity, enhances the recognition of tumor boundaries and small lesions, improves the stability and discrimination ability of the prediction model, and enhances the ability to discriminate features of key lesion areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243859A_ABST
    Figure CN122243859A_ABST
Patent Text Reader

Abstract

This invention proposes a multimodal deep feature representation method for 3D tumor medical images based on a large language model, belonging to the fields of medical image processing and artificial intelligence. The invention proposes a two-stage scheme of "coarse segmentation + fine feature extraction" to improve the multimodal representation capability of 3D tumors. In the coarse segmentation stage, 3D CT / MRI volume data, spatial cues generated from tumor delineation (boxes / points), and semantic cues from lesion text are input into an interactive segmentation network to obtain a tumor prediction mask and crop the ROI volume data. In the fine feature extraction stage, a 3D sliding window is divided within the ROI. Local spatial cues are constructed for each window, and semantic cues are inherited. Window-level multimodal features are extracted through image, spatial, and semantic encoders and a fusion encoder. These features are weighted and aggregated according to the proportion of tumor voxels within the window to form patient-level global features, which are input into the prediction model to output patient status indicators. This invention takes into account both the overall morphology and local heterogeneity of the tumor, improving prediction robustness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of medical image processing and artificial intelligence technology, and in particular relates to a method for representing multimodal deep features of three-dimensional tumor medical images based on a large language model. Background Technology

[0002] Currently, radiomics and deep learning methods have been extensively studied in the quantitative analysis of images for diseases such as tumors. Traditional radiomics methods typically extract handcrafted features such as shape, grayscale, and texture within regions of interest (ROIs) delineated by physicians to construct classification or prognostic models. These methods rely on single ROI delineations by experts, making feature stability susceptible to changes in delineation boundaries and volume. Furthermore, the extracted statistical features primarily reflect grayscale distribution patterns, limiting the utilization of anatomical priors and semantic information about lesions, thus restricting the ability to characterize the complex morphology and biological heterogeneity of tumors.

[0003] With the development of convolutional neural networks and transformer networks, end-to-end deep learning-based methods have been proposed to replace some manual feature extraction processes. These methods typically take the entire region of interest (ROI) or the entire 3D medical image as input, extracting high-level features through multi-layer convolution, downsampling, and pooling operations, and directly outputting classification or prognostic prediction results. However, in order to expand the receptive field, these networks often repeatedly downsample in the spatial dimension, causing the detailed information of small lesions, complex boundaries, and focal infiltrating areas to be gradually weakened during feature transfer. This results in insufficient sensitivity to small-volume lesions and boundary regions, making it difficult to fully reflect local morphological changes in tumors.

[0004] To comprehensively utilize multi-source information, existing research has attempted to fuse image features with clinical indicators and genetic data in a multimodal manner. A common approach is to extract features from image and non-image data separately, then concatenate or perform simple linear combinations along the feature dimensions before inputting them into a classifier or survival model. In recent years, large-scale pre-trained models and large language models have been introduced into the field of medical imaging to align semantic information such as text reports and disease names with image features. However, current work primarily focuses on image-text retrieval, report generation, or coarse-grained classification. Collaborative modeling between spatial and semantic cues and 3D image volume data remains insufficient, and a systematic feature extraction framework that jointly encodes image content, spatial location, and text semantics within a unified representation space for specific tumor regions is still lacking.

[0005] In summary, existing technologies, including single-modal radiomics and conventional deep learning methods, struggle to fully utilize spatial and semantic priors. End-to-end networks also fall short in preserving spatial details. Furthermore, multimodal methods based on large language models primarily focus on image-text alignment and high-level semantic tasks, failing to develop refined multimodal representation schemes for 3D tumor regions. Therefore, it is necessary to propose a multimodal representation method for medical images that, based on reliable 3D coarse segmentation, combines spatial and semantic cues and leverages the deep feature extraction capabilities of large language models to improve the ability of patient-level representations to characterize tumor morphology and heterogeneity. Summary of the Invention

[0006] The purpose of this invention is to address the problems of insufficient single-modal feature representation, inadequate utilization of multi-source information, and difficulty in characterizing tumor heterogeneity in existing medical image analysis, and to provide a novel multimodal representation method for medical images. This invention proposes a processing flow divided into a coarse segmentation stage and a fine feature extraction stage. First, a reliable tumor region is automatically obtained from the 3D medical image. Then, fine-grained feature extraction is performed within this region using a sliding window. Furthermore, an image encoder, spatial encoder, and semantic encoder are jointly used to model 3D image volume data, spatial cue information, and semantic cue information, fusing multimodal features to form a patient-level feature representation that reflects the overall morphology and internal heterogeneity of the tumor, thereby improving the stability and discriminative ability of subsequent prediction models. Specifically, the interactive volume data segmentation network and the interactive volume data feature extraction network are constructed based on a pre-trained structure related to a large language model, and are used to perform multimodal deep feature extraction in conjunction with textual semantic cues.

[0007] To achieve the above objectives, this invention provides a method for multimodal deep feature representation of three-dimensional tumor medical images based on a large language model. In one embodiment of this invention, the method includes a coarse segmentation stage and a fine feature extraction stage, as detailed below:

[0008] First, acquire the 3D medical image volumetric data of the object to be analyzed and the corresponding tumor delineation file. Preferably, the 3D medical image is CT or MRI volumetric data, and the tumor delineation file is the voxel-level segmentation result obtained by a clinician in 3D space. Based on the delineation file, spatial cue information to characterize the tumor location is automatically generated. The spatial cue information includes at least a bounding box around the tumor region and / or point cue located inside and outside the tumor. Simultaneously, semantic cue information about the target lesion or related anatomical structures input by the user is received. The semantic cue information can be a textual cue of disease name, location name, or a combination thereof.

[0009] Secondly, in the coarse segmentation stage, the 3D medical image volume data, the spatial cue information, and the semantic cue information are input into the interactive volume data segmentation network. The interactive volume data segmentation network includes an image encoder, a spatial encoder, a semantic encoder, a fusion encoder, and a mask decoder. The image encoder extracts image features from the 3D medical image volume data; the spatial encoder encodes bounding box cues and point cues; the semantic encoder encodes semantic cue information; the fusion encoder fuses the above encoding results and provides a mask embedding to the mask decoder; the mask decoder outputs a predicted mask for the target region. Based on this predicted mask, spatial cropping is performed on the original 3D volume data to obtain ROI volume data centered on the tumor region.

[0010] Then, in the fine feature extraction stage, a three-dimensional sliding window partitioning is performed on the ROI volume data. In one embodiment, preset window sizes and step sizes are set in three directions of three-dimensional space, and the ROI volume data is scanned regularly to obtain multiple overlapping or non-overlapping sliding window volume data. For each sliding window, window-level spatial cue information is generated based on the local region of the window in the prediction mask, such as local bounding box cue, local point cue, or local mask; at the same time, the aforementioned semantic cue information is used as window-level semantic cue information, thereby forming a cue input that corresponds one-to-one with the sliding window volume data.

[0011] Next, the volume data of each sliding window, along with its window-level spatial and semantic cues, is input into the interactive volume data feature extraction network. Preferably, the interactive volume data feature extraction network shares at least part of the image encoder, spatial encoder, and semantic encoder structure with the aforementioned interactive volume data segmentation network to maintain representation space consistency between the coarse segmentation stage and the fine feature extraction stage. For each sliding window, the image encoder outputs an image embedding representation, the spatial encoder outputs a spatial embedding representation, and the semantic encoder outputs a semantic embedding representation. These three types of embedding representations are input into the fusion encoder, which performs feature interaction and weighted fusion through concatenation and attention mechanisms to obtain the multimodal feature representation of the corresponding sliding window.

[0012] After obtaining the multimodal feature representations of all sliding windows for the same patient, window weights are further calculated based on the prediction mask. Specifically, the number of voxels belonging to the prediction mask within each sliding window is counted and divided by the total number of voxels in the overall prediction mask. This proportion is used as the window weight of the corresponding sliding window, reflecting the contribution of that window to the overall tumor burden. These window weights are then used to perform a weighted sum or weighted average of the multimodal feature representations of all sliding windows to form the global multimodal feature representation of the patient.

[0013] Finally, the patient-level global multimodal feature representation is input into the prediction model to obtain prediction results related to the patient's state. The prediction model can be a deep neural network structure such as a multilayer perceptron. In some embodiments, before inputting the prediction model, feature selection such as LASSO or recursive feature elimination can be performed on the global multimodal feature representation to reduce redundant features and improve prediction performance.

[0014] The present invention, by adopting the above technical solution, has the following beneficial effects:

[0015] 1. This invention introduces sliding window fine-grained modeling on the basis of three-dimensional coarse segmentation, and performs weighted fusion by combining the proportion of tumor voxels within the window, so that local features at different spatial locations can be distinguished in the patient-level representation, more fully reflecting the morphological and textural heterogeneity of the tumor, and overcoming the deficiency of traditional radiomics global statistical features in terms of insufficient sensitivity to local heterogeneous regions.

[0016] 2. In the feature extraction process, this invention utilizes both spatial and semantic cue information. The features output by the spatial encoder, semantic encoder and image encoder are fused together, and the location prior and disease semantic prior are explicitly injected into the multimodal representation. This is beneficial to enhance the feature discrimination ability of tumor boundaries, small lesions and focal infiltration areas, and improve the model's recognition effect on key lesion areas.

[0017] 3. This invention shares at least a portion of the encoder structure between the interactive volume data segmentation network and the interactive volume data feature extraction network, enabling the coarse segmentation stage and the fine feature extraction stage to be performed in a unified representation space. On the one hand, this ensures the accuracy of volume data segmentation, and on the other hand, it provides a semantically consistent embedding basis for subsequent sliding window multimodal features, reducing the performance loss caused by inconsistent feature distributions at different stages.

[0018] 4. By designing window weights based on the proportion of tumor voxels, this invention can suppress the influence of background windows and noise windows that contain almost no tumor when constructing patient-level global features, highlighting key areas with large tumor burdens. This allows the obtained global features to maintain spatial details while improving robustness and generalization ability, providing a relatively stable and highly interpretable input feature foundation for subsequent prognostic prediction or risk assessment models. Attached Figure Description

[0019] Figure 1 This is a schematic diagram of the overall technical process of the present invention;

[0020] Figure 2 A schematic diagram of the interactive volume data segmentation network structure;

[0021] Figure 3 A schematic diagram of the network structure for interactive volume data feature extraction;

[0022] Figure 4 This is a schematic diagram of the network structure for interactive volume data feature extraction.

[0023] Figure 1 and Figure 2 The term "interactive segmentation network" corresponds to the interactive volume data segmentation network described in this specification.

[0024] Figure 1 and Figure 3 The term "interactive segmentation network" corresponds to the interactive volume data feature extraction network described in this specification. Detailed Implementation

[0025] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. After reading the present invention, any modifications of the present invention in various equivalent forms by those skilled in the art will fall within the scope defined by the appended claims.

[0026] S1: Obtain the 3D medical image volume data V of the object to be analyzed and the corresponding tumor delineation file G.

[0027] The three-dimensional medical image volume data is preferably CT or MRI volume data;

[0028] The tumor delineation file is preferably a voxel-level segmentation result obtained by clinicians in three-dimensional space. The delineation file can be a single-class binary annotation or a multi-class annotation.

[0029] In another embodiment, when the tumor delineation file G is not available, spatial prompts can be generated by user-interactive input of point prompts and / or box prompts, or by an automatic detection / coarse localization module generating initial spatial prompts.

[0030] S2: Perform preprocessing on the three-dimensional medical image volume data to obtain normalized volume data V for inference. in And the scaled volume data V used for coarse segmentation out .

[0031] In a preferred embodiment, the preprocessing includes at least one or more of the following operations:

[0032] Unify the images and tags to the preset spatial orientation;

[0033] Foreground intensity normalization is performed on the image: a set of candidate foreground voxels is screened out based on the image voxel mean, and the intensity is cropped based on the upper and lower quantiles of the set, followed by standardization;

[0034] Dimensional transformation of images and labels to adapt to network input;

[0035] Perform min-max normalization on the image;

[0036] Perform space padding on the image and labels to make them fit the preset input size spatial_size;

[0037] Scale the image and tags to the spatial_size to obtain scaled version V. out .

[0038] In one specific embodiment, the spatial_size is set to (32,256,256), and the image and label are aligned to this size through space filling and scaling.

[0039] S3: Automatically generate spatial hint information P based on the tumor delineation file G. s .

[0040] The spatial prompt information includes at least one of the following:

[0041] Box hint: Calculate the 3D minimum bounding box for the tumor foreground voxel set, obtaining (X... min Y min Z min X max Y max Z max );

[0042] Tip: Select at least one positive point inside the tumor and optionally at least one negative point outside the tumor; when the number of points is insufficient, placeholder points and placeholder labels can be used to maintain consistency in the input dimensions.

[0043] S4: Receive semantic prompts from the user regarding the target lesion or related anatomical structures. t The semantic prompt information may be a text prompt for the name of a disease, the name of a body part, or a combination thereof.

[0044] The semantic prompt information is input into a pre-trained text encoder to obtain the semantic embedding E. t .

[0045] In a preferred embodiment, the text encoder employs a CLIP text encoding structure and maps the output to a semantic embedding space (e.g., 768-dimensional) of the same dimension as the image features through a dimension alignment layer.

[0046] In one embodiment, the text encoder parameters are frozen to enhance the stability of semantic representation.

[0047] S5: Scale the volume data V out Space prompt information P s With semantic prompt information P tThe interactive volumetric data segmentation network N is input together, and the output coarse segmentation prediction result L is output. out (logits).

[0048] In a preferred embodiment, the interactive volume data segmentation network N includes:

[0049] Image encoders are used to extract image embeddings from three-dimensional volume data;

[0050] The cue encoder is used to encode the position of point cues and box cues and generate cue embeddings, and uses the semantic embeddings as sparse cue tokens to participate in the fusion.

[0051] The mask decoder is used for interactive decoding based on image embedding and cue embedding, and outputs segmentation logits.

[0052] In a preferred embodiment, the mask decoder uses a Transformer structure to achieve bidirectional interaction between image features and cue tokens, and when semantic embedding exists, it fuses the semantic embedding with the upsampled features based on similarity to enhance the ability of semantics to modulate segmentation.

[0053] S6: The coarse segmentation prediction result L... out Map back to V via interpolation. in The spatial scale is used to obtain the global prediction result L. global Among them, L global The result is a global prediction in logits form.

[0054] S7: Regarding the global prediction result L... global Foreground extraction is performed to obtain ROI coordinates and ROI data. Specifically, for L... global The probability map is obtained through sigmoid and thresholding is performed, with a preferred threshold of 0.5, to obtain the foreground voxel set. Its 3D boundary is then calculated, and the ROI coordinates (D) are automatically obtained. min H min W min D max H max W max ).

[0055] In a preferred embodiment, the ROI coordinates are expanded and boundary callbacks are performed according to a preset spatial_size, so that the size of the clipped ROI volume data is exactly equal to the spatial_size. Then, in the V... in The top clipping yields the ROI volume data V. roi Simultaneously cropping to obtain the global prediction binary result within the ROI. in, This is a binary prediction mask.

[0056] S8: Obtaining ROI volume data V roi and global prediction binary results within the ROI Then, we construct prior information R for generating window-level spatial hints during the sliding window phase.

[0057] In one embodiment, based on the spatial prompt information P obtained in step S3 s Generate hint mapping B in the ROI coordinate system roi The prompt is mapped to three-dimensional binary data:

[0058] When the space tooltip is a box tooltip, B roi This represents a three-dimensional binary cube where the value inside the bounding box is 1 and the value outside the bounding box is 0.

[0059] When the spatial hint is a dot hint, B roi This represents a three-dimensional binary point plot where the value at the point location is 1 and the values ​​at other locations are 0.

[0060] Preferably, the cue mapping and the prediction results within the ROI are combined to form prior information, represented as follows: This is used for the automatic generation and constraint of window-level spatial prompts in subsequent steps.

[0061] S9: For the ROI body data V roi Perform a three-dimensional sliding window partitioning.

[0062] In one implementation, preset window sizes and step sizes are set in three directions in three-dimensional space, for V roi Perform a rule scan to obtain multiple overlapping or non-overlapping sliding window data. For each sliding window W i Based on the local region corresponding to the sliding window in the prior information R, a window-level spatial prompt information P is generated. s,i , where P s,i It includes at least partial bounding box hints and / or partial point hints; simultaneously, the semantic hint information P obtained in step S4 is... t As a window-level semantic prompt information P t,i This creates a prompt input that corresponds one-to-one with the data of the sliding window body (W). i P s,i P t,i ).

[0063] Next, the sliding window body data W i and its window-level space tooltip P s,i and window-level semantic prompts P t,i Input interactive volume data feature extraction network N featOutput the multimodal feature representation F corresponding to the sliding window. i .

[0064] Preferably, the interactive volume data feature extraction network N feat The interactive volume data segmentation network described in step S5 shares at least part of the image encoder, spatial cue encoder, and semantic encoder structures to maintain the consistency of the representation space between the coarse segmentation stage and the fine feature extraction stage.

[0065] For each sliding window, the image encoder outputs an image embedding representation, the spatial cue encoder outputs a spatial embedding representation, and the semantic encoder outputs a semantic embedding representation. These embedding representations are input into the fusion encoder, which performs feature interaction and weighted fusion through an attention mechanism to obtain the multimodal feature representation F of the sliding window. i =N feat (W i P s,i P t,i ).

[0066] In one specific embodiment, the sliding window size is set to (32, 256, 256), the window overlap rate is set to 0.5, and the corresponding three-dimensional step size is (16, 128, 128).

[0067] S10: Obtaining the multimodal feature representation of all sliding windows for the same case. Then, based on the prediction results within the ROI Calculate the window weights of each sliding window. Based on this, the window features are weighted and aggregated to obtain the case-level global multimodal feature representation F. global .

[0068] In one implementation, the number n voxels belonging to the predicted foreground mask within the coverage area of ​​the i-th sliding window is counted. i And count the total number N of voxels for predicting the foreground mask within the ROI range. M The ratio of the two is used as the window weight:

[0069]

[0070] Where ε is a preset, extremely small positive number used to avoid division by zero and improve numerical stability. When N M When = 0, set the window weight to a uniform weight a. i =1 / N or a preset default weight is used to ensure the aggregation process can be executed. Subsequently, the window weights are used to perform a weighted average of all sliding window multimodal feature representations to obtain a case-level global multimodal feature representation, for example:

[0071]

[0072] The case-level global multimodal feature representation F global It can be used for subsequent analysis tasks, including but not limited to tumor burden assessment, efficacy prediction, prognostic risk stratification, or comprehensive modeling that integrates with structured clinical information.

[0073] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A method for representing multimodal deep features of three-dimensional tumor medical images based on a large language model, characterized in that, Includes the following steps: (1) Acquire three-dimensional medical image volume data and corresponding tumor delineation files, generate spatial prompt information based on the delineation files, the spatial prompt information including at least box prompts and / or point prompts; and simultaneously receive text prompt information for describing target lesions or anatomical structures. (2) Input the three-dimensional medical image volume data, the spatial prompt information and the text prompt information into the interactive volume data segmentation network to obtain the prediction mask of the target region, and obtain the ROI volume data by cropping from the three-dimensional medical image volume data according to the prediction mask; (3) Divide the ROI volume data into three-dimensional sliding windows to obtain multiple sliding window volume data, and generate window-level spatial prompt information based on the local area of ​​each sliding window in the prediction mask, while using the text prompt information as window-level semantic prompt information. (4) Input each sliding window body data and its window-level spatial prompt information and window-level semantic prompt information into the interactive body data feature extraction network, and process them through the image encoder, spatial encoder and semantic encoder respectively to obtain the corresponding image embedding representation, spatial embedding representation and semantic embedding representation. (5) Input the image embedding representation, the spatial embedding representation and the semantic embedding representation into the fusion encoder, fuse the various embedding representations, and obtain a multimodal feature representation that corresponds one-to-one with each sliding window; (6) Determine the window weight based on the proportion of the number of voxels belonging to the prediction mask in each sliding window to the total number of voxels in the overall prediction mask, or its equivalent normalization method, and perform weighted fusion of the multimodal feature representations of all sliding windows based on the window weights to generate a patient-level global feature representation. (7) Input the patient-level global feature representation into the prediction model to obtain prediction results related to the patient's state.

2. The method according to claim 1, characterized in that, The interactive volume data segmentation network in step (2) includes an image encoder, a spatial encoder, a semantic encoder, a fusion encoder, and a mask decoder. The image encoder is used to extract image features from the volume data of three-dimensional medical images. The spatial encoder is used to encode the spatial cue information. The semantic encoder is used to encode the text cue information. The fusion encoder is used to fuse the above encoding results and provide a mask embedding to the mask decoder to output the predicted mask.

3. The method according to claim 1 or 2, characterized in that, The sliding window division in step (3) uses three-dimensional sub-blocks with preset size and step size in three spatial dimensions to perform regular scanning on the ROI volume data, thereby obtaining sliding window volume data that overlaps or does not overlap, in order to adapt to different lesion volumes and shapes.

4. The method according to any one of claims 1 to 3, characterized in that, The interactive volume data feature extraction network in step (4) shares at least a portion of the encoder structure with the interactive volume data segmentation network in step (2) to maintain a unified multimodal representation space between the segmentation stage and the feature extraction stage.

5. The method according to any one of claims 1 to 4, characterized in that, The fusion encoder described in step (5) performs feature interaction and weighted fusion on image embedding representation, spatial embedding representation and semantic embedding representation through splicing, weighted summation and / or attention mechanisms to form the multimodal feature representation.

6. The method according to any one of claims 1 to 5, characterized in that, The window weights mentioned in step (6) are normalized proportional weights, and the weighted fusion adopts a weighted summation or weighted average method to give higher weights to the sliding window containing a large tumor burden in the patient-level global feature representation.

7. The method according to any one of claims 1 to 6, characterized in that, The prediction model in step (7) is a deep neural network structure, which represents the input patient-level global features and outputs one-dimensional or multi-dimensional prediction indicators.

8. The method according to any one of claims 1 to 7, characterized in that, Prior to step (7), a feature selection step is included for the patient-level global feature representation, wherein the feature selection employs at least one method, such as LASSO, recursive feature elimination, or feature importance based on ensemble model, to reduce redundant features and improve prediction performance.