A method and system for cancer survival prediction

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By extracting global features from 3D MRI image data and fusing features from multiple perspectives, combined with clinical data, the problem of insufficient survival prediction caused by 2D slice analysis in existing technologies has been solved, achieving higher accuracy and generalization ability, and supporting personalized diagnosis and treatment.

CN122244620APending Publication Date: 2026-06-19XI AN JIAOTONG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XI AN JIAOTONG UNIV
Filing Date: 2026-04-07
Publication Date: 2026-06-19

Application Information

Patent Timeline

07 Apr 2026

Application

19 Jun 2026

Publication

CN122244620A

IPC: G06V10/80; G16H50/30; G16H30/40; G06V10/42; G06V10/26; G06N3/045; G06V10/82; G06V20/64

AI Tagging

Application Domain

Health-index calculation Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Current medical imaging prognostic methods mainly rely on two-dimensional slice analysis, ignoring three-dimensional spatial structural information, resulting in insufficient accuracy and generalization ability in survival prediction.

Method used

Using 3D MRI image data, a global feature extraction model and multi-view feature fusion were employed, combined with clinical data, spatial attention and channel attention modules were used to extract features, and a cross-attention mechanism was designed to capture the spatial correlation and multi-view features of 3D images, thereby constructing a survival prediction model.

Benefits of technology

It significantly improves the accuracy and generalization ability of cancer survival prediction, and provides support for more precise personalized diagnosis and treatment plans.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244620A_ABST

Patent Text Reader

Abstract

This invention discloses a method and system for predicting cancer survival, relating to the field of medical image analysis technology. The method includes the following steps: acquiring three-dimensional MRI image data and clinical data from cancer patients; inputting the three-dimensional MRI image data into a global feature extraction model to obtain global spatial features; selecting slices of the largest tumor region from the three-dimensional MRI image data at different perspectives, and obtaining multi-view features based on multiple slices; fusing the global spatial features, multi-view features, and embedded clinical data to obtain multi-modal features; and outputting the final prediction result based on the multi-modal features. This invention fully leverages the complete structural information of three-dimensional medical images, combining the advantages of multi-view and multi-modal data, significantly improving the accuracy of cancer survival prediction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image analysis technology, and in particular to a method and system for predicting cancer survival. Background Technology

[0002] Medical imaging survival prediction refers to using medical imaging data combined with patient clinical information, and employing deep learning methods to construct survival prediction models to predict patients' survival risks. Its core research objective is to provide quantitative evidence for medical care, assist clinical decision-making, and improve patient treatment success rates. Medical imaging data includes CT (Computed Tomography), PET (Positron Emission Tomography), MRI (Magnetic Resonance Imaging), and WSI (Whole Slide Imaging), among others.

[0003] Existing prognostic stratification mainly relies on the TNM (Tumor, Node, Metastasis) staging system and pathological tumor regression grading. However, these methods have limitations in clinical application. In particular, TNM staging cannot fully reflect the clinical heterogeneity of patients, and pathological tumor regression grading can only be obtained after surgery, which limits its application in early treatment decisions.

[0004] Therefore, existing research on medical image prognostic prediction has begun to explore the use of deep learning to mine image features to compensate for the aforementioned shortcomings, but significant technical limitations remain. Current survival prediction studies mostly analyze single two-dimensional images of the largest cross-section of the tumor, neglecting the inherent three-dimensional spatial structural information of medical images. Since tumors are highly spatially heterogeneous three-dimensional entities, their texture variations, edge infiltration, and volumetric features along the Z-axis are crucial for prognostic assessment. Relying solely on two-dimensional slices leads to the loss of inter-slice correlation information and global spatial features, making it difficult for models to comprehensively represent the biological characteristics of tumors, thus limiting the accuracy and generalization ability of survival prediction. Summary of the Invention

[0005] Based on the shortcomings of the existing technology, the present invention provides a cancer survival prediction method and system, which solves the problem that the analysis of a single two-dimensional image of the largest cross section of the tumor in existing medical imaging prognosis methods limits the accuracy and generalization ability of survival prediction.

[0006] The present invention adopts the following technical solution: In a first aspect, the present invention provides a method for predicting cancer survival, comprising the following steps: Collect 3D MRI images and clinical data from cancer patients; Three-dimensional MRI image data is input into a global feature extraction model to obtain global spatial features. The global feature extraction model includes a segmentation module, a projection module, and multiple cascaded feature extraction modules. Each feature extraction module includes a downsampling module, a spatial attention module, a channel attention module, and a fusion module. The segmentation module segments the three-dimensional MRI image data into multiple patches, and the projection module projects these patches to obtain an input feature map. The downsampling module downsamples the current input feature map. The spatial attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a spatial value layer matrix to obtain a spatial attention map. The channel attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a channel value layer matrix to obtain a channel attention map. The fusion module fuses the spatial attention map and the channel attention map to obtain an output feature map. Slices of the largest tumor region in 3D MRI image data were selected from different perspectives, and multi-view features were obtained based on multiple slices. Global spatial features, multi-view features, and clinical data are fused to obtain multimodal features; the final prediction result is output based on the multimodal features.

[0007] Preferably, the spatial attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a spatial value layer matrix, specifically including the following steps: The shared query matrix, shared key matrix, and spatial value layer matrix are obtained based on different projection weights; Project the shared key matrix and the spatial value layer matrix; The similarity is obtained by multiplying the shared query matrix with the transpose of the projected spatial value layer matrix and then normalizing the result. Multiplying the similarity by the projected spatial value layer matrix yields the spatial attention feature map.

[0008] Preferably, the spatial attention map is as follows: ; In the formula, This is a spatial attention map. , and These represent the shared query matrix, the projected shared key matrix, and the projected spatial value layer matrix, respectively. For each vector dimension, T This is a transpose. Preferably,

[0009] Preferably, the channel attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a channel value layer matrix, including the following steps: Perform dot product on the transposes of the shared query matrix and the shared key matrix, and then normalize them; Multiply the normalized result by the channel value layer matrix to obtain the channel attention feature map.

[0010] Preferably, the channel attention map is as follows: ; In the formula, This is a channel attention map. , and These represent the channel value layer matrix, the shared query matrix, and the shared key matrix, respectively. For each vector dimension, T This is a transpose.

[0011] Preferably, the multiple viewpoints include sagittal, coronal, and axial directions; the acquisition of multi-view features based on multiple slices specifically includes the following steps: Multiple slices are subjected to feature extraction to obtain multiple two-dimensional features; Multiple two-dimensional features are aggregated using a cross-attention mechanism to obtain multi-view features.

[0012] Preferably, clinical data is embedded as clinical features into a vector space of the same dimension as global spatial features and multi-view features, and then fused with three-dimensional features and multi-view features.

[0013] Preferably, the multimodal features are input into the Cox proportional hazards model to obtain the final prediction result.

[0014] Preferably, the acquisition of three-dimensional MRI image data of cancer patients specifically includes the following steps: Acquire multimodal MRI images of cancer patients, including T1, T1c, and T2. All modal images are rigidly registered to the T1c sequence; The registered multimodal MRI images were subjected to N4 bias field correction, resampling, and grayscale normalization. The volume of interest data of the standardized multimodal MRI images are stacked to obtain three-dimensional MRI image data.

[0015] Secondly, the present invention provides a cancer survival prediction system, comprising: The acquisition module is used to acquire three-dimensional MRI image data and clinical data from cancer patients. The input module is used to input 3D MRI image data into the global feature extraction model to obtain global spatial features. The global feature extraction model includes a segmentation module, a projection module, and multiple cascaded feature extraction modules. Each feature extraction module includes a downsampling module, a spatial attention module, a channel attention module, and a fusion module. The segmentation module segments the 3D MRI image data into multiple patches, and the projection module projects multiple patches to obtain an input feature map. The downsampling module downsamples the current input feature map. The spatial attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a spatial value layer matrix to obtain a spatial attention map. The channel attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a channel value layer matrix to obtain a channel attention map. The fusion module fuses the spatial attention map and the channel attention map to obtain an output feature map. The aggregation module is used to select slices of the largest tumor region in 3D MRI image data from different perspectives and obtain multi-view features based on multiple slices. The fusion module is used to fuse global spatial features, multi-view features, and clinical data to obtain multimodal features; based on the multimodal features, the final prediction result is output.

[0016] Compared with the prior art, the above-mentioned at least one technical solution adopted by the present invention can achieve the following beneficial effects: This invention segments 3D MRI image data into multiple patches, projects them, and inputs them into multiple cascaded feature extraction modules. The spatial attention module and the channel attention module share a query matrix and a shared key matrix, effectively capturing joint features of spatial and channel dimensions while reducing the computational complexity of self-attention. This fully exploits the complete structural information of 3D medical images, solving the problem of lost inter-layer correlation information and global spatial features caused by the use of 2D dimensionality reduction slicing in existing technologies. Simultaneously, by selecting slices of the largest tumor region from different perspectives and aggregating multiple 2D features using a cross-attention mechanism, it simulates the multidimensional diagnostic thinking of doctors, capturing spatial correlations between different perspectives and compensating for the shortcomings of a single perspective. Finally, the global spatial features, multi-view features, and clinical data are fused in a multimodal manner, significantly improving the accuracy of cancer survival prediction. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a flowchart of a cancer survival prediction method according to the present invention; Figure 2 This is a flowchart of the efficient pairwise attention module of the present invention; Figure 3 This is a structural diagram of the high-efficiency pairwise attention module of the present invention; Figure 4 This is a model architecture diagram of the cancer survival prediction method based on paired attention multi-view features according to an embodiment of the present invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] This invention proposes a cancer survival prediction method, specifically a cancer survival prediction method based on pairwise attention multi-view features. It models and analyzes multimodal data from a clinical perspective, improving the accuracy of survival prediction in real clinical nasopharyngeal carcinoma patient data, providing clinical feasibility and interpretability, and offering a scalable technical route for doctors to accurately design personalized treatment plans. By designing modules for 3D image feature extraction, multi-view feature extraction, and clinical information fusion, and by designing an efficient pairwise attention module in 3D image feature extraction, a survival prediction model is built based on a CNN+Transformer architecture, improving the accuracy of survival prediction for nasopharyngeal carcinoma patients and providing strong technical support for clinical diagnosis and prognosis.

[0021] In previous survival prediction tasks, researchers typically selected a slice of the largest tumor region from MRI as input to the deep learning model for prognostic prediction, reducing the 3D image task to a 2D plane for processing. This approach ignores the overall 3D structure of the image, leading to the loss of crucial global spatial information. To address this limitation, this invention proposes a prediction method that integrates 3D global features and multi-view features: First, to capture the global feature representations discarded by traditional 2D methods, a Transformer-based feature extraction network is built, introducing efficient pairwise attention blocks. Shared keys and queries are used to encode spatial and channel dimensions, effectively learning rich 3D features. Second, to compensate for the shortcomings of a single viewpoint and obtain feature representations from different perspectives, a cross-attention mechanism is designed to compute features from multiple perspectives, capturing the spatial correlation between perspectives and highlighting feature information more strongly related to survival prediction. Furthermore, by combining this with clinical information, the model's survival prediction accuracy is improved. (Refer to...) Figures 1-4The present invention specifically includes the following steps:

[0022] S1: Receives the patient's three-dimensional medical imaging data and clinical data.

[0023] Three-dimensional medical imaging data includes MRI, and clinical data includes electronic health record (EHR) data.

[0024] This invention uses a nasopharyngeal carcinoma patient as an example. The medical image acquisition platform should adopt standardized imaging protocols and a unified scanning process to acquire the MRI medical image data of the nasopharyngeal carcinoma patient. The MRI images of the nasopharyngeal carcinoma patient include T1, T1c, and T2. The input to the three-dimensional image feature extraction module is... ,in H , W , D , C These represent the height, width, depth, and number of channels of an MRI image, respectively.

[0025] The acquired multimodal MRI data underwent standardized preprocessing. First, images from all modalities were rigidly registered to the T1c sequence to ensure spatial alignment. Second, the N4 bias field correction algorithm was applied to eliminate magnetic field inhomogeneities. Third, all images were resampled to an isotropic voxel resolution of 1 mm × 1 mm × 1 mm. Finally, the images were grayscale normalized, for example, using the Z-score method, to ensure that the mean grayscale value of each image was 0 and the standard deviation was 1, in order to eliminate differences introduced by different scanning devices. Simultaneously, a senior radiologist delineated the tumor region as the volume of interest (VOI) on the T1c images, and this mask was applied to other modalities.

[0026] The VOI data of the three modalities T1, T1c, and T2 are stacked along the channel dimension to form a data structure of size [size missing]. The four-dimensional tensor is used as input.

[0027] The patient's electronic health record should be complete and free of blank values, including gender, age, T stage, N stage, test2ebv, follow-up time (months), and whether a death event has occurred.

[0028] S2: Extract global spatial features from 3D medical image data.

[0029] Image features (global spatial features) are extracted through an encoder, the structure of which is similar to the encoder on the left side of the U-Net network, except that it adopts a Transformer-like CNN hybrid structure. In addition, an efficient pairwise attention module is designed to capture spatial and channel features.

[0030] The input image is segmented into a series of non-overlapping small 3D cubes (i.e., patches). P The size of the patch, i.e. The total number of patches after splitting is , representing the length of the input sequence. The height of the input image / the height of the patch. The width of the input image / the width of the patch. The depth of the input image is divided by the depth of the patch. These patches are then projected onto... Each channel dimension generates a size of [size]. Feature map .

[0031] The encoder feature extraction of the designed 3D image CNN+Transformer architecture consists of four stages. In each stage, the resolution is reduced by half through a non-overlapping convolutional downsampling layer, and then the designed pair attention module (EPA, Efficient Paired Attention) is connected.

[0032] Each EPA module contains two attention submodules: a spatial attention submodule and a channel attention submodule. They encode information in both spatial and channel dimensions through a shared key-queries mechanism, thereby efficiently learning rich spatial-channel feature representations. The self-attention operation used in most CNN+Transformer hybrid architectures has a complexity quadratic with the number of tokens. This incurs high computational costs in survival prediction tasks, and this problem becomes even more pronounced when using window attention and convolutional components in hybrid designs. Therefore, by projecting the spatial matrix of keys and values into a low-dimensional space, spatial attention information can be learned efficiently. Effectively combining the interactions in the spatial dimension with the dependencies between channel features yields rich contextual spatial-channel feature representations, thereby improving survival prediction accuracy.

[0033] Downsampled input feature map The channel attention submodule and spatial attention submodule are fed into the efficient pairwise attention module. The weights of the query (Q) and key (K) linear layers are shared between the two attention submodules, and each attention submodule uses a different value (V) layer. The two attention submodules are computed as follows:

[0034] ; ; in, , , and These represent the spatial attention map, the channel attention map, the spatial attention submodule, and the channel attention submodule, respectively. , , and These are the shared query matrix, shared key matrix, spatial value layer matrix, and channel value layer matrix, respectively.

[0035] (1) Spatial attention submodule.

[0036] By reducing complexity from Down to To efficiently learn spatial information, among which n For the number of markers, p Let be the dimension of the projection vector, and p much smaller n Input the normalized original feature map. , shape Calculated through three linear layers , and The projection yields:

[0037] ; ; ; Wherein, the output dimension is , , and They are respectively , , The projection weights.

[0038] Then, three steps are performed. The first step is to... and Layer from Projected onto shape The low-dimensional matrix; the second step, by... Layers and projections transpose The first step involves multiplying the features and then using Softmax to calculate the spatial attention matrix, measuring the similarity between each feature and other spatial features. The third step is to multiply the calculated similarity with the projected similarity. Multiplying them together, the final calculated shape is... Spatial attention feature map. The formula for calculating the spatial attention feature map is as follows:

[0039] ; in, , and These represent the shared query matrix, the projected shared key matrix, and the projected spatial value layer matrix, respectively. For each vector dimension, T This is a transpose.

[0040] (2) Channel attention submodule.

[0041] By adjusting the channel value layer matrix in the channel dimension Perform a dot product operation with the channel attention matrix to capture dependencies between feature channels. Use the same method as the spatial attention submodule. and And calculate the channel through the linear layer. Learning complementary features, among which = (dimension is) ), The projection weights of the channel value layer.

[0042] The formula for calculating the channel attention feature map is as follows: ; in, , and These represent the channel value layer, the shared query matrix, and the shared key matrix, respectively.

[0043] Finally, the outputs of the two attention submodules are summed and fused, and the fused result is transformed by a convolutional module to obtain rich feature representations. The final output of the efficient pairwise attention module. The calculation is as follows:

[0044] + ; in, and These represent the spatial attention map and the channel attention map, respectively. and These are 1×1×1 and 3×3×3 convolutional modules, respectively.

[0045] S3: Extract multi-view features from multiple two-dimensional slice planes (such as sagittal, coronal and axial views) of three-dimensional medical image data.

[0046] Multi-view feature extraction was performed by selecting the most representative slice from sagittal, coronal, and axial perspectives. The strategy employed was to select a slice representing the location of the largest tumor region in the sagittal, coronal, and axial directions based on the tumor region label. Essentially, this characterizes key spatial information about the tumor in terms of anterior-posterior continuity, vertical expansion, lateral positioning, and lateral invasion range.

[0047] The aim is to extract complementary, finer-grained local features from multiple standard 2D slice planes of 3D images to compensate for the potential limitations of 3D convolution in capturing subtle textures. As a preferred implementation, slices representing the largest tumor regions in each sagittal, coronal, and axial direction are selected for feature extraction. For each slice, a pre-trained 2D convolutional neural network, such as ResNet-50 or EfficientNet, is used as the feature extractor. The weights of this 2D-CNN can be pre-trained on large-scale natural image datasets (such as ImageNet) and fine-tuned for medical image data. For a slice within each viewpoint, the extracted features are aggregated using a cross-attention mechanism between viewpoints to generate a compact feature vector. Finally, these three vectors are concatenated or added to form the final multi-view feature representation.

[0048] The implementation of the multi-view feature extraction module consists of two steps: Given that the dataset includes labels for tumor regions, in order to select the most representative slice for each viewpoint, the strategy adopted is to select a slice based on the location of the largest tumor region in each direction, and then perform subsequent feature extraction.

[0049] Each image from the three perspectives is segmented into patches and embedded into a 768-dimensional vector space. Pairwise cross-attention is calculated, resulting in six independent attention layers with different cross-paths. Simultaneously, a unique positional encoding is learned for each image size to preserve spatial information. These patches are then concatenated through max pooling and passed through a fully connected layer, resulting in a final output with a dimension of 1024, which serves as the final feature for multi-view feature extraction.

[0050] S4: Embed clinical data into the same dimension as global spatial features and multi-view features.

[0051] Patient-related electronic health record information is embedded as clinical features into a 768-dimensional vector space to facilitate fusion with global spatial features and multi-view features.

[0052] Numerical clinical features are embedded into the same dimension as global spatial features and multi-view features. One-hot encoding is used to input the processed clinical features into a multilayer perceptron (MLP), which consists of several fully connected layers and activation functions. The role of the MLP is to project the sparse, high-dimensional raw clinical data into a low-dimensional, dense embedding space, and its output embedding vector dimension matches the image feature vector dimension to facilitate subsequent fusion.

[0053] S5: The global spatial features, multi-view features, and embedded clinical data are fused to obtain multimodal features.

[0054] Global spatial features, multi-perspective features, and clinical information features are spliced together.

[0055] S6: Based on the fused multimodal features, output the final survival risk prediction result.

[0056] Based on the fused multimodal features, the data is input into the CoxPH proportional hazards model, and the final survival risk prediction result, i.e., the patient's risk prediction score, is output. This can be used to stratify patients into high-risk and low-risk groups, providing auxiliary decision support for clinicians to develop personalized treatment plans.

[0057] We collected MRI and electronic health record data from 500 nasopharyngeal carcinoma patients in real clinical data, and comprehensively evaluated the performance of the survival prediction model using ROC curves, C-index, and Kaplan-Meier survival analysis.

[0058] For example, patients can be grouped based on the predicted median risk as a threshold, and the prognostic differences between the two groups can be visualized and statistically validated using Kaplan-Meier survival curves.

[0059] Based on the same concept, the present invention also provides a cancer survival prediction system, including a data acquisition module, an input module, an aggregation module, and a fusion module.

[0060] The acquisition module is used to acquire three-dimensional MRI image data and clinical data from cancer patients.

[0061] The input module is used to input 3D MRI image data into the global feature extraction model to obtain global spatial features. The global feature extraction model includes a segmentation module, a projection module, and multiple cascaded feature extraction modules. Each feature extraction module includes a downsampling module, a spatial attention module, a channel attention module, and a fusion module. The segmentation module segments the 3D MRI image data into multiple patches, and the projection module projects multiple patches to obtain an input feature map. The downsampling module downsamples the current input feature map. The spatial attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a spatial value layer matrix to obtain a spatial attention map. The channel attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a channel value layer matrix to obtain a channel attention map. The fusion module fuses the spatial attention map and the channel attention map to obtain an output feature map.

[0062] The aggregation module is used to select slices of the largest tumor region in 3D MRI image data from different perspectives and obtain multi-view features based on multiple slices.

[0063] The fusion module is used to fuse global spatial features, multi-view features, and clinical data to obtain multimodal features; the final prediction result is output based on the multimodal features.

[0064] Example The medical imaging input module is used to receive three-dimensional medical imaging data (such as MRI). The clinical data input module is used to receive patients' electronic health records (EHR) data.

[0065] MRI images of nasopharyngeal carcinoma patients were acquired, with each patient receiving sequence data in three modalities: T1-weighted imaging (T1), T1-contrast-enhanced imaging (T1c), and T2-weighted imaging (T2).

[0066] The acquired multimodal MRI data underwent standardized preprocessing. First, images from all modalities were rigidly registered to the T1c sequence to ensure spatial alignment. Second, the N4 bias field correction algorithm was applied to eliminate magnetic field inhomogeneities. Third, all images were resampled to an isotropic voxel resolution of 1 mm × 1 mm × 1 mm. Finally, the images were grayscale normalized, for example, using the Z-score method, to ensure that the mean grayscale value of each image was 0 and the standard deviation was 1, in order to eliminate differences introduced by different scanning devices. Simultaneously, a senior radiologist delineated the tumor region as the volume of interest (VOI) on the T1c images, and this mask was applied to other modalities.

[0067] Obtain clinical information corresponding to patient images, including demographic characteristics (such as sex and age) and clinicopathological characteristics (such as T stage, N stage, and serum EBV DNA copy number). Categorical variables are numerically coded (e.g., sex: 0 represents female, 1 represents male; T stage: 1-4 represent T1-T4 respectively), while continuous variables (such as age and EBV DNA value) are used directly or normalized.

[0068] The prepared dataset (including image VOI, clinical data, survival time and survival status) was randomly divided into training set, test set and validation set in a ratio of 6:3:1 to ensure that patients in different sets do not overlap during the division process.

[0069] The feature extraction branch is used to extract global features containing spatial context information from the preprocessed 3D image VOI. The specific workflow is as follows:

[0070] The VOI data of the three modalities T1, T1c, and T2 are stacked along the channel dimension to form a data structure of size [size missing]. The four-dimensional tensor is used as input.

[0071] 3D feature extraction: Feature extraction is performed using an encoder structure based on CNN+Transformer.

[0072] Patching and Projection: The input four-dimensional tensor is divided into a series of non-overlapping three-dimensional cubes (Patches), each Patch having a size of P×P×P. These Patches are then projected onto the specified channel dimensions to generate feature maps.

[0073] Staged encoding: The encoder consists of four stages. Each stage first reduces the resolution of the feature map by a factor of two through a non-overlapping convolutional downsampling layer.

[0074] Paired Attention Module (EPA): After downsampling, the feature map is fed into the efficient Paired Attention Module (EPA). This module contains two attention sub-modules: spatial and channel. Through a shared key-query mechanism, it efficiently encodes information in both spatial and channel dimensions, learning rich contextual spatial-channel feature representations. This module effectively reduces the computational complexity of self-attention operations by projecting the key and value matrices into a low-dimensional space.

[0075] The multi-view feature extraction branch is used to extract multi-view features from multiple two-dimensional slice planes (such as sagittal, coronal, and axial) of 3D images. The specific workflow is as follows:

[0076] Representative slice selection: Based on the tumor region mask, the slice with the largest tumor region area is selected as the representative image of that viewpoint in the sagittal, coronal, and axial directions.

[0077] Two-dimensional feature extraction: For each selected two-dimensional slice, a pre-trained two-dimensional convolutional neural network (such as ResNet-50 or EfficientNet) is used as a feature extractor.

[0078] Cross-attention fusion: Each slice image from three perspectives is segmented into patches and embedded into a 768-dimensional vector space, while learning unique positional encodings for images of different sizes to preserve spatial information.

[0079] Cross-attention is performed on the features of these three perspectives (combined in pairs, forming a total of 6 cross paths) to capture the spatial correlation between different perspectives. The computed features are then aggregated using max pooling and concatenated. Finally, the concatenated features are input into a fully connected layer to generate a 1024-dimensional vector as the final multi-view feature.

[0080] Numerical clinical features are embedded into an embedding space that matches the dimensions of the global spatial features and multi-view features. The specific workflow is as follows:

[0081] Feature embedding: Numericalized (e.g., one-hot encoding) clinical features are input into a multilayer perceptron (MLP). This MLP consists of several fully connected layers and activation functions, projecting the raw clinical data into a low-dimensional, dense embedding space. The output vector dimension matches the dimension of the image feature vector for subsequent fusion.

[0082] The multimodal feature fusion module is used to fuse features from the 3D feature extraction branch, the multi-view feature extraction branch, and the clinical data input module. The specific workflow is as follows:

[0083] Feature concatenation: The three-dimensional global features output by S21, the multi-view features output by S22, and the clinical information embedded features output by S23 are concatenated to form a unified multimodal feature vector.

[0084] The survival risk prediction module outputs the final survival risk prediction result based on the fused multimodal features. The specific workflow is as follows:

[0085] Risk Prediction and Assessment: The multimodal feature vectors spliced and fused from S31 are input into the Cox Proportional Hazard Model (CoxPH). The model outputs a final survival risk prediction score, and patients can be stratified by risk (e.g., high risk, low risk) based on the score. The predictive performance of the model is comprehensively evaluated using indicators such as ROC curves, C-index, and Kaplan-Meier survival analysis.

[0086] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.

[0087] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.

Claims

1. A method for predicting cancer survival, characterized in that, Includes the following steps: Collect 3D MRI images and clinical data from cancer patients; The three-dimensional MRI image data is input into the global feature extraction model to obtain global spatial features; The global feature extraction model includes a segmentation module, a projection module, and multiple cascaded feature extraction modules. Each feature extraction module includes a downsampling module, a spatial attention module, a channel attention module, and a fusion module. The segmentation module segments the 3D MRI image data into multiple patches, and the projection module projects these patches to obtain the input feature map. The downsampling module downsamples the current input feature map. The spatial attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a spatial value layer matrix to obtain a spatial attention map. The channel attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a channel value layer matrix to obtain a channel attention map. The fusion module fuses the spatial attention map and the channel attention map to obtain the output feature map. Slices of the largest tumor region in 3D MRI image data were selected from different perspectives, and multi-view features were obtained based on multiple slices. By fusing global spatial features, multi-perspective features, and clinical data, multimodal features are obtained. The final prediction result is output based on multimodal features.

2. The cancer survival prediction method as described in claim 1, characterized in that, The spatial attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a spatial value layer matrix, specifically including the following steps: The shared query matrix, shared key matrix, and spatial value layer matrix are obtained based on different projection weights; Project the shared key matrix and the spatial value layer matrix; The similarity is obtained by multiplying the shared query matrix with the transpose of the projected spatial value layer matrix and then normalizing the result. Multiplying the similarity by the projected spatial value layer matrix yields the spatial attention feature map.

3. The cancer survival prediction method as described in claim 2, characterized in that, The spatial attention map is shown below: ； In the formula, This is a spatial attention map. , and These represent the shared query matrix, the projected shared key matrix, and the projected spatial value layer matrix, respectively. For each vector dimension, T This is a transpose.

4. The cancer survival prediction method as described in claim 1, characterized in that, The channel attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a channel value layer matrix, including the following steps: Perform dot product on the transposes of the shared query matrix and the shared key matrix, and then normalize them; Multiply the normalized result by the channel value layer matrix to obtain the channel attention feature map.

5. A cancer survival prediction method as described in claim 4, characterized in that, The channel attention map is shown in detail below: ； In the formula, This is a channel attention map. , and These represent the channel value layer matrix, the shared query matrix, and the shared key matrix, respectively. For each vector dimension, T This is a transpose.

6. The cancer survival prediction method as described in claim 1, characterized in that, Multiple perspectives include sagittal, coronal, and axial directions; the acquisition of multi-perspective features based on multiple slices specifically includes the following steps: Multiple slices are subjected to feature extraction to obtain multiple two-dimensional features; Multiple two-dimensional features are aggregated using a cross-attention mechanism to obtain multi-view features.

7. The cancer survival prediction method as described in claim 1, characterized in that, Clinical data is embedded as clinical features into a vector space of the same dimension as global spatial features and multi-view features, and then fused with 3D features and multi-view features.

8. The cancer survival prediction method as described in claim 1, characterized in that, The multimodal features are input into the Cox proportional hazards model to obtain the final prediction results.

9. A cancer survival prediction method as described in claim 1, characterized in that, The acquisition of three-dimensional MRI image data from cancer patients specifically includes the following steps: Acquire multimodal MRI images of cancer patients, including T1, T1c, and T2. All modal images are rigidly registered to the T1c sequence; The registered multimodal MRI images were subjected to N4 bias field correction, resampling, and grayscale normalization. The volume of interest data of the standardized multimodal MRI images are stacked to obtain three-dimensional MRI image data.

10. A cancer survival prediction system, characterized in that, include: The acquisition module is used to acquire three-dimensional MRI image data and clinical data from cancer patients. The input module is used to input 3D MRI image data into the global feature extraction model to obtain global spatial features; The global feature extraction model includes a segmentation module, a projection module, and multiple cascaded feature extraction modules. Each feature extraction module includes a downsampling module, a spatial attention module, a channel attention module, and a fusion module. The segmentation module segments the 3D MRI image data into multiple patches, and the projection module projects these patches to obtain the input feature map. The downsampling module downsamples the current input feature map. The spatial attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a spatial value layer matrix to obtain a spatial attention map. The channel attention module extracts features from the downsampled input feature map using a shared query matrix, a shared key matrix, and a channel value layer matrix to obtain a channel attention map. The fusion module fuses the spatial attention map and the channel attention map to obtain the output feature map. The aggregation module is used to select slices of the largest tumor region in 3D MRI image data from different perspectives and obtain multi-view features based on multiple slices. The fusion module is used to fuse global spatial features, multi-view features, and clinical data to obtain multimodal features; The final prediction result is output based on multimodal features.