Multimodal survival prediction method and device of mamba model with wavelet attention
By constructing a multimodal dataset and introducing a wavelet attention mechanism, combined with the state space module of the Mamba model, the problems of fusion timing lag and state evolution fragmentation in cross-modal interaction mechanisms are solved, achieving higher accuracy and robustness in multimodal survival prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
AI Technical Summary
In existing multimodal survival prediction methods based on the Mamba model, the cross-modal interaction mechanism suffers from fusion timing lag and state evolution fragmentation, which limits the prediction accuracy and robustness.
By constructing a multimodal dataset and introducing a wavelet attention mechanism, combined with the state space module of the Mamba model, real-time interaction of cross-modal information during state evolution is achieved. Wavelet attention is used to perform multi-scale decomposition and feature update of cross-modal features, and a fusion state space model is constructed.
It significantly improves the accuracy and robustness of multimodal survival prediction, fully leverages the modeling advantages of the Mamba model for long-term dependence, and provides a more reliable basis for clinical decision-making.
Smart Images

Figure CN122245750A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of medical multimodal data fusion technology, and in particular to a multimodal survival prediction method and apparatus that integrates wavelet attention with a Mamba model. Background Technology
[0002] The core objective of survival prediction is to quantify the time frame from diagnosis or treatment to critical clinical events such as death or tumor recurrence, thereby enabling patient risk stratification and providing a scientific basis for the development and adjustment of individualized treatment strategies. Cancer, with its high incidence and mortality rates, poses a significant public health challenge. Therefore, accurately predicting a patient's survival probability at a specific time point is crucial for optimizing clinical decisions (such as treatment selection), rationally allocating medical resources (such as prioritizing intensive care), and improving patient outcomes.
[0003] In recent years, multimodal data fusion has become an important direction for improving cancer survival prediction performance. Among them, the joint analysis of histopathological images (providing information on the spatial morphological heterogeneity of tumors) and genomic maps (revealing molecular variations and key signaling pathway activities) has shown significant potential, which can simultaneously capture the macroscopic morphology and microscopic molecular features of tumors, providing a more comprehensive biological basis for survival prediction. The Mamba model, which is good at handling long-range dependencies, has performed outstandingly in multiple fields (such as natural language processing and computer vision), and it efficiently captures the dynamic evolution patterns in sequences through state transition equations. Inspired by this, researchers have introduced Mamba into multimodal survival prediction tasks. Typical implementation methods include: combining Mamba with cross-attention fusion methods (such as encoding two modal features with Mamba and then fusing them with cross-attention), or designing independent Mamba branches for each modality (such as processing image and genomic features separately with Mamba and then fusing them at the output layer).
[0004] Although multimodal survival prediction methods based on the Mamba model have made some progress, their cross-modal interaction mechanisms still have significant limitations. Existing methods generally adopt an "extract first, then fuse" strategy, that is, first extracting single-modal features independently using the SSM module of the Mamba model, and then performing multimodal fusion outside the SSM module. The core problem with this strategy is that it ignores the state evolution process within the SSM module, which is precisely the core advantage of the Mamba model compared to traditional models (such as LSTM and Transformer), achieving implicit modeling of long-range dependencies through recursive updates of hidden states. Specifically, this manifests in two ways: First, the fusion timing is delayed. Multimodal information only interacts after single-modal feature extraction is completed, causing cross-modal associations to be unable to be dynamically utilized in early state evolution, potentially losing prognostic cues contained in early interactions; Second, state evolution is fragmented. The state of a single-modal SSM is updated only based on the historical information of its own mode, without explicitly considering the synchronous influence of another mode. This isolates the state evolution processes of different modes, making it difficult to leverage the complementary value of multimodal data, and ultimately limiting the prediction accuracy and robustness of the Mamba model. Summary of the Invention
[0005] This application provides a multimodal survival prediction method and apparatus that integrates wavelet attention into a Mamba model. This solves the problems of delayed fusion timing and fragmented state evolution in existing Mamba model-based multimodal survival prediction methods, which in turn limit the prediction accuracy and robustness of the Mamba model.
[0006] In a first aspect, embodiments of this application provide a multimodal survival prediction method incorporating a Mamba model with wavelet attention, comprising: standardizing histopathological images and genomic maps of multiple patients to obtain structured pathological fully connected maps and gene embeddings to construct a multimodal dataset; incorporating wavelet attention into a Mamba model to construct a multimodal survival prediction model, and training the multimodal survival prediction model using the multimodal dataset; wherein the multimodal survival prediction model includes a representation learning module, a data fusion module, a survival prediction module, and a loss function module; the representation learning module is used to perform deep feature extraction on the pathological fully connected maps and gene embeddings to mine key features therein, obtaining pathological image representations and genomic data. The data fusion module is used to mine the feature update rules and cross-modal complementary associations of the pathological image representation and the genomic data representation through wavelet attention mechanism and the selective mechanism of Mamba model to obtain the fused representation; the survival prediction module is used to predict the survival risk of the fused representation through a fully connected neural network and output the survival probability of the corresponding patient at different time points; the loss function module is used to calculate the loss between the predicted survival probability and the clinical data to optimize the model parameters of the multimodal survival prediction model; the histopathological image and genomic map of the patient to be predicted are preprocessed into a pathological fully connected map and gene embedding, and input into the trained multimodal survival prediction model to obtain the survival probability of the patient to be predicted at different time points.
[0007] In conjunction with the first aspect, in one possible implementation, the standardization of histopathological images and genomic maps from multiple patients to obtain a structured pathological fully connected map and gene embeddings includes: cropping each histopathological image in the multimodal dataset into multiple pixel blocks according to a set size; extracting the first-dimensional feature embeddings of the multiple pixel blocks corresponding to the histopathological images; constructing an initial fully connected map based on the feature embeddings of the pixel blocks in the histopathological images as nodes; calculating topological relationships based on the spatial location and feature similarity of the nodes, and sorting the nodes in the initial fully connected map based on the topological relationships to obtain the pathological fully connected map; dividing the genomic map into multiple gene sequences according to biological function; mapping each type of gene sequence to an initial gene embedding in the first dimension through linear transformation; and normalizing each of the initial gene embeddings in the first dimension to obtain the gene embeddings.
[0008] In conjunction with the first aspect, in one possible implementation, the representation learning module includes a pathological feature extraction unit and a gene feature extraction unit; the pathological feature extraction unit is used to capture the topological relationships between pixel blocks in the pathological fully connected graph and learn its global context information in both forward and reverse directions to obtain a pathological image representation; the gene feature extraction unit is used to perform representation learning on the gene embedding to mine key information therein and obtain the genomic data representation.
[0009] In conjunction with the first aspect, in one possible implementation, the data fusion module includes a parameter update unit, a cross-modal fusion unit, and a state space fusion unit; the parameter update unit is used to: extract pathological update parameters and genomic update parameters within their respective modalities from the pathological image representation and the genomic data representation using the selective mechanism of the Mamba model; the cross-modal fusion unit is used to: convert the pathological update parameters / genomic update parameters into high-frequency and low-frequency components using a wavelet attention mechanism, and calculate the gene cross-modal parameters / pathological cross-modal parameters of the genomic update parameters / pathological update parameters based on them; and use learnable weights to... The pathological update parameters and the pathological cross-modal parameters, as well as the genome update parameters and the gene cross-modal parameters, are weighted and fused to obtain pathological cross-modal update parameters and genome cross-modal update parameters. The state space fusion unit is used to: construct a pathological image fusion state space model and a genome fusion state space model; apply the pathological cross-modal update parameters / the genome cross-modal update parameters to the pathological image fusion state space model / genome fusion state space model to perform real-time fusion at the state space level to obtain pathological fusion features and gene fusion features; and concatenate the pathological fusion features and the gene fusion features to obtain the fusion representation.
[0010] In conjunction with the first aspect, in one possible implementation, the survival prediction module includes a first fully connected layer and a second fully connected layer; the first fully connected layer is used to map the high-dimensional fusion representation into low-dimensional features; the second fully connected layer is used to convert the low-dimensional features into a probability distribution through a Softmax function, and then obtain the survival function value through cumulative multiplication to represent the survival probability at different time points.
[0011] In conjunction with the first aspect, in one possible implementation, the loss function module includes a loss calculation unit and a parameter optimization unit; the loss calculation unit is used to calculate the loss between the survival probability and clinical data based on the survival probability received from the survival prediction module using a loss function; the parameter optimization unit is used to use the Adam optimization algorithm to iteratively update the module parameters of the multimodal survival prediction model through backpropagation to minimize the loss between the survival probability and clinical data.
[0012] In conjunction with the first aspect, in one possible implementation, the loss function is as follows: ; In the formula, This represents the loss function, which is calculated as the loss between the survival probability and the clinical data. N represents the total number of samples in the multimodal dataset. This represents the censoring indicator variable for the i-th sample. This indicates that the content has not been deleted. Indicates deletion. Let represent the predicted value of the survival function for the i-th sample. This indicates that the predicted i-th sample is in The probability of survival over time. This indicates that the predicted i-th sample is in The probability of survival over time. This indicates that the predicted i-th sample is in The probability of survival over time. This represents the survival time of the observed i-th sample.
[0013] Secondly, embodiments of this application provide a multimodal survival prediction device that integrates a wavelet attention-based Mamba model, comprising: a dataset construction module for standardizing histopathological images and genomic maps of multiple patients to obtain structured pathological fully connected maps and gene embeddings, thereby constructing a multimodal dataset; and a model construction and training module for integrating wavelet attention into a Mamba model to construct a multimodal survival prediction model, and training the multimodal survival prediction model using the multimodal dataset; wherein the multimodal survival prediction model includes a representation learning module, a data fusion module, a survival prediction module, and a loss function module; the representation learning module is used to perform deep feature extraction on the pathological fully connected map and the gene embeddings to mine key features and obtain pathological images. The system comprises: a characterization module and a genomic data characterization module; a data fusion module that mines the feature update patterns and cross-modal complementary associations of the pathological image characterization and the genomic data characterization through wavelet attention mechanism and the selective mechanism of the Mamba model to obtain a fused characterization; a survival prediction module that uses a fully connected neural network to predict the survival risk of the fused characterization and outputs the corresponding patient's survival probability at different time points; a loss function module that calculates the loss between the predicted survival probability and clinical data to optimize the model parameters of the multimodal survival prediction model; and a prediction module that preprocesses the histopathological image and genomic map of the patient to be predicted into a fully connected pathological image and gene embeddings, and inputs them into the trained multimodal survival prediction model to obtain the patient's survival probability at different time points.
[0014] Thirdly, embodiments of this application provide an apparatus comprising: a processor; a memory for storing processor-executable instructions; wherein, when the processor executes the executable instructions, it implements the method as described in the first aspect or any possible implementation of the first aspect.
[0015] Fourthly, embodiments of this application provide a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium including storage for storing a computer program or instructions that, when executed, cause the method described in the first aspect or any possible implementation of the first aspect to be implemented.
[0016] One or more technical solutions provided in the embodiments of this application have at least the following technical effects or advantages: This application's embodiments, by constructing a multimodal dataset, can accurately capture the spatial morphological heterogeneity of histopathological images and the molecular variation features of genomic maps, laying a high-quality data foundation for the early dynamic mining of subsequent cross-modal associations. By coupling the wavelet attention mechanism with the state space module of the Mamba model, real-time interaction of cross-modal information during the state evolution process within the SSM module is realized, effectively compensating for the shortcomings of the traditional "extract first, then fuse" strategy. This solves the problems of delayed fusion timing and fragmented state evolution in existing Mamba model-based multimodal survival prediction methods, which limit the prediction accuracy and robustness of the Mamba model. The method in this application not only fully leverages the modeling advantages of the Mamba model for long-range dependencies but also utilizes the multi-scale decomposition capability of wavelet attention for cross-modal features, enhancing the efficiency of mining complementary information between different modalities. Ultimately, it significantly improves the accuracy and robustness of multimodal survival prediction, providing more reliable quantitative evidence for clinical decision-making. Attached Figure Description
[0017] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 A flowchart illustrating the multimodal survival prediction method for the Mamba model incorporating wavelet attention, provided in an embodiment of this application; Figure 2 A schematic diagram of the structure of the multimodal survival prediction device for the Mamba model fused with wavelet attention provided in the embodiments of this application; Figure 3 A schematic diagram of the structure of the fused state-space model (FSSM) provided in the embodiments of this application; Figure 4 A schematic diagram of wavelet cross-attention (WCA) provided for an embodiment of this application; Figure 5 Example diagrams comparing the effects of different cross-attention methods provided in the embodiments of this application with the method of this application. Detailed Implementation
[0019] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0020] The following description of some technologies involved in the embodiments of this application is provided to aid understanding and should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this application. Similarly, for clarity and brevity, some descriptions of well-known functions and structures are omitted in the following description.
[0021] The problems of delayed fusion timing and fragmented state evolution in existing Mamba model multimodal survival prediction methods are essentially a waste of the core advantage of the Mamba model (capturing long-range dependencies through the internal state evolution of the SSM module). This directly leads to the failure to fully realize the synergistic effect of multimodal information, ultimately causing a series of problems that affect the performance and clinical value of the Mamba model.
[0022] Specifically, the delayed fusion timing (extraction before fusion) means that cross-modal interaction only occurs after the extraction of single-modal features. Early association clues in multimodal data (such as the immediate correspondence between a certain cell morphology in a histopathological image and the corresponding driver mutation in a genomic map) cannot be dynamically captured in the early stages of the SSM module's state evolution. Single-modal SSM states are updated only based on their own historical information (e.g., histopathological images only consider spatial sequences, genomic maps only consider variant sequences), without actively absorbing synchronous information from another modality. This results in cross-modal complementarity not being integrated during state evolution. Later fusion (such as splicing or addition) is merely a simple feature stacking rather than semantic collaboration, leading to a lack of semantic association between two features from different modalities. After fusion, they remain two independent prognostic signals. Furthermore, the Mamba model cannot distinguish between strong single-modal signals and strong bimodal collaborative signals, resulting in decreased accuracy in identifying high-risk patients.
[0023] The fragmented state evolution negates the core advantage of the Mamba model (efficiently capturing dynamic dependencies in long sequences through recursive updates of the hidden states in the SSM module). Furthermore, the hidden states of a single modality only encode historical information from that modality, failing to incorporate the synchronous states of another modality (e.g., pathway activation states in genome maps cannot affect the spatial heterogeneity of histopathological images). This results in the inability to uniformly model long-range dependencies across modalities. Because single-modal states are not calibrated through another modality, noise is easily amplified, making the Mamba model more sensitive to intramodal noise (e.g., staining differences in histopathological images and sequencing errors in genome maps), thus leading to poor robustness.
[0024] Figure 1 This is a flowchart of the multimodal survival prediction method for the Mamba model incorporating wavelet attention, provided in an embodiment of this application, including steps 101 to 103. Figure 1 This is merely one execution order shown in the embodiments of this application and does not represent the only execution order of the multimodal survival prediction method for the Mamba model fused with wavelet attention. Where the final result can be achieved, Figure 1 The steps shown can be performed in parallel or in reverse order.
[0025] Step 101: Standardize the histopathological images and genomic maps of multiple patients to obtain structured pathological fully connected graphs and gene embeddings to construct a multimodal dataset. In this embodiment, each histopathological image in the multimodal dataset is cropped into multiple pixel blocks according to a set size; the first-dimensional feature embeddings of the multiple pixel blocks corresponding to the histopathological images are extracted; using the pixel blocks in the histopathological images as nodes, an initial fully connected graph is constructed based on their feature embeddings; topological relationships are calculated based on the spatial location and feature similarity of the nodes, and the nodes in the initial fully connected graph are sorted based on the topological relationships to obtain the pathological fully connected graph; the genomic map is divided into multiple gene sequences according to biological function; each type of gene sequence is mapped to the initial gene embeddings of the first dimension through linear transformation; the initial gene embeddings of the first dimension are normalized to obtain the gene embeddings.
[0026] Specifically, on the official website of TCGA (The Cancer Genome Atlas, an authoritative platform jointly maintained by the National Cancer Institute and the National Human Genome Institute), five types of high-incidence cancer datasets with complete data were selectively screened: LUAD (lung adenocarcinoma), BLCA (bladder urothelial carcinoma), BRCA (breast cancer), GBMLGG (glioblastoma and low-grade glioma), and UCEC (endometrial cancer). The selection criteria were that the datasets must simultaneously contain three core types of information: histopathological images, genomic maps, and clinical data. Then, histopathological images, genomic maps, and clinical data for multiple patients were downloaded. Further, histopathological images refer to histopathological images stained with eosin, a staining method that clearly presents key pathological features such as cell morphology and tissue structure, and is the standard staining protocol for pathological diagnosis. Genome maps include molecular-level data such as RNA sequencing, copy number variations, single nucleotide variations, and DNA methylation, covering core gene features related to cancer development and progression. Clinical data includes clinical indicators directly related to survival prediction, such as gender, age, survival time, censoring status (whether an endpoint event such as death was observed), and cancer grade (pathological grading). Then, the one-to-one correspondence between histopathological images, genomic maps, and clinical data is confirmed through the patient's unique identifier to avoid data association errors.
[0027] Furthermore, the histopathological images are standardized by using the CLAM model. High-resolution histopathological images (typically thousands × thousands of pixels) are uniformly cropped into multiple pixel blocks of 512 × 512 pixels (i.e., the set size). The number of pixel blocks is determined by the original resolution of the histopathological image (e.g., a 2048 × 2048 pixel histopathological image is cropped into 16 pixel blocks). Using the feature extractor built into the CLAM model (based on a deep learning network), 1024-dimensional (i.e., the first dimension) feature embeddings are extracted from each pixel block, thus transforming pixel-level image information into high-dimensional semantic vectors and preserving the core features of the pathological regions. All pixel blocks of a single histopathological image (with extracted 1024-dimensional feature embeddings) are treated as nodes in a graph network. An initial fully connected graph is constructed, with each node establishing edge connections with all other nodes to capture the spatial and semantic relationships between different regions in the histopathological image. The topological relationship is calculated based on the spatial location coordinates and feature similarity of nodes (pixel blocks). This allows for the orderly arrangement of nodes in the initial fully connected graph, ensuring that the subsequent GCN network (image roll and network) can aggregate node features in a reasonable order and preserve the spatial structure information of the histopathological image.
[0028] The genome map is standardized and divided into six core gene sequences according to biological function: tumor suppressor genes, tumorigenesis genes, protein kinase genes, cell differentiation genes, transcription genes, and cytokines and growth genes. This grouping is based on key signaling pathways in cancer development and highlights core gene features related to cancer prognosis. The gene sequences of each category are input into a fully connected layer, and a linear transformation is used to uniformly map gene features of different dimensions to 1024-dimensional initial gene embeddings. The first dimension here is consistent with the embedding dimension of pixel blocks in the histopathological map, eliminating dimensional barriers for subsequent multimodal fusion. The mapped 1024-dimensional initial gene embeddings are then Z-score standardized to eliminate dimensional differences between different initial gene embeddings and prevent any initial gene embedding from dominating model learning due to excessively large numerical scales.
[0029] Step 102: Integrate wavelet attention into the Mamba model to construct a multimodal survival prediction model, and train the multimodal survival prediction model using a multimodal dataset. The multimodal survival prediction model includes a representation learning module, a data fusion module, a survival prediction module, and a loss function module. The representation learning module performs deep feature extraction on the pathological fully connected graph and gene embeddings to uncover key features, obtaining pathological image representations and genomic data representations. The data fusion module uses wavelet attention and the selective mechanism of the Mamba model to mine the feature update patterns and cross-modal complementary associations of the pathological image representations and genomic data representations, obtaining a fused representation. The survival prediction module uses a fully connected neural network to predict the survival risk of the fused representation, outputting the survival probability of the corresponding patient at different time points. The loss function module calculates the loss between the predicted survival probability and the clinical data to optimize the model parameters of the multimodal survival prediction model.
[0030] In this embodiment, the representation learning module includes a pathological feature extraction unit and a gene feature extraction unit. The pathological feature extraction unit is used to capture the topological relationships between pixel blocks in the pathological fully connected graph and learn its global context information in both forward and reverse directions to obtain a pathological image representation. The gene feature extraction unit is used to perform representation learning on gene embeddings to mine key information and obtain a genomic data representation.
[0031] Specifically, the representation learning module first inputs the pathological fully connected map into the GCN network and then into the bidirectional Mamba network for representation learning. At the same time, it inputs the gene embedding into the bidirectional Mamba network for representation learning. It learns from the histopathological map and genome map (actually the pathological fully connected map and gene embedding) of each patient to extract key information from the histopathological map and genome map, namely pathological image representation and genome data representation.
[0032] Furthermore, the learning module is represented as: , .
[0033] in, , This represents the pathological image representation and genomic data representation learned from the pathological fully connected graph and gene embeddings of each patient, with a feature dimension of 1024. , This represents a fully connected pathological graph and gene embeddings. GCN stands for Graph Convolutional Neural Network. For example, a GCN consists of one input graph convolutional layer, one intermediate graph convolutional layer, and one output graph convolutional layer. It aggregates the features of each node (pixel block) with the features of its neighboring nodes through graph convolution operations, capturing the local correlations and global structural information of different regions in the histopathological image, and outputting an intermediate representation that enhances structural features. BiMamba represents a bidirectional Mamba model. Based on the classic Mamba model, it adapts to the processing requirements of reverse sequences by transposing convolutional weights and state-space model parameters, creating two independent CUDA streams (a set of sequentially executed CUDA operation commands) to process forward and reverse sequence data. It can also capture global contextual information from both directions simultaneously, ultimately outputting a 1024-dimensional pathological image representation.
[0034] By inputting 1024-dimensional gene embeddings into the BiMamba model, and leveraging its linear complexity long sequence modeling capabilities, the model efficiently captures temporal and functional associations between gene sequences. Through the selective mechanism of the BiMamba model, it focuses on core gene features related to cancer survival, suppresses irrelevant noise, and outputs 1024-dimensional genomic data representations.
[0035] In this embodiment, the data fusion module includes a parameter update unit, a cross-modal fusion unit, and a state space fusion unit. The parameter update unit is used to: extract pathological update parameters and genomic update parameters within their respective modalities from pathological image representations and genomic data representations using the selective mechanism of the Mamba model. The cross-modal fusion unit is used to: convert pathological update parameters / genomic update parameters into high-frequency and low-frequency components using a wavelet attention mechanism, and calculate gene cross-modal parameters / pathological cross-modal parameters based on these components; perform weighted fusion of pathological update parameters and pathological cross-modal parameters, and genomic update parameters and gene cross-modal parameters using learnable weights to obtain pathological cross-modal update parameters and genomic cross-modal update parameters. The state space fusion unit is used to: construct a pathological image fusion state space model and a genomic fusion state space model; apply the pathological cross-modal update parameters / genomic cross-modal update parameters to the pathological image fusion state space model / genomic fusion state space model to perform real-time fusion at the state space level to obtain pathological fusion features and gene fusion features; and concatenate the pathological fusion features and gene fusion features to obtain a fusion representation.
[0036] Specifically, the data fusion module is represented as: , , .
[0037] In the formula, This represents the fused representation after splicing; Concat indicates the splicing operation. , This represents the cross-modal update parameters of pathology and cross-modal updates of the genome. , This represents the pathological image representation and genomic data representation learned from the pathological fully connected map and gene embedding of each patient. HFM and GFM are the histopathological map fusion Mamba (HFM) model and the gene fusion Mamba (GFM) model, respectively, constructed based on the Mamba model.
[0038] Among them, the HFM model and GFM model are based on the selection mechanism of Mamba and the wavelet cross-attention mechanism to determine the intramodal pathological update parameters and genomic update parameters of histopathological map and genomic map data. The pathological update parameters and genomic update parameters are combined in an adaptive summation manner as the state update parameters in the SSM module to obtain the HFSSM model and GFSSM model. The SSM module in the Mamba model is replaced with the HFSSM model and GFSSM model to obtain the corresponding pathological image fusion state space model and genomic fusion state space model composed of the HFM model and GFM model.
[0039] Furthermore, based on the selection mechanism of the Mamba model, the intra-modal update parameters for histopathological maps and genomic maps were designed as follows: , , , , , .
[0040] In the formula, , , These represent the pathological update parameters within the modality of the histopathological image. , , These represent genome update parameters within a modality of the genome map. This indicates that the linear transformation function is used to transform the linear function. B For the input x ( or Perform a linear transformation. This indicates that the linear transformation function is used to transform the linear function. C For the input x ( or Perform a linear transformation. This means that a linear transformation, Linear, is first applied to the input x. Δ Then, broadcast the operation via Broadcast. Δ Extend the result of linear transformation to a new shape or dimension. , represents the softplus activation function, x ( or This represents the pathological image representation and genomic data representation before inputting the state space model, i.e. and .
[0041] like Figure 4As shown, the cross-modal update parameters for histopathological maps and genomic maps, based on a wavelet cross-attention mechanism, are designed as follows: , , ; , , ; In the formula, , , This represents the pathological cross-modal parameters of the histopathological image. , , This represents the cross-modal parameters of genes in the genome map. The superscripts High / Low indicate the high-frequency / low-frequency components obtained through wavelet decomposition. represents the matrix transpose operation, d represents the dimension of the feature vector used to calculate attention weights, and Softmax represents the Softmax function.
[0042] The adaptive summation of cross-modal update parameters for histopathological and genomic atlas data is as follows: , , ; , , ; In the formula, , , This indicates the update parameters for pathological modality fusion. , , This indicates the parameters for updating gene modality fusion. , , This represents a learnable parameter that takes values in the interval (0,1) to achieve greater adaptability during the weighted fusion process.
[0043] like Figure 3 As shown, the pathological image fusion state space model (HFSSM) is as follows: , ; In the formula, This represents the state vector of the pathological image fusion state space model at time step t, i.e., the pathological fusion feature. denoted by , represents the state transition matrix of the pathological image fusion state-space model, and exp represents the exponential function. Obtained by initialization with the Hippo matrix. , represents the input mapping matrix of the pathological image fusion state-space model. Represents the identity matrix. This represents the pathological image representation before the input pathological image is fused into the state space model. This represents the output vector of the pathological image fusion state-space model at time step t. , , This indicates the update parameters for pathological modality fusion.
[0044] like Figure 3 As shown, the Genome Fusion State Space Model (GFSSM) is as follows: , ; In the formula, This represents the state vector output by the genome fusion state-space model at time step t, i.e., the gene fusion feature. , represents the state transition matrix of the genome fusion state-space model, and exp represents the exponential function. Obtained by initialization with the Hippo matrix. , represents the input mapping matrix of the genome fusion state-space model. This represents the genetic data representation before inputting into the genome fusion state space model. , , This represents parameters for cross-modal genome updates.
[0045] Specifically, the data fusion module obtains a pathological image fusion state space model and a genome fusion state space model by improving the state space model (SSM). The histopathological image and the genome map are input into different fusion state space models. Then, the features after fusion (i.e., pathological fusion features and gene fusion features) are spliced together to solve the problems of fusion timing lag and state evolution fragmentation, and realize the real-time fusion of features of two modalities at the state space level.
[0046] In this embodiment, the survival prediction module includes a first fully connected layer and a second fully connected layer; the first fully connected layer is used to map the high-dimensional fusion representation into low-dimensional features; the second fully connected layer is used to convert the low-dimensional features into a probability distribution through the Softmax function, and then obtain the survival function value through cumulative product to represent the survival probability at different time points.
[0047] Specifically, the survival prediction module is represented as follows: , ; In the formula, This represents the low-dimensional features after the fusion representation mapping. This represents a fused representation with 2048 dimensions. This represents the predicted survival probability, with 4 dimensions. , This represents the weight parameters that need to be learned. , The parameters to be learned are represented by _____, _____ is a function that transforms a vector into a probability distribution, and _____ is the cumulative product function. The survival prediction module uses a two-layer fully connected neural network to predict survival based on the fused representation of pathological images and the genome.
[0048] In this embodiment, the loss function module includes a loss calculation unit and a parameter optimization unit. The loss calculation unit is used to calculate the loss between the survival probability and clinical data based on the survival probability received from the survival prediction module. The parameter optimization unit is used to use the Adam optimizer to iteratively update the module parameters of the multimodal survival prediction model through backpropagation to minimize the loss between the survival probability and clinical data.
[0049] Specifically, the loss function module is used to train the parameters of the representation learning module, data fusion module, and survival prediction module. Based on the calculated loss, the model parameters of the multimodal survival prediction model are optimized through the backpropagation algorithm, so that the predicted survival probability is theoretically close to the survival status of the corresponding clinical data. In this embodiment, the Adam optimization algorithm (an optimization algorithm applied in deep learning models) is used to improve the prediction accuracy of the multimodal survival prediction model by iteratively updating the model parameters to minimize the loss. After multiple iterations, the optimal prediction model, i.e., the trained multimodal survival prediction model, is obtained.
[0050] The loss function is as follows: ; In the formula, This represents the loss function, which is calculated as the loss between the survival probability and the clinical data. N represents the total number of samples in the multimodal dataset. This represents the censoring indicator variable for the i-th sample. This indicates that the event was not censored (an observed event, such as death). This indicates censoring (the event was not observed). Let represent the predicted value of the survival function for the i-th sample. This indicates that the predicted i-th sample is in The probability of survival over time. This indicates that the predicted i-th sample is in The probability of survival over time. This indicates that the predicted i-th sample is in The probability of survival over time. This represents the survival time of the observed i-th sample.
[0051] Step 103: Preprocess the histopathological map and genomic map of the patient to be predicted into a pathological fully connected map and gene embedding, and input them into the trained multimodal survival prediction model to obtain the survival probability of the patient at different time points.
[0052] In this embodiment, to verify the effectiveness of the multimodal survival prediction model, five publicly available TCGA datasets (BLCA, BRCA, GBMLGG, LUAD, UCEC) were selected for comparison with multiple gene modality methods (SNN, SNNTrans), image modality methods (Patch-GCN, CLAM, TransMIL, GraphLSurv, GraphMamba, MambaMIL), and multimodal methods (MCAT, PORPOISE, MOTCat, CMTA, MoME, CCL, LD-CVAE, SurMoE, SAMamba). Furthermore, to further demonstrate the effectiveness of the designed fusion state space model (FSSM, pathological image fusion state space model and genome fusion state space model) and the wavelet cross-attention mechanism in multimodal data fusion, a series of ablation experiments were conducted and visualizations were provided for intuitive demonstration. The C-index (Concordance) was selected. The consistency index (in survival analysis, it measures the model's ability to correctly rank pairs of individuals based on predicted survival time; the higher the value, the more accurate the model's prediction) is used as an evaluation indicator. The mean and standard deviation are calculated using 5-fold cross-validation, as shown in Tables 1 and 2 below.
[0053] Table 1. Comparison of C-index results for different methods on five publicly available TCGA datasets.
[0054] From the C-index perspective, multimodal methods generally outperform single-modal methods because they integrate rich information from both genomic and histopathological data. Among multimodal methods, this application achieved state-of-the-art results on all four datasets, ranking second only on the GBMLGG dataset, demonstrating its general effectiveness. Furthermore, compared to SAMamba, another multimodal method based on the Mamba model, this application's multimodal survival prediction method based on wavelet attention fusion of the Mamba model achieves better results. This indicates that the proposed FSSM (Pathological Image Fusion State Space Model and Genomic Fusion State Space Model) can better integrate information from both modalities, thereby generating features best suited for the relevant task.In the table, Modality represents a mode, g. represents the gene mode of the genome map, h. represents the image mode of the histopathological image, SNN represents a survival prediction method based on genomic data (using a self-normalized neural network), SNNTrans represents a survival prediction method based on genomic data combined with a Transformer architecture, Patch-GCN represents a survival prediction method based on pathological images and processing whole slice images (WSI) with a graph convolutional network, CLAM represents a multi-instance learning survival prediction method based on pathological images and employing an attention mechanism, TransMIL represents a survival prediction method based on pathological images and using Transformer for multi-instance learning of whole slice image classification, GraphLSurv represents a scalable survival prediction network based on pathological images and featuring adaptive sparse structure learning, GraphMamba represents a survival prediction method based on pathological images and combining graph structure with the Mamba architecture, MambaMIL represents a survival prediction method based on pathological images and using a state-space model for multi-instance learning of whole slice image classification, and MCAT represents a method that integrates pathological images and genomic data. The following are multimodal survival prediction methods based on collaborative attention Transformer: PORPOISE represents a multimodal deep learning pan-cancer integrated histology-genomic analysis method based on pathological images and genomic data; MOTCat represents a survival prediction method based on pathological images and genomic data combined with multimodal optimal transport and collaborative attention Transformer; CMTA represents a cross-modal translation and alignment survival prediction method fusing pathological images and genomic data; MoME represents a method fusing images and genomic data and performing multimodal cancer survival analysis through expert mixing; CCL represents a multimodal cancer survival analysis method based on pathological images and genomic data using cohort-individual collaborative learning; LD-CVAE represents a robust multimodal survival prediction method based on pathological images and genomic data and utilizing latent differential conditional variational autoencoders; SurMoE represents a survival prediction method fusing images and genomic data and performing multimodal cancer survival analysis through expert mixing; SAMamba represents a multimodal survival analysis method fusing pathological images and genomic data and combining state-space models; and Ours represents the multimodal survival prediction model of this application.
[0055] Table 2. Comparison of C-index results for different cross-attention mechanisms and adaptive fusion parameters.
[0056] From the C-index perspective, the wavelet cross-attention mechanism and adaptive fusion parameter design proposed in this application are reasonable and achieve good survival prediction results. Specifically, in the wavelet cross-attention mechanism, high-frequency information is used as the key vector and low-frequency information as the value vector. This application considers three variants: low-frequency information as the key (KL), high-frequency information as the value (VH), and low-frequency information as the key and high-frequency information as the value (KLVH). In addition, this application also includes traditional cross-attention (CA), which does not decompose the frequency but serves as a reference. The results are shown in Table 2. It can be seen that all alternative cross-attention strategies lead to a significant performance degradation, indicating that, on the one hand, frequency information decomposition is important for solving the heterogeneity problem between modes; on the other hand, this frequency information should be used appropriately. In the table, Methods represent the methods used, with / off... Indicate to remove These are the few learnable parameters.
[0057] like Figure 5 As shown, the cross-modal fusion parameters in the FSSM obtained by this application through different cross-attention mechanisms are further visualized. It can be seen that the response based on the wavelet cross-attention method is usually clearer and more significant than other strategies. This indicates that the proposed wavelet-based cross-attention mechanism can retrieve more useful complementary information from other modalities for feature extraction of the current modality.
[0058] While this application provides the method operation steps as described in the embodiments or flowcharts, more or fewer operation steps may be included based on conventional or non-inventive labor. The order of steps listed in this embodiment is merely one possible execution order among many and does not represent the only execution order. In actual device or client product execution, the methods shown in this embodiment or the accompanying drawings can be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment).
[0059] like Figure 2 As shown in the illustration, this application also provides a multimodal survival prediction device 200 that integrates wavelet attention with a Mamba model. The device includes a dataset construction module 201, a model construction and training module 202, and a prediction module 203, as detailed below.
[0060] The dataset construction module 201 is used to standardize the histopathological maps and genomic maps of multiple patients to obtain structured pathological fully connected maps and gene embeddings, in order to construct a multimodal dataset.
[0061] The model building and training module 202 is used to incorporate wavelet attention into the Mamba model to construct a multimodal survival prediction model and train the multimodal survival prediction model using a multimodal dataset. The multimodal survival prediction model includes a representation learning module, a data fusion module, a survival prediction module, and a loss function module. The representation learning module is used to perform deep feature extraction on the pathological fully connected graph and gene embeddings to mine key features and obtain pathological image representations and genomic data representations. The data fusion module is used to mine the feature update patterns and cross-modal complementary associations of the pathological image representations and genomic data representations through wavelet attention and the selective mechanism of the Mamba model, obtaining a fused representation. The survival prediction module is used to predict the survival risk of the fused representation through a fully connected neural network, outputting the survival probability of the corresponding patient at different time points. The loss function module is used to calculate the loss between the predicted survival probability and the clinical data to optimize the model parameters of the multimodal survival prediction model.
[0062] The prediction module 203 is used to preprocess the histopathological map and genomic map of the patient to be predicted into a pathological fully connected map and gene embedding, and input them into the trained multimodal survival prediction model to obtain the survival probability of the patient to be predicted at different time points.
[0063] Some modules in the apparatus described in this application can be described in the general context of computer-executable instructions that are executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, classes, etc., that perform a specific task or implement a specific abstract data type. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0064] The apparatus or module described in the above embodiments can be implemented by a computer chip or physical entity, or by a product with a certain function. For ease of description, the above apparatus is described by dividing it into various modules according to their functions. When implementing the embodiments of this application, the functions of each module can be implemented in one or more software and / or hardware. Of course, a module that implements a certain function can also be implemented by combining multiple sub-modules or sub-units.
[0065] The methods, apparatus, or modules described in this application can be implemented in a computer-readable program code manner. The controller can be implemented in any suitable manner, such as a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of a memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code manner, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included within it for implementing various functions can also be considered as structures within the hardware component. Alternatively, the device used to implement various functions can be viewed as either a software module that implements the method or a structure within a hardware component.
[0066] This application also provides an apparatus, the apparatus comprising: a processor; a memory for storing processor-executable instructions; wherein, when the processor executes the executable instructions, it implements the method described in this application.
[0067] This application also provides a non-volatile computer-readable storage medium storing a computer program or instructions thereon, which, when executed, enables the method described in this application embodiment to be implemented.
[0068] Furthermore, in the various embodiments of the present invention, each functional module can be integrated into a processing module, or each module can exist independently, or two or more modules can be integrated into a single module.
[0069] The aforementioned storage media include, but are not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Cache, Hard Disk Drive (HDD), or Memory Card. The memory can be used to store computer program instructions.
[0070] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary hardware. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product, or it can be embodied in the process of data migration. The computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of this application.
[0071] The various embodiments described in this specification are presented in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on its differences from other embodiments. All or part of this application can be used in numerous general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices, etc.
[0072] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit this application. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of this application.
Claims
1. A multimodal survival prediction method for a Mamba model incorporating wavelet attention, characterized in that, include: The histopathological images and genomic maps of multiple patients were standardized to obtain structured pathological fully connected maps and gene embeddings, in order to construct a multimodal dataset; A multimodal survival prediction model is constructed by incorporating wavelet attention into a Mamba model, and trained using the multimodal dataset. The multimodal survival prediction model includes a representation learning module, a data fusion module, a survival prediction module, and a loss function module. The representation learning module performs deep feature extraction on the pathological fully connected graph and the gene embedding to uncover key features, obtaining pathological image representations and genomic data representations. The data fusion module uses wavelet attention and the selective mechanism of the Mamba model to mine the feature update patterns and cross-modal complementary associations of the pathological image representations and the genomic data representations, obtaining a fused representation. The survival prediction module uses a fully connected neural network to predict the survival risk of the fused representation, outputting the survival probability of the corresponding patient at different time points. The loss function module calculates the loss between the predicted survival probability and the clinical data to optimize the model parameters of the multimodal survival prediction model. The histopathological and genomic maps of the patients to be predicted are preprocessed into pathological fully connected maps and gene embeddings, which are then input into the trained multimodal survival prediction model to obtain the survival probability of the patients to be predicted at different time points.
2. The method according to claim 1, characterized in that, The standardization of histopathological and genomic maps from multiple patients yields structured pathological fully connected maps and gene embeddings, including: Each histopathological image in the multimodal dataset is cropped into multiple pixel blocks according to a set size; Extract the first dimension of feature embedding from multiple pixel blocks corresponding to the histopathological image; Using pixel blocks in the histopathological image as nodes, an initial fully connected graph is constructed based on their feature embedding. The topological relationships are calculated based on the spatial location and feature similarity of the nodes, and the nodes in the initial fully connected graph are sorted based on the topological relationships to obtain the pathological fully connected graph. The genome map is divided into multiple gene sequences according to their biological functions; The gene sequences of each type are mapped to an initial gene embedding in the first dimension through a linear transformation; The initial gene embeddings of the first dimension are normalized to obtain the gene embeddings.
3. The method according to claim 1, characterized in that, The representation learning module includes a pathological feature extraction unit and a gene feature extraction unit; The pathological feature extraction unit is used to capture the topological relationship between each pixel block in the pathological fully connected graph and learn its global context information in both positive and negative directions to obtain a pathological image representation. The gene feature extraction unit is used to perform characterization learning on the gene embedding in order to mine key information and obtain the genome data characterization.
4. The method according to claim 1, characterized in that, The data fusion module includes a parameter update unit, a cross-modal fusion unit, and a state-space fusion unit; The parameter update unit is used to: extract pathological update parameters and genomic update parameters within their modalities from the pathological image representation and the genomic data representation, respectively, through the selective mechanism of the Mamba model; The cross-modal fusion unit is used to: convert the pathological update parameters / the genome update parameters into high-frequency and low-frequency components through a wavelet attention mechanism, and calculate the gene cross-modal parameters / pathological cross-modal parameters of the genome update parameters / the pathological update parameters based on them; and perform weighted fusion of the pathological update parameters and the pathological cross-modal parameters, and the genome update parameters and the gene cross-modal parameters respectively through learnable weights to obtain the pathological cross-modal update parameters and the genome cross-modal update parameters. The state space fusion unit is used to: construct a pathological image fusion state space model and a genome fusion state space model; apply the pathological cross-modal update parameters / the genome cross-modal update parameters to the pathological image fusion state space model / genome fusion state space model to perform real-time fusion at the state space level, and obtain pathological fusion features and gene fusion features; The pathological fusion features are spliced together with the gene fusion features to obtain the fusion characterization.
5. The method according to claim 1, characterized in that, The survival prediction module includes a first fully connected layer and a second fully connected layer; The first fully connected layer is used to map the high-dimensional fused representation into low-dimensional features; The second fully connected layer is used to convert the low-dimensional features into a probability distribution through the Softmax function, and then obtain the survival function value through cumulative product to represent the survival probability at different time points.
6. The method according to claim 1, characterized in that, The loss function module includes a loss calculation unit and a parameter optimization unit; The loss calculation unit is used to calculate the loss between the survival probability and clinical data based on the survival probability received from the survival prediction module using a loss function. The parameter optimization unit is used to use the Adam optimization algorithm to iteratively update the module parameters of the multimodal survival prediction model through backpropagation, so as to minimize the loss between the survival probability and clinical data.
7. The method according to claim 6, characterized in that, The loss function is as follows: ; In the formula, This represents the loss function, which is calculated as the loss between the survival probability and the clinical data. N represents the total number of samples in the multimodal dataset. This represents the censoring indicator variable for the i-th sample. This indicates that the content has not been deleted. Indicates deletion. Let represent the predicted value of the survival function for the i-th sample. This indicates that the predicted i-th sample is in Probability of survival over time This indicates that the predicted i-th sample is in Probability of survival over time This indicates that the predicted i-th sample is in Probability of survival over time This represents the survival time of the observed i-th sample.
8. A multimodal survival prediction apparatus for implementing the method described in any one of claims 1-7 using a Mamba model incorporating wavelet attention, characterized in that, include: The dataset construction module is used to standardize the histopathological and genomic maps of multiple patients to obtain structured pathological fully connected maps and gene embeddings, in order to construct a multimodal dataset. A model building and training module is used to incorporate wavelet attention into a Mamba model to construct a multimodal survival prediction model, and to train the multimodal survival prediction model using the multimodal dataset. The multimodal survival prediction model includes a representation learning module, a data fusion module, a survival prediction module, and a loss function module. The representation learning module performs deep feature extraction on the pathological fully connected graph and the gene embedding to mine key features, obtaining pathological image representations and genomic data representations. The data fusion module uses wavelet attention and the selective mechanism of the Mamba model to mine the feature update patterns and cross-modal complementary associations of the pathological image representations and the genomic data representations, obtaining a fused representation. The survival prediction module uses a fully connected neural network to predict the survival risk of the fused representation, outputting the survival probability of the corresponding patient at different time points. The loss function module calculates the loss between the predicted survival probability and the clinical data to optimize the model parameters of the multimodal survival prediction model. The prediction module is used to preprocess the histopathological map and genomic map of the patient to be predicted into a pathological fully connected map and gene embedding, and input them into the trained multimodal survival prediction model to obtain the survival probability of the patient to be predicted at different time points.
9. An apparatus for performing a multimodal survival prediction method for a Mamba model incorporating wavelet attention, characterized in that, include: processor; Memory used to store processor-executable instructions; When the processor executes the executable instructions, it implements the method as described in any one of claims 1 to 7.
10. A non-volatile computer-readable storage medium, characterized in that, Includes storage of computer programs or instructions that, when executed, cause the method as described in any one of claims 1 to 7 to be implemented.