Spatial gene expression enhancement method based on bidirectional attention multi-modal fusion
By generating high-dimensional images and neighborhood-aware gene representation sequences, and utilizing pre-trained models and polar coordinate interpolation strategies, the problem of insufficient multimodal information fusion in existing technologies is solved, achieving high-precision spatial gene expression reconstruction and visualization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- YUNNAN UNIV
- Filing Date
- 2026-03-13
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies struggle to collaboratively model local neighborhood and global spatial dependencies, suffer from insufficient multimodal information fusion, and have limited model robustness and generalization capabilities, failing to meet the demands for high-precision and highly continuous spatial gene expression reconstruction.
By acquiring the target gene expression matrix, high-dimensional image representation sequences and neighborhood-aware gene representation sequences are generated. A pre-trained spatial gene prediction model is used for prediction, and multiple gene expression prediction values are integrated by combining a polar coordinate spatial interpolation strategy to achieve the generation of a high-resolution gene expression matrix.
It achieves high-precision recovery and visualization of spatial gene expression, improves the robustness and generalization ability of the model, solves the stability problem in sparse sampling scenarios, and reduces detection costs and cycle time.
Smart Images

Figure CN122201436A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of spatial transcriptomics technology, and in particular to a spatial gene expression enhancement method based on bidirectional attention multimodal fusion. Background Technology
[0002] In multicellular organisms, the spatial distribution of gene expression is highly correlated with cell location, intercellular interactions, and tissue functional zoning. Spatial transcriptomics can obtain gene expression and spatial coordinate information while preserving tissue morphology, providing crucial support for understanding biological issues such as tissue microenvironment and tumor heterogeneity. Current mainstream spatial transcriptomics platforms mostly use sparse array detection with sampling points as the basic unit. This not only easily masks cell-scale expression heterogeneity but also results in incomplete tissue coverage and loss of key microenvironmental information. Simply increasing experimental resolution significantly increases costs and detection cycles. Therefore, achieving spatial gene expression enhancement based on bidirectional attention multimodal fusion through computational methods has become a core industry demand.
[0003] Existing spatial gene expression enhancement techniques based on bidirectional attention and multimodal fusion still have significant limitations. Image-driven methods largely rely on convolutional neural networks for feature extraction, which limits their ability to model long-distance spatial dependencies and is susceptible to the influence of pathological slide quality and staining differences. Methods relying solely on spatial coordinates cannot fully characterize the structure of high-dimensional transcriptome data, resulting in insufficient prediction accuracy. While some multimodal methods incorporating Transformers possess global modeling capabilities, they are insufficient in depicting local neighborhood relationships, and the fusion of information between modalities is relatively shallow. Pure expression modeling methods that do not rely on images ignore tissue morphology and spatial topological constraints, leading to poor stability in sparse sampling scenarios. Overall, existing technologies struggle to collaboratively model local neighborhoods and global spatial dependencies, suffer from insufficient multimodal information fusion, and have limited model robustness and generalization ability, failing to meet the demands for high-precision and highly continuous spatial gene expression reconstruction. Summary of the Invention
[0004] To overcome the shortcomings of existing technologies, the present invention aims to provide a spatial gene expression enhancement method based on bidirectional attention multimodal fusion, which aims to achieve high-precision recovery and visualization of spatial gene expression.
[0005] The first aspect of this invention provides a spatial gene expression enhancement method based on bidirectional attention multimodal fusion, comprising: acquiring a target gene expression matrix, the target gene expression matrix including multiple sampling points; generating a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point based on the target gene expression matrix; acquiring a pre-trained spatial gene prediction model, inputting the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction, and obtaining a first gene expression prediction value corresponding to each sampling point; generating multiple interpolation points based on the multiple sampling points using a polar coordinate spatial interpolation strategy; predicting a second gene expression prediction value corresponding to each interpolation point based on the spatial gene prediction model; and integrating the multiple first gene expression prediction values and the multiple second gene expression prediction values to obtain a high-resolution gene expression matrix.
[0006] Optionally, in a first implementation of the first aspect of the present invention, obtaining the target gene expression matrix includes: obtaining an original gene expression matrix; standardizing the original gene expression matrix using a standardization algorithm to obtain a standardized gene expression matrix; performing a logarithmic transformation on the standardized gene expression matrix using a logarithmic transformation algorithm to obtain a transformed gene expression matrix; obtaining a preset sorting rule; sorting multiple genes in the transformed gene expression matrix based on the sorting rule; and selecting a preset number of target genes from the multiple genes to construct the target gene expression matrix.
[0007] Optionally, in a second implementation of the first aspect of the present invention, the step of generating a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point based on the target gene expression matrix includes: acquiring a first local image patch corresponding to each sampling point in a whole-section histological image; acquiring a pre-trained pathological image feature extraction model, inputting each first local image patch into the pathological image feature extraction model for feature encoding to obtain multiple high-dimensional image representation vectors; integrating the multiple high-dimensional image representation vectors to obtain the first high-dimensional image representation sequence; and performing neighborhood information injection processing on the target gene expression matrix to obtain the first neighborhood-aware gene expression representation sequence.
[0008] Optionally, in a third implementation of the first aspect of the present invention, the step of performing neighborhood information injection processing on the target gene expression matrix to obtain the first neighborhood-aware gene expression characterization sequence includes: obtaining first spatial coordinates corresponding to each sampling point based on the target gene expression matrix, and dividing all sampling points into multiple observation sampling points and prediction sampling points based on the first spatial coordinates; determining a group of neighboring sampling points corresponding to each prediction sampling point based on the multiple observation sampling points; calculating the mean gene expression based on the true gene expression values of the neighboring sampling point groups, and assigning the gene expression mean to the corresponding prediction sampling point. Pre-filling of predicted sampling points is performed to determine the initial expression value of each of the sampling points to be predicted; the real gene expression values of all observed sampling points and the initial expression values of all the sampling points to be predicted are integrated to obtain a pre-filled expression matrix; based on the first spatial coordinates of the plurality of sampling points, the K nearest neighbor sampling points corresponding to each sampling point are determined, and spatial neighborhood connections are established between each sampling point and its K nearest neighbor sampling points to construct a nearest neighbor undirected graph; on the nearest neighbor undirected graph, a graph attention network is used to perform attention-weighted aggregation processing on the pre-filled expression matrix to obtain the first neighborhood-aware gene expression representation sequence.
[0009] Optionally, in a fourth implementation of the first aspect of the present invention, the spatial gene prediction model includes a cross-attention module, a feature fusion module, a Transformer encoder module, and a gene expression prediction module. The cross-attention module, the feature fusion module, the Transformer encoder module, and the gene expression prediction module are connected sequentially. The cross-attention module includes a first attention submodule and a second attention submodule, which are connected to each other. The step of inputting the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction to obtain a first gene expression prediction value corresponding to each sampling point includes: based on the first attention submodule, using the first high-dimensional image representation sequence... Using the first neighborhood-aware gene expression representation sequence as the query and the first high-dimensional image representation sequence as the key and value, a first cross-attention calculation is performed to obtain a first interactive feature sequence. Based on the second attention submodule, using the first neighborhood-aware gene expression representation sequence as the query and the first high-dimensional image representation sequence as the key and value, a second cross-attention calculation is performed to obtain a second interactive feature sequence. The first interactive feature sequence and the second interactive feature sequence are input into the feature fusion module for concatenation to obtain a multimodal fusion feature sequence. The multimodal fusion feature sequence is input into the Transformer encoder module for global context enhancement to obtain a global context-enhanced feature sequence. The global context-enhanced feature sequence is input into the gene expression prediction module for prediction to obtain a first gene expression prediction value corresponding to each sampling point.
[0010] Optionally, in a fifth implementation of the first aspect of the present invention, the step of using a polar coordinate system spatial interpolation strategy to generate multiple interpolation points based on multiple sampling points includes: calculating the Euclidean distance between two adjacent sampling points based on the first spatial coordinates corresponding to each sampling point; determining the sampling distance of the interpolation point relative to the sampling point based on the Euclidean distance; and generating multiple interpolation points at the sampling distance with each sampling point as the center and in a direction of uniform circumferential distribution.
[0011] Optionally, in a sixth implementation of the first aspect of the present invention, the step of predicting the second gene expression prediction value corresponding to each interpolation point based on the spatial gene prediction model includes: obtaining a second local image patch and a second spatial coordinate corresponding to each interpolation point; converting the second local image patch into a second high-dimensional image representation sequence, and generating a second neighborhood-aware gene expression representation sequence based on the second spatial coordinate; inputting the second high-dimensional image representation sequence and the second neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction, thereby obtaining the second gene expression prediction value corresponding to each interpolation point.
[0012] A second aspect of the present invention provides a spatial gene expression enhancement device based on bidirectional attention multimodal fusion, comprising: a matrix acquisition module for acquiring a target gene expression matrix, the target gene expression matrix including multiple sampling points; a sequence generation module for generating a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point based on the target gene expression matrix; a first prediction module for acquiring a pre-trained spatial gene prediction model, inputting the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction, and obtaining a first gene expression prediction value corresponding to each sampling point; an interpolation point generation module for generating multiple interpolation points based on the multiple sampling points using a polar coordinate system spatial interpolation strategy; a second prediction module for predicting a second gene expression prediction value corresponding to each interpolation point based on the spatial gene prediction model; and an integration module for integrating multiple first gene expression prediction values and multiple second gene expression prediction values to obtain a high-resolution gene expression matrix.
[0013] A third aspect of the present invention provides a spatial gene expression enhancement device based on bidirectional attention multimodal fusion, the spatial gene expression enhancement device based on bidirectional attention multimodal fusion comprising: a memory and at least one processor, the memory storing instructions; at least one processor calling the instructions in the memory to cause the spatial gene expression enhancement device based on bidirectional attention multimodal fusion to perform the various steps of the spatial gene expression enhancement method based on bidirectional attention multimodal fusion described above.
[0014] A fourth aspect of the present invention provides a computer-readable storage medium storing instructions that, when executed by a processor, implement the steps of the spatial gene expression enhancement method based on bidirectional attention multimodal fusion described in any of the preceding claims.
[0015] In the technical solution of this invention, a target gene expression matrix is first obtained, which includes multiple sampling points. Based on the target gene expression matrix, a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point are generated. Then, a pre-trained spatial gene prediction model is obtained. The first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence are input into the spatial gene prediction model for prediction to obtain a first gene expression prediction value corresponding to each sampling point. A polar coordinate spatial interpolation strategy is adopted to generate multiple interpolation points based on multiple sampling points. Based on the spatial gene prediction model, a second gene expression prediction value corresponding to each interpolation point is predicted. Finally, multiple first gene expression prediction values and multiple second gene expression prediction values are integrated to obtain a high-resolution gene expression matrix, aiming to achieve high-precision recovery and visualization of spatial gene expression. Attached Figure Description
[0016] Figure 1 A flowchart illustrating the spatial gene expression enhancement method based on bidirectional attention multimodal fusion provided in this embodiment of the invention; Figure 2 A flowchart of the algorithm model for a spatial gene expression enhancement method based on bidirectional attention multimodal fusion provided in an embodiment of the present invention; Figure 3 This is a gene visualization result on a human prefrontal cortex dataset provided in an embodiment of the present invention; Figure 4 The gene visualization results on a high-resolution mouse brain dataset provided in this embodiment of the invention; Figure 5 The enrichment analysis results for reconstructing gene predictions on a breast cancer dataset provided in this embodiment of the invention; Figure 6 This is a schematic diagram of the spatial gene expression enhancement device based on bidirectional attention multimodal fusion provided in an embodiment of the present invention; Figure 7 This is a schematic diagram of the structure of a spatial gene expression enhancement device based on bidirectional attention multimodal fusion provided in an embodiment of the present invention. Detailed Implementation
[0017] This invention provides a spatial gene expression enhancement method based on bidirectional attention multimodal fusion. In this invention, the terms "first," "second," "third," "fourth," etc. (if present)," in the specification, claims, and accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0018] For ease of understanding, the specific process of the embodiments of the present invention is described below. Please refer to [link / reference]. Figure 1 One embodiment of the spatial gene expression enhancement method based on bidirectional attention multimodal fusion in this invention includes: 101. Obtain the target gene expression matrix, wherein the target gene expression matrix includes multiple sampling points; In this embodiment, the process of obtaining the target gene expression matrix is based on multi-source spatial transcriptome data. Twelve spatial transcriptome datasets from multiple platforms, comprising 22 tissue slices, are integrated, covering diverse tissue types including human prefrontal cortex, cervical cancer, colon cancer, breast cancer, squamous cell carcinoma, liver cancer, and mouse brain and placenta. The datasets encompass sample types such as prefrontal cortex, brain, cervical cancer, colon cancer, breast cancer, squamous cell carcinoma, placenta, and liver, and involve multiple mainstream detection platforms such as 10xVisium, Visium, ST, STARmap, and Xenium, providing a rich and heterogeneous data source for subsequent modeling. The raw gene expression data and corresponding tissue images are preprocessed, and the raw gene expression matrix is standardized and logarithmically transformed to eliminate scale differences in gene expression levels between different sampling points, ensuring the comparability of expression data between different samples and avoiding interference from expression level fluctuations in subsequent analysis and modeling. Subsequently, the genes were sorted from highest to lowest degree of variation, and the top 300 highly variable genes were selected as the target gene set. This effectively reduced data dimensionality while preserving core biological variation information, thus decreasing computational complexity and focusing on genes with key regulatory significance, providing more targeted input for subsequent model training. Finally, the processed data was integrated into a target gene expression matrix with uniform dimensions. The matrix dimensions were determined by the total number of sampling points and the number of highly variable genes in all samples. This matrix serves as a supervision label for subsequent model training and prediction, providing a standardized and high-quality input foundation for spatial gene expression enhancement tasks based on bidirectional attention and multimodal fusion.
[0019] 102. Based on the target gene expression matrix, generate a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point; In this embodiment, during the image feature generation stage, for each tissue slice, a histological stained image aligned with the coordinates of the sampling points is acquired. Using the spatial coordinates of each sampling point as the center, a fixed-size local image patch is cropped from the histological image, so that each sampling point is associated with the corresponding local morphological context. The local image patch is input into a pre-trained pathological image feature extraction model for feature encoding to obtain a high-dimensional image representation vector corresponding to each sampling point. After integrating all high-dimensional image representation vectors, a first high-dimensional image representation sequence with regular dimensions is formed, providing visual modality input for subsequent multimodal fusion. In the neighborhood-aware expression feature generation stage, all sampling points are first divided into observation sampling points and prediction sampling points. A pre-filled expression matrix is constructed. The true gene expression value of the observation sampling points is retained. The gene expression mean of the prediction sampling points is calculated based on the K nearest observation neighbors in their coordinate space and approximate filling is completed. The expression values of all sampling points are integrated to obtain the pre-filled expression matrix. Then, a nearest neighbor undirected graph is constructed based on the spatial coordinates of the sampling points. Each sampling point is connected to its K nearest neighbor sampling points in a spatial neighborhood. On this graph structure, a graph attention network is used to perform attention-weighted aggregation of the pre-filled expression features to capture the expression context information of the local spatial neighborhood. Finally, the first neighborhood-aware gene expression representation sequence is obtained. By mining morphological, structural, and textural information from histological images through a pre-trained pathological image feature extraction model, tissue morphology constraints are provided for gene expression prediction. At the same time, spatial nearest neighbor graphs and graph attention networks are used to finely characterize local neighborhood expression patterns and integrate spatial topological relationships between sampling points, which makes up for the limitations of relying solely on images or expression data. The parallel generation and normalization of multimodal features lay a solid foundation for subsequent cross-attention fusion and global spatial dependency modeling, which not only ensures the spatial consistency and biological interpretability of the data, but also improves the efficiency and stability of model training, and helps to achieve more accurate spatial gene expression enhancement based on bidirectional attention multimodal fusion.
[0020] 103. Obtain a pre-trained spatial gene prediction model, input the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction, and obtain the first gene expression prediction value corresponding to each sampling point; In this embodiment, the spatial gene prediction model is composed of a cross-attention module, a feature fusion module, a Transformer encoder module, and a gene expression prediction module connected sequentially. The cross-attention module includes a first attention sub-module and a second attention sub-module, which are connected in parallel to form a bidirectional interactive structural basis. During the model training phase, the true gene expression values in the preprocessed target gene expression matrix are used as supervision labels. The first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence corresponding to the sampling points are used as inputs. Parameter optimization is completed by minimizing the loss function between the predicted value and the true expression value, and the learnable parameters of each module are updated synchronously, enabling the model to gradually learn the accurate mapping relationship between multimodal inputs and gene expression profiles. In the prediction phase, the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence are first input into the cross-attention module. Based on the first attention submodule, cross-attention calculation is performed using the first high-dimensional image representation sequence as the query and the first neighborhood-aware gene expression representation sequence as the key and value, resulting in the first interactive feature sequence, realizing the transfer and constraint of image modality information to the neighborhood expression modality. Based on the second attention submodule, cross-attention calculation is performed using the first neighborhood-aware gene expression representation sequence as the query and the first high-dimensional image representation sequence as the key and value, resulting in the second interactive feature sequence, completing the reverse interaction and complementarity from the neighborhood expression modality to the image modality. Subsequently, the two types of interactive feature sequences are input into the feature fusion module for concatenation and refinement through a feedforward network to obtain a multimodal fusion feature sequence, realizing the deep integration and nonlinear transformation of the two types of modality features. Then, the multimodal fusion feature sequence is input into the Transformer encoder module to model the global spatial dependency relationship between all sampling points, complete the global context enhancement processing, and obtain the global context-enhanced feature sequence, capturing the long-distance spatial expression pattern at the organizational level. Finally, the global context-enhanced feature sequence is input into the gene expression prediction module, and the first gene expression prediction value corresponding to each sampling point is obtained through multilayer perceptron mapping. Deep interaction and bidirectional constraints between the image and the neighborhood expression modalities are achieved through bidirectional cross-attention, improving the efficiency and complementarity of multimodal information utilization. The Transformer encoder module efficiently models long-distance spatial dependencies, making up for the shortcomings of traditional structures in global dependency modeling.
[0021] 104. Using a polar coordinate system spatial interpolation strategy, multiple interpolation points are generated based on the multiple sampling points; In this embodiment, a polar coordinate spatial interpolation strategy is employed to generate multiple interpolation points based on multiple sampling points. This aims to fill in the gaps in tissue coverage based on the sparse sampling of the original spatial transcriptome data, thereby improving the resolution of the spatial gene expression map. Based on the first spatial coordinates corresponding to each sampling point, the Euclidean distance between two adjacent sampling points is calculated. This distance is used as a benchmark to determine the sampling distance of the interpolation point relative to the original sampling point, ensuring that the spacing between the interpolation point and the original sampling point matches the original sampling density, avoiding both information redundancy and preventing cross-boundary interference. Centered on each sampling point, the complete circumference is uniformly divided into several directions according to the polar coordinate modeling rules, generating multiple interpolation points at preset sampling distances. This ensures that the interpolation points are symmetrically and uniformly distributed around the original sampling point. These interpolation points can be considered theoretically high-resolution observation points, accurately filling in the gaps between the original sampling points and alleviating the problem of missing key microenvironment information caused by sparse sampling. This strategy significantly improves the resolution and coverage of spatial gene expression maps while preserving the original tissue structure and spatial expression patterns. The uniformly distributed interpolation points ensure the spatial continuity and structural consistency of the enhanced expression map, avoiding distribution disorder or information distortion during the interpolation process. At the same time, high-density expression mapping can be achieved without additional improvement in experimental resolution, which greatly reduces detection costs and cycle time, and provides more complete and fine-grained data support for subsequent high-precision spatial gene expression analysis and biological function interpretation.
[0022] 105. Based on the spatial gene prediction model, predict the expression value of the second gene corresponding to each interpolation point; In this embodiment, after generating interpolation points, a second local image patch and second spatial coordinates corresponding to each interpolation point are first obtained. The second local image patch is extracted from the local region corresponding to the interpolation point in the whole-section histological image, accurately carrying the tissue morphology information at that location. The second spatial coordinates clarify the spatial location of the interpolation point in the tissue section. Subsequently, the second local image patch is input into a pre-trained pathological image feature extraction model for feature encoding to obtain a second high-dimensional image representation sequence. Simultaneously, neighborhood information injection processing is performed based on the second spatial coordinates to generate a second neighborhood-aware gene expression representation sequence. Next, the second high-dimensional image representation sequence and the second neighborhood-aware gene expression representation sequence are input into a spatial gene prediction model. The model sequentially completes bidirectional cross-attention interaction, multimodal feature fusion, and global context enhancement processing through internal modules. Finally, the gene expression prediction module maps the result to obtain the second gene expression prediction value corresponding to each interpolation point. By introducing histological image features and spatial neighborhood constraints, the gene expression prediction of interpolation points closely matches the tissue morphology and spatial topology, avoiding spatial distortion and biological bias that may be caused by pure numerical interpolation. While effectively improving the resolution of spatial gene expression maps, it completely preserves the spatial distribution characteristics of the original tissue structure, enhances the coverage and continuity of the expression map, and provides more fine-grained and complete data support for subsequent high-precision spatial transcriptomics analysis.
[0023] 106. Integrate multiple predicted values of the first gene expression and multiple predicted values of the second gene expression to obtain a high-resolution gene expression matrix.
[0024] In this embodiment, multiple first gene expression prediction values and multiple second gene expression prediction values are integrated to obtain a high-resolution gene expression matrix. Based on a unified spatial coordinate system, the prediction results of the original sampling points and interpolation points are precisely aligned and orderly fused in the tissue space. Through the standardized splicing and spatial consistency calibration of the two types of prediction values, a gene expression data set with complete coverage and dense spatial distribution is formed. While preserving the true expression characteristics of the original sampling points, the spatial gaps caused by sparse sampling are filled by the prediction results of interpolation points, eliminating data discontinuities and regional missing problems. Finally, a high-resolution gene expression matrix with both completeness and fine granularity is constructed. This process aims to achieve high-precision restoration and visualization of spatial gene expression. By organically fusing multi-source prediction results, it accurately restores the true spatial distribution pattern of gene expression within tissues, weakens the information bias caused by sparse sampling and experimental limitations, improves the detail depiction and spatial continuity of expression maps, and clearly presents complex gene expression spatial patterns. At the same time, it enhances the biological reliability and structural rationality of expression data, provides solid support for in-depth mining and intuitive display of spatial transcriptome data, effectively meets the practical application needs of high-precision spatial expression reconstruction and visualization analysis, and helps to efficiently analyze key biological issues such as tissue microenvironment and cellular heterogeneity.
[0025] like Figure 2 As shown, the spatial gene expression enhancement method based on bidirectional attention multimodal fusion in this embodiment of the invention relies on a standardized multimodal interaction and spatial enhancement framework. In the neighborhood-aware feature generation stage, an initial feature map is constructed using the relationship between the original gene expression and the position coordinates of each sampling point. Subsequently, a graph attention neural network is used to aggregate neighborhood information to obtain the neighborhood-aware gene expression representation sequence of each sampling point. In the model training stage, the histological image encoding results corresponding to the original sparse expression points are combined with the neighborhood features. The deep fusion of multimodal information is completed through a cross-attention module, and then the global modeling between all sampling points is modeled through a Transformer encoder module. Spatial dependencies are identified, and the predicted gene expression values for known sampling points are output. The loss function is calculated using the actual gene expression values for supervision, and the model parameters are iteratively optimized. In the testing or interpolation prediction stage, the trained model is generalized to unmeasured locations or interpolation point scenarios. Local histological image features and neighborhood-aware gene expression characterization sequences at the corresponding locations are extracted and input into the model to obtain the gene expression prediction results at the corresponding locations. Finally, the first gene expression prediction value of the original sampling point and the second gene expression prediction value of the interpolation point are integrated to obtain the enhanced high-resolution gene expression map, thus fully realizing the enhancement process from sparse spatial transcriptome data to high-resolution spatial gene expression.
[0026] Please see Figure 3 , Figure 4 and Figure 5To further verify the performance and biological value of this method on different datasets and tissue types, multiple sets of experimental verification and visualization analysis were conducted. For example... Figure 3 As shown, on slice 151507 of the human prefrontal cortex dataset, the original gene map was first downsampled and then restored using this model. The marker genes CALM2, CALM3, GFAP, ENC1, and CKB were selected for visualization. The performance of this method was compared with existing methods such as STAGE, DeepSpaCE, ST-Net, HisToSGE, SpaViT, and DIST, visually demonstrating the gene expression effects generated by the model. Figure 4 As shown, simulated sampling points were constructed on mouse brain slices from the high-resolution Visium HD dataset, and this model was used to predict gene expression in unmeasured regions. The marker genes Cryab, Rgs9, TMeff2, Ptprn, Slc17a7, and Hpca were selected for visualization. The performance of this method was simultaneously compared with that of STAGE, DeepSpaCE, ST-Net, HisToSGE, SpaViT, and DIST, validating the model's predictive ability on high-resolution platform data. Figure 5 As shown, this paper demonstrates the effectiveness of the SpaBiT method in improving the clarity of spatial gene expression and uncovering biologically significant pathways: A. SpaBiT was used to reconstruct the original gene expression data of breast cancer slides, and differentially expressed genes after reconstruction were analyzed and enrichment analysis was performed; B and D. Most of the top ten biological pathways analyzed from the gene expression data after SpaBiT reconstruction are related to the disease; C and E. The biological pathways analyzed from the gene expression data after SpaBiT reconstruction are significantly more significant than those from the original data, further highlighting the advantages of this method in uncovering biological functions.
[0027] In this embodiment of the invention, obtaining the target gene expression matrix includes: obtaining an original gene expression matrix; standardizing the original gene expression matrix using a standardization algorithm to obtain a standardized gene expression matrix; performing a logarithmic transformation on the standardized gene expression matrix using a logarithmic transformation algorithm to obtain a transformed gene expression matrix; obtaining a preset sorting rule; sorting multiple genes in the transformed gene expression matrix based on the sorting rule; and selecting a preset number of target genes from the multiple genes to construct the target gene expression matrix.
[0028] In this embodiment, the process of obtaining the target gene expression matrix is based on multi-source spatial transcriptome data. Twelve spatial transcriptome datasets from multiple platforms, comprising twenty-two tissue slices, are integrated, and the original gene expression matrix and corresponding tissue images are acquired simultaneously. First, the original gene expression matrix is standardized using a standardization algorithm, specifically the Z-score standardization algorithm. This transforms the expression value of each gene into a distribution with a mean of 0 and a variance of 1, eliminating scale differences in gene expression levels between different sampling points and ensuring a unified numerical benchmark for expression data from different samples, thus guaranteeing the comparability of cross-sample data. Next, a logarithmic transformation algorithm is used to perform a logarithmic transformation on the standardized gene expression matrix, specifically the natural logarithm transformation algorithm. This performs a logarithmic transformation on each expression value, avoiding numerical anomalies caused by taking the logarithm of zero values while compressing the influence of extremely high expression values. This makes the gene expression distribution more closely match the distribution assumptions for subsequent model training, optimizing the statistical characteristics and stability of the data.
[0029] Subsequently, a pre-defined rule is obtained to sort genes from highest to lowest variability. The variability can be calculated using the coefficient of variation or variance of gene expression. Multiple genes in the transformed gene expression matrix are then sorted, and the top 300 highly variable genes are selected as the target gene set. This set is used to construct the target gene expression matrix. The processed data is then unified to a dimension of [dimension not specified]. In the form of, Represents a constant. This represents the total number of sampling points in all samples. =300 represents the number of highly variable genes. This matrix serves as a supervisory label for model training, providing a standardized input basis for subsequent prediction tasks. The combined processing of standardization and natural logarithm transformation effectively eliminates differences in expression scales and interference from extreme values, optimizes data distribution characteristics, and improves the stability and efficiency of model training. Highly variable genes are selected based on the coefficient of variation or variance, which simplifies the data dimensions while retaining core biological variation information. This reduces computational complexity, focuses on genes with key regulatory significance, and avoids redundant information interfering with model training.
[0030] In this embodiment of the invention, generating a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point based on the target gene expression matrix includes: acquiring a first local image patch corresponding to each sampling point in a whole-section histological image; acquiring a pre-trained pathological image feature extraction model, inputting each first local image patch into the pathological image feature extraction model for feature encoding to obtain multiple high-dimensional image representation vectors; integrating the multiple high-dimensional image representation vectors to obtain the first high-dimensional image representation sequence; and performing neighborhood information injection processing on the target gene expression matrix to obtain the first neighborhood-aware gene expression representation sequence.
[0031] In this embodiment, during the image modality processing stage, for each tissue slice, a full-slice histological image aligned with the coordinates of the sampling points is acquired. Using the spatial coordinates of each sampling point as the center, a first local image patch of a fixed size is cropped from the histological image, thus associating each sampling point with its corresponding local morphological context. This is denoted as the... The image patch corresponding to each sampling point is ,in, This indicates the total number of sampling points in the slice. and These represent the width and height of the image patch, respectively, and can be set to 224 to fit the input size of the pre-trained pathological image feature extraction model. 3 represents the RGB three channels of the image, fully carrying the tissue morphology information of the local region. The first local image patch is then input into the pre-trained pathological image feature extraction model, such as the UNI model, to perform feature encoding operations, obtaining a high-dimensional image representation vector corresponding to each sampling point. , ,in, Indicates the first Image feature vector of each sample point To enable the feature extraction model pre-trained on large-scale pathological image datasets, it possesses powerful visual feature mining capabilities. The feature dimension, which can be set to 1024, is used to encapsulate the texture, structure, and morphological features of the local region corresponding to the sampling point. It integrates the high-dimensional image representation vectors of all sampling points to obtain a dimensionally regular first high-dimensional image representation sequence, providing visual modal input for subsequent multimodal fusion. In the expression modality processing stage, neighborhood information injection is performed based on the target gene expression matrix. By constructing a spatial neighborhood graph and using a graph attention network to aggregate local expression context, a first neighborhood-aware gene expression representation sequence is obtained. This sequence characterizes the gene expression association pattern of each sampling point within its spatial neighborhood. By mining morphological information from histological images through a pre-trained pathological image feature extraction model, tissue morphology constraints are provided for gene expression prediction. At the same time, neighborhood information is injected to finely characterize the spatial topology and expression associations between sampling points, which makes up for the limitations of relying solely on images or expression data. The parallel generation and normalization of the two types of sequences lay a solid foundation for the cross-attention interaction and global spatial dependency modeling of the subsequent model. This not only ensures the spatial consistency and biological interpretability of the data, but also improves the efficiency and stability of model training, and helps to achieve more accurate spatial gene expression enhancement based on bidirectional attention multimodal fusion.
[0032] In this embodiment of the invention, the step of injecting neighborhood information into the target gene expression matrix to obtain the first neighborhood-aware gene expression characterization sequence includes: obtaining the first spatial coordinates corresponding to each sampling point based on the target gene expression matrix, and dividing all sampling points into multiple observation sampling points and prediction sampling points based on the first spatial coordinates; determining a group of neighbor sampling points corresponding to each prediction sampling point based on the multiple observation sampling points; calculating the mean gene expression value based on the true gene expression value of the neighbor sampling point group, and pre-filling the corresponding prediction sampling points based on the mean gene expression value to determine the initial expression value of each prediction sampling point; integrating the true gene expression values of all observation sampling points and the initial expression values of all prediction sampling points to obtain a pre-filled expression matrix; determining the K nearest neighbor sampling points corresponding to each sampling point according to the first spatial coordinates of the multiple sampling points, and establishing a spatial neighborhood connection between each sampling point and its K nearest neighbor sampling points to construct a nearest neighbor undirected graph; and performing attention-weighted aggregation processing on the pre-filled expression matrix using a graph attention network on the nearest neighbor undirected graph to obtain the first neighborhood-aware gene expression characterization sequence.
[0033] In this embodiment, the first spatial coordinates corresponding to each sampling point are first obtained based on the target gene expression matrix. Then, all sampling points are divided into observation sampling points and prediction sampling points based on the first spatial coordinates. The observation sampling points constitute the training set, and the prediction sampling points constitute the prediction set. The union of the two covers all sampling points, and the intersection is empty, ensuring the logical independence of training and prediction.
[0034] Next, based on multiple observation sampling points, a group of neighboring sampling points corresponding to each sampling point to be predicted is determined, that is, the K nearest neighbor sampling points of the sampling point to be predicted in the observation set. The mean gene expression is calculated based on the true gene expression values of the neighboring sampling point groups, and the corresponding sampling points to be predicted are pre-filled to determine the initial expression value. The specific formula expression is as follows: , in, Indicates the sampling points to be predicted The pre-filled representation vector, Indicates the number of neighboring sampling points. Indicates the sampling points to be predicted In the observation set A set of nearest neighbor sampling point indices Indicates the observation sampling point The true gene expression vectors are padded with the mean to assign an initial expression context to the sampling points to be predicted, avoiding modeling bias caused by missing expression values at the sampling points. Then, the true gene expression values of all observed sampling points are integrated with the initial expression values of all sampling points to be predicted to obtain a pre-filled expression matrix. ,in, This represents the total number of all sampling points. This indicates the number of target genes, and the observed sampling points directly retain the true expression level. , Indicates the observation sampling point The pre-filled representation vector, Indicates the observation sampling point The true gene expression vector is obtained. Then, based on the first spatial coordinates of multiple sampling points, the K nearest neighbor sampling points corresponding to each sampling point are determined, spatial neighborhood connections are established, and a nearest-neighbor undirected graph is constructed. On this nearest-neighbor undirected graph, a graph attention network is used to perform attention-weighted aggregation processing on the pre-filled expression matrix. The specific formula expression is as follows: , in, Indicates sampling point Neighborhood-sensing gene expression representation vector Indicates sampling point The neighborhood group, This represents a learnable linear transformation matrix used to perform dimensionality transformation on the pre-filled representation vectors of neighboring sample points. The attention coefficient, obtained through adaptive learning by the model, is used to quantize sampling points. with neighbors The strength of the expression association between them Indicates the sampling points to be predicted The pre-filled representation vector, This approach uses nonlinear activation functions, such as ReLU or GELU, to aggregate the expression features of neighboring sample points through attention-weighted aggregation, characterizing the expression context information of the local spatial neighborhood. Finally, it integrates the neighborhood features of all sample points to obtain the first neighborhood-perceived gene expression representation sequence. This process completes the initial expression information of the sample points to be predicted through pre-filling operations, avoiding interference from missing data in modeling. The construction of the nearest undirected graph explicitly encodes the spatial topological relationships between sample points. The weighted aggregation mechanism of the graph attention network can adaptively learn the contribution weights of different neighbors to the current sample point, accurately capturing the expression patterns of the local spatial neighborhood. This preserves the true expression features of the observed sample points while finely characterizing the local expression context of the sample points to be predicted, providing input rich in spatial neighborhood information for subsequent multimodal fusion and gene expression prediction. This improves the model's ability to model local spatial heterogeneity, enhances the spatial continuity and biological interpretability of the prediction results, and the regular graph structure design ensures the model's generalization ability and robustness under different sampling densities and tissue types.
[0035] In this embodiment, the spatial gene prediction model includes a cross-attention module, a feature fusion module, a Transformer encoder module, and a gene expression prediction module. The cross-attention module, feature fusion module, Transformer encoder module, and gene expression prediction module are sequentially connected. The cross-attention module includes a first attention submodule and a second attention submodule, which are connected. The step of inputting the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction to obtain a first gene expression prediction value corresponding to each sampling point includes: based on the first attention submodule, using the first high-dimensional image representation sequence as the query, the first high-dimensional image representation sequence as the reference, and the first neighborhood-aware gene expression prediction module as the reference. Using a neighborhood-aware gene expression representation sequence as the key and value, a first cross-attention calculation is performed to obtain a first interactive feature sequence. Based on the second attention submodule, using the first neighborhood-aware gene expression representation sequence as the query and the first high-dimensional image representation sequence as the key and value, a second cross-attention calculation is performed to obtain a second interactive feature sequence. The first interactive feature sequence and the second interactive feature sequence are input into the feature fusion module for concatenation to obtain a multimodal fusion feature sequence. The multimodal fusion feature sequence is input into the Transformer encoder module for global context enhancement to obtain a global context-enhanced feature sequence. The global context-enhanced feature sequence is input into the gene expression prediction module for prediction to obtain a first gene expression prediction value corresponding to each sampling point.
[0036] In this embodiment, the spatial gene prediction model is composed of a cross-attention module, a feature fusion module, a Transformer encoder module, and a gene expression prediction module connected sequentially. The cross-attention module includes a first attention submodule and a second attention submodule, which are connected in parallel to form a bidirectional interactive structural basis. After obtaining the two types of feature sequences, they are organized into regular sequence forms, specifically using the following formula: , in, This represents the first high-dimensional image representation sequence. Indicates the first Image feature vectors of each sampling point Indicates the total number of sampling points. This represents the expression characterization sequence of the first-neighbor sensing gene. Indicates the first Neighborhood-aware gene expression representation vectors for each sampling point.
[0037] The basic form of cross attention is defined as follows: ,in, Represents the query matrix. Represents the key matrix, Represents a value matrix, The attention head dimension is used to scale the attention score to avoid numerical saturation. Based on the first attention submodule, using the first high-dimensional image representation sequence as the query and the first neighborhood-aware gene expression representation sequence as the key and value, the first cross-attention calculation is performed to obtain the first interactive feature sequence. This enables the transfer and constraint of image modal information to neighborhood representation modalities. Based on the second attention submodule, using the first neighborhood perceived gene expression representation sequence as the query and the first high-dimensional image representation sequence as the key and value, a second cross-attention calculation is performed to obtain the second interactive feature sequence. This enables the reverse interaction and complementarity between the neighborhood representation modality and the image modality.
[0038] Then, the first interactive feature sequence and the second interactive feature sequence are input into the feature fusion module. For each sampling point, and The corresponding position vectors are concatenated and then fed forward to obtain the fused features. The specific formula is as follows: , in, Indicates sampling point The multimodal fusion feature vector, This represents a feedforward neural network. Represents the first interactive feature sequence The row vectors Represents the second interactive feature sequence The row vectors This represents a vector concatenation operation, which summarizes the fused features from all sampling points to obtain a multimodal fused feature sequence. .
[0039] Next, the multimodal fused feature sequence is input into the Transformer encoder module to model the global spatial dependencies between distant sampling points, resulting in a global context-enhanced feature sequence, where each sampling point... The global enhancement representation is denoted as Then, the global context-enhanced feature sequence is input into the gene expression prediction module, and the first gene expression prediction value corresponding to each sampling point is obtained through multilayer perceptron mapping. The specific formula expression is as follows: , in, Indicates sampling point The first gene expression prediction value, It is a multilayer perceptron. Indicates the number of target genes.
[0040] During the model training phase, the mean squared error loss function is used as the training objective to supervise the comparison between the predicted and actual representations of the observed sampling points. The loss function is defined as follows: ,in, This represents the total loss value. Represents the set of observation sampling points Size, Indicates sampling point The first gene expression prediction value, Indicates sampling point The true gene expression vector, The square of the L2 norm is used to measure the difference between the predicted and the true values. By minimizing this loss function, the model learns the mapping relationship between multimodal inputs and spatial gene expression patterns.
[0041] The model evaluation phase uses three types of indicators to quantify predictive performance. The first is the Pearson correlation coefficient (PCC), which measures the linear correlation between predicted and actual values. The calculation formula is as follows: , in, Represents the covariance function. Represents the variance function. This represents the observed true gene expression value. The PCC value represents the measured gene expression value. The closer the PCC value is to 1, the stronger the linear correlation between the predicted result and the actual result, and the better the model performance.
[0042] The second is the mean squared error (MSE), which measures the squared average of the differences between the predicted and actual values. The formula is: , in, Represents the total number of samples. Indicates the first The true gene expression value of each sample Indicates the first The predicted gene expression value of each sample; the smaller the MSE value, the higher the overall prediction accuracy of the model.
[0043] Thirdly, there is the Mean Absolute Error (MAE), which is used to assess the average magnitude of the error between the predicted and actual values. The calculation formula is as follows: , in, Represents the total number of samples. Indicates the first The true gene expression value of each sample Indicates the first The predicted gene expression value of each sample is considered. The smaller the MAE value, the closer the model prediction is to the true value on average, and the better the performance.
[0044] By employing bidirectional cross-attention, deep interaction and bidirectional constraints between image modalities and neighborhood representation modalities are achieved, improving the efficiency and complementarity of multimodal information utilization. The Transformer encoder module efficiently models long-distance spatial dependencies, compensating for the shortcomings of traditional structures in global dependency modeling. The well-organized module design ensures the model's scalability and interpretability. The training process combines real-expression supervision with the mean squared error loss function, enabling the model to maintain high prediction accuracy and spatial consistency even in sparse sampling scenarios. Multi-dimensional evaluation metrics comprehensively quantify model performance from aspects such as correlation and error magnitude, providing objective basis for model optimization and iteration. At the same time, it improves the model's generalization ability and robustness on different tissue types and datasets, laying a reliable foundation for subsequent high-resolution gene expression matrix construction.
[0045] In this embodiment of the invention, the step of using a polar coordinate system spatial interpolation strategy to generate multiple interpolation points based on multiple sampling points includes: calculating the Euclidean distance between two adjacent sampling points based on the first spatial coordinates corresponding to each sampling point; determining the sampling distance of the interpolation point relative to the sampling point based on the Euclidean distance; and generating multiple interpolation points at the sampling distance with each sampling point as the center and in a direction of uniform circumference distribution.
[0046] In this embodiment, a polar coordinate spatial interpolation strategy is employed to generate multiple interpolation points based on multiple sampling points. This aims to fill in gaps in the tissue tissue based on the sparse sampling of the original spatial transcriptome, thereby improving the resolution of the spatial gene expression map. First, based on the first spatial coordinates corresponding to each sampling point, the Euclidean distance between two adjacent sampling points is calculated and denoted as [the distance is not specified in the original text]. This serves as a benchmark for subsequent interpolation distances, ensuring the interpolation point spacing matches the original sampling density, thus avoiding information redundancy and preventing out-of-bounds interference. Based on this Euclidean distance, the sampling distance between the interpolation points and the sampling points is determined. A polar coordinate system is then used to accurately model the spatial position of the interpolation points, with the specific formula as follows: , in, This represents the Euclidean distance between two adjacent sampling points. This represents the radial distance between the interpolation point and the center point (i.e., the current sampling point), and its value is [value missing]. This value ensures that the distance between the interpolation point and the center sampling point matches the original sampling distribution. Indicates the angle direction of the interpolation point, with a value of , The angle index ranges from 0 to N-2, with a total of N-1 possible values. N is a parameter controlling the number of interpolation directions. This parameter evenly divides the complete circumference into N-1 directions, ensuring that the interpolation points are symmetrically and uniformly distributed around the sampling point. Subsequently, with each sampling point as the center, multiple interpolation points are generated at a sampling distance r along the uniformly distributed directions of the circumference. These interpolation points can be considered as theoretically high-resolution observation points, which can completely fill the blank areas in the original sampling point distribution and alleviate the problem of missing key microenvironment information caused by sparse sampling. This strategy achieves precise quantification and uniform distribution of interpolation point spatial locations through polar coordinate system modeling. While preserving the original tissue structure and spatial expression patterns, it significantly improves the resolution and coverage of spatial gene expression maps, avoiding spatial distortion or distribution disorder during the interpolation process. At the same time, it can achieve high-density expression mapping without additional experimental resolution enhancement, greatly reducing detection costs and cycle time. It provides more complete and finer-grained data support for subsequent high-precision spatial gene expression analysis and biological function interpretation, enhances the continuity and interpretability of expression maps, and helps to efficiently analyze key biological issues such as tissue microenvironment and cellular heterogeneity.
[0047] In this embodiment of the invention, predicting the second gene expression prediction value corresponding to each interpolation point based on the spatial gene prediction model includes: obtaining a second local image patch and a second spatial coordinate corresponding to each interpolation point; converting the second local image patch into a second high-dimensional image representation sequence, and generating a second neighborhood-aware gene expression representation sequence based on the second spatial coordinate; inputting the second high-dimensional image representation sequence and the second neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction, thereby obtaining the second gene expression prediction value corresponding to each interpolation point.
[0048] In this embodiment, after generating interpolation points, a second local image patch and second spatial coordinates corresponding to each interpolation point are first obtained. The second local image patch is extracted from the local region corresponding to the interpolation point in the whole-slice histological image, carrying the tissue morphology information of that location. The second spatial coordinates clarify the spatial location of the interpolation point in the tissue slice, sharing a unified spatial coordinate system with the original sampling points, ensuring spatial alignment and logical consistency in subsequent modeling. Subsequently, the second local image patch is input into a pre-trained pathological image feature extraction model for feature encoding to obtain a second high-dimensional image representation sequence. This sequence encapsulates the texture, structure, and morphological features of the local region corresponding to the interpolation point. Simultaneously, based on the second spatial coordinates, neighborhood information injection processing is performed to generate a second neighborhood-aware gene expression representation sequence. This sequence characterizes the gene expression association pattern of the interpolation point in the spatial neighborhood. The generation methods of the two types of sequences are completely consistent with the processing flow of the original sampling points, ensuring the homology and comparability of the feature inputs of the interpolation points and the original sampling points. Next, the second high-dimensional image representation sequence and the second neighborhood-aware gene expression representation sequence are input into the spatial gene prediction model. The model reuses the parameters optimized for the original sampling points, and sequentially achieves deep interaction between the image modality and the neighborhood expression modality through a bidirectional cross-attention module. A feature fusion module integrates multimodal information, and a Transformer encoder module models global spatial dependencies. Finally, the gene expression prediction module maps the predicted second gene expression value corresponding to each interpolation point. This process, while preserving the spatial distribution characteristics of the original tissue structure, generalizes the trained model to the interpolation point scenario. By introducing histological image features and spatial neighborhood constraints, the gene expression prediction at the interpolation points closely matches the tissue morphology and spatial topology, avoiding spatial distortion and biological bias that may be caused by pure numerical interpolation. This effectively improves the resolution and coverage of the spatial gene expression map, enhances the continuity and integrity of the expression map, and provides finer-grained and more reliable data support for subsequent high-precision spatial transcriptomics analysis. Simultaneously, it verifies the model's generalization ability and robustness at unknown observation points, contributing to the high-precision recovery and visualization of spatial gene expression.
[0049] The spatial gene expression enhancement method based on bidirectional attention multimodal fusion in the embodiments of the present invention has been described above. The spatial gene expression enhancement device based on bidirectional attention multimodal fusion in the embodiments of the present invention is described below. Please refer to [link to relevant documentation]. Figure 6 One embodiment of the spatial gene expression enhancement device based on bidirectional attention multimodal fusion in this invention includes: Matrix acquisition module 601: used to acquire the target gene expression matrix, wherein the target gene expression matrix includes multiple sampling points; Sequence generation module 602: used to generate a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point based on the target gene expression matrix; First prediction module 603: used to acquire a pre-trained spatial gene prediction model, input the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction, and obtain the first gene expression prediction value corresponding to each sampling point; Interpolation point generation module 604: used to generate multiple interpolation points based on multiple sampling points using a polar coordinate system spatial interpolation strategy; Second prediction module 605: used to predict the second gene expression prediction value corresponding to each interpolation point based on the spatial gene prediction model; Integration module 606: used to integrate multiple first gene expression prediction values and multiple second gene expression prediction values to obtain a high-resolution gene expression matrix.
[0050] Based on the same ideas as the methods in the above embodiments, the apparatus provided in this application can implement the methods in the above embodiments.
[0051] above Figure 6 The spatial gene expression enhancement device based on bidirectional attention multimodal fusion in the embodiments of the present invention will be described in detail from the perspective of modular functional entities. The spatial gene expression enhancement device based on bidirectional attention multimodal fusion in the embodiments of the present invention will be described in detail from the perspective of hardware processing.
[0052] Figure 7This is a schematic diagram of a spatial gene expression enhancement device 700 based on bidirectional attention multimodal fusion, provided by an embodiment of the present invention. The spatial gene expression enhancement device 700 based on bidirectional attention multimodal fusion can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) for storing application programs 733 or data 732. The memory 720 and storage media 730 can be temporary or persistent storage. The program stored in the storage media 730 may include one or more modules (not shown in the diagram), each module including a series of instruction operations on the spatial gene expression enhancement device 700 based on bidirectional attention multimodal fusion. Furthermore, the processor 710 may be configured to communicate with the storage media 730 and execute a series of instruction operations in the storage media 730 on the spatial gene expression enhancement device 700 based on bidirectional attention multimodal fusion to implement the steps of the spatial gene expression enhancement method based on bidirectional attention multimodal fusion provided in the above-described method embodiments.
[0053] The spatial gene expression enhancement device 700 based on bidirectional attention multimodal fusion may further include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input / output interfaces 760, and / or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will understand that... Figure 7 The illustrated spatial gene expression enhancement device structure based on bidirectional attention multimodal fusion does not constitute a limitation on spatial gene expression enhancement devices based on bidirectional attention multimodal fusion. It may include more or fewer components than illustrated, or combine certain components, or have different component arrangements.
[0054] The present invention also provides a computer-readable storage medium, which can be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, wherein the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform steps of a spatial gene expression enhancement method based on bidirectional attention multimodal fusion.
[0055] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the system, device, or unit described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0056] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0057] Finally, it should be noted that the above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A spatial gene expression enhancement method based on bidirectional attention multimodal fusion, characterized in that, include: Obtain the target gene expression matrix, which includes multiple sampling points; Based on the target gene expression matrix, a first high-dimensional image representation sequence and a first neighborhood-sensing gene expression representation sequence are generated corresponding to each of the sampling points; A pre-trained spatial gene prediction model is obtained. The first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence are input into the spatial gene prediction model for prediction, and the first gene expression prediction value corresponding to each sampling point is obtained. A polar coordinate system spatial interpolation strategy is adopted to generate multiple interpolation points based on multiple sampling points; Based on the spatial gene prediction model, predict the second gene expression value corresponding to each interpolation point; By integrating multiple predicted values of the first gene expression and multiple predicted values of the second gene expression, a high-resolution gene expression matrix is obtained.
2. The spatial gene expression enhancement method based on bidirectional attention multimodal fusion according to claim 1, characterized in that, The process of obtaining the target gene expression matrix includes: The original gene expression matrix is obtained, and the original gene expression matrix is standardized using a standardization algorithm to obtain a standardized gene expression matrix. The standardized gene expression matrix was logarithmically transformed using a logarithmic transformation algorithm to obtain the transformed gene expression matrix. A preset sorting rule is obtained, and multiple genes in the transformed gene expression matrix are sorted based on the sorting rule. A preset number of target genes are selected from the multiple genes to construct the target gene expression matrix.
3. The spatial gene expression enhancement method based on bidirectional attention multimodal fusion according to claim 1, characterized in that, The step of generating a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point based on the target gene expression matrix includes: Obtain the first local image patch corresponding to each sampling point in the whole-section histological image; A pre-trained pathological image feature extraction model is obtained, and each first local image block is input into the pathological image feature extraction model for feature encoding to obtain multiple high-dimensional image representation vectors. By integrating multiple high-dimensional image representation vectors, the first high-dimensional image representation sequence is obtained; The target gene expression matrix is subjected to neighborhood information injection processing to obtain the first neighborhood-sensing gene expression characterization sequence.
4. The spatial gene expression enhancement method based on bidirectional attention multimodal fusion according to claim 3, characterized in that, The process of injecting neighborhood information into the target gene expression matrix to obtain the first neighborhood-aware gene expression characterization sequence includes: Based on the target gene expression matrix, obtain the first spatial coordinates corresponding to each sampling point, and divide all sampling points into multiple observation sampling points and prediction sampling points based on the first spatial coordinates; Based on the multiple observation sampling points, a group of neighboring sampling points corresponding to each of the sampling points to be predicted is determined; The mean gene expression value is calculated based on the true gene expression value of the neighbor sampling point group, and the corresponding sampling points to be predicted are pre-filled based on the mean gene expression value to determine the initial expression value of each sampling point to be predicted. By integrating the true gene expression values of all observed sampling points and the initial expression values of all sampling points to be predicted, a pre-filled expression matrix is obtained; Based on the first spatial coordinates of the plurality of sampling points, determine the K nearest neighbor sampling points corresponding to each sampling point, and establish a spatial neighborhood connection between each sampling point and its K nearest neighbor sampling points to construct a nearest neighbor undirected graph; On the nearest neighbor undirected graph, a graph attention network is used to perform attention-weighted aggregation on the pre-filled expression matrix to obtain the first neighborhood-sensing gene expression characterization sequence.
5. The spatial gene expression enhancement method based on bidirectional attention multimodal fusion according to claim 1, characterized in that, The spatial gene prediction model includes a cross-attention module, a feature fusion module, a Transformer encoder module, and a gene expression prediction module. The cross-attention module, feature fusion module, Transformer encoder module, and gene expression prediction module are sequentially connected. The cross-attention module includes a first attention submodule and a second attention submodule, which are connected. The step of inputting the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction to obtain a first gene expression prediction value corresponding to each sampling point includes: Based on the first attention submodule, the first high-dimensional image representation sequence is used as the query and the first neighborhood-sensing gene expression representation sequence is used as the key and value. The first cross-attention calculation is performed to obtain the first interactive feature sequence. Based on the second attention submodule, the second cross-attention calculation is performed using the first neighborhood-sensing gene expression representation sequence as the query and the first high-dimensional image representation sequence as the key and value, to obtain the second interactive feature sequence. The first interactive feature sequence and the second interactive feature sequence are input into the feature fusion module and concatenated to obtain a multimodal fusion feature sequence; The multimodal fusion feature sequence is input into the Transformer encoder module for global context enhancement to obtain a global context enhanced feature sequence. The global context-enhanced feature sequence is input into the gene expression prediction module for prediction, and a first gene expression prediction value corresponding to each sampling point is obtained.
6. The spatial gene expression enhancement method based on bidirectional attention multimodal fusion according to claim 4, characterized in that, The method employs a polar coordinate system spatial interpolation strategy to generate multiple interpolation points based on multiple sampling points, including: Based on the first spatial coordinates corresponding to each sampling point, the Euclidean distance between two adjacent sampling points is calculated. The sampling distance of the interpolation point relative to the sampling point is determined based on the Euclidean distance; Multiple interpolation points are generated at the sampling distance, centered on each sampling point and distributed evenly around the circumference.
7. The spatial gene expression enhancement method based on bidirectional attention multimodal fusion according to claim 1, characterized in that, The prediction of the second gene expression prediction value corresponding to each interpolation point based on the spatial gene prediction model includes: Obtain the second local image patch and the second spatial coordinates corresponding to each interpolation point; The second local image patch is converted into a second high-dimensional image representation sequence, and a second neighborhood-aware gene expression representation sequence is generated based on the second spatial coordinates; The second high-dimensional image representation sequence and the second neighborhood-aware gene expression representation sequence are input into the spatial gene prediction model for prediction, and a second gene expression prediction value corresponding to each interpolation point is obtained.
8. A spatial gene expression enhancement device based on bidirectional attention multimodal fusion, characterized in that, include: Matrix acquisition module: used to acquire the target gene expression matrix, which includes multiple sampling points; Sequence generation module: used to generate a first high-dimensional image representation sequence and a first neighborhood-aware gene expression representation sequence corresponding to each sampling point based on the target gene expression matrix; First prediction module: used to acquire a pre-trained spatial gene prediction model, input the first high-dimensional image representation sequence and the first neighborhood-aware gene expression representation sequence into the spatial gene prediction model for prediction, and obtain the first gene expression prediction value corresponding to each sampling point; Interpolation point generation module: used to generate multiple interpolation points based on multiple sampling points using a polar coordinate system spatial interpolation strategy; The second prediction module is used to predict the second gene expression prediction value corresponding to each interpolation point based on the spatial gene prediction model. Integration module: used to integrate multiple first gene expression prediction values and multiple second gene expression prediction values to obtain a high-resolution gene expression matrix.
9. A spatial gene expression enhancement device based on bidirectional attention multimodal fusion, characterized in that, The spatial gene expression enhancement device based on bidirectional attention multimodal fusion includes: a memory and at least one processor, wherein the memory stores instructions; At least one of the processors invokes the instructions in the memory to cause the spatial gene expression enhancement device based on bidirectional attention multimodal fusion to perform the steps of the spatial gene expression enhancement method based on bidirectional attention multimodal fusion as described in any one of claims 1-7.
10. A computer-readable storage medium storing instructions thereon, characterized in that, When the instructions are executed by the processor, they implement the steps of the spatial gene expression enhancement method based on bidirectional attention multimodal fusion as described in any one of claims 1-7.