Spatial gene expression prediction method based on multi-scale feature extraction

By combining multi-scale tissue morphology features with spatial coordinate information, and employing multi-scale feature extraction and learning-based comparative training methods, the problem of insufficient accuracy and biological interpretability in spatial gene expression prediction of existing models is solved, achieving higher prediction accuracy and model generalization ability.

CN122201437APending Publication Date: 2026-06-12YUNNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
YUNNAN UNIV
Filing Date
2026-03-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing spatial gene expression prediction models fail to effectively capture deep biological information about gene function and pathological background, neglect spatial dependence and tissue section heterogeneity, resulting in decreased prediction accuracy, and lack of cross-modal feature alignment and multi-scale information integration.

Method used

By combining multi-scale tissue morphological features and spatial coordinate information, multi-scale feature fusion is performed through a pre-trained feature extraction model. A sequencing point encoder is used to generate an expression embedding matrix, and a target prediction model is constructed through learning and contrastive training to obtain image feature vectors for prediction.

Benefits of technology

It achieves accurate prediction of spatial gene expression, improves prediction accuracy and biological interpretability, and enhances the model's generalization ability and cross-sample adaptability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201437A_ABST
    Figure CN122201437A_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of spatial transcriptomics, and particularly relates to a spatial gene expression prediction method based on multi-scale feature extraction. The method first acquires a whole slide image, the whole slide image comprises a plurality of sequencing points, a gene expression matrix corresponding to the plurality of sequencing points and a spatial coordinate matrix, pre-processes the whole slide image, calls a pre-trained feature extraction model to perform multi-scale feature extraction and fusion processing based on the obtained plurality of image patches, obtains an image feature vector, adopts a sequencing point encoder to generate an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix, constructs an initial prediction model, learns and compares the initial prediction model based on the image feature vector and the expression embedding matrix to obtain a target prediction model, acquires a to-be-tested image feature vector, inputs the to-be-tested image feature vector into the target prediction model for prediction, and obtains a predicted gene expression value, so as to realize accurate prediction of spatial gene expression.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of spatial transcriptomics technology, and in particular to a spatial gene expression prediction method based on multi-scale feature extraction. Background Technology

[0002] Spatial transcriptomics, by preserving spatial information in tissue sections, has advanced gene expression research, enabling the spatial localization of gene expression patterns and deepening our understanding of cell-cell interactions, disease mechanisms, and tissue development. Particularly in areas such as tumor heterogeneity and neurodegenerative diseases, it has revealed new perspectives on region-specific gene expression patterns, driving innovation in clinical diagnosis, prediction, and treatment. However, the high cost of equipment and complex experimental procedures limit its widespread clinical application.

[0003] Hematoxylin-eosin stained whole-section images, as common pathological materials, are ideal for predicting spatial gene expression due to their low cost and easy accessibility, and their widespread application provides rich image data for clinical and pathological research. In recent years, researchers have attempted to predict spatial gene expression from images using deep learning models, achieving preliminary results, but many challenges remain in improving prediction accuracy and biological interpretability.

[0004] Existing spatial gene expression prediction models generally rely on low-level visual features of images, making it difficult to effectively capture deep biological information such as gene function and pathological background. Furthermore, they fail to adequately consider spatial dependence and tissue section heterogeneity, neglecting the crucial role of spatial relationships between adjacent sections, leading to decreased prediction accuracy. Some models use convolutional neural networks to extract features but cannot effectively model spatial context; some introduce visual transformers to enhance spatial consistency but still fail to integrate expression correlations between adjacent regions; and some utilize graph neural networks to model spatial topology but have limitations in single-cell expression resolution. Subsequent contrastive learning has been introduced into related modeling tasks, strengthening modal correlations through dual encoders or joint modeling, but it focuses primarily on modal alignment, failing to adequately consider the functional structure of expression data and differences between tissues, thus limiting biological interpretability and generalization ability. Existing methods have significant limitations in multi-scale information integration and cross-modal feature alignment, focusing only on local image features, making it difficult to capture hierarchical representations of tissue structures, exhibiting insufficient cross-sample generalization ability, and failing to fully utilize the intrinsic relationship between gene expression and spatial coordinates. Summary of the Invention

[0005] To overcome the shortcomings of existing technologies, the present invention aims to provide a spatial gene expression prediction method based on multi-scale feature extraction, which aims to achieve accurate prediction of spatial gene expression by combining multi-scale tissue morphology features with spatial coordinate information.

[0006] The first aspect of this invention provides a spatial gene expression prediction method based on multi-scale feature extraction, comprising: acquiring a whole-slice image, the whole-slice image including multiple sequencing points and gene expression matrices and spatial coordinate matrices corresponding to the multiple sequencing points; preprocessing the whole-slice image to obtain multiple image patches; calling a pre-trained feature extraction model to perform multi-scale feature extraction and fusion processing based on the multiple image patches to obtain image feature vectors; using a sequencing point encoder to generate an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix; constructing an initial prediction model, and performing learning and comparative training on the initial prediction model based on the image feature vectors and the expression embedding matrix to obtain a target prediction model; acquiring a feature vector of a test image, and inputting the feature vector of the test image into the target prediction model for prediction to obtain predicted gene expression values.

[0007] A second aspect of the present invention provides a spatial gene expression prediction device based on multi-scale feature extraction, comprising: an image acquisition module for acquiring a whole-slice image, the whole-slice image including multiple sequencing points and gene expression matrices and spatial coordinate matrices corresponding to the multiple sequencing points; an image preprocessing module for preprocessing the whole-slice image to obtain multiple image patches; a feature extraction and fusion module for calling a pre-trained feature extraction model to perform multi-scale feature extraction and fusion processing based on the multiple image patches to obtain image feature vectors; an embedding matrix generation module for generating an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix using a sequencing point encoder; a learning and contrastive training module for constructing an initial prediction model and performing learning and contrastive training on the initial prediction model based on the image feature vectors and the expression embedding matrix to obtain a target prediction model; and a prediction module for acquiring a feature vector of a test image and inputting the feature vector of the test image into the target prediction model for prediction to obtain predicted gene expression values.

[0008] A third aspect of the present invention provides a spatial gene expression prediction device based on multi-scale feature extraction, the spatial gene expression prediction device based on multi-scale feature extraction comprising: a memory and at least one processor, the memory storing instructions; at least one processor calling the instructions in the memory to cause the spatial gene expression prediction device based on multi-scale feature extraction to perform each step of the spatial gene expression prediction method based on multi-scale feature extraction described above.

[0009] A fourth aspect of the present invention provides a computer-readable storage medium storing instructions that, when executed by a processor, implement the steps of the spatial gene expression prediction method based on multi-scale feature extraction described in any of the preceding claims.

[0010] In the technical solution of this invention, a whole-slice image is first acquired, which includes multiple sequencing points and gene expression matrices and spatial coordinate matrices corresponding to the sequencing points. Then, the whole-slice image is preprocessed to obtain multiple image patches. A pre-trained feature extraction model is called to perform multi-scale feature extraction and fusion processing based on the multiple image patches to obtain image feature vectors. Next, a sequencing point encoder is used to generate an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix to construct an initial prediction model. The initial prediction model is then trained by learning and comparison based on the image feature vectors and the expression embedding matrix to obtain a target prediction model. Finally, the feature vector of the image to be tested is obtained and input into the target prediction model for prediction to obtain the predicted gene expression value. The aim is to achieve accurate prediction of spatial gene expression by combining multi-scale tissue morphology features and spatial coordinate information. Attached Figure Description

[0011] Figure 1 A flowchart illustrating the spatial gene expression prediction method based on multi-scale feature extraction provided in this embodiment of the invention. Figure 2 A flowchart of the algorithm model for a spatial gene expression prediction method based on multi-scale feature extraction provided in an embodiment of the present invention; Figure 3 The gene visualization results on the Her2ST, cSCC, and Alex datasets provided in this embodiment of the invention; Figure 4 The gene visualization results on the DLPFC dataset provided in this embodiment of the invention; Figure 5 The results of gene correlation analysis on four datasets provided in this embodiment of the invention; Figure 6 The spatial domain recognition results on the DLPFC dataset provided in this embodiment of the invention; Figure 7 The tumor region detection results on the Alex dataset provided in this embodiment of the invention; Figure 8 The results of biological process analysis on tumor regions in the Alex dataset provided in this embodiment of the invention; Figure 9 This is a schematic diagram of the structure of a spatial gene expression prediction device based on multi-scale feature extraction provided in an embodiment of the present invention; Figure 10 This is a schematic diagram of the structure of a spatial gene expression prediction device based on multi-scale feature extraction provided in an embodiment of the present invention. Detailed Implementation

[0012] This invention provides a spatial gene expression prediction method based on multi-scale feature extraction. In this invention, the terms "first," "second," "third," "fourth," etc. (if present)," in the specification, claims, and accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0013] For ease of understanding, the specific process of the embodiments of the present invention is described below. Please refer to [link / reference]. Figure 1 One embodiment of the spatial gene expression prediction method based on multi-scale feature extraction in this invention includes: 101. Obtain a whole-slice image, wherein the whole-slice image includes multiple sequencing points and gene expression matrices and spatial coordinate matrices corresponding to the multiple sequencing points; In this embodiment, the acquisition of whole slice images relies on the research scenario of publicly available multi-slice spatial transcriptome datasets. The acquired whole slice images must contain multiple sequencing points, and each whole slice image is equipped with a gene expression matrix and spatial coordinate matrix that correspond one-to-one with the sequencing points. This construction logic must be adapted to the research specifications of publicly available multi-slice ST datasets.

[0014] For example, in this embodiment, four types of publicly available multi-slice ST datasets serve as core data support. Two types of human breast cancer datasets, cutaneous squamous cell carcinoma datasets, and human dorsolateral prefrontal cortex datasets all provide sample and data guarantees for the effective acquisition of whole-slice images. Among them, the HER2+ breast cancer dataset contains 36 samples from 8 patients, and after screening, 32 whole-slice images from 7 patients are retained. Each whole-slice image contains no less than 360 sequencing points, each corresponding to a 100μm scale. The corresponding gene expression matrix and spatial coordinate matrix are precisely matched with the number and spatial distribution of these sequencing points. The Alex breast cancer dataset consists of 6 whole-slice images from 4 patients, covering a total of 24,714 sequencing points, each corresponding to a 50μm scale. The gene expression matrix and spatial coordinate matrix accompanying the whole-slice images synchronously cover the gene expression information and spatial location information of all sequencing points. The cutaneous squamous cell carcinoma dataset contains 12 whole-slice images from 4 patients, totaling 8671 sequencing points, with each sequencing point corresponding to a 100μm scale. The gene expression matrix accompanying the whole-slice images of this dataset contains expression data for 17047 genes, and the spatial coordinate matrix accurately records the spatial distribution location of each sequencing point. The human dorsolateral prefrontal cortex dataset contains 12 whole-slice images from 3 individuals, covering 74631 sequencing points, with each sequencing point corresponding to a 55μm scale. Its accompanying gene expression matrix contains expression data for 33536 genes, and the spatial coordinate matrix fully reflects the spatial coordinate information of each sequencing point. The differences in the number of sequencing points, scale, and gene expression dimensions of different datasets make the information coverage of whole-slice images more diverse. This not only ensures the model's generalization and adaptation ability under different tissue types and sequencing point scales, but also enhances the model's ability to learn gene expression spatial localization and expression patterns through rich gene expression dimensions and spatial coordinate information. This effectively improves the model's prediction accuracy and biological interpretability, and solves the problem of insufficient generalization ability caused by the single data dimension and insufficient sample coverage of existing methods.

[0015] 102. Preprocess the full-slice image to obtain multiple image patches; In this embodiment, the preprocessing process first uses image processing tools to resize the whole-slice image, obtaining an adjusted image suitable for subsequent processing. This effectively addresses the large size of the whole-slice image, avoiding inefficiencies and insufficient feature extraction caused by exceeding image size limits. Subsequently, based on the spatial distribution information of sequencing points recorded in the spatial coordinate matrix, the adjusted image is divided into multiple image patches using image processing tools. Each image patch corresponds one-to-one with a sequencing point in the whole-slice image, achieving precise matching of the spatial locations of image patches and sequencing points. This effectively bridges the information gap between the global view of the whole-slice and the local information at the patch level, enabling the image feature extraction process to consider both the global background information provided by the whole-slice and the local morphological details at the patch level, thus improving the comprehensiveness and richness of image features.

[0016] 103. Call the pre-trained feature extraction model to perform multi-scale feature extraction and fusion processing based on multiple image patches to obtain image feature vectors; In this embodiment, the pre-trained feature extraction model is specifically the UNI model. This model, based on the ViT-L / 16 architecture, is designed specifically for histopathological images and can extract a 1×1024 image patch feature vector for each image patch. The model includes an extraction backbone module, a neighborhood fusion module, a global aggregation module, a dimension mapping module, and a multi-scale attention fusion module. First, the extraction backbone module performs basic feature extraction. Then, the neighborhood fusion module selects the eight nearest neighbor image patch feature vectors for each target image patch feature vector and fuses these nine vectors (target and neighbor features) using a linear multi-head transformer to generate initial neighborhood-level context-enhanced features, thereby capturing local spatial correlation information. Next, the global aggregation module performs global aggregation analysis on all image patch feature vectors to form initial global-level tissue structure features, presenting the overall tissue structure and spatial layout. Finally, the dimension mapping module uses a multilayer perceptron to combine the sequencing point-level image patch feature vectors and the neighborhood-level initial context-enhanced features. Initial organizational structure features at both the sequencing point and global levels are uniformly mapped to the high-variability gene dimension for gene expression prediction, resulting in sequencing point-level target features, target neighborhood-level context enhancement features, and target global-level organizational structure features, achieving dimensional alignment between features at different scales and the prediction task. Finally, the multi-scale attention fusion module employs a multi-head cross-attention mechanism, using sequencing point-level target features as queries and performing cross-attention calculations with target neighborhood-level context enhancement features and target global-level organizational structure features respectively. The attention results at the neighborhood and global scales are merged to generate an image feature vector that integrates local details, neighborhood context, and global structural information. The pre-trained model ensures the professionalism of feature extraction. Through hierarchical extraction and efficient fusion of multi-scale features, it preserves the morphological details of individual patches while enhancing the modeling ability of local spatial relationships and global organizational background, achieving organic integration of information at different levels. This provides high-quality image features rich in contextual information for subsequent spatial gene expression prediction, effectively improving the accuracy and biological interpretability of the prediction results. 104. Using a sequencing point encoder, an expression embedding matrix is ​​generated based on the spatial coordinate matrix and the gene expression matrix; In this embodiment, a sequencing point encoder is employed to generate an expression embedding matrix based on a spatial coordinate matrix and a gene expression matrix. This encoder consists of a coordinate encoding module, a multi-head self-attention module, and a projection module connected sequentially. First, the coordinate encoding module processes the horizontal and vertical coordinates of the spatial coordinate matrix. The horizontal coordinates are first converted into a one-hot encoded matrix adapted to the coordinate range of the tissue slices. Then, a learnable linear layer maps this matrix to a horizontal coordinate feature matrix with dimensions identical to the gene expression matrix. The same one-hot encoding and linear mapping process is performed on the vertical coordinates, resulting in a dimension-matched vertical coordinate feature matrix. This aligns the spatial location information with the gene expression data, laying the foundation for subsequent feature fusion. Subsequently, the multi-head self-attention module combines the gene expression matrix, the horizontal coordinate feature matrix, and the vertical coordinate feature matrix element-wise. Through a multi-head self-attention mechanism, global dependency modeling is performed to learn the intrinsic relationship between gene expression features and spatial location features, generating an intermediate fusion feature matrix that effectively integrates gene expression information with the spatial context information of the sequencing points. Finally, the projection module maps the intermediate fusion feature matrix to the target dimension using a multilayer perceptron to obtain the expression embedding matrix, completing the final integration of gene expression data and spatial location information. The final generated expression embedding matrix not only preserves the biological information of gene expression but also embeds the spatial distribution features of sequencing points. This provides high-quality expression features rich in contextual information for subsequent cross-modal contrastive learning and gene expression prediction tasks, effectively improving the model's ability to learn and predict spatial gene expression patterns and enhancing the biological interpretability of the prediction results.

[0017] 105. Construct an initial prediction model, and perform learning and comparative training on the initial prediction model based on the image feature vector and the expression embedding matrix to obtain the target prediction model; In this embodiment, the cosine similarity between the image feature vector and the representation embedding matrix is ​​first calculated to obtain a similarity matrix. This matrix quantifies the similarity between the image patch embedding and the sequencing point embedding, providing a basis for subsequent loss calculation. Then, a preset target matrix for labeling positive and negative sample pairs is obtained. This matrix is ​​presented in identity matrix form. Diagonal elements label image patches and sequencing points at the same position as positive sample pairs, with a value of 1. Off-diagonal elements label image patches and sequencing points at different positions as negative sample pairs, with a value of 0, thus clarifying the supervision signal for model training. Next, the total contrastive loss is calculated based on the similarity matrix and the target matrix. The total contrastive loss is obtained by weighted averaging of the image loss and the sequencing point loss. The image loss is used to constrain the similarity between the image patch embedding and the sequencing point embedding, and the sequencing point loss is used to constrain the similarity between the sequencing point embedding and the image patch embedding. The relative weights of the two types of losses are controlled by hyperparameters to comprehensively measure the model's ability to distinguish between positive and negative sample pairs. Subsequently, based on the total contrastive loss, the backpropagation algorithm is used to iteratively update the weight parameters of the initial prediction model. This allows the model to continuously adjust its parameters, maximizing the similarity between image features and gene expression embeddings at the same location while minimizing the similarity between samples at different locations, gradually constructing an aligned multimodal embedding space. When a preset iteration stopping condition is met, the target prediction model with completed training is output. Through intermodal contrastive learning, the embedding spaces of image features and gene expression features are effectively aligned, strengthening the feature association between samples at the same location and suppressing feature confusion between samples at different locations. This significantly improves the model's ability to fuse image information and gene expression data, providing more consistent and discriminative multimodal features for subsequent gene expression prediction. This not only enhances the model's generalization ability and prediction accuracy but also makes the learned features more biologically interpretable, better capturing the intrinsic relationship between tissue morphology and gene expression.

[0018] After obtaining the target prediction model, three types of indicators—PCC, MSE, and ARI—can be used to comprehensively evaluate the model performance to quantify the accuracy and biological rationality of the prediction results. PCC, or Pearson correlation coefficient, is a statistical indicator used to assess the strength of the linear relationship between two variables. It is commonly used to measure the correlation between predicted gene expression values ​​and actual gene expression values. The closer the PCC value is to 1, the stronger the linear correlation between the predicted and actual results, and the better the model performance. The specific formula is as follows: ,in, Covariance is used to measure the degree of linear association between two variables. Variance is used to measure the dispersion of a single variable. This represents the observed true gene expression value. This represents the gene expression value predicted by the model. This indicator can intuitively reflect the consistency between the predicted and actual values ​​in the overall trend. MSE, or Mean Squared Error, is a commonly used evaluation metric that measures the average of the squared differences between the predicted and actual values. This metric squares the error and is more sensitive to larger deviations. It is suitable for evaluating the overall prediction accuracy of the model. The smaller the MSE value, the smaller the deviation between the model's prediction and the actual value, and the better the performance. Its calculation formula is: ,in, Represents the total number of samples. Indicates the first The true gene expression value of each sample Indicates the first The predicted gene expression value for each sample is used to comprehensively quantify the model's prediction error level across all samples by averaging the sum of the squared prediction errors for each sample. ARI, or Adjusted Rand Index, is used to evaluate the model's clustering performance, measuring the consistency between the clustering results and the true labels. A higher ARI value indicates a better match between the clustering results and the true category labels. The specific formula is as follows: ,in, This indicates that it belongs to the real category at the same time. and prediction categories The number of samples, Represents the true category The total number of samples in the sample, Indicates the prediction category The total number of samples in the sample, The expected value is used to calculate the expected number of such sample pairs under random clustering assignment. This indicator, by correcting for the influence of random clustering, more accurately reflects the true effectiveness of the model's clustering results. The three indicators comprehensively evaluate the model performance from three dimensions: linear correlation, prediction accuracy, and clustering consistency, providing multi-dimensional references for model optimization and practical applications.

[0019] 106. Obtain the feature vector of the image to be tested, input the feature vector of the image to be tested into the target prediction model for prediction, and obtain the predicted gene expression value.

[0020] In this embodiment, to obtain the feature vector of the image to be tested, the test image is first divided into multiple 224×224 pixel image patches according to the sequencing point locations. For each patch, features at three scales—sequencing point level, neighborhood level, and global level—are extracted. These three types of features are then fused into a feature vector of the image to be tested using a multi-scale fusion network and mapped to the gene dimension. The feature vector of the image to be tested is input into a target prediction model for prediction to obtain the predicted gene expression value. The target prediction model consists of a similar feature retrieval module, a weight calculation module, and a weighted aggregation module connected in sequence. First, the similar feature retrieval module receives the feature vector of the image to be tested, calculates its cosine similarity with the expression embedding matrix, and determines multiple sequencing point reference features most similar to the feature vector of the image to be tested based on the similarity, thus completing the screening of similar samples. Subsequently, the weight calculation module receives the feature vector of the image to be tested and the selected multiple sequencing point reference features, calculates the Euclidean distance between them, and assigns weights to each sequencing point reference feature based on the Euclidean distance. The smaller the feature distance, the higher the weight assigned, thereby quantifying the correlation between each reference feature and the sample to be tested. Finally, the weighted aggregation module performs a weighted summation operation based on the gene expression values ​​corresponding to the reference features of each sequencing point and the assigned weights, obtaining the final predicted gene expression values. Multi-scale feature extraction ensures the comprehensiveness of the features of the image under test, similar feature retrieval accurately locates associated reference samples, and distance weighting strengthens the contribution of similar samples. The final weighted aggregation prediction method not only conforms to the feature association logic between samples but also fully utilizes the effective information from the training phase, effectively improving the accuracy and biological interpretability of spatial gene expression prediction and enhancing the model's generalization ability in different tissue scenarios.

[0021] After the above model is built and trained, combined with Figure 2The overall process logic can be further clarified: In the training phase, features are extracted from image patches using a multi-scale fusion network, and gene expression data is fused with spatial coordinate data using the multi-head self-attention mechanism of the sequencing point encoder. Next, a contrastive learning method is used to optimize the similarity between image features and gene expression features, maximizing the similarity between positive sample pairs formed by image patches at the same location and sequencing points, while minimizing the similarity between negative sample pairs formed by image patches at different locations and sequencing points. In the multi-scale fusion network, for each image patch, features are extracted at three different scales: sequencing point level, neighborhood level, and global level. The backbone extraction module, neighborhood fusion module, and global aggregation module are responsible for learning features at each scale, while the multi-scale attention fusion module fuses these features at different scales, ultimately obtaining an image feature vector that integrates local details, neighborhood context, and global structural information. In the prediction phase, the feature vector of the new test image is extracted through the multi-scale fusion network and matched with the expression embedding matrix obtained in the training phase. By querying the reference features of the k closest sequencing points, the weights of these reference features are calculated, and then the corresponding gene expression values ​​are weighted and aggregated to finally predict the gene expression value of the target sequencing point. Finally, in the downstream analysis stage, the predicted gene expression results are further analyzed and interpreted, including spatial gene expression pattern visualization, gene correlation analysis, spatial domain identification, tumor region detection, and functional enrichment analysis, to explore the biological significance and clinical application value of the predicted results.

[0022] After completing the model training and downstream analysis workflow construction, the model performance was comprehensively validated through multiple sets of cross-dataset experiments. The relevant experimental results are as follows: Figures 3 to 8 As shown: First, please refer to Figure 3 Gene expression visualization comparison experiments were conducted on three datasets: Her2ST, cSCC, and Alex, to visually demonstrate the gene expression effects generated by the models. Specifically, on the B1 and C1 slices of the HER2+ dataset, the marker gene GNAS was visualized, and the performance of HisCMCL was compared with that of representative methods. On the cSCC dataset, the marker gene RPL13 was visualized using the same comparison method as the Her2ST dataset. On the Alex dataset, the marker gene AZGP1 was visualized, and the comparison method was consistent with the first two datasets. Each panel displays the actual gene expression values ​​and the prediction results of each representative method. The above three genes are known tissue-specific marker genes, which can intuitively reflect the ability of different methods to restore the expression patterns of key genes.

[0023] For further information, please refer to [link / reference]. Figure 4Visual validation of marker gene expression pattern prediction was performed on the DLPFC dataset. The results show the performance of our method and seven baseline methods on five different tissue sections, comprehensively evaluating the model's generalization ability and expression prediction accuracy on different tissue sections.

[0024] Then please see Figure 5 The model's ability to reproduce gene association patterns was verified through gene co-expression analysis. A heatmap was used to display the Pearson correlation coefficient (PCC) among the top 50 highly expressed genes, with values ​​ranging from -1 (dark blue) to 1 (dark red). This analysis compared the correlation patterns between the gene expression observed in the four datasets and the prediction results of seven methods to verify whether the model can effectively preserve the true gene co-expression relationships.

[0025] To evaluate the model's performance in spatial domain recognition tasks, please refer to [link / reference]. Figure 6 On the DLPFC dataset, the spatial domain recognition accuracy of the original data and the generated predicted data was compared. On the one hand, the spatial domain was visualized on slice 151673 using the K-means clustering method, and the recognition accuracy of different methods was compared. On the other hand, the above visualization and performance comparison were repeated using the GraphST clustering method, comprehensively verifying the ability of the gene expression data generated by the model to identify and reconstruct spatial organizational structures from the perspective of different clustering algorithms.

[0026] For clinically relevant tumor region detection tasks, please refer to [link / reference]. Figure 7 On the Alex dataset, tumor region detection experiments were conducted based on predicted gene expression profiles. Sequencing points were classified into cancerous tissue regions and normal tissue regions according to the tissue regions annotated by pathologists. The ARI index was used to evaluate the tumor region detection performance of different methods and to verify the application potential of the model in the identification of disease-related tissue regions.

[0027] Finally, please see Figure 8 We performed GO:BP (biological process) and KEGG (signaling pathway) enrichment analyses on differentially expressed genes in tumor regions on the Alex dataset. First, we demonstrated the complete workflow for identifying differentially expressed genes and conducting enrichment analysis based on the gene expression predictions made using our method. Second, we presented the p-values ​​of the top 10 significant GO:BP terms identified in the tumor region and their corresponding enrichment analysis heatmaps. We analyzed the impact of related differentially expressed genes on these pathways by calculating log (folding change), with the pathways highlighted in red being the core focus. Simultaneously, we presented the p-values ​​of the top 10 significant KEGG pathways identified in the tumor region and their corresponding enrichment analysis heatmaps, validating the biological rationality and clinical value of the model's predictions from a functional pathway perspective.

[0028] In this embodiment of the invention, the preprocessing of the full-slice image to obtain multiple image patches includes: using an image processing tool to adjust the size of the full-slice image to obtain an adjusted image; and using the image processing tool to divide the adjusted image into multiple image patches according to the spatial coordinate matrix.

[0029] In this embodiment, an image processing tool is used to resize the whole-slice image, resulting in an adjusted image. This operation addresses the large size of the whole-slice image, avoiding insufficient morphological feature capture due to exceeding image size limits, and provides a suitable image foundation for subsequent precise image patching. Subsequently, the image processing tool divides the adjusted image into multiple image patches based on a spatial coordinate matrix. The size of each patch is 224×224 pixels, with each patch corresponding to a target sequencing point. This partitioning method, relying on the spatial coordinate matrix, achieves precise matching between the spatial location of local image units and sequencing points, effectively bridging the significant gap between the global view of the whole-slice image and the local information at the patch level. It decomposes the whole-slice image into independently processable local units, preserving the global background information of the whole-slice image while focusing on capturing local morphological features. The processing adaptability of whole-slice images was optimized by adjusting the size, improving the efficiency and stability of subsequent operations. Image patching based on spatial coordinate matrix enabled precise correspondence between local image units and sequencing points. This not only solved the problem of large whole-slice images being difficult to process directly, but also fully captured local morphological features, bridging the gap between global and local information. This provided high-quality local image input for subsequent image feature extraction and spatial gene expression prediction tasks, enhanced the model's ability to learn the spatial association between tissue morphology and sequencing points, and effectively improved the accuracy and biological interpretability of subsequent prediction results.

[0030] In this embodiment of the invention, the feature extraction model includes an extraction backbone module, a neighborhood fusion module, a global aggregation module, a dimension mapping module, and a multi-scale attention fusion module. The extraction backbone module is connected to the neighborhood fusion module and the global aggregation module, respectively. The neighborhood fusion module and the global aggregation module are connected to the dimension mapping module, respectively. The dimension mapping module is connected to the multi-scale attention fusion module. The step of calling the pre-trained feature extraction model to perform multi-scale feature extraction and fusion processing based on multiple image patches to obtain image feature vectors includes: performing feature extraction processing on each image patch based on the extraction backbone module to obtain multiple image patch feature vectors; determining multiple target image patch feature vectors based on the neighborhood fusion module according to each image patch feature vector; and obtaining the feature vectors of its k nearest neighbor image patches for each target image patch feature vector. The algorithm performs linear multi-head transformation fusion processing on each target image patch feature vector and its k nearest neighbor image patch feature vectors to obtain multiple initial neighborhood-level context enhancement features. Based on the global aggregation module, it performs global aggregation processing on all image patch feature vectors to obtain initial global-level organizational structure features. Based on the dimension mapping module, it maps each image patch feature vector, each initial neighborhood-level context enhancement feature, and the initial global-level organizational structure feature to a high-variability gene dimension for gene expression prediction, respectively, to obtain multiple sequencing point-level target features, multiple target neighborhood-level context enhancement features, and target global-level organizational structure features. Based on the multi-scale attention fusion module, it performs multi-head cross-attention merging calculation based on the multiple sequencing point-level target features, the multiple target neighborhood-level context enhancement features, and the target global-level organizational structure features to obtain the image feature vector.

[0031] In this embodiment, the feature extraction module first performs feature extraction, and the specific formula expression is as follows: , in, Indicates the first Feature vectors of image patches This represents a pre-trained feature extractor. Indicates the first One image patch, The feature vector has a dimension of 1 row and 1024 columns, ensuring the professionalism and effectiveness of basic feature extraction.

[0032] Subsequently, the neighborhood fusion module selects the feature vectors of the eight nearest neighbor image patches for each target image patch feature vector. The target feature and the nine nearest neighbor features are then fused using a linear multi-head transformer to obtain the initial neighborhood-level context enhancement features. The specific formula is as follows: , in, Indicates the first Neighborhood-level contextual enhancement features for each target image patch This represents a multilayer perceptron. This indicates the number of patches in the neighborhood (value is 9). Indicates the first The weight of each neighborhood patch Indicates the first The feature vectors of the neighborhood patches This indicates the number of highly variable genes used to express predictions, thereby capturing local spatial association information and enhancing the contextual adaptability of features.

[0033] Next, the global aggregation module performs global aggregation analysis on all image patch feature vectors to form initial global-level organizational structure features, comprehensively presenting the overall organizational structure and spatial layout, providing global background information for subsequent fusion. Then, the dimension mapping module uses a multilayer perceptron to uniformly map the sequencing point-level image patch feature vectors, neighborhood-level initial context enhancement features, and global-level initial organizational structure features to the highly variable gene dimension used for gene expression prediction. This ultimately yields multiple sequencing point-level target features, multiple target neighborhood-level context enhancement features, and target global-level organizational structure features, achieving dimensional alignment between features at different scales and the prediction task. For example, taking the mapping of sequencing point-level image patch feature vectors to the highly variable gene dimension used for gene expression prediction as an example, the specific formula expression is: , in, Indicates the first Sequencing point-level target features, This represents a multilayer perceptron. Indicates the first Feature vectors of image patches This indicates the number of highly variable genes.

[0034] Finally, the multi-scale attention fusion module employs a multi-head cross-attention mechanism, using sequencing point-level target features as queries and performing cross-attention calculations with target neighborhood-level context enhancement features and target global-level tissue structure features respectively, to obtain image feature vectors. Relying on the pre-trained UNI model ensures the professionalism of basic feature extraction. Through hierarchical extraction and efficient fusion of multi-scale features, it preserves the morphological details of individual patches while enhancing the modeling ability of local spatial relationships and global tissue background, bridging the gap between the global view of the whole-slice image and the local information at the patch level. Dimensional mapping aligns the dimensions of features at different scales with the prediction task. The multi-head cross-attention mechanism allows image features to simultaneously carry local details, neighborhood context, and global structural information, providing high-quality image features rich in contextual information for subsequent spatial gene expression prediction. This effectively improves the accuracy and biological interpretability of the prediction results and enhances the model's generalization ability in different tissue scenarios.

[0035] In this embodiment of the invention, the multi-scale attention fusion module includes a first fusion submodule, a second fusion submodule, and a merging submodule, which are sequentially connected. The step of performing multi-head cross-attention merging calculation based on the multi-scale attention fusion module, according to multiple sequencing point-level target features, multiple target neighborhood-level context enhancement features, and target global-level organizational structure features, to obtain the image feature vector includes: based on the first fusion submodule, performing multi-head cross-attention calculation using multiple sequencing point-level target features as queries and multiple target neighborhood-level context enhancement features as keys and values ​​to obtain a neighborhood attention result; based on the second fusion submodule, performing multi-head cross-attention calculation using multiple sequencing point-level target features as queries and the target global-level organizational structure features as keys and values ​​to obtain a global attention result; and inputting the neighborhood attention result and the global attention result into the merging submodule for merging processing to obtain the image feature vector.

[0036] In this embodiment, firstly, based on the first fusion submodule, multiple sequencing point-level target features are used as queries, and multiple target neighborhood-level context enhancement features are used as keys and values. Multi-head cross-attention calculation is then performed to obtain the neighborhood attention result. The specific formula expression is as follows: , in, Represents the query vector. Represents the key vector. Represents a value vector. This represents the product of the query vector and the transpose of the key vector, used to measure the correlation between different features. This represents the feature dimension, used to scale correlation values ​​to avoid gradient vanishing. This represents the activation function, used to transform relevance values ​​into attention weights. The final neighborhood attention result is obtained through a weighted operation of the attention weights and the value vector. ,in, This represents the key vector corresponding to the target neighborhood-level context enhancement features. This represents the value vector corresponding to the target neighborhood-level context enhancement features. The neighborhood attention result represents the association information between the target sequencing point features and the local context features of the neighborhood. Based on the second fusion submodule, multiple sequencing point-level target features are used as queries, and the target global-level organizational structure features are used as keys and values. Multi-head cross-attention calculation is performed to obtain the global attention result. The same cross-attention formula is used to obtain the global attention result. ,in, This represents the key vector corresponding to the target's global organizational structure features. This represents the value vector corresponding to the target's global-level organizational structure features. The global attention result represents the correlation information between the target sequencing point features and the global tissue structure features. The neighborhood attention result and the global attention result are input into the merging submodule for merging processing to obtain the image feature vector. The specific formula expression is as follows: , in, Represents the image feature vector. This represents the neighborhood attention result. This represents the result of global attention. This represents the merge function, used to integrate neighborhood and global attention results. This indicates that the final image feature vector has a dimension of 1 row. This approach achieves the organic integration of local context and global structural information. Through parallel cross-attention modules, sequencing point-level features simultaneously absorb neighborhood local spatial correlations and global tissue layout information, effectively bridging the significant gap between the global view of the whole slice image and patch-level local information. This preserves the morphological details of individual patches while enhancing the spatial context adaptability of features. The dimension-aligned fusion results provide high-quality image features rich in multi-level biological information for subsequent spatial gene expression prediction, significantly improving the model's ability to learn the correlation between tissue morphology and gene expression, enhancing the accuracy and biological interpretability of prediction results, and improving the model's generalization ability in different tissue scenarios.

[0037] In this embodiment of the invention, the sequencing point encoder includes a coordinate encoding module, a multi-head self-attention module, and a projection module, which are sequentially connected. The step of generating an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix using the sequencing point encoder includes: performing one-hot encoding and dimension mapping on the horizontal and vertical coordinates of the spatial coordinate matrix using the coordinate encoding module to obtain a horizontal coordinate feature matrix and a vertical coordinate feature matrix; performing global dependency modeling on the gene expression matrix, the horizontal coordinate feature matrix, and the vertical coordinate feature matrix using the multi-head self-attention module to obtain an intermediate fusion feature matrix; and inputting the intermediate fusion feature matrix into the projection module for dimension mapping to obtain the expression embedding matrix.

[0038] In this embodiment, the sequencing point encoder includes a coordinate encoding module, a multi-head self-attention module, and a projection module. These three modules are connected sequentially and work together to fuse the gene expression matrix and the spatial coordinate matrix, generating an expression embedding matrix. Based on the coordinate encoding module, one-hot encoding and dimension mapping are performed on the horizontal and vertical coordinates of the spatial coordinate matrix to obtain the horizontal coordinate feature matrix and the vertical coordinate feature matrix. Specifically, the horizontal coordinates are first converted to a size of... The one-hot encoding matrix, where, Indicates the number of sequencing sites. This represents the maximum value of the x-coordinate in all tissue slices, which is then mapped to a learnable linear layer. The abscissa feature matrix, where, Indicates the number of sequencing sites. This matrix represents the number of highly variable genes used for gene expression prediction, and its dimensions are consistent with those of the gene expression matrix. The ordinate is processed in the same way to obtain a dimension-matched ordinate feature matrix, which aligns spatial location information with the dimensions of gene expression data, laying the foundation for subsequent feature fusion.

[0039] Based on the multi-head self-attention module, the gene expression matrix, the horizontal axis feature matrix, and the vertical axis feature matrix are subjected to global dependency modeling to obtain the intermediate fusion feature matrix. The specific formula expression is as follows: , in, Indicates the first Intermediate fusion characteristics of sequencing sites This represents a multi-head self-attention mechanism used to learn the global dependencies between gene expression features and spatial location features. express Gene expression matrix, Represents the characteristic matrix of the horizontal axis. The vertical axis feature matrix is ​​represented by the three elements, which are first added element by element, and then integrated through a multi-head self-attention mechanism to generate an intermediate fused feature matrix that combines gene expression information and spatial context information.

[0040] The intermediate fused feature matrix is ​​input into the projection module for dimension mapping to obtain the representation embedding matrix. The specific formula is as follows: , in, Indicates the first Expression embedding features of each sequencing site This represents a multilayer perceptron used to map intermediate fused features to the target dimension, ultimately generating an embedding matrix of dimension . Consistent with the dimension of the gene expression matrix, this approach achieves deep fusion of gene expression data and spatial location information. A coordinate encoding module enables structured representation of spatial location information, while a multi-head self-attention mechanism strengthens the modeling of the association between gene expression and spatial location. Finally, a projection module generates an expression embedding matrix that combines core gene expression biological information with the spatial distribution characteristics of sequencing points. This integrates core gene expression data with the specific spatial context of each sequencing point, providing high-quality expression features rich in spatial context for subsequent cross-modal contrastive learning and gene expression prediction tasks. This effectively enhances the model's ability to learn and predict spatial gene expression patterns and strengthens the biological interpretability of the prediction results.

[0041] In this embodiment of the invention, the step of learning and comparing the initial prediction model based on the image feature vector and the expression embedding matrix to obtain the target prediction model includes: calculating the cosine similarity between the image feature vector and the expression embedding matrix to obtain a similarity matrix; obtaining a preset target matrix for labeling positive and negative sample pairs; calculating the total contrast loss based on the similarity matrix and the target matrix; iteratively updating the weight parameters of the initial prediction model using a backpropagation algorithm based on the total contrast loss; and outputting the target prediction model when a preset iteration stopping condition is met.

[0042] In this embodiment, the cosine similarity between the image feature vector and the representation embedding matrix is ​​first calculated to obtain the similarity matrix. The specific formula is as follows: , in, This represents the cosine similarity calculation function, used to quantify the directional similarity between image feature vectors and their embedding matrices. Indicates the first The image feature vector corresponding to each image patch, i.e., the image embedding; Indicates the first The expression embedding matrix corresponding to each sequencing point, i.e., the sequencing point embedding; express transpose, The product of the L2 norms of two vectors is represented by a normalization process to avoid interference from differences in vector magnitudes on the similarity calculation results. The resulting similarity matrix comprehensively quantifies the similarity between image features and sequencing point expression embeddings, providing a core quantitative basis for subsequent loss calculation.

[0043] Then, a pre-defined target matrix for labeling positive and negative sample pairs is obtained, and the total contrastive loss is calculated based on the similarity matrix and the target matrix. The target matrix is ​​constructed in identity matrix form, and the specific formula is as follows: , in, Indicates the first Line number The elements of the column are used to indicate the first... The image patch and the first Sample pairs consisting of sequencing points; Indicates the first One image patch, Indicates the first One sequencing point; positive sample pairs consist of an image patch and a sequencing point at the same location, corresponding to... A value of 1 indicates that negative sample pairs are image patches and sequencing points at different locations, corresponding to... A value of 0 for this matrix provides explicit supervision during the training process. Based on the target matrix and similarity matrix mentioned above, two loss components, image loss and sequencing point loss, are calculated. The specific formulas are as follows: , , in, Indicates image loss, Indicates sequencing point loss. Indicates the first Line number The elements of the column are used to indicate the first... The image patch and the first Sample pairs consisting of sequencing points Indicates the first Image feature vectors corresponding to each image patch Indicates the first The expression embedding matrix corresponding to each sequencing point This represents the function for calculating cosine similarity. The logarithmic function is used to convert similarity into a loss value to quantify the deviation between the predicted similarity and the true sample label; finally, the total contrastive loss is calculated by weighted average, and the specific formula is as follows: , in, Indicates the total comparative loss. Indicates image loss, Indicates sequencing point loss. The hyperparameter, ranging from 0 to 1, controls the relative weights of image loss and sequencing point loss, enabling precise regulation of the alignment between image modalities and gene expression modalities. The total contrast loss comprehensively reflects the model's ability to distinguish between positive and negative sample pairs, providing a core objective for parameter optimization.

[0044] Based on the total contrastive loss, the backpropagation algorithm is used to iteratively update the weight parameters of the initial prediction model. The backpropagation algorithm uses the total contrastive loss as the optimization objective, calculating the gradient layer by layer from the output layer to the input layer, and updating the model parameters synchronously according to the gradient descent principle. This allows the model to gradually adjust its feature mapping logic, maximizing the similarity between image features and representation embeddings at the same location, while minimizing the similarity between samples at different locations, thus gradually constructing an aligned multimodal embedding space. When a preset iteration stopping condition is met, the target prediction model is output. The iteration stopping condition can be set as the total contrastive loss converging to a preset threshold, the number of iterations reaching a preset upper limit, or the model validation set performance no longer improving. At this point, the trained target prediction model has a stable multimodal feature extraction and alignment capability.

[0045] A contrastive learning mechanism achieves deep alignment between image features and gene expression embeddings. Cosine similarity quantification accurately captures intermodal feature similarity, while the target matrix provides clear positive and negative sample supervision signals for training. A bidirectional loss function and hyperparameter tuning ensure the balance and effectiveness of intermodal feature alignment, and backpropagation parameter optimization ensures the convergence and efficiency of model training. The resulting target prediction model effectively extracts high-quality image morphological features and gene expression spatial features, providing unified and aligned multimodal feature support for subsequent spatial gene expression prediction tasks. This significantly improves the accuracy and biological interpretability of prediction results, while enhancing the model's generalization ability across different tissue samples and sequencing scales, providing a reliable model foundation for spatial transcriptomics research.

[0046] In this embodiment of the invention, the target prediction model includes a similar feature retrieval module, a weight calculation module, and a weighted aggregation module, which are sequentially connected. The step of inputting the feature vector of the image to be tested into the target prediction model for prediction to obtain the predicted gene expression value includes: inputting the feature vector of the image to be tested into the similar feature retrieval module, calculating the cosine similarity between the feature vector of the image to be tested and the expression embedding matrix, and determining multiple sequencing point reference features most similar to the feature vector of the image to be tested based on the cosine similarity; inputting the feature vector of the image to be tested and the multiple sequencing point reference features into the weight calculation module, calculating the Euclidean distance between the feature vector of the image to be tested and each sequencing point reference feature, and assigning weights to each sequencing point reference feature based on the Euclidean distance; and, based on the weighted aggregation module, performing a weighted summation based on the gene expression value corresponding to each sequencing point reference feature and the assigned weights to obtain the predicted gene expression value.

[0047] In this embodiment, the feature vector of the image to be tested is input into the target prediction model for prediction to obtain the predicted gene expression value. The target prediction model consists of a similar feature retrieval module, a weight calculation module, and a weighted aggregation module connected in sequence. The feature vector of the image to be tested needs to first divide the test image into 224×224 pixel image patches according to the sequencing point position, and extract three scale features at the sequencing point level, neighborhood level, and global level respectively. After being fused by a multi-scale fusion network, it is mapped to the gene dimension.

[0048] The feature vector of the image to be tested is input into the similarity feature retrieval module, and the cosine similarity between it and the expression embedding matrix is ​​calculated. Based on the cosine similarity, multiple sequencing point reference features most similar to the feature vector of the image to be tested are determined. This step, by quantifying the similarity between features, filters out the reference samples most closely associated with the sample to be tested, providing a reliable reference for subsequent predictions. The feature vector of the image to be tested and multiple sequencing point reference features are input into the weight calculation module, and the Euclidean distance between the feature vector of the image to be tested and each sequencing point reference feature is calculated. The specific formula is as follows: , in, This represents the feature vector of the test image patch, i.e., the feature vector of the image under test. This represents the feature vector obtained during the training process. The dimension of the feature space. express In the The values ​​of the components in each feature dimension express In the The formula measures the spatial distance between two features by calculating the square root of the sum of the squares of the differences between the features in each dimension; a smaller distance indicates higher feature similarity. Weights are assigned to reference features for each sequencing point based on Euclidean distance, and the specific formula is as follows: , in, This represents the feature vector of the image under test and the feature vector of the retrieved reference sequencing points. The weights between features are calculated, with the denominator being the Euclidean distance between all features in the neighborhood. The sum of powers of 1, with weights assigned according to the principle that the smaller the distance, the higher the weight, thus strengthening the contribution of similar reference samples to the prediction results. This represents the feature vector of the image under test for which weights are currently being calculated, i.e., compared with the feature vector of the reference sequencing point. Paired target test sample features express and The Euclidean distance between them Represents the th in the neighborhood set Each image feature vector is a given feature vector, and the neighborhood set is a reference sequencing point feature vector. The set of all untested features that together constitute a similar neighborhood, that is, the set of untested features corresponding to the first k similar reference features retrieved earlier using cosine similarity. Represents all the features to be tested in the neighborhood set and Euclidean distance Summing by powers, using the result as the normalized denominator, and dividing the individual distances... The exponentiation is converted into a percentage form to ensure that the sum of the weights of all reference features is 1. Based on the weighted aggregation module, the predicted gene expression value is obtained by weighted summation according to the gene expression value corresponding to the reference feature of each sequencing point and the assigned weight. The specific formula expression is as follows: , in, Indicates predicted gene expression values. Indicates the first Weights of features of each reference sequencing site, This represents a pre-defined hyperparameter, indicating the number of reference sequencing points most similar to the feature vector of the image to be tested, selected during the similarity feature retrieval stage. In other words, it represents the total number of similar reference samples participating in the final weighted aggregation. The top k most relevant sequencing point features are obtained through cosine similarity retrieval and used as a reference for prediction. Indicates the first Gene expression values ​​from several reference sequencing sites are weighted and summed to integrate the gene expression information of similar reference samples according to their weights, generating the final prediction result.

[0049] This prediction process relies on multi-scale fusion of the target feature vector, taking into account local image morphological details, neighborhood spatial context, and global tissue structure information. The similarity feature retrieval module filters relevant reference samples to avoid interference from irrelevant samples. The weight calculation module quantifies similarity and assigns differentiated weights through Euclidean distance, highlighting the core role of similar samples. The weighted aggregation module efficiently integrates the gene expression information of reference samples, making full use of the effective data accumulated during the training phase and conforming to the feature association logic between samples. This effectively improves the accuracy and biological interpretability of spatial gene expression prediction, enhances the model's generalization ability under different tissue types and sequencing scales, and provides reliable predictive support for subsequent pathological analysis and clinical applications.

[0050] The spatial gene expression prediction method based on multi-scale feature extraction in the embodiments of the present invention has been described above. The spatial gene expression prediction device based on multi-scale feature extraction in the embodiments of the present invention is described below. Please refer to [link / reference]. Figure 9 One embodiment of the spatial gene expression prediction device based on multi-scale feature extraction in this invention includes: Image acquisition module 901: used to acquire a whole slice image, the whole slice image including multiple sequencing points and gene expression matrix and spatial coordinate matrix corresponding to the multiple sequencing points; Image preprocessing module 902: used to preprocess the whole slice image to obtain multiple image patches; Feature extraction and fusion module 903: used to call a pre-trained feature extraction model to perform multi-scale feature extraction and fusion processing based on multiple image patches to obtain image feature vectors; Embedding matrix generation module 904: used to generate an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix using a sequencing point encoder; Learning and contrast training module 905: used to construct an initial prediction model, and to perform learning and contrast training on the initial prediction model based on the image feature vector and the expression embedding matrix to obtain a target prediction model; Prediction module 906: used to acquire the feature vector of the image to be tested, input the feature vector of the image to be tested into the target prediction model for prediction, and obtain the predicted gene expression value.

[0051] Based on the same ideas as the methods in the above embodiments, the apparatus provided in this application can implement the methods in the above embodiments.

[0052] above Figure 9The spatial gene expression enhancement device based on bidirectional attention multimodal fusion in the embodiments of the present invention will be described in detail from the perspective of modular functional entities. The spatial gene expression enhancement device based on bidirectional attention multimodal fusion in the embodiments of the present invention will be described in detail from the perspective of hardware processing.

[0053] Figure 10 This is a schematic diagram of a spatial gene expression enhancement device based on bidirectional attention multimodal fusion according to an embodiment of the present invention. The spatial gene expression enhancement device 1000 based on bidirectional attention multimodal fusion can vary considerably due to different configurations or performance. It may include one or more central processing units (CPUs) 1010 (e.g., one or more processors) and a memory 1020, and one or more storage media 1030 (e.g., one or more mass storage devices) for storing application programs 1033 or data 1032. The memory 1020 and storage media 1030 can be temporary or persistent storage. The program stored in the storage media 1030 may include one or more modules (not shown in the diagram), and each module may include a series of instruction operations on the spatial gene expression enhancement device 1000 based on bidirectional attention multimodal fusion. Furthermore, the processor 1010 can be configured to communicate with the storage medium 1030 and execute a series of instruction operations in the storage medium 1030 on the spatial gene expression enhancement device 1000 based on bidirectional attention multimodal fusion to implement the steps of the spatial gene expression enhancement method based on bidirectional attention multimodal fusion provided in the above-described method embodiments.

[0054] The spatial gene expression enhancement device 1000 based on bidirectional attention multimodal fusion may further include one or more power supplies 1040, one or more wired or wireless network interfaces 1050, one or more input / output interfaces 1060, and / or one or more operating systems 1031, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will understand that... Figure 10 The illustrated spatial gene expression enhancement device structure based on bidirectional attention multimodal fusion does not constitute a limitation on spatial gene expression enhancement devices based on bidirectional attention multimodal fusion. It may include more or fewer components than illustrated, or combine certain components, or have different component arrangements.

[0055] The present invention also provides a computer-readable storage medium, which can be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, wherein the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform steps of a spatial gene expression enhancement method based on bidirectional attention multimodal fusion.

[0056] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the system, device, or unit described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0057] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0058] Finally, it should be noted that the above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A spatial gene expression prediction method based on multi-scale feature extraction, characterized in that, include: Obtain a whole slice image, the whole slice image including multiple sequencing points and gene expression matrix and spatial coordinate matrix corresponding to the multiple sequencing points; The full-slice image is preprocessed to obtain multiple image patches; The pre-trained feature extraction model is invoked to perform multi-scale feature extraction and fusion processing based on multiple image patches to obtain image feature vectors; An expression embedding matrix is ​​generated based on the spatial coordinate matrix and the gene expression matrix using a sequencing point encoder. An initial prediction model is constructed, and the initial prediction model is trained by learning and comparison based on the image feature vector and the expression embedding matrix to obtain the target prediction model; The feature vector of the image to be tested is obtained, and the feature vector of the image to be tested is input into the target prediction model for prediction to obtain the predicted gene expression value.

2. The spatial gene expression prediction method based on multi-scale feature extraction according to claim 1, characterized in that, The preprocessing of the whole-slice image yields multiple image patches, including: The size of the full-slice image is adjusted using image processing tools to obtain an adjusted image; The image processing tool is used to divide the adjusted image into multiple image patches according to the spatial coordinate matrix.

3. The spatial gene expression prediction method based on multi-scale feature extraction according to claim 1, characterized in that, The feature extraction model includes a backbone extraction module, a neighborhood fusion module, a global aggregation module, a dimension mapping module, and a multi-scale attention fusion module. The backbone extraction module is connected to the neighborhood fusion module and the global aggregation module, respectively. The neighborhood fusion module and the global aggregation module are connected to the dimension mapping module, respectively. The dimension mapping module is connected to the multi-scale attention fusion module. The pre-trained feature extraction model is invoked to perform multi-scale feature extraction and fusion processing based on multiple image patches to obtain an image feature vector, including: Based on the extraction backbone module, feature extraction processing is performed on each of the image patches to obtain multiple image patch feature vectors; Based on the neighborhood fusion module, multiple target image patch feature vectors are determined according to each image patch feature vector. For each target image patch feature vector, its k nearest neighbor image patch feature vectors are obtained. Linear multi-head transformation fusion processing is performed on each target image patch feature vector and its k nearest neighbor image patch feature vectors to obtain multiple initial neighborhood-level context enhancement features. Based on the global aggregation module, all the image patch feature vectors are globally aggregated to obtain the initial global-level organizational structure features; Based on the dimension mapping module, each image patch feature vector, each initial neighborhood-level context enhancement feature, and the initial global-level organizational structure feature are mapped to the high-variability gene dimension for gene expression prediction, resulting in multiple sequencing point-level target features, multiple target neighborhood-level context enhancement features, and target global-level organizational structure features. Based on the multi-scale attention fusion module, multi-head cross-attention merging calculation is performed according to multiple sequencing point-level target features, multiple target neighborhood-level context enhancement features, and target global-level organizational structure features to obtain the image feature vector.

4. The spatial gene expression prediction method based on multi-scale feature extraction according to claim 3, characterized in that, The multi-scale attention fusion module includes a first fusion submodule, a second fusion submodule, and a merging submodule, which are sequentially connected. Based on the multi-scale attention fusion module, multi-head cross-attention merging calculation is performed according to multiple sequencing point-level target features, multiple target neighborhood-level context enhancement features, and target global-level organizational structure features to obtain the image feature vector, including: Based on the first fusion submodule, multiple sequencing point-level target features are used as queries, and multiple target neighborhood-level context enhancement features are used as keys and values ​​to perform multi-head cross-attention calculation and obtain neighborhood attention results. Based on the second fusion submodule, using multiple sequencing point-level target features as queries and the target global-level organizational structure features as keys and values, multi-head cross-attention calculation is performed to obtain the global attention result; The neighborhood attention result and the global attention result are input into the merging submodule for merging processing to obtain the image feature vector.

5. The spatial gene expression prediction method based on multi-scale feature extraction according to claim 1, characterized in that, The sequencing point encoder includes a coordinate encoding module, a multi-head self-attention module, and a projection module, which are connected in sequence. The step of employing a sequencing point encoder to generate an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix includes: Based on the coordinate encoding module, one-hot encoding and dimension mapping are performed on the horizontal and vertical coordinates in the spatial coordinate matrix to obtain the horizontal coordinate feature matrix and the vertical coordinate feature matrix. Based on the multi-head self-attention module, the gene expression matrix, the horizontal axis feature matrix, and the vertical axis feature matrix are subjected to global dependency modeling to obtain an intermediate fusion feature matrix; The intermediate fusion feature matrix is ​​input into the projection module for dimensional mapping to obtain the expression embedding matrix.

6. The spatial gene expression prediction method based on multi-scale feature extraction according to claim 1, characterized in that, The step of learning and comparing the initial prediction model based on the image feature vector and the representation embedding matrix to obtain the target prediction model includes: Calculate the cosine similarity between the image feature vector and the representation embedding matrix to obtain a similarity matrix; Obtain a preset target matrix for labeling positive and negative sample pairs, and calculate the total contrast loss based on the similarity matrix and the target matrix; Based on the total contrast loss, the weight parameters of the initial prediction model are iteratively updated using the backpropagation algorithm; When the preset iteration stopping condition is met, the target prediction model is output.

7. The spatial gene expression prediction method based on multi-scale feature extraction according to claim 1, characterized in that, The target prediction model includes a similar feature retrieval module, a weight calculation module, and a weighted aggregation module, which are sequentially connected. The step of inputting the feature vector of the image to be tested into the target prediction model for prediction to obtain the predicted gene expression value includes: The feature vector of the image to be tested is input into the similar feature retrieval module, and the cosine similarity between the feature vector of the image to be tested and the expression embedding matrix is ​​calculated. Based on the cosine similarity, multiple sequencing point reference features that are most similar to the feature vector of the image to be tested are determined. The feature vector of the image to be tested and the multiple sequencing point reference features are input into the weight calculation module to calculate the Euclidean distance between the feature vector of the image to be tested and each sequencing point reference feature, and to assign weights to each sequencing point reference feature based on the Euclidean distance. Based on the weighted aggregation module, the predicted gene expression value is obtained by weighted summation according to the gene expression value corresponding to the reference feature of each sequencing point and the assigned weight.

8. A spatial gene expression prediction device based on multi-scale feature extraction, characterized in that, include: Image acquisition module: used to acquire whole slice images, the whole slice images including multiple sequencing points and gene expression matrices and spatial coordinate matrices corresponding to the multiple sequencing points; Image preprocessing module: used to preprocess the full-slice image to obtain multiple image patches; Feature extraction and fusion module: used to call a pre-trained feature extraction model to perform multi-scale feature extraction and fusion processing based on multiple image patches to obtain image feature vectors; Embedding matrix generation module: used to generate an expression embedding matrix based on the spatial coordinate matrix and the gene expression matrix using a sequencing point encoder; Learning and contrast training module: used to construct an initial prediction model, and to learn and contrast train the initial prediction model based on the image feature vector and the expression embedding matrix to obtain the target prediction model; Prediction module: used to acquire the feature vector of the image to be tested, input the feature vector of the image to be tested into the target prediction model for prediction, and obtain the predicted gene expression value.

9. A spatial gene expression prediction device based on multi-scale feature extraction, characterized in that, The spatial gene expression prediction device based on multi-scale feature extraction includes: a memory and at least one processor, wherein the memory stores instructions; At least one of the processors invokes the instructions in the memory to cause the spatial gene expression prediction device based on multi-scale feature extraction to perform the steps of the spatial gene expression prediction method based on multi-scale feature extraction as described in any one of claims 1-7.

10. A computer-readable storage medium storing instructions thereon, characterized in that, When the instructions are executed by the processor, they implement the steps of the spatial gene expression prediction method based on multi-scale feature extraction as described in any one of claims 1-7.