Machine learning enabled gene based prediction of intercellular interactions

WO2026122789A1PCT designated stage Publication Date: 2026-06-11GENENTECH INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GENENTECH INC
Filing Date
2025-12-04
Publication Date
2026-06-11

Smart Images

  • Figure US2025058095_11062026_PF_FP_ABST
    Figure US2025058095_11062026_PF_FP_ABST
Patent Text Reader

Abstract

A training dataset is generated to include a plurality of training multicellular gene profiles. Each training multicellular gene profile being associated with a sample including a plurality of cells and including, for each cell, an expression level of each gene of a plurality of genes. A multicellular gene interaction computation model is trained by applying the multicellular gene interaction computation model to generate, for each gene in a training multicellular gene profile, a contextual gene embedding that captures a co-regulatory relationship between the gene and one or more other genes in the training multicellular gene profile. The trained multicellular gene interaction computation model is applied in one or more applications including, for example, receptor-ligand analysis, multicellular gene network inference, disease¬ specific gene identification, cell population-level perturbation effect prediction, recovery of spatial colocalization patterns, and / or the like.
Need to check novelty before this filing date? Find Prior Art

Description

Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1MACHINE LEARNING ENABLED GENE BASED PREDICTION OF INTERCELLULAR INTERACTIONSCROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims priority to U.S. Provisional Application No. 63 / 728,220, entitled “MACHINE LEARNING ENABLED GENE BASED PREDICTION OF INTERCELLULAR INTERACTIONS” and fded on December 5, 2024, the disclosure of which is incorporated herein by reference in its entirety.TECHNICAL FIELD

[0001] The subject matter described herein relates generally to gene co-regulation and more specifically to machine-learning enabled techniques for analyzing co-regulated genes in a multicellular environment.INTRODUCTION

[0002] Regulation of gene expression, or gene regulation, refers to a wide range of cellular mechanisms for controlling when, where, and how much specific genes (or segments of deoxyribonucleic acid (DNA) in a cell) are expressed. Gene regulation modulates the production of gene products, which are the biochemical materials resulting from the expression of those genes. Ribonucleic acid (RNA) is one example of a gene product whose primary purpose is to carry genetic information (e.g., as messenger ribonucleic acid (RNA) (mRNA) molecules) from a cell’s deoxyribonucleic acid (DNA) to the ribosomes where the sequences of the ribonucleic acid (RNA) are translated into strings of amino acid residues called proteins. Proteins, the other example of a gene product, have numerous functions including, for example, cellular structural support, deoxyribonucleic acid (DNA) replication, metabolic catalysts (e.g., enzyme catalysis), immune response, intra- and intercellular signaling (e.g., through hormones), cell cycle management, and / or the like. Gene regulation play a crucial role in an organism’s physicalAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 development and response to environmental stresses. Malfunctions in gene regulation can lead to many disorders including, for example, cancer, autoimmunity and inflammation, developmental defects, diabetes, cardiovascular diseases, and / or the like.SUMMARY

[0003] Systems, methods, and articles of manufacture, including computer program products, are provided for machine-learning enabled analysis of co-regulated genes in a multicellular environment.

[0004] In one aspect, there is provide a system for machine-learning enabled analysis of co-regulated genes in a multicellular environment. The system may include at least one data processor and at least one memory. The at least one memory may store instructions that result in operations when executed by the at least one data processor. The operations may include: generating a training dataset to include a plurality of training multicellular gene profdes, each training multicellular gene profile of the plurality of training multicellular gene profiles being associated with a sample including a plurality of cells, and each training multicellular gene profile including, for each cell of the plurality of cells, an expression level of each gene of a plurality of genes; training, based at least on the training dataset, a multicellular gene interaction computation model, wherein the training of the multicellular gene interaction computation model includes applying the multicellular gene interaction computation model to generate, for each gene in a training multicellular gene profile, a contextual gene embedding that captures a co- regulatory relationship between the gene and one or more other genes in the training multicellular gene profile; and applying the trained multicellular gene interaction computation model in one or more applications.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0005] In another aspect, there is provide a computer-implemented method for machine-learning enabled analysis of co-regulated genes in a multicellular environment. The method may include: generating a training dataset to include a plurality of training multicellular gene profdes, each training multicellular gene profile of the plurality of training multicellular gene profiles being associated with a sample including a plurality of cells, and each training multicellular gene profile including, for each cell of the plurality of cells, an expression level of each gene of a plurality of genes; training, based at least on the training dataset, a multicellular gene interaction computation model, wherein the training of the multicellular gene interaction computation model includes applying the multicellular gene interaction computation model to generate, for each gene in a training multicellular gene profde, a contextual gene embedding that captures a co-regulatory relationship between the gene and one or more other genes in the training multicellular gene profde; and applying the trained multicellular gene interaction computation model in one or more applications.

[0006] In another aspect, there is provided a computer program product for machinelearning enabled analysis of co-regulated genes in a multicellular environment. The computer program product may include a non-transitory computer readable medium storing instructions that result in operations when executed by at least one data processor. The operations may include: generating a training dataset to include a plurality of training multicellular gene profdes, each training multicellular gene profde of the plurality of training multicellular gene profdes being associated with a sample including a plurality of cells, and each training multicellular gene profde including, for each cell of the plurality of cells, an expression level of each gene of a plurality of genes; training, based at least on the training dataset, a multicellular gene interaction computation model, wherein the training of the multicellular gene interaction computation modelAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 includes applying the multicellular gene interaction computation model to generate, for each gene in a training multicellular gene profile, a contextual gene embedding that captures a co- regulatory relationship between the gene and one or more other genes in the training multicellular gene profile; and applying the trained multicellular gene interaction computation model in one or more applications.

[0007] In some variations, one or more features disclosed herein including the following features can optionally be included in any feasible combination.

[0008] In some variations, the multicellular gene interaction computation model includes an embedding layer trained to generate, for each gene in the training multicellular gene profile, a gene embedding used to generate the contextual gene embedding of the gene.

[0009] In some variations, the embedding layer is trained to generate two or more similar gene embeddings for two or more co-regulated genes.

[0010] In some variations, each input unit of the embedding layer is fixed to a certain cell type by at least imposing a corresponding structure in each training multicellular gene profile in the training dataset.

[0011] In some variations, each training multicellular gene profile comprises a matrix having a plurality of sets of adjacent rows, and each set of adjacent rows in the matrix are assigned to cells of a different cell type.

[0012] In some variations, the embedding layer includes a neural network trained to generate, for each gene in the multicellular gene profile, a gene expression embedding corresponding to an expression level of the gene across the plurality of cells in a corresponding sample.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0013] In some variations, the embedding layer includes a gene identity embedding layer trained to generate, for each gene in the multicellular gene profile, a gene identity embedding that is independent of the expression level of the gene across the plurality of cells in the corresponding sample.

[0014] In some variations, the gene embedding of each gene comprises a combination of the gene expression embedding and the gene identity embedding.

[0015] In some variations, the multicellular gene interaction computation model includes a transformer block trained to generate, based at least on the gene embedding of each gene in the training multicellular gene profile, the contextual gene embedding of each gene in the training multicellular gene profile.

[0016] In some variations, the one or more downstream applications includes identifying, based at least on a gene embedding and / or a contextual gene embedding of each gene in an input multicellular gene profile, one or more gene sets. Each gene set of the one or more gene sets include a plurality of co-regulated genes.

[0017] In some variations, the one or more gene sets are identified by at least identifying, based at least on a gene embedding and / or a contextual gene embedding of each gene in an input multicellular gene profile, one or more clusters of genes. Each cluster of genes corresponds to a gene set comprising a plurality of genes having a functional relationship in one or more common biological processes.

[0018] In some variations, the one or more downstream applications include identifying, based at least on a gene embedding and / or a contextual gene embedding of each gene in an input multicellular gene profile, one or more pairs of genes involved in receptorligand interactions.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0019] In some variations, the one or more pairs of genes are identified by at least computing, for each pair of genes in the input multicellular gene profile, a similarity metric quantifying a similarity between a corresponding pair of gene embeddings or contextual gene embeddings.

[0020] In some variations, a pair of genes is identified as being involved in receptorligand interactions based at least on the similarity metric of the pair of genes satisfying one or more thresholds.

[0021] In some variations, the one or more downstream applications further include identifying, based at least on a gene embedding and / or a contextual gene embedding of each gene in the input multicellular gene profile, one or more gene sets, each gene set comprising a plurality of genes having a functional relationship in one or more common biological processes, identifying a gene set including at least one pair of genes involved in receptor-ligand interactions, and analyzing one or more other genes included in the gene set.

[0022] In some variations, the one or more downstream applications include determining, based at least on a gene embedding or a contextual gene embedding of each gene in an input multicellular gene profile, two or more spatially co-located genes in the input multicellular gene profile.

[0023] In some variations, a spatial co-localization of two or more genes is determined based at least on a similarity metric between two or more corresponding gene embeddings or contextual gene embeddings.

[0024] In some variations, an additional training dataset is generated to include a plurality of spatially defined training multicellular gene profiles. Each spatially defined training multicellular gene profiles of the plurality of spatially defined training multicellular gene profilesAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 is associated with a sample containing cells within a certain radius. The trained multicellular gene interaction computation model is finetuned based at least on the additional training dataset. The finetuned multicellular gene interaction computation model is applied to determine the gene embedding or contextual gene embedding of each gene in the input multicellular gene profile.

[0025] In some variations, the one or more downstream applications include identifying one or more condition-specific genes.

[0026] In some variations, the identifying the one or more condition-specific genes includes identifying, based at least on contextual gene embeddings output by the trained multicellular gene interaction computation model, gene neighborhoods in multicellular gene profiles associated with different conditions, identifying, based at least on a similarity between the gene neighborhoods in the multicellular gene profiles of different conditions, the one or more condition-specific genes.

[0027] In some variations, a gene neighborhood of a gene includes one or more other genes whose contextual gene embedding exhibits a threshold similarity to a contextual gene embedding of the gene.

[0028] In some variations, the identifying one or more condition-specific genes includes applying the trained multicellular gene interaction computation model to generate, for a multicellular gene profile associated with a condition, a first set of contextual gene embeddings for a first plurality of genes comprising the multicellular gene profile of the condition, determining, based at least on the first set of contextual gene embeddings, a first gene neighborhood of a gene present in the multicellular gene profile of the condition, applying the trained multicellular gene interaction computation model to generate, for a multicellular gene profile associated with a different condition, a second set of contextual gene embeddings for aAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 second plurality of genes comprising the multicellular gene profile of the different condition, determining, based at least on the second set of contextual gene embeddings, a second gene neighborhood of the gene present in the multicellular gene profile of the different condition, and identifying, based at least on a similarity between the first gene neighborhood and the second gene neighborhood, the gene as being condition-specific.

[0029] In some variations, the gene is identified as being condition-specific based at least on a similarity metric between the first gene neighborhood and the second gene neighborhood satisfying one or more thresholds.

[0030] In some variations, the condition comprises one disease or disease stage, and wherein the different condition comprises a different disease or different disease stage.

[0031] In some variations, the condition comprises a disease state, and wherein the different condition comprises a non-disease state.

[0032] In some variations, the one or more downstream applications include determining a perturbation effect of a gene on one or more other genes in a same multicellular environment.

[0033] In some variations, a gene expression level across a plurality of different cell types in the same multicellular environment is determined based at least on the perturbation effect of the gene.

[0034] In some variations, the perturbation effect of the gene is determined by at least applying the trained multicellular gene interaction computation model to generate, for a baseline multicellular gene profile including the gene in a baseline state, a first set of contextual gene embeddings for a plurality of genes comprising the baseline multicellular gene profile; applying the trained multicellular gene interaction computation model to generate, for a perturbedAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 multicellular gene profile including the gene in a perturbed state, a second set of contextual gene embeddings for a same plurality of genes comprising the perturbed multicellular gene profile; and determining, based at least on a difference between the first set of contextual gene embeddings and the second set of contextual gene embeddings, the perturbation effect of the gene.

[0035] In some variations, an indication specific training dataset including a plurality of indication specific multicellular gene profiles is generated. The trained multicellular gene interaction computation model is finetuned based at least on the indication specific training dataset. The finetuned multicellular gene interaction computation model is applied to determine the first set of contextual gene embeddings and the second set of contextual gene embeddings.

[0036] In some variations, the multicellular gene interaction computation model is trained to recover a masked identity and / or a masked expression level of one or more genes in each training multicellular gene profile.

[0037] In some variations, the training of the multicellular gene interaction computation model includes adjusting one or more parameters of the multicellular gene interaction computation model such that the contextual gene embedding generated by the multicellular gene interaction computation model enables a recovery of the masked identity and / or the masked expression level of the one or more genes in each training multicellular gene profile.

[0038] Implementations of the current subject matter can include, but are not limited to, methods consistent with the descriptions provided herein as well as articles that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations implementing one or more of the described features.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1Similarly, computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors. A memory, which can include a non- transitory computer-readable or machine-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein. Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more data processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and / or commands or other instructions or the like via one or more connections, including, for example, to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.

[0039] The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims. While certain features of the currently disclosed subject matter are described for illustrative purposes in relation to co-regulated genes in a multicellular environment, it should be readily understood that such features are not intended to be limiting. The claims that follow this disclosure are intended to define the scope of the protected subject matter.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1DESCRIPTION OF DRAWINGS

[0040] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

[0041] The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings,

[0042] FIG. 1A depicts a system diagram illustrating an example of a multicellular gene analysis system, in accordance with some example embodiments;

[0043] FIG. IB depicts a schematic diagram illustrating a flow of data through a multicellular gene interaction computation model, in accordance with some example embodiments;

[0044] FIG. 1C depicts a schematic diagram illustrating a flow of data through an embedding layer, in accordance with some example embodiments;

[0045] FIG. ID depicts a schematic diagram illustrating training objectives for a multicellular gene interaction computation model, in accordance with some example embodiments, in accordance with some example embodiments;

[0046] FIG. 2A depicts a flowchart illustrating an example of a process for machine learning enabled gene based prediction of intercellular interactions, in accordance with some example embodiments;

[0047] FIG. 2B depicts a flowchart illustrating an example of a process for generating contextual gene embeddings, in accordance with some example embodiments;Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0048] FIG. 3 depicts an analysis of zero-shot gene sets identified by an example of a multicellular gene interaction computation model, in accordance with some example embodiments;

[0049] FIG. 4 depicts receptor-ligand analysis using an example of a multicellular gene interaction computation model, in accordance with some example embodiments;

[0050] FIG. 5 depicts an analysis of multicellular context-aware gene embeddings, in accordance with some example embodiments;

[0051] FIG. 6 depicts spatial patterns across datasets recovered by the multicellular gene interaction computation model, in accordance with some example embodiments; and

[0052] FIG. 7 depicts in silico genetic perturbation effect estimation at cell population level, in accordance with some example embodiments; and

[0053] FIG. 8 depicts a block diagram illustrating an example of a computing system, in accordance with some example embodiments.

[0054] When practical, similar reference numbers denote similar structures, features, or elements.DETAILED DESCRIPTION

[0055] Gene regulation and cell communication are closely related processes that influence one another in complex ways. For example, gene regulation, which controls the expression of certain genes, can modulate when, where, and how much proteins serving as signaling molecules (e.g., hormones) are produced. The interaction between different cells in a tissue, including the interaction between the proteins expressed by those cells, may be crucial for orchestrating an immune response and maintaining tissue homeostasis. This interplay can play a key role in the context of various diseases, such as cancer, where pathological conditions are theAttomey Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 result of aberrant gene or protein interactions. As such, deciphering which genes mediate the communication between cells remains a major challenge that thwarts the development of novel therapeutics. Single-cell sequencing techniques (e.g., single-cell RNA sequencing) and spatial transcriptomics, while capable of revealing the patterns of individual cells within their tissue environment, provide merely a snapshot of how the cells communicate and function together within a tissue sample. The higher level regulatory mechanisms that may be present in a multicellular environment remain difficult to decipher due to the high context-dependency and inherent heterogeneity of cell populations within multicellular environments. For instance, the expression of a first gene by a first cell and the production of a first protein resulting therefrom may be regulated by a second protein produced by the expression of a second gene by a second cell. As such, the presence (or absence) of the second cell in a tissue as well as the timing, location, and level of expression of the second gene may impact the expression of the first gene by the first cell as well as, in some cases, the expression of a third gene by a third cell regulated by the first protein produced by the expression of the first gene. The context dependent interactions between the first gene, the second gene, and in some cases, the third gene, elude conventional computational techniques.

[0056] Conventional computational techniques model individual cells while ignoring the intra- and intercellular interactions that are present in a sample. As such, conventional computational techniques fail to capture context dependent interactions between genes in a multicellular environment and thus fall short of revealing the mechanisms governing multicellular interactions, which are key to better understanding disease mechanisms and the impact (e.g., safety, efficacy, and / or the like) of treatments in multicellular environments. For example, some computational methods rely on existing knowledge, such as known receptor-Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 ligand pairs, the spatial organization of cells within tissues (e g., spatial transcriptomics), and / or the like. Those computational techniques that are less dependent on existing knowledge require a large number of cells per sample, which scales even higher given the greater complexity of multicellular environments, to ensure robust performance. Large-scale pretrained models are limited to modelling intercellular interactions across cell states or interactions between cells, rather than the genes mediating these interactions.

[0057] Various embodiments of the present disclosure overcome the limitations of conventional computational techniques by at least modeling the contextual interactions between genes in a multicellular environment. For instance, in some example embodiments, a multicellular gene interaction computation model may be trained to generate, for each gene present in a sample, a gene embedding that captures the context-dependency of the gene. As used herein, the term “context” refers to the different cell types present in a multicellular environment as well as the genes expressed by these cells. For instance, the role of a first gene expressed by a first cell in a multicellular environment, including that of a first protein produced by the expression of the first gene, may be dependent on a second cell present in the same multicellular environment at least because the expression of the first gene may be regulated by a second protein produced by the second gene expressed by the second cell. Accordingly, the multicellular gene interaction computation model may generate, for the first gene, a context- aware gene embedding capable of adapting to the expression of other genes, such as the second gene, in the same multicellular environment.

[0058] As noted, cells within the same environment interact by expressing genes. At least some of the proteins encoded by the genes serve as signaling molecules and receptors that enables cellular communication and activity coordination. The genes involved in cell-cellAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 communication include both genes that interact in a signaling pathway at the intracellular level as well as genes that orchestrate communication between cells. This signaling is highly context- dependent such that to decipher the true function of any gene involved in these signaling pathways requires examining the population of cells in the environment rather than studying each individual cell in isolation. As described in more details below, various embodiments of the multicellular gene interaction computation model described herein is trained to capture the context of a multicellular environment by learning the relationship between genes across diverse cell states. The contextual gene embeddings generated by the multicellular gene interaction computation model convey the context of each individual gene. Such contextual gene embeddings may applied toward a variety of applications including, for example, receptor-ligand analysis, multicellular gene network inference, disease-specific gene identification, cell population-level perturbation effect prediction, and / or the like.

[0059] FIG. 1A depicts a system diagram illustrating an example of a multicellular gene analysis system 100, in accordance with some example embodiments. As shown in FIG. 1A, the multicellular gene analysis system 100 may include a gene analysis engine 110, a task engine 120, and a client device 130. In the example of the multicellular gene analysis system 100 shown in FIG. 1A, the gene analysis engine 110, the task engine 120, and the client device 130 may be communicatively coupled via a network 140. It should be appreciated that the client device 130 may be a processor-based device including, for example, a workstation, a desktop computer, a laptop computer, a smartphone, a tablet computer, a wearable apparatus, and / or the like. The network 140 may be a wired network and / or a wireless network including, for example, a local area network (LAN), a virtual local area network (VLAN), a wide area network (WAN), a public land mobile network (PLMN), the Internet, and / or the like.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0060] In some example embodiments, the gene analysis engine 110 may include a multicellular gene interaction computation model 115 trained to operate on an input multicellular gene profde 150 of a sample including, for example, a set of single-cell profdes (e.g., single-cell RNA sequencing data) of cells from the sample or neighboring cells in spatial transcriptomics. In some cases, the input multicellular gene profde 150 may include, for each cell of an N quantity of cells present in the sample, an expression level of each of an N quantity of genes (e g., an RNA sequence count corresponding to the number of reads that map to the gene in RNA sequencing).

[0061] In some example embodiments, the multicellular gene interaction computation model 115 may include one or more of an embedding layer 162 and a transformer block 164 (e.g., including one or more transformer layers). In some cases, the embedding layer 162 may be trained to generate, for each of the N genes present in the input multicellular gene profde 150, a gene embedding. For example, the embedding layer 162 may generate an N quantity of gene embeddings including, for example, a first gene embedding 163a, a second gene embedding 163b, and / or the like. Each one of the N quantity of gene embeddings may correspond to one of the N quantity of genes. As described in more details below, each one of the N gene embeddings may combine a gene expression embedding (e.g., a gene expression latent vector) whose value(s) varies with the expression level (e.g., the RNA sequence count) of the corresponding gene and a gene identity embedding that is independent of the expression level (e.g., the RNA sequence count) of the gene. Moreover, in some cases, the transformer block 164 may operate on the N gene embeddings in order to generate, for each of the N genes in the input multicellular gene profile 150, a contextual gene embedding. For instance, the transformer block 164 may generate an N quantity of contextual gene embeddings, such as a first contextual geneAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 embedding 165a and a second contextual gene embedding 165b. One or more of the resulting N contextual gene embeddings, such as the first contextual gene embedding 165a and the second contextual gene embedding 165b, may be used in one or more applications including, for example, receptor-ligand analysis, multicellular gene network inference, disease-specific gene identification, cell population-level perturbation effect prediction, and / or the like.

[0062] To further illustrate, FIG. IB depicts a schematic diagram illustrating a flow of data through the multicellular gene interaction computation model 115. As shown in FIG. IB, the multicellular gene interaction computation model 115 may ingest, as input, the input multicellular gene profile 150. In the example shown in FIG. IB, the input multicellular gene profile 150 may correspond to single-cell sequencing data (e.g., single-cell RNA sequencing data) for cells from a sample (e g., containing cells from a single donor and batch). The input multicellular gene profile 150 may include, for each of an N quantity of cells in the sample, an expression level (e.g., an RNA sequence count) of each of an N quantity of genes. As described in more details below, the multicellular gene interaction computation model 115 may generate, for each of the N quantity of genes in the input multicellular gene profile 150, a contextual gene embedding that captures of the context of the gene. For example, FIG. IB shows that the N quantity of contextual gene embeddings that are generated by the multicellular gene interaction computation model 115 includes the first contextual gene embedding 165a and the second contextual gene embedding 165b. In some cases, the context of each individual gene in the sample may include the N — 1 other genes from the same cell or different cells that are also present in the sample. The multicellular gene interaction computation model 115 may generate, for each of the N genes in the sample, a contextual gene embedding that captures the interactionAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 between the gene and the other N — 1 genes in the same sample, including the co-regulatory mechanisms that may exist between two or more genes.

[0063] In some example embodiments, the multicellular gene interaction computation model 115 may treat each of the N quantity of genes in the input multicellular gene profile 150 as individual tokens. As noted, the multicellular gene interaction computation model 115 may ingest, as input, the expression level (e.g., RNA sequence count) of each of the N gene across N distinct cell types before computing a contextual gene embedding for each of the N genes. In some cases, each contextual gene embedding may be computed, by the transformer block 164, based on a corresponding gene embedding generated by the embedding layer 162. For example, in some cases, the embedding layer 162 may generate, for each of the N genes in the input multicellular gene profile 150, a gene embedding that combines a gene expression embedding (e g., a gene expression latent vector) and a gene identity embedding. In some cases, the value(s) of the gene expression embedding (e.g., the gene expression latent vector) may vary with the expression level (e.g., the RNA sequence count) of the gene whereas the gene identity embedding is independent of the expression level (e.g., the RNA sequence count) of the gene. In some cases, the transformer block 164 may ingest, as input, an N gene embeddings corresponding to the N quantity of genes in order to generate, for each of the N quantity of genes, a contextual gene embedding that is dependent upon the other N — 1 quantity of genes in the same sample.

[0064] To further illustrate, FIG. 1C depicts a schematic diagram illustrating the flow of data through the embedding layer 162. As shown in FIG. 1C, gene i (of the N quantity of genes) in the input multicellular gene profile 150 may be associated with an expression level (e.g., RNA sequence count) for each of the N quantity of cells present in the input multicellularAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 gene profile 150. The expression level of gene i across the N quantity of cells captures its context-dependent activity within the sample. In some cases, the embedding layer 162 may include a neural network 166 (e.g., a feed forward neural network (FFNN) and / or the like) trained to generate, based at least on the expression level (e.g., RNA sequence count) of gene i across the N quantity of cells, a gene expression embedding 171 In some cases, the gene expression embedding 171 may be a gene expression latent vector whose value(s) is dependent on the expression levels (e.g., normalized RNA sequence counts) of gene i in the sample. In some cases, the embedding layer 162 may also include a gene identity embedding layer 168 trained to generate a gene identity embedding 173 that is independent of the expression levels (e.g., RNA sequence counts) of gene i in the sample. That is, while the gene expression embedding 171 of gene i may vary (e.g., depending on the RNA sequence counts) across different samples, the value(s) of the gene identity embedding 173 remains the same. In some cases, the gene expression embedding 171 (e.g., the gene expression latent vector) and the gene identity embedding 173 may be combined (e.g., concatenated, summed, and / or the like) to form a gene embedding 165. In some cases, the gene embedding 165 may be conditional on the expression level (e.g., normalized RNA sequence count) of gene i.

[0065] Referring again to FIG. 1C, in some cases, the neural network 166 may be shared across all genes (e.g., all N genes) in order to avoid overfitting. That is, the neural network 166 may be trained to generate embeddings for all N genes instead of being trained to specialize in one or more genes. In some cases, the input units of the neural network 166 may be fixed to specific cell types during pretraining based on the cell type composition. For example, FIG. 1C shows that the same rows of the input being reserved for the expression level of gene i from cells of the same cell type across all samples. Doing so may reserve a first subset of inputAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 units for a first cell type (e.g., CD8 T cells), a second subset of input units for a second cell type (e.g., macrophages), up to an N subset of input units for an A -th cell type. Allocating the input units of the neural network 166 in this manner implicitly accounts for cell type specific gene expression and helps prevent the neural network 166 from learning average gene expression loadings without regard for cell type. As described in more details below, the neural network 166 and the gene identity embedding layer 168 may be pretrained from scratch to generate, for co-regulated genes, gene embeddings that are proximally located in the embedding (or latent) space.

[0066] FIG. ID depicts a schematic diagram illustrating training objectives for the multicellular gene interaction computation model 115, in accordance with some example embodiments, in accordance with some example embodiments. In some example embodiments, the multicellular gene interaction computation model 115 may undergo pretraining to learn gene co-regulation across different cells by predicting masked gene identity and masked gene expression levels (e.g., RNA sequence counts) based on context (e.g., other genes present in the sample). For example, in some cases, the multicellular gene interaction computation model 115 may be trained on a training corpus that contains cells from different donors across a variety of different tissues and diseases (e.g., over 4.2 thousand samples containing 14 million cells of more than 300 different cell types from more than 3 thousand donors across 10 different tissues and 13 different diseases). Moreover, the identity and / or expression levels of a random selection of genes (e.g., 15%) in each multicellular gene profde used for training may be masked. In some cases, the multicellular gene interaction computation model 115 may be pretrained to learn meaningful latent representations of gene expression by at least combining the masked learning objective with a reconstruction loss associated with reconstruction of the masked geneAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 expressions such that the expression levels of a gene is able to be recovered from the contextual gene expression generated by the multicellular gene interaction computation model 115 for the gene.

[0067] FIG. ID depicts an example of a training multicellular gene expression profde 176 with the gene expression levels of an N quantity of genes (e.g., Gene 1, Gene 2, ••• ,Gene N) across an N quantity of cells (e.g., Cell 1, Cell 2, ••• ,Cell N). In the example shown in FIG. ID, the identity and / or expression levels of Gene 1 and Gene 2 of the training multicellular gene expression profile 176 are masked. In some cases, the pretraining of the multicellular gene interaction computation model 115 may include adjusting one or more parameters (e.g., weights, biases, and / or the like) such that the contextual gene embedding 175 generated by the multicellular gene interaction computation model 115 for the training multicellular gene profile 170 enables the recovery (e.g., by a classifier 170) of the identity and / or the recovery (e.g., by a decoder 175) of the gene expression levels of Gene 1 and Gene 2.

[0068] FIG. 2A depicts a flowchart illustrating an example of a process 200 for machine learning enabled gene based prediction of intercellular interactions, in accordance with some example embodiments. Referring to FIGS. 1A and 2A, the process 200 may be performed by the gene analysis engine 110 to train and apply the multicellular gene interaction computation model 115. As described in more details below, the multicellular gene interaction computation model 115 may be trained to generate, for each gene present in the input multicellular gene profile 150 of a sample, a contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) that accounts for the context of the gene. For example, the role of a first gene expressed by a first cell in a multicellular environment, including that of a first protein produced by the expression of the first gene, may beAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 dependent on a second cell present in the same multicellular environment at least because the expression of the first gene may be regulated by a second protein produced by the second gene expressed by the second cell. Unlike conventional computational techniques that focus on individual cells while ignoring intra- and intercellular interactions, the multicellular gene interaction computation model 115 may be trained to learn the relationship between genes across diverse cell states. Accordingly, the contextual gene embeddings generated by the multicellular gene interaction computation model 115 may be advantageous at least because the contextual gene embeddings capture the context-dependent activity of genes in a multicellular environment, such as one populated by the cells and genes present in the input multicellular gene profile 150, thereby providing a more accurate representation of each gene. In some cases, the advantages associated with various example embodiments of the contextual gene embeddings described herein may further extend to improving the accuracy and computational efficiency of downstream tasks leveraging the contextual gene embeddings including, for example, receptorligand analysis, multicellular gene network inference, disease-specific gene identification, cell population-level perturbation effect prediction, and / or the like.

[0069] At 202, a training dataset to include a plurality of training multicellular gene profiles is generated. In some example embodiments, the training corpus for training a multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may include a set of training multicellular gene profiles (e.g., the training multicellular gene profile 170). In some cases, each training multicellular gene profile may be associated with a sample containing multiple cells of one or more cell types (e.g., Cell 1, 2, , A). Moreover, in some cases, the training multicellular gene profile (e.g., the training multicellular gene profile 170) of a sample may include, for each gene present in the sample, anAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 expression level (e.g., RNA sequence count) of the gene. In some cases, the training corpus may include training multicellular gene profiles (e.g., the training multicellular gene profile 170) of samples from multiple different donors. In some cases, the samples may include different types of tissues collected from a variety of different organs. Furthermore, in some cases, the samples may be collected from donors with different diseases or different states of diseases. For example, in some cases, the training corpus may include the training multicellular gene profiles (e.g., the training multicellular gene profile 170) of more than 4.2 thousand samples from over 3 thousand donors. Collectively, the training multicellular gene profiles may encompass over 14 million cells of more than 300 different cell types. The more than 4.2 thousand samples may originate from over 10 different types of tissues including, for example, blood, breast, lung, brain, lung parenchyma, respiratory airway, liver, omentum (stomach), skin, and / or the like. The more than 4.2 thousand samples may be associated with over 13 different diseases and / or disease states including, for example, normal (healthy), COVID-19, glioblastoma, malignant ovarian serous tumor, systemic lupus erythematosus, pulmonary fibrosis, and / or the like.

[0070] At 204, a multicellular gene interaction computation model is trained based at least on the training dataset. In some example embodiments, the training of a multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115 in FIG. 1A) may include applying the multicellular gene interaction computation model to generate, for each gene in a training multicellular gene profile, a contextual gene embedding that captures as co-regulatory relationship between the gene and one or more other genes in the training multicellular gene profile. In some cases, a multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may undergo pretraining in order to learn the relationship between genes across diverse cell states. In some cases, theAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 pretraining of the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may include applying the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) to predict masked gene identities and / or masked gene expression levels (e.g., RNA sequence counts) in each training multicellular gene profile (e.g., the training multicellular gene profile 170) in the training corpus. For example, a training multicellular gene profile (e.g., the training multicellular gene profile 170) may include the expression level of different genes from cells of various different cell types present in a sample. The identity and / or the expression levels of a random selection (e.g., 15%) of these genes may be masked. The pretraining of the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may include adjusting one or more parameters (e.g., weights, biases, and / or the like) of the multicellular gene interaction computation model such that the contextual gene embeddings generated for each gene with the masked identity and / or the masked expression levels enables the recovery (e.g., by a classifier) of the identity of the gene and / or the recovery (e.g., by a decoder) of the gene expression levels of the gene. In some cases, despite the identity and / or the expression levels of a gene in a training multicellular gene profile (e.g., the training multicellular gene profile 170) being masked, the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may be trained to generate, based on the other (unmasked) genes present in the training multicellular gene profile, a meaningful contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) that can be used to recover the identity as well as the expression levels of the gene.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0071] To further illustrate, consider an example in which 15% of the genes in each training multicellular gene profile are masked. It should be appreciated that this 15% masking rate may be achieved through combination various masking types that includes, for example, the following:• 70% are masked gene identity embeddings wherein for gene g, in an expression matrix X E HRWxMhaving an N number of genes and an M number of cells, its gene identity embedding is replaced with a special token embedding [MASK}.• 10% are masked gene expression embeddings achieved by replacing the gene expression embedding with [MASK vector.• 10% are swapped gene identity embeddings where a random gene identity embedding is swapped with the true gene identity embedding.• 10% where neither the gene identity nor the gene expression are masked, leaving the embedding unchanged.

[0072] By varying the masking types - including masking gene identities, gene expressions, swapping gene identities, and leaving some unchanged - the multicellular gene interaction computation model is encouraged to leverage both gene identity and gene expression information. Doing so encourages the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) to capture complex dependencies among genes and understand the impact of gene expression patterns across different cells, thereby enhancing its ability to model gene interactions in a multicellular environment. In other words, training the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) on a combination of masked and swapped gene identities and gene expression is advantageous at least because the resulting multicellular gene interactionAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 computation model may be more robust and capable of leveraging both gene identity and gene expression levels when generating contextual gene embeddings, such as the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like.

[0073] In some cases, the gene probabilities for each masked gene may be computed by passing the output of the transformer block hj G IRDof a masked gene gtthrough a classifier (e.g., the classifier 170). This classifier (e.g., the classifier 170) may be implemented as a weight matrix W E IR£)XC, where D is the embedding dimension and G is the total number of genes in the pretraining corpus. This weight matrix W may be learned over the course of pretraining and used for analysis on downstream tasks. The masked modeling loss Lmaskmay be defined as follows:wherein is the set of positions of masked gene identities, gtis the true gene identity at position i, and Emaskedrepresents the input embedding with applied masking.

[0074] For the gene expression reconstructive objective, the last hidden state from the transformer block (e.g., the transformer block 164) may be used to reconstruct the input normalized gene expression for each gene. This may be performed for all genes in a sample, including genes with masked gene identity as well as those with masked gene expression. To reconstruct the gene expression, a decoder (e.g., the decoder 175) implemented as a 2-layer multi-layer perceptron (MLP), MLPrmay be trained to operate on a D-dimensional vector to project it into an M-dimensional vector equal to the input gene expression vector as follows: x5. = MLPr(hj)Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 where hj G IRDis the hidden representation of gene g, from the transformer block and xgis the reconstructed gene expression vector.

[0075] In some cases, the reconstruction loss £reconmay be defined, for example, as the mean squared error (MSE) between the reconstructed and true gene expression vectors:Breconwherein L is the total number of genes in the input sequence.

[0076] The final loss function includes the two aforementioned losses, weighted by a (masked cross-entropy loss) and 1 — a (reconstruction loss):T t £mask + (1 ft)£recon

[0077] It should be appreciated that a may be a large value due to the reconstruction loss £reCon having significantly larger absolute values. By optimizing this combined loss function, the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) learns to predict masked gene identities and accurately reconstruct gene expression profiles, capturing intricate patterns and dependencies within the multicellular context. This masked learning objective is important to learning gene-gene interactions. However, gene expression reconstruction improves the stability of training and the successful use of the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) for downstream tasks, such as in silico perturbation effect prediction at cell population level.

[0078] At 206, the multicellular gene interaction computation model is applied in one or more downstream applications. In some example embodiments, once pretrained, theAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 multicellular gene interaction computation model (e g., the multicellular gene interaction computation model 115) may be applied towards a variety of downstream applications including, for example, receptor-ligand analysis, multicellular gene network inference, disease-specific gene identification, cell population-level perturbation effect prediction, and / or the like. For example, in some cases, the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may be applied to generate, for each gene that is present in an input multicellular gene profile (e.g., the input multicellular gene profile 150), a contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) that captures the co-regulatory relationships that may exist between different genes in the input multicellular gene profile (e.g., the input multicellular gene profile 150). In some cases, the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may include an embedding layer (e.g., the embedding layer 162) that generates, for each individual gene in the input multicellular gene profile (e.g., the input multicellular gene profile 150), a gene embedding (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) that combines (e.g., concatenates, sums, and / or the like) a gene expression embedding that varies in accordance with the expression levels (e.g., RNA sequence counts) of the gene in the input multicellular gene profile (e.g., the input multicellular gene profile 150) and a gene identity embedding that is independent of the expression levels of the gene. In some cases, the embedding layer (e.g., the embedding layer 162) may be pretrained to generate similar gene embeddings (e.g., embeddings that are proximally located to the embedding (or latent) space) for co-regulated genes. In some cases, the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may also include a transformer blockAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1(e g., the transformer block 164) that generates, based at least on the gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) of the genes in the input multicellular gene profile (e.g., the input multicellular gene profile 150), a contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) for each gene that captures the co-regulatory relationships that may exist between the genes.

[0079] As described in more details below, in some cases, the pretrained multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may be finetuned for certain applications. For example, in some cases, the pretrained multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may be finetuned on spatial transcriptomics data such that the contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) generated by the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) captures the spatial co-localization of genes, such as those expressed by cells that interact through adhesion. Compared to conventional approaches, the contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) generated by the pretrained multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115), with finetuning or without (e.g., zero-shot), are advantageous in their ability to enable the recovery of the spatial colocalization patterns that are present between genes. Alternatively, the pretrained multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may be deployed in zeroshot applications where the pretrained multicellular gene interaction computation model (e.g., theAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 multicellular gene interaction computation model 1 15) is applied without further training (or finetuning) on any application specific training datasets. Moreover, in some cases, either the gene embeddings (e.g., the first gene embedding 163a, and the second gene embedding 163b, and / or the like) generated by the embedding layer (e.g., the embedding layer 162) of the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) or the contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) generated by the transformer block (e.g., the transformer block 164) may be used in the downstream applications.

[0080] FIG. 2B depicts a flowchart illustrating an example of a process 250 for generating contextual gene embeddings, in accordance with some example embodiments. Referring to FIGS. 1A and 2A-B, the process 250 may be performed by the multicellular gene interaction computation model 115 as part of operation 204 and / or 206 of the process 200 shown in FIG. 2A. As described in more details below, the multicellular gene interaction computation model 115 may be applied to generate, for each gene in the input multicellular gene profde 150, a contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) that captures the co-regulatory relationships that exist between the genes in the input multicellular gene profile 150. The contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) of a gene is representative of the context-dependency of the role played by the gene in a multicellular environment in which the gene may regulate as well as be regulated by other genes. The contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) generated by theAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 multicellular gene interaction computation model 1 15 may be more informative for downstream applications that rely on insights into cell-cell communication pathways.

[0081] At 252, an input multicellular gene profile including a plurality of genes is received. In some example embodiments, the input multicellular gene profile (e.g., the input multicellular gene profile 150) may include, for each gene of the plurality of genes (e.g., genes 1, 2, ••• , A), an expression level (e.g., RNA sequence count) across the different cells present in a sample associated with the input multicellular gene profile (e.g., the input multicellular gene profile 150). In some cases, the sample may be a tissue sample that contains multiple cells of multiple different cell types. Accordingly, in some cases, the input multicellular gene profile (e.g., the input multicellular gene profile 150) ingested by the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may be structured as a matrix, with each column of the matrix corresponding to a gene and each row of the matrix corresponding to an individual cell. In some cases, the matrix may be further structured such blocks of adjacent rows within the matrix are allocated to different cell types. For example, a first set of rows in the matrix may be occupied by the gene expression levels of cells having a first cell type while a second set of rows in the matrix may be occupied by the gene expression levels of cells having a second cell type. As described in more details below, the input multicellular gene profile (e.g., the input multicellular gene profile 150) may be structured in this manner because the input units of the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) operating on the input multicellular gene profile (e.g., the input multicellular gene profile 150) are fixed, for example, during training, to operate on specific cell types.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0082] At 254, a gene embedding that combines a gene expression embedding and a gene identity embedding is generated for each gene of the plurality of genes. In some example embodiments, the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may include an embedding layer (e.g., the embedding layer 162) with a neural network (e.g., the neural network 166 such as a feed forward neural network (FFNN)) trained to operate on the expression levels of a gene across every cell in the sample associated with the input multicellular gene profde (e.g., the input multicellular gene profile 150) to generate a gene expression embedding (e.g., the gene expression embedding 171). The gene expression embedding (e.g., the gene expression embedding 171) of a gene may vary depending on the expression levels of the gene across the different cells in the sample. In some cases, the embedding layer (e.g., the embedding layer 162) may further include a gene identity embedding layer (e.g., the gene identity embedding layer 168) trained to generate a gene identity embedding (e.g., the gene identity embedding 173). The gene identity embedding (e.g., the gene identity embedding 173) of a gene is fixed and does not vary based on the expression levels of the gene across the different cells in the sample. In some cases, the embedding layer (e.g., the embedding layer 162) may generate, for each gene, a gene embedding (e.g., the gene embedding 165) that combines the gene expression embedding (e.g., the gene expression embedding 171) and the gene identity embedding (e.g., the gene identity embedding 173) of the gene. For example, in some cases, the gene embedding (e.g., the gene embedding 165) of a gene may be a sum (or a concatenation) of its gene expression embedding (e.g., the gene expression embedding 171) and gene identity embedding (e.g., the gene identity embedding 173). In some cases, the embedding layer (e.g., the embedding layer 162) may be pretrained to generate gene embeddings (e.g., the gene embedding 165) representative of the co-regulatory relationships that may exist betweenAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 two or more genes. For instance, the embedding layer (e.g., the embedding layer 162) may generate similar gene embeddings, which are proximally located in an embedding (or latent) space, for two or more co-regulated genes.

[0083] To further illustrate, given an expression matrix X G IRWxM, wherein N is the number of genes and M is the number of cells, each element Xij G+represents the normalized expression value (after log Ip transformation) of gene gtin cell Cj. For each gene g^ its expression vector xg.may be defined as the i-th row of the expression matrix X as follows. In some cases, the gene expression vector xgimay represent the expression of gene gt across all M cells in the sample.

[0084] The embedding layer (e.g., the embedding layer 162) of the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may include a neural network (e.g., the neural network 166), such as a feed-forward neural network (FFNN) MLPethat is a 2-layer multilayer perceptron (MLP) that transforms the gene expression vector across all cells in to a latent vector of dimension D as follows:GEXEmb0. = MLPe(xa.)The feed-forward neural network (FFNN) MLPeis, in some cases, not gene-specific, meaning that it is shared across all genes, in order to prevent overfitting. Notably, the input units of the feed-forward neural network MLPeof dimension M are associated with certain cell types during training. This is achieved by partitioning the rows of the input vector xgaccording to a set of cell types {t1(t2, ... , tT}, where T < M, as shown below, wherein xgkcontains the expression values of gene gtin cells of cell type tk(e.g., CD4 T cells). Fixing input units to certain cell types is advantageous at least because doing so allows the feed-forward neural network (FFNN)Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1MLPeto learn cell type-specific patters, instead of just learning average gene loadings, and is therefore crucial for the learning of meaning cell-cell interactions.

[0085] In some cases, it is also possible to use a pseudo-bulked input vector as input to the feed-forward neural network (FFNN) MLPe. In this case, the expression values of each gene may be aggregated across cells of the same cell type, resulting in a reduced expression vector that captures the total expression of each gene per cell type. Formally, for each gene gt, the corresponding pseudo-bulked expression vector may be denoted as p5.where T is the number of cell types and T < M. Each element Pgkin pfl. represents the sum of normalized expression values of gene gt across all cells of the cell type tkas indicated below, where Ctis the set of cell indices corresponding to cell type tk.iECtk

[0086] In this pseudo-bulked approach, the input dimension of the feed-forward neural network MLPeis T instead of T. Thus, pseudo-bulking the input vector reduces the input size and focuses the feed-forward neural network MLPeon aggregated expression patterns per cell type. It should be appreciated that the performance of the feed-forward neural network MLPeis competitive whether the feed-forward neural network MLPeoperates on the original percell expression vectorr the pseudo-bulked expression vector p5E UkT. However, when the feed-forward neural network MLPeis trained to treat all cells as a single group, without any implicit cell type labels, the resulting feed-forward neural network MLPeunderperformed significantly. Moreover, while cell type labels may be implicit in the partitioning andAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 aggregation of expression vectors, it may also be possible to use any mutually exclusive cell labels, such as clusters derived from unsupervised methods (e.g., Leiden clustering).

[0087] Referring again to the embedding layer (e.g., the embedding layer 162), another component of the embedding layer (e.g., the embedding layer 162) is the gene identity embedding layer (e.g., the gene identity embedding layer 168), which is denoted here as embfl. Here, given a predefined set of genes G, each gene gt may be assigned a unique identifier to form a vocabulary of tokens used by the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115). Doing so is advantageous because it allows harmonization across different sets of genes across different studies, rather than being limited to intersections of genes therebetween. In some cases, these identifiers are used to transform a gene into a multidimensional vector for dimension D. The gene identity embedding layer embfl, which also goes pretraining, generates an output that is the same size as the output of the feed-forward neural network MLPe.

[0088] Combining the above, for a genewith expression vector xg. E IRM, the multicellular embedding may be defined as:

[0089] Across a set of genes in a multicellular sample, the input to the transformer block is:Input = [Embedding^, Embeddingfl2, ... , Embedding^], wherein L denotes a predefined maximum input length. In some cases, setting up the input of the transformer block in this manner may be advantageous because it accounts for multiple cells in the environment and the interaction therebetween. In some cases, setting up the input in this manner may also consider context-dependent expression of genes in the sample. Furthermore, inAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 some cases, setting up the input of the transformer block in the foregoing manner may represent gene function independent of expression across samples.

[0090] At 256, a contextual gene embedding for each gene of the plurality of genes is generated based at least on a plurality of gene embeddings of the plurality of genes in the multicellular gene profile. In some example embodiments, the multicellular gene interaction computation model (e.g., the multicellular gene interaction computation model 115) may include a transformer block (e g., the transformer block 164) having one or more transformer layers. In some cases, the transformer block (e.g., the transformer block 164) may ingest, as input, the gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) generated by the embedding layer (e.g., the embedding layer 162). For example, in some cases, the transformer block (e.g., the transformer block 164) may operate on the gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) of every gene present in the input multicellular gene profile (e.g., the input multicellular gene profile 150) in order to generate, for each gene, a contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) that captures the co-regulatory relationships that exist between genes. The contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) of a gene may dynamically adapt to the context of the gene, which is determined by the expression of other genes present the same multicellular environment. For example, the role of a first gene expressed by a first cell in a multicellular environment, including that of a first protein produced by the expression of the first gene, may be dependent on a second cell present in the same multicellular environment at least because the expression of the first gene may be regulated by a second protein produced by the second geneAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 expressed by the second cell. In some cases, the role of a third gene may be further contingent upon the expression of the first gene, thus forming a co-regulatory relationship between the first gene, the second gene, and the third gene. Accordingly, the contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) for the third gene may vary depending on the expression levels of the first gene as well as the second gene in the multicellular environment. Furthermore, that the contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) of the third gene accounts for this co-regulatory relationship means that the contextual gene embedding may better inform downstream applications to uncover the regulatory mechanisms within the multicellular environment.

[0091] To further illustrate, consider again the expression matrix X G DkWxM, the multicellular gene embeddings for each genemay be defined as follows:wherein xg. E IRMis the expression vector of gene gtacross all M cells in the sample. These embeddings may be collected into a sequence as shown below:E = [Embeddingfli, Embedding^, ... , EmbeddingflL] wherein L is the quantity of genes considered (up to a predefined maximum input length).

[0092] In some example embodiments, the transformer block (e.g., the transformer block 164) may compute the contextual representation (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) of each gene gt, denoted as H = [h h2, ... , hL], by passing the embedding sequence E through multiple transformer layers. In some cases, the efficiency of the self-attention mechanism may be improved, for example, byAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 using Flash Attention. In some cases, Flash Attention may optimize the transfer of data between processor (e.g., graphics processing unit (GPU)) memory and caches. In some cases, Flash Attention may use tiling and recomputation to avoid storing the large intermediate attention matrices in memory, thereby expediting training and inference by at least reducing memory access (e.g., for reads, writes, and / or the like) and enabling the transformer block to operate on longer contexts. Furthermore, in some cases, the embedding sequence E passed to the transformer block may be delimited with special tokens [CLS] and [SEP] to mark the beginning and end of the sequence.H = Transformer(E)

[0093] Multicellular Gene Network Inference

[0094] Many diseases, such as cancer, work by disrupting the existing signaling pathways regulating cell growth and death. These pathways may include co-regulated genes where the expression of a first gene produces a first protein that activates or suppresses the expression of a second gene. As such, understanding the genes involved in these signaling pathways may enable the identification of specific genes that can be targeted to modulate these interactions. However, conventional computational methods for extracting co-regulated genes are focused on intracellular interactions rather than intercellular interactions. Contrastingly, various implementations of the multicellular gene interaction computation model described herein is trained to learn the co-regulatory relationships that exist between different genes in the same multicellular environment.

[0095] In some example embodiments, the multicellular gene interaction computation model may be trained to learn the interactions between genes in a multicellular environment. As noted, in some cases, this training may include pretraining the multicellular gene interactionAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 computation model with a masked learning objective using training multicellular gene profdes in which the identity and / or gene expression levels of one or more constituent genes are masked. In some cases, this training may include adjusting one or more parameters (e.g., weights, biases, and / or the like) of the multicellular gene interaction computation model such that the multicellular gene interaction computation model is able to generate, from the gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) of each gene in a training multicellular gene profile (e.g., the training multicellular gene profile 170), contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) that enable the recovery of masked gene identities and / or gene expression levels in the multicellular gene profile. In some cases, the pretraining of the multicellular gene interaction computation model includes pretraining an embedding layer of the multicellular gene interaction computation model to generate similar gene embeddings, which are proximally located in an embedding (or latent) space, for coregulated genes in the multicellular environment. Accordingly, these gene embeddings or, in some cases, the contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) generated therefrom, may be applied to identify co-regulated genes. For example, in some cases, a similarity matrix may be computed with the gene embeddings and used to cluster the genes into distinct, mutually exclusive gene sets. Each gene set may represent a group of genes with similar gene embeddings (or latent representations), which are indicative of a functional relationship in one or more common biological processes. The multicellular gene interaction computation model described herein generated gene embeddings that can be used to recover a number of gene sets that play a key roleAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 in intracellular signaling. Advantageously, the multicellular gene interaction computation model is able to do so in a zero-shot setting, with no finetuning necessary.

[0096] In some example embodiments, depending on the clustering technique, the gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) or contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) can be clustered into overlapping or non-overlapping genes. For example, with some clustering techniques, such as Leiden clustering, the resulting gene sets are non-overlapping or mutually exclusive, meaning that a single gene cannot simultaneously belong to multiple gene sets. This mutual exclusivity may be a limitation, as one gene can be, in some cases, involved in multiple biological processes. Accordingly, in some cases, a clustering technique that yields overlapping or non-mutually exclusive gene sets may be applied instead. For instance, percolation clustering allows a single gene to be assigned to multiple clusters, such that the resulting gene sets can include genes that belong to more than one gene set. In non-mutually exclusive gene sets, receptor and ligand genes are significantly more likely to belong to multiple clusters. Moreover, genes belonging to the highest number of gene sets include genes (e.g., IFNG, FZD9, OPRK1, and TYRP1) known to participate in more than one biological processes.

[0097] FIG. 3 depicts an analysis of zero-shot gene sets identified by the multicellular gene interaction computation model, in accordance with some example embodiments. Referring to FIG. 3, (a) shows examples of zero-shot gene sets identified by the multicellular gene interaction computation model at a clustering resolution (e.g., Leiden resolution) of 50 together with pathway enrichment in the corresponding gene set. FIG. 3(b) shows the normalized expression of a set of genes identified by the multicellular gene interaction computation model atAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 a clustering resolution (e g., Leiden resolution) of 50 across a set of cell types in lung cancer atlas. FIG. 3(c) shows the enrichment of zero-shot gene sets extracted using the multicellular gene interaction computation model and alternative methods in cell-cell communication related pathways across clustering resolutions (e.g., Leiden resolutions). In FIG. 3(c), the genes selected for analysis were highly variable genes from the tumor microenvironment atlas. The coexpression was computed using the aggregated gene expression at the sample level using the entire pretraining corpus. FIG. 3(d) shows the enrichment of receptor-ligand pairs across Leiden resolutions using various methods. The enrichment was calculated by counting the number of receptor-ligand pairs present in the gene set and normalizing for the expected number of receptor-ligand pairs given the size of the gene set. The box represents the interquartile range (IQR), horizontal like marks the median, and whiskers extend to farthest data points within 1.5 times the IQR. FIG. 3(e) shows the normalized expression of a selected gene set involved in Wnt signaling in lung cancer across a set of cell types. The cell types are limited to those highly expressed in the gene set.

[0098] As noted, in some example embodiments, the multicellular gene interaction computation model is trained to predict masked genes, including masked gene identities and / or masked gene expression levels, given the identities and / or expression levels of other genes in the multicellular sample. While this masked learning objective explicitly encourages the multicellular gene interaction computation model to learn the intercellular interactions, it does not prevent the multicellular gene interaction computation model from learning intracellular ones. The extent to which the multicellular gene interaction computation model captures inter- and intra- cellular interactions may be evaluated based on whether the discovered gene sets are enriched in pairs of genes related to inter- and intra-cellular communication. For example, whereAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 the multicellular gene interaction computation model learns intercellular interactions, the model should produce gene sets enriched in receptor-ligand pairs. Conversely, the gene sets generated when the multicellular gene interaction computation model learns intracellular interactions should be enriched in Transcription Factor (TF)-target gene pairs, as these interactions happen at intracellular level. Accordingly, three sets of gene pairs, (1) receptor-ligand pairs, (2) TF-target gene pairs, and (3) protein-protein interaction pairs (PPI), were extracted using LIANA, CollecTRI, and OmniPath respectively. The first one represents gene pairs specific to intercellular interactions, the second one intracellular ones, and the latter contains both, with the majority being intracellular interactions. The gene sets identified by the multicellular gene interaction computation model showed the highest enrichment in receptor-ligand pairs across Leiden resolutions, achieving mean fold change enrichment of 5.62 vs second best performing3.1 by CellPLM (FIG. 3(d)). Furthermore, the multicellular gene interaction computation model also performed best on capturing PPI pairs, and performed similarly to scGPT while underperforming CellPLM on TF-target gene pairs. These results show that the multicellular gene interaction computation model is able to learn intercellular as well as intracellular interactions. Finally, the performance of CellPLM on intracellular interactions suggests that its pretraining objective induces the multicellular gene interaction computation model to learn intracellular mechanisms, showing the benefit of learning gene-gene interactions across cells.

[0099] Receptor-Ligand Analysis

[0100] Receptors and ligands are two essential components of cellular communication, enabling cells to respond to external signals and coordinate complex biological processes. By binding to specific ligands, receptors activate signaling pathways that influence cell behavior. An estimated 50% of all drugs are target receptors, underscoring the salience ofAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 target receptors in therapeutic interventions. For example, in immunotherapy, drugs that target the PD-1 / PD-L1 receptor-ligand interaction can enhance the immune system’s ability to recognize and destroy cancer cells by blocking the inhibitory signals (e.g., the binding of PD-L1 from tumor cells to PD-1 on the surface of T-cells) that suppress T-cell activity.

[0101] Capturing the relationship between receptors and ligands from sequencing data (e.g., single cell RNA sequencing data) or spatial transcriptomics data is challenging due to low expression levels and high context-dependency. Furthermore, many different ligands can bind to the same receptor while many different receptors can bind with the same ligand, further complicating the relationship between ligands and receptors. Nevertheless, various implementations of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) described herein are trained to capture multicellular context and can model non-linear interactions. Accordingly, in some example embodiments, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) may be deployed to identify one or more pairs of genes involved in receptor-ligand interactions in an unsupervised fashion. For example, in some cases, the gene embedding (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) or the contextual gene embedding (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) of each gene in an input multicellular gene profile (e.g., the input multicellular gene profile 150) may be compared to the gene embedding or the contextual gene embedding of every other gene in the multicellular gene profile. The similarity between two gene embeddings, such as the first gene embedding 163a and the second gene embedding 163b, or between two contextual gene embeddings, such as the first contextual gene embedding 165a and the second contextual gene embedding 165b, may be quantified by aAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 similarity metric. For instance, in some cases, a pairwise cosine similarity may be computed for each pair of genes in the multicellular gene profile. Two genes may be identified as being involved in receptor-ligand interactions if, for example, the similarity metric (e.g., cosine similarity) quantifying the similarity between the corresponding gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) or contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) satisfies one or more thresholds.

[0102] As noted, receptor-ligand interactions are highly context specific.Accordingly, the performance of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) may be dependent on the size of the context (e.g., the total number of genes in the multicellular gene profile). Performance (e.g., as quantified by area under the receiver operating characteristic (AUROC) curve is shown to increase significantly with an increase in context length, thus demonstrating the importance of considering the multicellular context for receptor-ligand identification.

[0103] In some example embodiments, identifying genes involved in receptor-ligand interactions beyond the receptors and ligands themselves may also be crucial as these genes often encode essential components that regulate or facilitate the signaling pathways. Understanding these genes can reveal novel therapeutic targets for diseases caused by dysregulated cell communication. Accordingly, in some example embodiments, one or more gene sets containing at least one pair of genes involved in receptor-ligand interactions may be identified and the other genes in the same gene set may be analyzed. For example, as noted, one or more gene sets may be identified by clustering the gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) or contextual gene embeddings (e.g., the first contextualAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 gene embedding 165a, the second contextual gene embedding 165b, and / or the like) generated by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115). Furthermore, the one or more pairs of genes involved in receptor-ligand interactions may be identified based on a similarity metric (e.g., cosine similarity metric) quantifying the similarity between the corresponding pairs of gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) or contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like). In some cases, a gene set containing at least one pair of genes involved in receptor-ligand interactions may be identified such that the other genes in the gene set may undergo further analysis.

[0104] FIG. 4 depicts receptor-ligand analysis using the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115), in accordance with some example embodiments. FIG. 4(a) shows the performance on the receptor-ligand prediction task using pretrained gene embeddings. Co-expression was computed using normalized expression aggregated at sample-level using training corpus. The true receptor- ligand pairs were taken from LIANA consensus receptor-ligand set, while the negatives (non receptor-ligand pairs) were randomly sampled. FIG. 4(b) shows receptor-ligand prediction performance across a set of diverse receptors specific to different cell types. FIG. 4(c) shows a comparison of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) pretrained with inputs having different context sizes (e.g., number of genes), with the same training regime and same dataset used in FIG. 4(a). FIG. 4(d) shows an example of a zeroshot gene set with a receptor (CD 14) - ligand (TLR4) pair. FIG. 4(e) shows the most important cell types for TLR-CD14 and CXCR4-CXCL12 receptor-ligand pairs using integrated gradientsAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 across validation samples on lung cancer atlas. Cell type importance scores were computed by applying softmax function to aggregated absolute attribution scores across samples. The error bars represent 95% confidence interval and eDC stands for conventional dendritic cells. FIG. 4(f) shows the relative change in cosine similarity (top) vs expression (bottom) in healthy vs disease samples with SLE across a set of known receptor-ligand pairs with highest change according to cosine similarity. FIG. 4(g) shows structures of protein complexes predicted with AlphaFold3 and predicted to interact with the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115). The pTM (predicted Template Modeling) score shown is a measure of how well different parts of protein fold together, while ipTM (interchain pTM) is the adaptation of pTM for assessing the structural confidence of protein-protein interfaces. Higher values are better for both.

[0105] Many of the receptors on the human cell surface are orphan receptors without any known ligands. This highlight the need to develop methods for identifying novel receptorligand complexes. In some example embodiments, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) can be applied to discover gene pairs capable of forming a protein complex. This capability is show with cytokines, which are key mediators in immune responses. By computing the cosine similarity between contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) and extracted top scoring pairs and filtering for pairs of secreted and membrane proteins, a structural computation model (e.g., AlphaFold3) is applied to compute the binding scores of the top 60 candidates. Out of 60, 5 pairs showed binding scores (measured in by ipTM) of more than 0.5. The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) was able to recoverAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1IL1B and TL1R2, which have been shown to bind, validating this approach (Fig. 4(f)). Moreover, top scoring pairs from those identified by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) included many known interaction pairs, including CSF1R and IL18.

[0106] Disease-Specific Gene Identification

[0107] The role and impact of a gene are specific to its environment, often affecting multiple cells in the tissue. For example, a gene that promotes inflammation during an infection may cause harmful effects on the other cells if overactivated in healthy tissue. Various implementations of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) described herein, which ingests as input the expression of multiple genes across cells, capture the interplay between genes and cells across thousands of diverse samples, thus allowing the model to consider the context of an entire sample rather than of individual cell. Advantageously, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) is able to produce highly contextual embeddings. In some cases, the context sensitivity of the embeddings output by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) can be demonstrated by measuring the change in the neighborhood of individual genes across samples. Moreover, the stability (or variability) of gene neighborhoods can be leveraged to identify condition specific genes, such as genes that are associated with specific diseases, stage of diseases, and / or the like. The intuition behind this is that if the gene neighborhood remains stable across samples, then the gene consistently interacts with the same set of other genes, indicating a context-independent role in cellular processes. Contrastingly, if the gene neighborhood highly varies between samples and conditions, then the gene will be highly context-specific.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0108] FIG. 5 depicts an analysis of multicellular context-aware gene embeddings, in accordance with some example embodiments. FIG. 5(a) depicts a pairwise plot illustrating the mean overlap between gene neighborhoods across samples in the same tissue, versus in different tissues for each gene. The overlap was calculated by using 20 nearest neighbors for each gene in each sample as measured by cosine similarity between gene embeddings. FIG. 5(b) shows genes with high condition-specificity across tissues and diseases (Lupus & COVID-19). The condition-specificity was calculated by dividing the mean overlap between gene neighborhoods in the same condition by the mean overlap between gene neighborhoods across conditions. FIG. 5(c) show the expression of highly tissue-specific genes on the immune across organs dataset across different tissues. FIG. 5(d) shows the expression of disease-specific genes for lupus (SLE, top) and COVID-19 (bottom). Many genes have been shown to be associated with the disease but do not exhibit a significant change in expression between healthy and disease. FIG. 5(e) shows the overlap between top 20 genes discovered by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) and differential expression. Overlap = 40%, N = 20. FIG. 5(f) shows the uniform manifold approximation and projections (UMAPs) of sample-level embeddings across tissues on the entire pretraining corpus (left), tissues on the immune across organs dataset (middle), and disease (right). The sample embeddings were extracted by averaging out all of the contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) in the sample.

[0109] The interactions between genes are intrinsically linked to the repertoire of genes expressed within the cellular environment. The zero-shot gene sets retrieved from pretrained gene embeddings, which can be used for zero-shot gene set analysis, also capture theAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 aggregated interactions across samples. In some cases, sample-specific gene embeddings, which are extracted from the output of the last hidden layer of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115), are used for contextspecific interactions. In some cases, these sample-specific gene embeddings are used to extract gene sets, for example, in the manner described earlier. Context-aware gene sets showed significant enrichment in sample-specific gene sets. For instances, the gene sets from glioblastoma samples showed enrichment in p53, NF-KB, and stress response pathways, which reflects the processes driving tumor development and progression. Contrastingly, the gene sets from healthy PBMC samples are enriched in pathways related to the normal functioning of the immune system, such as B-cell differentiation, T-cell receptor signaling, and Natural Killer cell- mediated cytotoxicity.

[0110] Differential expression (DEX) is one tool in genomics used to identify genes that show statistically significant differences in expression levels between two or more conditions. However, DEX often fails to recover lowly expressed gene which are associated with the condition and only consider the change in expression one single gene at a time, rather than looking at a network of genes, where subtle change in the expression of one or more genes can lead to significant alterations in biological pathways or cellular functions. The contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) generated by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) have the advantage of allowing for measuring the change across conditions not solely relying on the expression of a gene of interest, but also on its neighborhood in its multicellular environment. Leveraging this, condition-specific genes can be identified by collecting the neighborhoods of each gene across samples andAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 calculating the similarity (e g., Jaccard similarity) between gene neighborhoods across different conditions, as well as within the same condition. The pairwise plots of the two across tissues (FIG. 5(a)) and diseases shows trade-off between similarity (e.g., Jaccard similarity) across conditions and within condition, with majority of the genes noting low overlap in the two. The fold change in similarity (e.g., Jaccard similarity) can also be calculated within the same condition relative to the complement of similarity (e.g., Jaccard similarity) across conditions. This results in a set of highly condition-specific genes. Genes which exhibit high tissue-, lupus- and COVID-19-specificity are visualized (FIG. 5(b)) to show large overlap within a condition, while simultaneously maintaining low overlap across conditions. Furthermore, the expression of condition-specific genes is also shown to be variable. For example, tissue-specific genes from the immune system across organs dataset, which include FIGN, CLDN1, GSTA1 and more, show variable expression patterns across tissues, validating their tissue-specificity. Moreover, the recovered genes have low expression levels (normalized expression mean = 0.04, std = 0.2), demonstrating that the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) is robust to low expression levels. The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) was also able to recover disease-specific genes for Lupus and COVID- 19 that have been shown to be associated with the corresponding diseases. These genes include IFI44, MX1, EPSTI1 for SLE and OAS1, CISH and CD74 for COVID- 19. Importantly, numerous genes with evidence linking it to a disease have either low or similar expression levels between healthy and disease cells making it challenging for DEX to extract such genes. To assess the overlap between DEX and the condition-specific genes identified by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115), the top 20 genes from each were extractedAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 using Lupus study and plotted to show the overlap therebetween (FIG. 5(e)). The two show considerable overlap at 40%, which validates the approach. At the same time, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) also discovered genes, such as the OAS2 and FTH1 genes associated with Lupus, that eludes DEX.

[0111] Clustering is a notable technique in single-cell applications, enabling the identification of distinct populations within a heterogenous sample. Existing methods have mostly focused on clustering at the cell-level, grouping cells according to their cell states. By contrast, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) operates on multicellular samples as a whole to uncover salient variations at sample- rather than cellular level. Sample-level clustering can be then used to, for example, to reveal patient cohorts more likely to respond to certain therapies. For instance, upon pretraining, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) may be applied to determine gene embeddings (e.g., the first gene embedding 163a, the second gene embedding 163b, and / or the like) or contextual gene embeddings (e.g., the first contextual gene embedding 165a, the second contextual gene embedding 165b, and / or the like) for the genes in each sample before a sample embedding is determined by taking an average of the gene embeddings. The samples may then be clustered based on their respective sample embeddings into clusters that correspond to the tissue from which each sample originated (Fig. 5(f)). In instances where a dataset includes samples from multiple tissues, it was shown that the samples do not cluster by tissue alone due to the technical effects of the dataset by performing clustering on two datasets with multiple tissues (FIG 5(f)). The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) performed well on both datasets, achieving over 0.6 Normalized Mutual InformationAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1(NMI), which is a value between 0 and 1 .0, with a higher value being better. The sample embeddings also allowed samples to be clustered according to disease. This effect is shown with lupus and COVID-19 datasets. Finally, the sample-level embeddings were benchmarked against an expression-based method, PCA, which is created by aggregating the expression across samples, showing that the multicellular gene interaction computation model (e g., the multicellular gene computation model 115) outperform the baseline on 3 out of 4 datasets.

[0112] Recovery of Spatial Colocalization Patterns

[0113] Identifying genes that interact within spatial niches is essential for unravelling the complexities of cellular context, tissue architecture, and elucidation of cell-cell interactions. By using hundreds of cells as input and modeling interactions between genes across cells, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) contemplates cellular as well tissue context. To investigate whether the multicellular gene interaction computation model learns meaningful spatial patterns using only sequencing data (e.g., single cell RNA sequencing data), a comparative analysis between the gene embeddings generated by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) and spatial transcriptomics data is performed.

[0114] FIG. 6 depicts spatial patterns across datasets recovered by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115), in accordance with some example embodiments. FIG. 6(a) shows the spatial colocalization of AREG and VEGFA genes in the Xenium human lung cancer dataset, as recovered by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) in a zero-shot fashion by assessing genes with highest cosine similarity in gene embeddings for the AREG gene. The spatial colocalization was measured by computingAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 bivariate Moran’s I (a measure of autocorrelation) between the two genes. The color gradient in FIG. 6(a) corresponds to the value of the bivariate Moran’s I between two genes. FIG. 6(b) shows the spatial colocalization of SPP1 and FTH1 genes in the Visium Glioblastoma dataset, recovered by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) in a zero-shot fashion by assessing genes with highest cosine similarity in gene embeddings for SPP1 gene. The spatial colocalization was measured by computing bivariate Moran’s I between the two genes. The color gradient in FIG. 6(c) corresponds to the value of the bivariate Moran’s I between two genes. FIG. 6(c) shows the correlation between spatial colocalization and the cosine similarity computed using gene embeddings from the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) across different percentiles of absolute spatial colocalization (the higher the percentile, the higher the absolute spatial colocalization). The correlation was computed by randomly sampling 40,000 pairs of genes from each of the Xenium lung cancer and Xenium breast cancer dataset, and computing spatial colocalization using bivariate Moran’s I, and cosine similarity using gene embeddings from multicellular gene interaction computation model (e.g., the multicellular gene computation model 115). FIG. 6(d) depicts a visualization of spatial samples used for finetuning the multicellular gene interaction computation model (e g., the multicellular gene computation model 115) on spatial transcriptomics data. A spatial sample is defined by taking the cells within a predefined radius of a cell of interest. In FIG. 6(d), each cell cluster is represented by a different color while the dotted circles indicate the spatial neighborhood samples that are present across the cell clusters. FIG. 6(e) shows the spatial colocalization of POSTN and LUM genes in the Xenium breast cancer dataset, as recovered by the multicellular gene interaction computation model finetuned on the dataset by assessing genesAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 with highest cosine similarity in the gene embedding space. The color gradient in FIG. 6(e) corresponds to the value of the bivariate Moran’s I between two genes. FIG. 6(f) shows the interaction matrix between cell types computed using the cosine similarity between gene embeddings of marker genes generated by the multicellular gene interaction computation model (e g., the multicellular gene computation model 115). The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) used here was finetuned on the Xenium human breast cancer dataset and the marker genes were annotated using the same dataset. The interaction score, which is visually indicated by the color gradient in FIG. 6(f), was computed by calculating the log2 fold change between the gene pair’s cosine similarity divided by its average cosine similarity across all genes and taking the average across gene pairs for each cell type-cell type pair.

[0115] Whether the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) is able to recover genes with high spatial colocalization was determined by extracting gene pairs with high cosine similarity, for example, in the lung cancer atlas and glioblastoma study, and visualizing these genes using the corresponding spatial datasets, Xenium lung cancerl and Visium glioblastoma2. Importantly, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) was not trained on either of these datasets, therefore the evaluation is fully zero-shot. The high scoring genes have high spatial colocalization as measured by bivariate Moran’s and visualized by their expression on the spatial profiling slide (FIG. 6(a) and (b)). Furthermore, the Pearson correlation between the cosine similarity of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) and spatial colocalization increases when restricting the gene pairs to the ones with higher spatialAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 colocalization, as measured by percentile value (FIG. 6(c)). This shows that while the multicellular gene interaction computation model can accurately capture, from sequencing data (e.g., single cell RNA sequencing data), the highly colocalized gene pairs where signal is strong, the task is harder for the majority of the gene pairs, where the spatial colocalization is close to 0.

[0116] The interactions between cells in a sample are particularly important in spatial niches, where the physically proximal cells are more likely to communicate. To account for this, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) can be finetuned on the spatial transcriptomics data by defining the sample as a set of cells within a defined radius (FIG. 5(d)). For example, the Xenium human breast cancer datasetj was used to finetune the pretrained multicellular gene interaction computation model (e.g., the multicellular gene computation model 115). The finetuning was performed using the same training objective (e.g., masked learning objective) as in pretraining.

[0117] The finetuned model shows significantly higher correlation between the cosine similarity of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) and spatial colocalization (FIG. 6(c)) and identified highly colocalized gene pairs (FIG. 6(e)), demonstrating how the multicellular gene interaction computation model benefits from having direct access to the spatial neighborhood of the cells. One of the key use cases for spatial data is using it to identify the interactions between different cell types in the tissue. Therefore, the finetuned multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) can be applied to identify interactions between marker genes of distinct cell types, such as tumor and T cells in Xenium breast cancer data. The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) was able to identify a number of interacting genes, including onesAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 belonging to different cell types. The results across was aggregated across cell types using the cosine similarity between marker genes and used to compute an interaction matrix. Here, the assumption was that a higher similarity in latent space between marker genes of cell type A and cell type B implies an interaction between the two. The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) was shown to accurately capture the relationship between cell types, demonstrating high interaction scores between immune cell types such as T- and B-cells, Macrophages and Dendritic cells, as well as tumor, epithelial, and stromal cells (FIG. 6(f)). These experiments show that the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) can be applied to spatial data as well as be successfully finetuned on a dataset of interest.

[0118] Cell Population-Level Perturbation Effect Prediction

[0119] Approximately 30% of drugs fail in clinical trials due to safety issues. One of the key aspects for evaluating the safety of a therapeutic is how it affects not only target cells, but also other cells in the same microenvironment. For example, a cancer drug should ideally kill less non-tumor cells compared to tumor cells. Existing perturbation methods focus on evaluating the effect of a perturbation at a cell line level, ignoring the microenvironment of target cells. Various implementations of the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) described herein considers the entire microenvironment and thus allows for in silico genetic perturbations by leveraging the gene embeddings generated therewith. Accordingly, in some example embodiments, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) may be applied to estimate the effect of perturbation at the cell population level.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0120] FIG. 7 shows in silica genetic perturbation effect estimation at cell population level, in accordance with some example embodiments. FIG. 7(a) shows in silica gene knock- out effect on a set of true interacting genes and control genes. The control genes were randomly selected from a pool of available genes. Knock-out effect is quantified by the maximum log-fold change in gene’s logits. The predictions were min max normalized within a gene. The box represents the interquartile range (IQR), horizontal like marks the median, and whiskers extend to farthest data points within 1.5 times the IQR. FIG. 7(b) shows the overlap between genes with highest effect from the multicellular gene interaction computation model (e.g., genes leading to highest reduction in the normalized tumor expression) and from viability readouts following CRISPR knock-outs from DepMap data across lung cancer cell lines. FIG. 7(c) shows the zeroshot predicted normalized effect on tumor expression using the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) versus the observed effect on tumor across lung cancer cell lines from DepMap data. The predicted effect from the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) was computed by calculating the change in expression in the sample between perturbed versus control and normalizing it by control expression. FIG. 7(d) shows the zero-shot predicted normalized effect on non-tumor cells versus on tumor cells. The predicted effect was calculated in the same way as in FIG. 7(c). FIG. 7(e) shows the lung cancer cell lines with highest Pearson correlation between the viability readouts following CRISPR knock-outs from DepMap and the predicted normalized effect from the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115).

[0121] At the outset, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) is show to be capable of capturing the relationshipsAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 between genes upon a genetic perturbation. For example, a set of gene-gene relationships known to play a key role in the cell-cell communication is curated. Perturbation effect prediction was performed across samples using the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) fine-tuned on the lung cancer atlas. In silico gene knock-out was performed by removing genes from the sample and comparing the result to the baseline model with the knocked-out gene present. The quantification of the effect on other genes was done by calculating the maximum log2 fold change between the perturbed and baseline logits. The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) accurately learned the dependencies between genes, outperforming the control baseline across genes (Fig. 7(a)). The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) also performs well across genes on other metrics, such as area under the receiver operating characteristic (AUROC) curve, further validating the results. Predicting the effect of gene perturbation on other genes in a multicellular setting opens an avenue for evaluating target specificity and safety.

[0122] Currently, perturbation approaches focus on predicting the effect of the intervention on the target cell type, for example, by developing tumor cell line models and using them to quantify the impact of a perturbation. While these state-of-the-art approaches can be used to assess feasible targets lethal to tumor cells and understand the intracellular mechanisms, it does not take into account other cells in the environment. The multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) operates on cells across different cell types, which allows evaluation of gene perturbation (e.g., including knock-out or activation) to consider the impact on an entire cell population. That is, in some cases, the gene expression level across different cell types in the same multicellular environment may beAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 determined based on the perturbation effect of the gene on other genes. To quantify the impact of perturbing a gene on a population level, the reconstructed gene expression is used and compare the predicted normalized expression aggregated on cell type level with and without gene knock-out. This approach was evaluated using the DepMap consortium data, which includes viability read-outs across distinct cell lines specifying the impact of CRISPR gene knock-out on the tumor population. Doing so allows the quantification of the effect of gene knock-out on the tumor population as a whole (e.g., the higher the effect the more tumor cells it kills). Notably, the DepMap data is on the cell line level, which is a significant limitation as our model works on cell population level.

[0123] In some cases, the multicellular gene interaction computation model (e g., the multicellular gene computation model 115) can be finetuned, such as on the lung cancer atlas, prior to in silico gene knock-out prediction across samples (e.g., with lung cancer). For the lung cancer atlas, CRISPR gene knock-out viability read-outs from relevant lung cancer cell lines were retrieved from DepMap. When the genes with the highest effect (e.g., ones which reduce the tumor population the most) were compared, the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) demonstrated considerable overlap (30%) with the genes from DepMap (FIG. 7(b)). To confirm that the effect is not solely due to removing the expression of the knock-out gene, the overlap between the genes with highest effect coming from the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) and the genes with the highest expression levels in the tumor was determined. The resulting overlap equaled 50%, which is substantial but also expected as the predictions have been made in purely zero-shot fashion. A pairwise comparison between DepMap and the multicellular gene interaction computation model (e.g., the multicellular geneAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 computation model 115) is shown in FIG. 7(c). On the SW 1573 cell line, the Pearson correlation between the two equaled 0.33 (p-value < 1 x 10-10). Moreover, when investigating the genes with highest effect in both multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) and DepMap, a large number of genes were ribosomal proteins (FIG. 7(c)). This result makes intuitive sense because knocking out a ribosomal protein is more likely to cause cell death, regardless of cell type. To account for this, the predicted effect on non-tumor cells were determined and a comparison against the effect on tumor cells is plotted. This allows a look at genes that kill fewer non-tumor cells (low effect on non-tumor cells) in comparison to the tumor cells (high effect on tumor cells). This is desirable as it would potentially allow the discovery of safer targets. Such targets discovered by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) were significantly less enriched in ribosomal proteins, validating this approach.

[0124] A key aspect of identifying a promising therapeutic target is validating it by selecting an experimental model that is physiologically relevant and translatable to real-world biological systems. For example, using an appropriate cell line reduces the risk of misleading results that could arise from cells that do not adequately represent real-world conditions. Cell lines from the DepMap data, which includes hundreds of cell lines, with the highest correlation to the predicted effect determined by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) were identified by calculating the Pearson correlation between the predicted effect determined by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115) on genes and the viability read-outs. Doing so selects the cell lines that best mimic real-world setting, as modeledAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 by the multicellular gene interaction computation model (e.g., the multicellular gene computation model 115). The top cell lines for human lung cancer are shown in FIG. 7(e).

[0125] FIG. 8 depicts a block diagram illustrating an example of a computing system 800, in accordance with some example embodiments. Referring to FIGS. 1-8, the computing system 800 may be used to implement the gene analysis engine 110, the task engine 120, the client device 130, and / or any components therein.

[0126] As shown in FIG. 8, the computing system 800 can include a processor 810, a memory 820, a storage device 830, and input / output devices 840. The processor 810, the memory 820, the storage device 830, and the input / output devices 840 can be interconnected via a system bus 850. The processor 810 is capable of processing instructions for execution within the computing system 800. Such executed instructions can implement one or more components of, for example, the gene analysis engine 110, the task engine 120, the client device 130, and / or the like. In some example embodiments, the processor 810 can be a single-threaded processor. Alternately, the processor 810 can be a multi -threaded processor. The processor 810 is capable of processing instructions stored in the memory 820 and / or on the storage device 830 to display graphical information for a user interface provided via the input / output device 840.

[0127] The memory 820 is a computer readable medium such as volatile or nonvolatile that stores information within the computing system 800. The memory 820 can store data structures representing configuration object databases, for example. The storage device 830 is capable of providing persistent storage for the computing system 800. The storage device 830 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means. The input / output device 840 provides input / output operations for the computing system 800. In some example embodiments, the input / output device 840Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 includes a keyboard and / or pointing device. In various implementations, the input / output device 840 includes a display unit for displaying graphical user interfaces.

[0128] According to some example embodiments, the input / output device 840 can provide input / output operations for a network device. For example, the input / output device 840 can include Ethernet ports or other networking ports to communicate with one or more wired and / or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).

[0129] In some example embodiments, the computing system 800 can be used to execute various interactive computer software applications that can be used for organization, analysis and / or storage of data in various formats. Alternatively, the computing system 800 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and / or any other objects, etc.), computing functionalities, communications functionalities, etc. The applications can include various add-in functionalities or can be standalone computing products and / or functionalities. Upon activation within the applications, the functionalities can be used to generate the user interface provided via the input / output device 840. The user interface can be generated and presented to a user by the computing system 800 (e.g., on a computer screen monitor, etc.).

[0130] One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs, field programmable gate arrays (FPGAs) computer hardware, firmware, software, and / or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and / or interpretable on a programmable systemAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0131] These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural and / or object- oriented programming language, and / or in assembly / machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and / or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and / or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and / or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid- state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example, as would a processor cache or other random access memory associated with one or more physical processor cores.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1

[0132] To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT), a liquid crystal display (LCD), a light emitting diode (LED) monitor, or an organic light emitting diode (OLED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Other possible input devices include touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive track pads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.

[0133] In the descriptions above and in the claims, phrases such as “at least one of’ or “one or more of’ may occur followed by a conjunctive list of elements or features. The term “and / or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features. For example, the phrases “at least one of A and B;” “one or more of A and B;” and “A and / or B” are each intended to mean “A alone, B alone, or A and B together.” A similar interpretation is also intended for lists including three or more items. For example, the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and / or C” are each intended to mean “A alone, B alone, CAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 alone, A and B together, A and C together, B and C together, or A and B and C together.” Use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.

[0134] The subject matter described herein can be embodied in systems, apparatus, methods, and / or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and / or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and / or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and / or described herein do not necessarily require the particular order shown, or sequential order, to achieve desired results. Other implementations may be within the scope of the following claims.

Claims

Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1CLAIMSWhat is claimed is:

1. A computer-implemented method, comprising: generating a training dataset to include a plurality of training multicellular gene profiles, each training multicellular gene profile of the plurality of training multicellular gene profiles being associated with a sample including a plurality of cells, and each training multicellular gene profile including, for each cell of the plurality of cells, an expression level of each gene of a plurality of genes; training, based at least on the training dataset, a multicellular gene interaction computation model, wherein the training of the multicellular gene interaction computation model includes applying the multicellular gene interaction computation model to generate, for each gene in a training multicellular gene profile, a contextual gene embedding that captures a co-regulatory relationship between the gene and one or more other genes in the training multicellular gene profile; and applying the trained multicellular gene interaction computation model in one or more applications.

2. The method of claim 1, wherein the multicellular gene interaction computation model includes an embedding layer trained to generate, for each gene in the training multicellular gene profile, a gene embedding used to generate the contextual gene embedding of the gene.

3. The method of claim 2, wherein the embedding layer is trained to generate two or more similar gene embeddings for two or more co-regulated genes.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-14. The method of any of claims 2 to 3, wherein each input unit of the embedding layer is fixed to a certain cell type by at least imposing a corresponding structure in each training multicellular gene profile in the training dataset.

5. The method of claim 4, wherein each training multicellular gene profile comprises a matrix having a plurality of sets of adjacent rows, and wherein each set of adjacent rows in the matrix are assigned to cells of a different cell type.

6. The method of any of claims 2 to 5, wherein the embedding layer includes a neural network trained to generate, for each gene in the multicellular gene profile, a gene expression embedding corresponding to an expression level of the gene across the plurality of cells in a corresponding sample.

7. The method of claim 6, wherein the embedding layer includes a gene identity embedding layer trained to generate, for each gene in the multicellular gene profile, a gene identity embedding that is independent of the expression level of the gene across the plurality of cells in the corresponding sample.

8. The method of claim 7, wherein the gene embedding of each gene comprises a combination of the gene expression embedding and the gene identity embedding.

9. The method of any of claims 2 to 8, wherein the multicellular gene interaction computation model includes a transformer block trained to generate, based at least on the gene embedding of each gene in the training multicellular gene profile, the contextual gene embedding of each gene in the training multicellular gene profile.

10. The method of any of claims 2 to 9, wherein the one or more downstream applications includes identifying, based at least on a gene embedding and / or a contextual geneAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 embedding of each gene in an input multicellular gene profile, one or more gene sets, and wherein each gene set of the one or more gene sets include a plurality of co-regulated genes.

11. The method of claim 10, wherein the one or more gene sets are identified by at least identifying, based at least on the gene embedding and / or the contextual gene embedding of each gene in the input multicellular gene profile, one or more clusters of genes, and wherein each cluster of genes corresponds to a gene set comprising a plurality of genes having a functional relationship in one or more common biological processes.

12. The method of any of claims 2 to 11, wherein the one or more downstream applications include identifying, based at least on a gene embedding and / or a contextual gene embedding of each gene in an input multicellular gene profile, one or more pairs of genes involved in receptor-ligand interactions.

13. The method of claim 12, wherein the one or more pairs of genes are identified by at least computing, for each pair of genes in the input multicellular gene profile, a similarity metric quantifying a similarity between a corresponding pair of gene embeddings or contextual gene embeddings.

14. The method of claim 13, wherein a pair of genes is identified as being involved in receptor-ligand interactions based at least on the similarity metric of the pair of genes satisfying one or more thresholds.

15. The method of any of claims 12 to 14, wherein the one or more downstream applications further include identifying, based at least on a gene embedding and / or a contextual gene embedding of each gene in the input multicellular gene profile, one or more gene sets, each gene set comprisingAttorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 a plurality of genes having a functional relationship in one or more common biological processes, identifying a gene set including at least one pair of genes involved in receptor-ligand interactions, and analyzing one or more other genes included in the gene set.

16. The method of any of claims 2 to 15, wherein the one or more downstream applications include determining, based at least on a gene embedding or a contextual gene embedding of each gene in an input multicellular gene profile, two or more spatially co-located genes in the input multicellular gene profile.

17. The method of claim 16, wherein a spatial co-localization of two or more genes is determined based at least on a similarity metric between two or more corresponding gene embeddings or contextual gene embeddings.

18. The method of any of claims 16 to 17, further comprising: generating an additional training dataset to include a plurality of spatially defined training multicellular gene profiles, each spatially defined training multicellular gene profiles of the plurality of spatially defined training multicellular gene profiles being associated with a sample containing cells within a certain radius; finetuning, based at least on the additional training dataset, the trained multicellular gene interaction computation model; and applying the finetuned multicellular gene interaction computation model to determine the gene embedding or contextual gene embedding of each gene in the input multicellular gene profile.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-119. The method of any of claims 1 to 18, wherein the one or more downstream applications include identifying one or more condition-specific genes.

20. The method of claim 19, wherein the identifying the one or more conditionspecific genes includes identifying, based at least on contextual gene embeddings output by the trained multicellular gene interaction computation model, gene neighborhoods in multicellular gene profiles associated with different conditions, identifying, based at least on a similarity between the gene neighborhoods in the multicellular gene profiles of different conditions, the one or more condition-specific genes.

21. The method of claim 20, wherein a gene neighborhood of a gene includes one or more other genes whose contextual gene embedding exhibits a threshold similarity to a contextual gene embedding of the gene.

22. The method of any of claims 19 to 21, wherein the identifying one or more condition-specific genes includes applying the trained multicellular gene interaction computation model to generate, for a multicellular gene profile associated with a condition, a first set of contextual gene embeddings for a first plurality of genes comprising the multicellular gene profile of the condition, determining, based at least on the first set of contextual gene embeddings, a first gene neighborhood of a gene present in the multicellular gene profile of the condition, applying the trained multicellular gene interaction computation model to generate, for a multicellular gene profile associated with a different condition, a second set of contextual gene embeddings for a second plurality of genes comprising the multicellular gene profile of the different condition,Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 determining, based at least on the second set of contextual gene embeddings, a second gene neighborhood of the gene present in the multicellular gene profile of the different condition, and identifying, based at least on a similarity between the first gene neighborhood and the second gene neighborhood, the gene as being condition-specific.

23. The method of claim 22, wherein the gene is identified as being condition-specific based at least on a similarity metric between the first gene neighborhood and the second gene neighborhood satisfying one or more thresholds.

24. The method of any of claims 22 to 23, wherein the condition comprises one disease or disease stage, and wherein the different condition comprises a different disease or different disease stage.

25. The method of any of claims 22 to 24, wherein the condition comprises a disease state, and wherein the different condition comprises a non-disease state.

26. The method of any of claims 1 to 25, wherein the one or more downstream applications include determining a perturbation effect of a gene on one or more other genes in a same multicellular environment.

27. The method of claim 26, further comprising:

28. determining, based at least on the perturbation effect of the gene, a gene expression level across a plurality of different cell types in the same multicellular environmentThe method of any of claims 26 to 27, wherein the perturbation effect of the gene is determined by at least applying the trained multicellular gene interaction computation model to generate, for a baseline multicellular gene profile including the gene in a baseline state, a first set of contextual gene embeddings for a plurality of genes comprising the baseline multicellular gene profile;Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-1 applying the trained multicellular gene interaction computation model to generate, for a perturbed multicellular gene profde including the gene in a perturbed state, a second set of contextual gene embeddings for a same plurality of genes comprising the perturbed multicellular gene profde; and determining, based at least on a difference between the first set of contextual gene embeddings and the second set of contextual gene embeddings, the perturbation effect of the gene.

29. The method of any of claims 26 to 28, further comprising: generating an indication specific training dataset including a plurality of indication specific multicellular gene profiles; finetuning, based at least on the indication specific training dataset, the trained multicellular gene interaction computation model; and applying the finetuned multicellular gene interaction computation model to determine the first set of contextual gene embeddings and the second set of contextual gene embeddings.

30. The method of any of claims 1 to 29, wherein the multicellular gene interaction computation model is trained to recover a masked identity and / or a masked expression level of one or more genes in each training multicellular gene profile.

31. The method of claim 30, wherein the training of the multicellular gene interaction computation model includes adjusting one or more parameters of the multicellular gene interaction computation model such that the contextual gene embedding generated by the multicellular gene interaction computation model enables a recovery of the masked identity and / or the masked expression level of the one or more genes in each training multicellular gene profde.Attorney Ref.: 14786-076-228 (103963-228076) / P39709-WO-132. A system, comprising: at least one data processor; and at least one memory storing instructions, which when executed by the at least one data processor, result in operations comprising the method of any of claims 1 to 31.

33. A non-transitory computer readable medium storing instructions, which when executed by at least one data processor, result in operations comprising the method of any of claims 1 to 31.