Method for analysis and query-based interrogation of biological sequence data

The method decomposes biological sequence data into non-overlapping portions for a two-tiered index, addressing computational intensity in spatial and single-cell RNA sequencing, enabling fast, flexible, and user-friendly data exploration and classification.

WO2026132228A1PCT designated stage Publication Date: 2026-06-25MAX DELBRUECK CENT FUER MOLEKULARE MEDIZIN

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
MAX DELBRUECK CENT FUER MOLEKULARE MEDIZIN
Filing Date
2025-12-18
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Current methods for analyzing high-resolution spatial and single-cell RNA sequencing data are computationally intensive and require alignment to reference genomes, limiting real-time exploration and scalability, especially for large datasets.

Method used

A computer-implemented method that decomposes biological sequence data into non-overlapping portions, creating a two-tiered index for each portion, enabling direct querying and visualization without reliance on reference genomes, using k-mers for efficient lookup.

Benefits of technology

Enables ultrarapid analysis and flexible exploration of complex sequencing data, allowing for reference-free classification of cell types and spatial domains, and integration with large language models for automated biological insight generation, significantly reducing processing time.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025087998_25062026_PF_FP_ABST
    Figure EP2025087998_25062026_PF_FP_ABST
Patent Text Reader

Abstract

The invention relates to a computer-implemented method for creating a searchable indexed data structure for biological sequence data. The invention further relates to searching in said data structure for any given biological sequence data or associated information. The invention relates to a computer-implemented method comprising: a. providing biological sequence input data, b. decomposing the biological input data into non-overlapping portions, c. indexing each portion of the biological input data with at least one associated information of said portion comprised within or provided with the biological input data, thereby creating a two-tiered index for each portion. The method may further comprise d. receiving at least one query from a user, wherein the query comprises biological sequence information, language-based prompts and / or queries regarding the metadata comprised within or provided with the biological input data. The method may further comprise e. providing at least one output comprising at least one biological sequence information, biological sequence identifier and / or metadata of one or more portions of the biological sequence input data corresponding to the query. The invention further relates to the use of the method of the invention for identifying tissue and / or cell localized expression of at least one nucleic acid, such as a gene, and / or protein of interest. The method further relates to a data structure and a computer-readable medium comprising a data structure, such as a database, constructed according to the invention.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] METHOD FOR ANALYSIS AND QUERY-BASED INTERROGATION OF BIOLOGICAL SEQUENCE DATA

[0002] DESCRIPTION

[0003] The present invention lies in the field of molecular biology and genomics, particularly in the field of computer-implemented analysis of biological sequence information.

[0004] The invention relates to a computer-implemented method for creating a searchable indexed data structure for biological sequence data. The invention relates to a computer-implemented method, comprising a. providing biological sequence input data, b. decomposing the sequence input data into non-overlapping portions, and c. indexing each portion of the sequence data with at least one associated information of said portion, comprised within or provided with the biological input data, thereby creating a two-tiered index for each portion.

[0005] In embodiments, the method further comprises d. receiving at least one query from a user, wherein the query comprises biological sequence information, language-based prompts and / or queries regarding the metadata comprised within or provided with the biological sequence input data. In embodiments, the method further comprises e. providing at least one output, comprising at least one sequence, sequence identifier information and / or metadata of one or more portions of the biological sequence input data corresponding to the query, preferably by a graphical user interface.

[0006] The invention further relates to the use of the method of the invention for creating a searchable indexed data structure for biological sequence data. The invention further relates to using the method of the invention for searching in said data structure for any given biological sequence data or associated information.

[0007] The invention further relates to the use of the method of the invention for identifying tissue and / or cell localized expression of at least one nucleic acid, such as a gene, and / or protein of interest. The invention further relates to the use of the method of the invention for identifying expression of a gene and / or protein of interest with respect to tissue architecture, individual cells, cell-cell interactions, and / or the spatial distribution of gene expression.

[0008] The invention further relates to the use of the method of the invention for the analysis, diagnosis, prognosis, treatment guidance and / or monitoring of a medical condition, comprising analyzing biological sequence input data obtained from a sample of a subject suspected of or diagnosed with a disease.

[0009] The invention further relates to a computer-implemented method for constructing a data structure, such as the indexed data structure obtained via the method described herein, such as an indexed database of biological sequence data. The invention further relates to a data structure, such as an indexed database of biological sequence data, and a computer-readable medium comprising the data structure, constructed according to the method of the invention, comprising a two-tiered index and optionally the biological sequence input data. BACKGROUND OF THE INVENTION

[0010] Over the past decade, advances in spatial and single-cell RNA sequencing technologies have revolutionized our understanding of tissue architecture, cell-cell interactions, and the spatial dynamics of gene expression (Liu et al. 2024; Baysoy et al. 2023). High-resolution spatial transcriptomics methods in particular, such as Open-ST, Stereo-seq, Seq-Scope, and Slide-seq v2, generate remarkably rich datasets that capture gene expression at subcellular resolution across intact tissue sections (Chen et al. 2022; Schott et al. 2024; Cho et al. 2021 ; Stickels et al. 2021 ; Vickovic et al. 2019).

[0011] However, traditional analysis of these datasets presents significant computational challenges. Current approaches rely heavily on mapping reads to reference genomes followed by gene quantification (Melsted et al. 2021 ; Sztanka-Toth et al. 2022; He et al. 2022; Kaminow et al. 2021 ; Xi et al. 2022). This process is computationally intensive and must be repeated whenever exploring new features - for instance, previously unannotated sequences or specific splice variants. With current high-resolution technologies routinely generating over 1 billion reads across tens of millions of spatial locations in a single experiment (Schott et al. 2024), exploratory analysis of the sequence space becomes computationally prohibitive. The fastest available alignment algorithms require several hours of processing time for each new analysis. For example, standard single cell sequencing analysis pipelines remain reference-centric: they process sequencing reads by aligning them to a user-provided reference. Applying these approaches at ‘atlas’ scale requires downloading and reprocessing petabytes of raw files, which is beyond the computational resources available to most laboratories.

[0012] Previous efforts have attempted to overcome the need to repeatedly process raw reads when quantifying new sequences. One common approach consists of splitting raw reads into k-mers (all contiguous substrings of length k derived from a sequence) and building a searchable index that maps k-mers to metadata, such as sample or species identifiers. This approach is sufficient for estimating the presence of genes in samples (Solomon and Kingsford 2016). In practice, the k-mer lookup leverages efficient versions of hash tables, minimal perfect hash functions, Bloom filters, or a combination of these (Zhao et al. 2024; Bradley et al. 2019; Holley and Melsted 2020; Marchet et al. 2021 ; Yu et al. 2018; Karasikov et al. 2020; Marchet et al. 2020; Pandey et al. 2018). Upon k-mer lookup, these data structures lead to pointers to metadata, which can be stored and accessed efficiently as sparse matrices, bitmaps, delta lists, or colored de-Bruijn graphs. Methods like SPLASH or SPLASH2 (Chaung et al., 2023; Kokot et al., 2024) perform k- mer-based analysis of such label-rich data. Rather than building searchable indexes, these excel at splice junction or mutation detection, by detecting pairs of anchors that are significantly different between predefined groups. However, these tools are not optimized for the unique characteristics of single-cell or spatial data, where billions of k-mers map to millions or billions of spatial coordinates. For these reasons, no framework for sequence analysis and interrogation has been deployed across a large corpus of single-cell or spatial sequencing data, yet.

[0013] For instance, a typical Open-ST dataset (3x4 mm2physical size) covers hundreds of millions of unique k-mers over tens of millions of spatial locations. Unlike assigning labels to hundreds of large genomes (Bradley et al. 2019; Karasikov et al. 2020; Zhao et al. 2024), single-cell and spatial sequencing data would involve indexing many short reads, where each location or cell ID would contain only a couple hundreds or thousands of k-mers. In other words, even if the total number of sequenced bases is similar, the ratio of reads to metadata from these omics experiments is widely different to the bulk setting. To date, no fast k-mer lookup approach has been adapted for spatial or single-cell omics, due to several challenges: incompatibility with data formats, poor scalability to large numbers of spatial locations, and the lack of integrated tools for direct querying and visualization.

[0014] Until today, analysis of single-cell and spatial transcriptomics sequencing experiments heavily relies on reference genomes and computationally intensive algorithms, limiting discovery to annotated references and hindering real-time, interactive exploration of the sequence space.

[0015] SUMMARY OF THE INVENTION

[0016] In light of the prior art the technical problem underlying the present invention is to provide improved or alternative methods and means for an analysis of spatial or single-cell sequencing data that are less computationally intensive and enable improved scalability.

[0017] Another aim of the present invention is to provide improved or alternative methods and means for the analysis of spatial or single-cell omics data enabling a fast analysis, direct querying and visualization of the data.

[0018] This problem is solved by the features of the independent claims. Preferred embodiments of the present invention are provided by the dependent claims.

[0019] The present invention is therefore directed to methods for the analysis and interrogation of biological sequence information, preferably for the analysis and interrogation of large data sets comprising biological sequence information and further associated information.

[0020] The invention relates in one aspect to a computer-implemented method, comprising a. providing biological sequence input data, b. decomposing the sequence input data into non-overlapping portions, c. indexing each portion of the sequence data with at least one associated information of said portion comprised within or provided with the biological sequence input data, thereby creating a two-tiered index for each portion.

[0021] In embodiments, the method is for creating a searchable indexed data structure for biological sequence data.

[0022] In embodiments, the method is for searching in a data structure for any given biological sequence data and / or associated information.

[0023] In embodiments, step b. decomposing the biological input data into non-overlapping portions, comprises decomposing the biological input data into data strings of length k (k-mers). Preferably, the biological input data is decomposed into subsequent / consecutive, non-overlapping data strings of length k (k-mers), preferably wherein only the last data string of length k (k-mer) into which input data, e.g., sequence data, is decomposed is allowed to overlap with the second last k-mer, if the entire input sequence comprises a sum of elements that cannot be divided by k, such that any sequence comprised within biological input data may be entirely covered and thereby represented by the k-mers it is decomposed into. In a non-limiting example, the input sequence ATCGTCGA may be decomposed into k-mers of length 3 as follows: ATC-GTC-CGA, and into k-mers of length 4: ATCG-TCGA, etc. In other words, in embodiments a respective individual input sequence comprised within the biological input data may be decomposed (e.g., in step b.) into a set of consecutive k-mers of equal length, wherein each k-mer constitutes a “subsequence” (sequence portion or fragment) of said individual sequence. Hence, the terms "consecutive" or "non-overlapping" k-mers or portions preferably do not exclude that single k- mers of the sum of k-mers generated from an individual input sequence overlap with their neighboring k-mer(s), e.g., the second last and the last k-mer of an input sequence, such that different individual input sequences of varying length may be decomposed into k-mers of equal length, while each of the elements of a sequence is represented by / within said sum of k-mers.

[0024] In preferred embodiments, the input data comprises biological sequence data, preferably comprising a plurality of individual biological sequences (s,), such as nucleic acid and / or amino acid sequences. In preferred embodiments, the biological sequence input data is from or derived from a sequencing, transcriptomics, spatial sequencing, or spatial transcriptomics analysis, preferably of a biological sample.

[0025] In embodiments, at least a portion of the plurality ofz, or each individual sequence(s) comprised within the biological sequence input data is associated with additional information (“associated information”), such as, without limitation, metadata (e.g., information regarding the origin of said biological input data), and / or sequence characteristics (e.g., sequence (read) identifiers and / or (unique) barcodes).

[0026] In preferred embodiments, the method further comprises d. receiving at least one query from a user, wherein the query comprises biological sequence information, language-based prompts and / or queries regarding the metadata comprised within or provided with the biological sequence input data.

[0027] In embodiments, the method further comprises interrogating the two-tiered index with regard to I based on a query, e.g., comprising comparing, in cases where the query comprises biological sequence information, said biological sequence information to the two-tiered index, preferably thereby obtaining from the index at least information regarding the presence, abundancy, location and / or other associated information of said query biological sequence information within the biological input data.

[0028] In embodiments where a query comprises biological sequence information having a greater sequence length than k-mers comprised within the two-tiered index, said biological sequence information may be decomposed, e.g., using a sliding window, into consecutive, non-overlapping k-mers (having the same length as k-mers within the two-tiered index), which are then used to interrogate the two-tiered index according to the invention.

[0029] In embodiments, the query received in d. is processed according to the present method, comprising interrogating the generated two-tiered index, based on the query.

[0030] In an exemplary, non-limiting example, the present method enables, in case a query comprises biological sequence information, obtaining information on the expression, single-cell origin, spatial localization and / or spatial expression of said sequence information within the biological input data, e.g., a sample analyzed by spatial sequencing and / or single cell sequencing. In some embodiments the query comprises in addition to, or instead of sequence information languagebased prompts, which may be processed by the present metho using an LLM, thereby facilitating interrogating the two tiered index (and optionally also further resources) based on the language based prompt.

[0031] In embodiments, the present method comprises interrogating one or more online or local databases to ‘translate’ user queries into searchable biological sequence information.

[0032] In some embodiments, when a query comprises a gene symbol or Ensembl ID, these are mapped to genomic coordinates on standard reference genomes (e.g., currently hg38 for human, mm10 for mouse), optionally with default repeat masking. In some embodiments, when a query relates to the expression of a protein, gene or mRNA, full transcript sequences may be obtained from online resources, such as Ensembl.

[0033] In some embodiments, when a query relates to the coverage (e.g., abundancy, presence) of a certain genomic region within the biological sequence input data, the query may comprise genomic coordinates, wherein the present method uses the provided genomic coordinates to define a ‘region of interest’ and returns k-mer matches for each sequence position within the genomic coordinates, such that as output a visualization of the read distribution along genes and / or a written description of the coverage may be provided.

[0034] In embodiments, indexing each portion in c. comprises the construction of at least one index lookup table assigning each portion to the respective two-tiered index generated in c.

[0035] In embodiments, the present method enables a reference-free framework for ultrarapid analysis of biochemical sequence data, such as e.g., nucleic acid sequence data, preferably comprising spatial and / or single-cell data. In some embodiments, this analysis is available directly from raw reads. In embodiments, the present method employs k-mers for efficient lookup, akin to the indexing of a search engine, achieving at least 10x faster indexing and 10Ox faster querying than existing tools, while accurately representing gene expression from a sample, and preferably from spatial sequencing data obtained from diverse tissues and using various sequencing technologies.

[0036] In the examples provided below, the inventors demonstrate the advantages of two novel approaches: reference-free classification of cell types and spatial domains, and integration with large language models for automated biological insight generation. In their exemplary experiments using head-and-neck tumor samples, the inventors demonstrate how the method enables de novo annotation of tissue from a spatial sequencing analysis, automatically identifying tumor mass and immune cell interfaces, without relying on reference genomes. The present method therefore enables significantly faster interrogation of complex sequencing data, such as the data obtained from a spatial transcriptomics or spatial proteomics analysis.

[0037] As is shown in more detail below, the present method provides a framework that converts biochemical sequence raw data, such as single-cell and spatial sequencing raw data, into searchable objects, enabling ultrafast querying of arbitrary sequences and direct visualization in their spatial context.

[0038] In embodiments, the present method employs a simple k-mer matching approach to bypass alignment to a reference, thereby enabling the processing of massive datasets quickly, for example in minutes rather than hours. For instance, in one experimental example, the present method processed a large 7 TB high-resolution spatial atlas (Wu et al. 2024) in less than 1 hour on a standard 32-core workstation. Using diverse datasets, the inventors have validated that the results of the ’pseudoquantification’ of the invention, performed in preferred embodiments, is qualitatively comparable to well- established pipelines based on alignment and quantification. The inventors were further able to show that the present approach is able to accurately determine gene expression levels and distribution across diverse tissues. Embodiments of the present method enable novel analysis paradigms, such as reference-free classification of cell types and spatial domains, and integration with large language models for automated biological insight generation. The inventors propose the present method as an important building block for search engines and artificial intelligence applications, e.g., in single-cell and spatial sequencing, enabling real-time exploration of the full sequence space in health and disease.

[0039] Notably, since the present approach allows direct interrogation of raw data, it enables several analyses that would otherwise require to reprocessing of the data with accordingly modified pipelines. In standard frameworks, the processed objects are “static” and do not offer flexibility for exploration of entities not previously observed or quantified, such as pathogens or specific intron / exon combinations. Furthermore, the method described herein extends beyond basic sequence matching. In embodiments, it transforms complex sequence data, such as from nucleic acid sequencing, spatial sequencing or single-cell transcriptomics data analysis, into a user- friendly "search engine", by integrating sequence databases with large language models (LLMs). For example, users may directly explore biological questions, without annotated reference sequences. In embodiments, the present method considers a nucleotide sequence as the fundamental unit of a query, although a query may comprise also gene identifiers and / or natural language prompts, which, however, are then resolved by the present method into the underlying nucleotide sequence representation, to then be used for interrogating (or comparison to) the two- tired index.

[0040] The present invention therefore provides ultrarapid, flexible, and user-friendly tools for exploring biochemical sequence data, such as gene expression or spatial and single-cell data. The method makes raw sequence reads searchable in a vastly reduced time. It further enables the integration of diverse information sources into a biological sequence data set, and may be employed as a component of future search engines, analysis platforms and machine learning applications in biological sequence analysis.

[0041] In embodiments, the biological sequence input data comprises genomic, transcriptomic and / or proteomic data. The invention may be employed using any given biological sequence data, independent of which biological molecule is being measured and interrogated. In preferred embodiments, sequence data obtained from nucleic acid sequencing experiments is provided as input data. In other embodiments, protein sequence data, for example as may be obtained using mass spectrometry based proteomics, may be employed.

[0042] In embodiments, the biological sequence input data preferably comprises “raw” (unprocessed) biological sequence data, preferably nucleic acid sequence data, such as genomic or transcriptomic data.

[0043] In embodiments, the biological sequence data comprises genomic data comprising DNA- sequencing data. In embodiments, the biological sequence data comprises transcriptomic data comprising RNA-sequencing data. In embodiments, the biological sequence data comprises proteomic data comprising amino acid sequence data. In embodiments, the biological sequence input data (directly) originates or is obtained from a biological sample to be analysed by the present method.

[0044] In embodiments, the present method enables extracting and storing the sequence information obtained from a sample into the two-tiered index according to the invention, thereby enabling direct analysis and interrogation of said sequence information, preferably without alignment to a reference sequence.

[0045] In embodiments, the biological sequence input data originates from a sample comprising at least one cell and / or the contents of at least one cell. In embodiments, the biological sequence input data is obtained from intact tissue, from a cell, and / or from a sub-compartment or region of a biological cell. For example, the sequence data may be obtained from a spatial transcriptomics analysis. As is known to a skilled person, spatial transcriptomics is a method that captures positional context of transcriptional activity within cells or tissues. Defining the spatial distribution of gene expression, typically through detection and sequencing of mRNA molecules, allows for the detection of cellular heterogeneity in various tissues and cells, such as tumors or immune cells, as well as determining the subcellular distribution of transcripts under various conditions.

[0046] In embodiments, the at least one associated information comprises metadata and / or sequence identifier information.

[0047] In embodiments, the at least one associated information comprises metadata, comprising information on the spatial position and / or (single) cell, from which said biological sequence data (a respective k-mer) was obtained (or detected), and / or sequence identifier information.

[0048] In embodiments, the associated information comprises metadata, which is comprised within, or is provided with, the biological sequence input data. In embodiments, the metadata comprises information regarding the origin or location of said biological sequence input data. In embodiments, the metadata comprises information on the spatial position and / or cell from which said biological sequence data (comprising respective k-mers) was obtained. In embodiments, the metadata comprises information on the spatial position of a respective sequence 5 comprised within the input data, preferably spatial coordinates. In embodiments, the metadata comprises information on the sequencing reads, number of sequencing reads and / or location of sequencing reads obtained for any given portion of the sequence data. In embodiments, the metadata includes quality metrics such as sequencing quality scores.

[0049] In embodiments, the metadata may also comprise information regarding the analysis context, such as study information, tissue context, experimental condition(s), disease context, and / or developmental stage.

[0050] In embodiments, the spatial position regarding the origin or location of said biological sequence input data was obtained by microscopy, preferably by spatial sequencing comprising a microscopic analysis of the sample.

[0051] In embodiments, the at least one associated information comprises metadata, such as spatial coordinates, and sequence identifiers, such as unique molecular identifiers (UMIs) or barcodes.

[0052] In specific embodiments the method comprises an automated standardization pipeline based on an ontology mapping tool, e.g., text2term (a specialized biomedical ontology mapping tool), which preferably employs semantic similarity algorithms to map free-text metadata terms to established ontology standards, to harmonize heterogeneous metadata from diverse sources. In some specific embodiments the automated standardization pipeline first preprocesses metadata by standardizing null values and normalizing common terminology variants (preferably wherein multivalue fields containing multiple biological conditions are decomposed and mapped individually to ensure comprehensive coverage), then applies TF-IDF-based semantic matching with hierarchical fallback strategies to map terms to domain-appropriate ontologies (e.g., diseases to an disease ontology, such as MONDO, anatomical structures to an anatomical ontology, such as UBERON, phenotypic traits to a phenotype ontology, such as PATO, and / or developmental stages to age-related ontologies). In embodiments, subsequently, to enable system-level biological analysis, organs may be additionally mapped to their hierarchical parent anatomical systems through ontology traversal (e.g., within an anatomical ontology). Preferably such multilayered semantic harmonization approach ensures that functionally equivalent terms from different repositories are mapped to identical standardized identifiers, enabling robust cross-study aggregation and systematic biological queries while preserving both granular and hierarchical biological relationships.

[0053] In embodiments, the at least one associated information comprises sequence identifier information. In embodiments, sequence identifier information is comprised within, or is provided with, the biological input data. In embodiments, the sequence identifier information comprises sequence identifiers such as unique molecular identifiers (UMIs) or barcodes. In embodiments, the sequence identifier information comprises information regarding the sequencing read by which respective sequence 5 was obtained. In embodiments, the sequence identifier information comprises cellular barcodes that identify the spatial capture location. In embodiments, the sequence identifier information comprises sample indices that identify different experimental conditions or replicates. In embodiments, the sequence identifier information includes technologyspecific barcodes such as spatial barcodes, cell barcodes, or feature barcodes.

[0054] In embodiments, the information regarding the origin of said biological input data comprises sequence information of nucleic acid labels (e.g., barcodes, unique molecular identifiers (UMIs) etc.), spatial coordinates and / or chemical information of biochemical labels (e.g., amino acid labels, isotopes or chemical modifications). The skilled person is familiar with different means and methods of labelling individual nucleic acid or amino acid sequences.

[0055] In embodiments, the method further comprises e. providing at least one output comprising at least one sequence, sequence identifier information and / or metadata of one or more portions of the biological input data corresponding to the query.

[0056] In embodiments, the e. providing at least one output comprises providing at least one associated information, e.g., sequence identifier(s) and / or metadata of one or more portions of the biological input data corresponding to the query. For example, the query may comprise sequence data, and the associated information (meta data) may be presented as an output. In embodiments, the associated information output, e.g., sequence identifier(s) and / or metadata, is provided (to the user) by a graphical user interface (GUI), a screen, a file, a print-out or any other output format.

[0057] In embodiments, the e. providing at least one output comprises providing at least one biological sequence, such as a portion as described herein, for example where the query is directed to the associated information, e.g., sequence identifier(s) and / or metadata, corresponding to the query. In embodiments, the sequence output is provided (to the user) by a graphical user interface (GUI), a screen, a file, a print-out or any other output format. In embodiments, the present method comprises an automated ‘standardization pipeline’ that maps all metadata comprised within the input data to Human Cell Atlas ontology standards. In embodiments such pipeline uses state-of-the-art language models to interpret free-text descriptions and map them to controlled vocabularies for e.g., anatomical structures, cell types, developmental stages, and / or diseases, etc. This harmonization facilitates meaningful crossstudy queries, for example, users querying for "lung epithelial cells" retrieve results whether the original metadata said "pulmonary epithelium", "lung tissue", or "respiratory epithelial cells".

[0058] In embodiments, enabling aggregation of query results at the cell type level, the present method comprises a de novo annotation step comprising a two-stage approach for indexed cells.

[0059] In embodiments, the two-stage approach comprises in a first stage a marker-based scoring. In embodiments, the marker-based scoring is based on curated marker gene sets for cell types of interest, e.g., major / relevant cell types of the human body (e.g., following cell ontology classifications) or of a respective organ of interest, that are compiled with respective cell types, preferably defined by at least 5, or by 10-20, literature-validated markers. In embodiments, for each cell, enrichment scores are calculated for all cell type signatures, preferably using the precomputed gene expression matrices. In embodiments, a clustering is performed using standard methods (e.g., PCA, k-nearest neighbors, Leiden algorithm) such that differentially expressed genes are identified for each cluster, wherein respective clusters are assigned, cell types based, on the differential expression of marker genes, considering one or both, the number of expressed markers and / or their specificity to the cluster.

[0060] In embodiments, the two-stage approach comprises in a second stage, a supplementation of the annotations generated in the first stage using a language model (LM), which additionally integrates biological context. In embodiments, this integration by the LM comprises generating cell type assignments using the (top) differentially expressed genes of each cluster generated in the first stage, and the sample metadata (e.g., tissue origin, developmental stage and / or disease state, etc.). Preferably, to ensure biological accuracy, the LM also evaluates whether each annotation generated in the first stage is plausible given the tissue context, preferably thereby flagging unlikely assignments for correction. In embodiments, this two-stage approach generates cell type annotations that follow controlled or standard vocabularies.

[0061] In embodiments, the biological sequence input data is obtained by single cell sequencing, spatial nucleic acid sequencing (e.g., in FASTQ, SAM, BAM or CRAM format), microscopic and / or mass spectrometry analysis of a biological sample. In embodiments, nucleic acid sequencing data may be obtained by nucleic acid, namely DNA and / or RNA sequencing. In embodiments, the nucleic acid sequencing is or comprises single-cell sequencing, such as single cell DNA (genomic) and / or RNA (e.g., transcriptomic) sequencing. In embodiments, nucleic acid sequencing data may be obtained by spatial sequencing comprising DNA and / or RNA sequencing and a microscopic or other visual (image) analysis.

[0062] In embodiments, as biological input (sequence) data (raw sequence data), raw sequence reads are provided (e.g. as FASTA / FASTA.GZ, FASTQ / FASTQ.GZ or FAST5 / uBAM / SAM / BAM / CRAM files). In embodiments, across technologies, these data contain biological sequence data, such as nucleic acid or amino acid sequences, as well as metadata, such as cell barcodes, spatial coordinates and, optionally, sequence identifiers such as unique molecular identifiers (UMIs). In embodiments, individual datasets are processed independently before being merged into the unified index.

[0063] In embodiments, FASTQ files of raw sequence reads are parsed in parallel, optionally using optimized routines (including SIMD instructions for pattern matching), preferably such that cell barcodes and / or sequences are extracted at rates up to or even exceeding 2 million reads per second. In embodiments, for each dataset, reads are streamed through memory-efficient buffers, extracting k-mers and their associated barcodes.

[0064] In embodiments the present method rapidly converts the sequence input data into searchable objects (indexed data) without requiring reference sequences (see e.g., Fig. 1A).

[0065] The individual sequences (biological sequence data, as input) can in principle be any length that may be obtained from or derived from a sequencing experiment. A skilled person is aware of sequencing technologies and can generate, use and, if necessary, modify the biological sequence information in the present invention accordingly.

[0066] In embodiments the present method processes biological sequence input data, such as (raw) sequencing reads, in batches and transmits intermediate indices to a storage after processing a certain amount / number of processed data (e.g., a certain number of reads, such as every 50-200 million, or every 100 million reads). An advantage of such embodiments is that runtime per sample scales with unique sequences or barcodes comprised within the biological sequence input data, such as (raw) sequencing reads, rather than sequencing depth and / or and indexing speed approaches the theoretical limit of gzip decompression - currently one of the unavoidable bottlenecks when processing raw sequencing data. In a non-limiting example, the inventors validated indexing performance of an embodiments of the present method on a Stereo-seq mouse liver atlas containing ~61 billion reads across twenty sections, each with over 3 billion reads and hundreds of millions of spatial barcodes. Surprisingly, the present method enabled a complete indexing in 24 CPU-hours with peak memory under 8 GB, compared to 190 hours for kallisto and 724 hours for STAR (see, e. Example 3).

[0067] In embodiments, the individual sequences of the biological sequence input data have a length of -20-5,000,000 nt, -20-4,000,000 nt, -20-1 ,000,000 nt, -20-500,000 nt, -20-300,000 nt, -20- 100,000 nt, -20-50,000 nt or -20-10,000 nt in case of nucleic acid sequences. In other embodiments, the individual sequences comprised within the biological sequence input data comprise or consist of -20-5,000 nt, or -50-2,500 nt or -100-2,000 nt. In other embodiments, the individual sequences may have a length of -500-50,000 nt, or -1000-40,000 nt, or -5,000-30,000 nt.

[0068] In embodiments, the individual sequences comprised within the biological sequence input data are obtained by long read sequencing, such as, e.g., PacBio (SMRT) Sequencing, Oxford Nanopore Sequencing, or 10x Genomics Chromium sequencing. In other embodiments, the individual sequences comprised within the biological sequence input data are obtained by singlecell long-read Nanopore sequencing. In embodiments, for example when using sequences from Oxford nanopore sequencing, the individual sequences may have a length of up to about 5 megabases (Mb), or 4 Mb, 3 Mb, 2 Mb, or 1 Mb. In other embodiments, the sequences may exhibit a length of -100-500,000 nt, or -500-300,000 nt, or -1 ,000-200,000 nt, or -5,000-100,000 nt, or -10,000-100,000 nt. In embodiments, the individual sequences comprised within the biological sequence input data were obtained by short-read sequencing, such as, e.g., Illumina sequencing. In such embodiments, the individual sequences may have a length of ~20-500 nt or ~20-400 nt, or ~20- 300 nt, or ~20-200 nt.

[0069] In other embodiments, the individual sequences of the biological sequence input data have a length of ~2-200 aa, ~5-100 aa, or ~5-50 aa, in case of amino acid sequences.

[0070] In embodiments, the at least one associated information of the portion comprises a sequence identifier, preferably an identifier of a nucleic acid sequencing read or an amino acid sequence. In embodiments, the at least one associated information of said portion comprises at least one associated information of the respective individual biological sequence (si) from which the portion is derived. Hence, in embodiments, the at least one associated information (e.g., denoted as I = {ilti2, . . . , in}) of the respective individual biological sequence si) may be an identifier of the nucleic acid sequencing read or amino acid sequence corresponding to said biological sequence (Si).

[0071] In embodiments, step b. decomposing the biological input data into non-overlapping portions, comprises decomposing the biological input data into data strings of length k (k-mers). In embodiments, a data string of length k (k-mer) refers to the number of consecutive elements, such as nucleic acids or amino acids, in a respective sequence. For example, the nucleic acid sequence “ACGAT” would be a k-mer of length k, wherein k is 5.

[0072] In embodiments, a k-mer preferably has a length between 5-30 nucleotides / nucleic acts (nt) or amino acids (aa). In embodiments, a k-mer has a length between 5-50, between 10-40, between 15-35 or between 20-35. In embodiments, a k-mer has a length of 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 , 32, 33, 34, 35, 36, 37, 38, 39, 40, 41 , 42, 43, 44, 45, 46, 47, 48, 49, 50 nt or aa, or of 20, 24, 30. In some specific embodiments, a k-mer has a length of ~24 or of between 20-30 nt.

[0073] In embodiments, the biological sequence input data is decomposed into non-overlapping portions. In embodiments, the biological sequence input data comprises a plurality (or multitude) of single sequences, such as sequencing reads or similar. In embodiments, decomposing the biological sequence input data comprises decomposing a plurality of single sequences comprised in the biological input data into non-overlapping portions (preferably non-overlapping consecutive sequence portions), preferably into data strings of length k (k-mers).

[0074] In embodiments, at least a fraction, and preferably the majority (more than half), of the biological sequence input data is decomposed into non-overlapping portions. In embodiments, the sequence of the majority of portions (k-mers) of a respective single sequence comprised in the biological sequence input data does not overlap with its adjacent (neighboring; up- and downstream) portions of said respective input sequence (e.g., sequencing read). In other words, in embodiments, at least a fraction, preferably the majority, of the (sequence) portions (k-mers) comprised within a respective single sequence do not overlap with their adjacent (neighboring; up- and downstream) (sequence) portions. In simple terms, in embodiments, single sequences comprised within the biological input data may be decomposed (“split”) into consecutive, nonoverlapping (sequence) portions.

[0075] However, in embodiments, at least a fraction of each sequence portion (k-mer) derived from / comprised within the biological sequence input data has the same length k. In embodiments, respective single (individual) sequences comprised within the biological input data may be split (decomposed) into non-overlapping portions (k-mers) of equal length, wherein only the last, the first or only one of the portions comprises a sequence overlap with one or both of its adjacent (neighboring) sequence portions, such that single sequences of varying length may be decomposed into portions (k-mers) of equal length. In preferred embodiments, essentially all portions exhibit non-overlapping sequences, potentially with the exception of the last and / or the first portion of a sequence.

[0076] The approach of decomposing input data into non-overlapping portions has the advantage that the present method and the constructed two-tiered index require significantly less data storage, processing time and computing capacity, as no alignment of a large plurality of overlapping sequence portions (k-mers) has to be processed, when processing a query and / or accessing the indexed data.

[0077] Another reduction of computing capacity and processing time may be achieved by constructing a two-tiered index comprising only portions (k-mers) of equal length. In embodiments, where input sequences of varying length are provided, it may be necessary to allow only few or one or more, preferably max. 3, adjacent (sequence) portions of a respective single input sequence to overlap (with one or both neighboring (up- and / or downstream) portions). Hence, in embodiments, the biological input sequence data is decomposed into non-overlapping portions in b., wherein one or more (or a fraction or each) single sequence comprised within the biological input data is allowed to comprise / to be decomposed into one or more (preferably at max. 3) portions overlapping with its neighboring portions. In embodiments, one or more (or a fraction or each) single sequence comprised within the biological sequence input data is decomposed in b. into a multitude / plurality of non-overlapping portions and 1 , 2 or 3 overlapping portions, wherein each portion has the same length k. In this context, the term “single sequence” refers to an “individual sequence” comprised within an input dataset, such that an input dataset may comprise a plurality of “single” or “individual” sequences.

[0078] In view of prior art methods, a further advantage of the present method of extracting portions (k- mers) from input sequences (5) and storing the same in a two-tiered index is that no packed data structures are used, as the embodiments described herein aim to optimize for indexing and query speed, not for memory overhead. Moreover, the generation of (majorly) non-overlapping portions (k-mers) reduces computational overhead and storage requirements.

[0079] Another advantage of the present invention over the prior art, is that extracted portions (k-mers) are directly spatially indexed (e.g. with regard to their cellular and / or spatial origin), unlike an indexing approach on a sample-level, as is performed in most existing methods.

[0080] In embodiments, the present method enables ultrarapid indexing of biochemical sequence data, such as spatial or single-cell sequencing or amino acid sequence data.

[0081] In embodiments, indexing each portion in c. comprises the construction of at least one index lookup table assigning each portion to the respective (two-tiered) index generated in c.

[0082] In embodiments indexing each portion in c. preferably comprises adding or assigning a respective k-mer in the first array (indices array) of the two-tiered index via / through the second array (pointer array) to at least one entry in the third array ((meta)data array). In embodiments, indexing each portion (k-mer) in c. comprises linking each portion comprised within the first array via the second array (pointer array) to the third array ((meta)data array, e.g., comprising (spatial and / or cellular origin) data), based on the information comprised within the input data, thereby preferably mapping each portion (k-mer) to its origin, e.g., spatial location(s) and / or single cell, preferably also storing start and end positions within the third array ((mata)data array).

[0083] In embodiments, a plurality of each individual sequence (5) comprised within the biological sequence input data (e.g., nucleic acid sequencing read) is decomposed (split / cut up) into nonoverlapping portions (k-mers), preferably wherein each portion (k-mer) is encoded upon indexing as a 64-bit integer, e.g., using 2-bit representation.

[0084] In embodiments, each k-mer is encoded as a 64-bit integer using 2-bit nucleotide encoding / representation (e.g., A=00, C=01 , G=10, T=11).

[0085] In embodiments, the cellular origin of a sequence is tracked through a composite identifier that encodes both the cell barcode and dataset source.

[0086] In embodiments, lower bits are allocated for dataset identification. In embodiments, upper bits are allocated for cell-specific barcodes. As non-limiting example, in embodiments 8 bits may be used for dataset ID and / or 24 bits may be used for cell barcode allowing indexing up to 256 datasets with ~16 million cells each. In embodiments, such hierarchical encoding enables queries to return both the specific cell and its source dataset.

[0087] In preferred embodiments, the present two-tiered index is constructed from a collection of n input sequences S = s2, . . . , sn] comprised within the biological input data, wherein in case of nucleic acid sequences each stis a string over the alphabet X = {A, C, G, T} for DNA sequences, and for RNA sequences over the alphabet X = {A, C, G, U}, wherein A = adenine, T = thymine T, C = cytosine, G = guanine and U = uracil.

[0088] In case of amino acid sequences each stis a string over the alphabet s =

[0089] {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, 7}, wherein Alanine = A, Arginine = R, Asparagine = N, Aspartic acid = D, Cysteine = C, Glutamine = Q, Glutamic acid = E, Glycine = G, Histidine = H, Isoleucine = I, Leucine = L, Lysine = K, Methionine = M, Phenylalanine = F, Proline = P, Serine = S, Threonine = T, Tryptophan = W, Tyrosine = Y, and Valine = V. In embodiments, also further amino acid variants may be comprised within the biological input data and therefore also in 5Zof the index.

[0090] In embodiments, where stcomes from short-read sequencing (e.g., Illumina sequencing), the typical length of each string will be ~32-200 nucleotides (nt). In embodiments, where stcomes from long-read sequencing (e.g., OxfordNanopore sequencing), the typical length of each string will be ~200-2500 nt. In preferred embodiments, each sequence stis associated with additional information, such as, e.g., metadata (e.g., information regarding the origin of said biological input data), and / or sequence identifiers (e.g., sequence (read) identifiers and / or barcodes). In embodiments this associated information (e.g., metadata and / or identifiers) is denoted as / = {ilti2,..., in}- In embodiments, (it is assumed that) this information is encoded as an integer value or a tuple of values (e.g., in the case of 2D coordinates of spatial nucleic acid sequencing data).

[0091] In embodiments, at least a fraction, or preferably essentially each individual sequence (5) comprised within the provided biological input data is decomposed (e.g., in step b. of the present method) into portions (portions / strings / sequences of length k (k-mers)), preferably into nonoverlapping portions of equal length, wherein an exception (an overlap) may be allowed for at least one k-mer, e.g., the last k-mer of a respective sequence. In other words, in embodiments, e.g., for a given integer k > 8, k-mers from each individual sequence (5) are generated (the sequence (5) is decomposed into k-mers I portions), preferably using a non-overlapping approach, with a special case for, e.g., at least one k-mer, such as the last k-mer.

[0092] In embodiments, given a string s of length |s| = I over an alphabet , the set of k-mers K(s) is defined as:

[0093] / (s) = {s(i, i + / <) | i = jk; 0 < j < [ / / / cj] U {s(l — k, l) | I mod k =# 0} where s(i ’) denotes the substring of s starting at index i (inclusive) and optionally ending at index j (exclusive). In embodiments, this formulation generates non-overlapping k-mers for most of a respective sequence 5, with a potential overlapping k-mer, e.g., at the end, to ensure complete coverage of sequence 5 by k-mers of equal length. In embodiments, the total number of k-mers generated for a sequence is \l / k], In embodiments this approximation enables perfect retrieval from a sequence q queried against the final index, if q G S.

[0094] In embodiments, each k-mer is then encoded as a 64-bit integer using a function f. In an exemplary embodiment of DNA sequences, naive 2-bit encoding uses A=00, C=01 , G=10, T=11 .

[0095] In embodiments, in step c. each portion of the biological input data is indexed with at least one associated information of said portion. Preferably said associated information comprises information with regard to the corresponding individual sequence 5 comprised within the biological input data from which said portion is derived (e.g., in b.), such as a read- or other sequenc(ing) identifier.

[0096] In embodiments, the metadata comprised within or provided with the biological input data comprises information regarding the origin of said biological input data. In embodiments, the metadata belonging to an individual biological sequence (.s-,) may be denoted as I =

[0097] In embodiments, e.g., in step c. for each sequence stand its associated information i (identifier(s) and / or metadata), a set of encoded portions (k-mers) paired with the identifier is created: E(sitii) = {f(x), it | x is a kmer in s . In embodiments the complete set of encoded k- mers with their associated identifiers is then formed as the union of these sets, E = u"=1E(siti ). In embodiments the set E is afterwards lexicographically sorted, based on the encoded k-mer values, resulting in a sorted list L.

[0098] In embodiments, the k-mers are lexicographically sorted and deduplicated, preferably keeping track of all cells where each appears.

[0099] In embodiments, dataset-specific indices are merged hierarchically, such that smaller indices combine into progressively larger ones, e.g., using k-way merge algorithms that require minimal memory. An advantage of this approach is the ability of processing larger datasets which size exceed available RAM, while maintaining near-linear scaling with data size.

[0100] In embodiments, indices exceeding available memory, may be processed using a hierarchical page-based structure, preferably wherein, during construction, sequences are processed in chunks requiring only 0(C) memory per chunk, then merged using k-way merge with O(kB) memory where B is the buffer size. In embodiments, for the final index, when the number of unique k-mers N exceeds RAM capacity, the sequence array may be organized into pages of P k- mers (e.g., default 1024). In such embodiments, the pages form a tree structure with L = [log_P N] levels, where each non-leaf node stores the maximum k-mer value of its child pages.

[0101] Advantageously, this approach bounds memory usage to O(P log_P N) integers - for example, such that indexing 10A12 unique k-mers requires only ~40 MB of memory with P=1024 (see, e.g., Example 3).

[0102] In embodiments, the final merged index is stored in a compressed binary format optimized for disk-based retrieval.

[0103] In embodiments, arrays are partitioned into fixed blocks (512 entries) that are independently compressed, e.g., using delta-encoding, optionally followed by a compression step, such as blosc compression, thereby preferably achieving a reduction, e.g., 10-20x reduction, compared to raw FASTQ files.

[0104] In embodiments, the hierarchical organization enables efficient searches. In a non-limiting example, locating a k-mer requires 0(log_P N) page accesses, typically 3-4 disk reads for billionscale indices, wherein a small LRU cache maintains frequently accessed pages in memory.

[0105] In embodiments indexing the obtained portions (k-mers) comprises the sorting of the (unique) portion (m-mer) list L of encoded portions (k-mers) paired with their associated information (z), such as spatial or single cell identifiers. In embodiments the list L is subsequently transformed into three arrays: an indices array, a pointer arrays from indices to data, and a data array.

[0106] In embodiments, the arrays are preferably stored in a page-aligned, hierarchical data structure, preferably a custom page-aligned, hierarchical data structure. In embodiments, the arrays are stored in an HDF5 file structure or in any page-aligned array system with a dedicated page cache. This method has advantages over prior art approaches that this compressed row storagelike approach enables both efficient on-disk storage and efficient retrieval, particularly in view of computation capacities required for data retrieval and query.

[0107] In embodiments, the two-tiered index is constructed in c. to comprise three arrays, namely a first array (indices array) comprising the portions (k-mers) extracted / generated in b., a second array (pointer array) linking / connecting the first with the third array, and a third array ((meta)data array) comprising the at least one associated information of each portion (k-mer). The second array, which may also termed pointer(s) array comprises “pointers” that link / connect the portions (k- mers) stored in the first array with their respective associated information of each portion (k-mer), e.g., metadata or sequence identifies etc., that is stored in the third array.

[0108] In embodiments, the data may then be stored using chunked storage by dividing the index into manageable chunks for large datasets und optionally merging the chunks into a single final chunk, if possible.

[0109] In embodiments, a two-tiered index of the biological sequence input data is generated, as described in the following.

[0110] In embodiments, indexing each portion of the biological input data with at least one associated information, e.g., sequence identifier(s) and / or metadata (ij) of (associated with) said portion (k- mer) comprises creating a set of encoded portions (k-mers) paired with the identifier, preferably defined by E(sitii) = {f(x), i | x is a kmer in s , preferably wherein a complete set of encoded portions (k-mers) with their associated at least one associated information, e.g., sequence identifier(s) and / or metadata (ij) is then formed as the union of these sets, preferably defined as E = u"=1E(siti ), wherein E is optionally sorted (e.g., lexicographically) afterwards resulting in a list L.

[0111] In embodiments, during the sorting process, duplicate entries for identical portions (k-mers) are preferably removed while all associated identifiers are preserved.

[0112] In embodiments, the at least one information i associated with a sequence st(“associated information”) and its corresponding portions (k-mers) comprises at least one sequence identifier and / or metadata (ij) of said sequence si twhich was preferably comprised within the biological input data.

[0113] In embodiments, the two-tiered index is constructed or formatted to comprise one or more data arrays. In embodiments, the two-tiered index, preferably the sorted two-tiered index (L), is transformed to comprise more than one, preferably three data arrays.

[0114] In embodiments, the arrays are stored in an HDF5 file structure, or in any custom page-aligned array system with a dedicated page cache, allowing for efficient disk storage and retrieval of information relating to the biological input data.

[0115] In embodiments, the two-tiered index is constructed to comprise three data arrays. In embodiments, the two-tiered index is constructed to comprise three data arrays, wherein the first data array (indices array) comprises a plurality or the sum of portions (k-mers) obtained in step b. (obtained by decomposing individual sequences 5 comprised within the input data), preferably a plurality or the sum of unique encoded portions (k-mers), and the second data array (pointers array) comprises the start and end indices in the third array (data array) for the identifiers associated with each portion (k-mer) of the first array (indices array), and the third array ((meta)data array) comprises a concatenation of associated information of each portion (k-mer), preferably all unsigned integer identifier sets for the portions (k-mers) comprised within the first array (indices array).

[0116] In embodiments, a first data array (indices array) comprises the sum of unique (encoded) portions (k-mers) obtained by decomposing individual sequences 5 comprised within the input data. In embodiments a second data array (pointers array) comprises the start and end indices in the third array (data array) for the identifiers associated with each portion (k-mer) of the first array (indices array). In embodiments a third array (metadata array) comprises a concatenation of associated information of each portion (k-mer).

[0117] In embodiments, the indices array contains the unique encoded portions (k-mers). In embodiments the pointer(s) array stores the start and end indices in the data or metadata array for the identifiers associated with each portion (k-mer). In embodiments, the data array is a concatenation of all, e.g., 32-bit, unsigned integer identifier sets for the portions (k-mers). In specific embodiments this storage supports a (preferred) maximum of ~4 billion spatial units or cells per indexed experiment.

[0118] In embodiments, in particular an efficient disk (storage device I hard drive) storage of the two- tiered index is enabled by transformation of L into preferably three arrays: indices, pointer(s) and (meta)data arrays, similarly to the principle of compressed row storage matrices. Therefore, one advantage of the present method is the extension and improvement of the concept of k-mer indexing by integrating spatial and / or single-cell annotation directly into an index structure. In the present index structure, a compressed annotation matrix associates k-mers with spatial locations or cell barcodes, and in an optimized fashion for spatial data.

[0119] In embodiments, the present method enables directly querying an indexed dataset on the basis of k-mers and retrieves associated spatial information without the need to traverse a graph structure. The hash-based or tree-based approaches of embodiments of the method enables rapid lookups, which themselves facilitate the herein described “pseudoquantification”. The pseudoquantification technique according to the invention introduces a concept not used in existing k-mer indexing methods. While it leverages a sliding window approach reminiscent of k- mer counting techniques, it incorporates a flexible presence threshold and a background model.

[0120] In embodiments, the present method enables real-time analysis, for example, when querying sequence(s), the present method comprises tiling the query sequence into k-mers and, for example, counts the exact matches per sequenced cell. An advantage regarding speed and storage capacity id that in such embodiments raw match totals, also termed pseudocounts, emerge directly from k-mer matches without ‘classical’ sequence alignment.

[0121] For example, the inventors observed in a non-limiting example testing an embodiments of the present invention, that search latency of the present method depended only on disk access, e.g., in that case 70 ms for a single k-mer, 0.9 s for a 1 kb transcript, and 45 minutes for the entire Ensembl catalog on a single CPU core. Other methods performed worse, e.g., MetaGraph or BLAST would require orders of magnitude more time, among other reasons, because of requiring loading the entire index into memory before each query. Meanwhile, traditional sequence analysis pipelines would need complete read realignment for every new search query.

[0122] In embodiments, the present method comprises the querying of the generated two-tiered index, based on a user query (comprising biological sequence information) and without the need for sequence alignment, thereby enabling a fast interrogation of the information stored within the two-tiered index and an (approximate) quantification (detection and / or analysis) of a query sequence’ presence in the biological input data.

[0123] In embodiments, the pseudoquantification step refers to or comprises the querying of the generated two-tiered index, based on a user query and advantageously without the need for sequence alignment. However, at the same time, pseudoquantification enables a fast approximate quantification of sequence presence (in the input sequence data).

[0124] During embodiments of said pseudoquantification process, the approach exploits the two-tiered index for “indirectly” querying and quantification of sequence abundancies in the biological input data.

[0125] In embodiments, during pseudoquantification for each query sequence, a set of sliding sequences is generated from the query. In embodiments, each sliding sequence starts at a different position in the query, from 1 st to k-th, and extends to the end of the query, as long as the resulting sequence is at least k nucleotides long, wherein k is the k-mer size. In preferred embodiments the k-mer length is equal for all k-mers within the index, e.g., 20, 21 , 22, 23, 24, 25, 26, 27, 28, 29, 30 etc. nt long. In embodiments, a query can comprise an exact sequence, e.g., from a 24-mer splice probe to a full viral genome, wherein the output may in embodiments comprise an indication which and / or how many cells of the spatial or single cell sequencing contain said input sequence.

[0126] In embodiments, a query can comprise a genomic interval (e.g., specified by genomic coordinates), wherein the output may in embodiments comprise an indication of (live) coverage (genomic) sequence tracks, e.g., colored by lineage or tissue.

[0127] In embodiments, a query can comprise natural language entries, such as, e.g., "cells expressing high levels of mitochondrial genes in tumor samples", wherein an integrated language model translates the entry into structured search request(s) and the method provides a respective output, e.g., a list of respective gene-identifiers, and / or read counts, and / or cell types and / or cell numbers / counts and / or an output (list, graph and / or picture) indicating cell locations within a tissue in case of spatial sequencing.

[0128] In embodiments, the query may relate to certain splice variants, genotypes and / or isoforms of a gene, circular RNAs (circRNA), miRNAs, or even viral sequences (e.g., of viruses integrated into a host genome or present in a sample), as well as sequences of potential contaminants, e.g., of cell cultures, such as Mycoplasma.

[0129] In embodiments, the query may relate to or comprise a request or inquiry regarding predicting cell-cell similarities based on the biological sequence input data stored (decomposed) within the two-tiered index. In embodiments, processing of such queries may comprise collapsing sequence k-mers into in silico ‘buckets’ of sequence similarity to reduce cell-by-feature matrices in size enabling tractable dimensionality reduction.

[0130] In embodiments, a query may interrogate the two-tiered index along different dimensions, e.g., an inquiry comprising / regarding a specific feature, such as a raw sequence, a specific gene, protein, and / or mRNA, and / or a specific pathway, an inquiry comprising / regarding a specific biological and / or technical context, such as a specific tissue, disease or disease state, and / or protocol, an inquiry regarding the representation of a biological sequence within the input data, e.g., baselevel coverage and / or its abundance therein, and / or an inquiry comprising / regarding the presence or abundancy of an a biological sequence or gene, protein or mRNA in individual cells, whole libraries, and / or aggregated groups.

[0131] In embodiments, the step of pseudoquantification comprises decomposing (e.g., using a sliding window) a query sequence into a set of ‘sliding sequences’ (or ‘sliding window’ or ‘sequence window1) wherein each sliding sequence is decomposed afterwards into consecutive, nonoverlapping data strings of length k (k-mers), and said k-mers are then compared with (looked up within) the two-tiered index, wherein the comparison of each k-merwith the index (the interrogation of the index) enables the identification (assigning) of respective metadata, e.g., the cellular (single cell) and / or spatial origin within a sample, of each k-mer and thereby the identification (assigning) of respective metadata to each query sequence based on the two-tiered index. In embodiments, for each sliding sequence (sequence window) in the query, a match score is computed for each window, preferably by dividing the number of matching k-mers in the window by the total number of possible k-mers in the window, e.g., wherein the total number of possible k-mers is the window size minus k-mer size plus one. Preferably, a match between a sequence and a certain index entry is considered significant, if a matching score is (greater or equal) > threshold T (preferably 0.65). Thereby, this ‘sliding-window look-up approach’ enables fast approximation of sequence presence without full alignment, preferably tolerating mismatches and gaps.

[0132] In embodiments, the present method identifies query sequences by decomposing them into k- mers and searching the index for matches. In embodiments, queries longer than k are processed using a sliding window approach to ensure all possible k-mer alignments are considered. In a non-limiting example, given a query sequence q and window size w (e.g., default 48), all windows of length w are extracted and the fraction of matching k-mers in each window is evaluated.

[0133] In embodiments, the match score for a window W and cell ID is defined as the ratio of matching k-mers to total possible k-mers: score (W, ID) = |M(W, ID)| / (w-k+1), where M(W, ID) represents k- mers in W that appear in cell ID.

[0134] In embodiments, false positives from repetitive sequences may be reduced by , excluding k-mers appearing in many genes across the reference genome from scoring.

[0135] In embodiments, users may mask low-complexity regions using standard tools like dustmasker.

[0136] In embodiments, windows with scores exceeding threshold T (e.g., set to default 0.65) may be considered significant matches.

[0137] A preferred advantage of the present method is that it is able to tolerate sequencing errors and mutations better than prior art approaches, as a single nucleotide change would only affect one k- mer, while the detection (pseudoquantification) is driven by the remaining k-mers.

[0138] In embodiments, the (complete) query process aggregates matches between query and index across all sliding windows, preferably then returning results, e.g., cells or genes, ranked by the number of significant windows.

[0139] In a non-limiting example of a use case (e.g., w=48, k=24, T=0.65), the method was able to reliably detect sequences with up to ~15% divergence while maintaining low false positive rates. In embodiments query optimization may include batching k-mer lookups to minimize disk access and / or exploiting the sorted index structure for cache efficiency (see, Example 3).

[0140] An advantage the pseudoquantification is the capture of all possible k-mer offsets, resulting in improved sensitivity.

[0141] In embodiments, during pseudoquantification for each sliding sequence, k-mers are generated and looked-up (searched) in the two-tiered index, preferably using either in-memory hash table (e.g., for smaller datasets) or hierarchical search on the indices array (e.g., for very large datasets). This k-mer matching step enables a flexible lookup strategy balancing speed and memory usage.

[0142] In embodiments, subsequently for each sliding window in the query, a match score is computed for each window. In embodiments this score is calculated by dividing the number of matching k- mers in the window by the total number of possible k-mers in the window. In embodiments, the total number of possible k-mers is the window size minus k-mer size plus one.

[0143] In embodiments, a match is considered significant, if a matching score is (greater or equal) > threshold T (preferably 0.65). In embodiments the threshold T of the match score is 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, preferably 0.65. This sliding-window look-up approach during the pseudoquantification enables fast approximation of sequence presence without full alignment, preferably tolerating mismatches and gaps.

[0144] In embodiments, the final step of pseudoquantification comprises aggregating significant matches across all windows. Preferably an output is generated, comprising a list of spatial locations and / or single cell identifiers and / or numbers (counts) with their respective match counts.

[0145] In some embodiments a background model is used (incorporated), which is based on k-mer frequencies in reference transcriptomes to distinguish between informative and ubiquitous k- mers.

[0146] In embodiments, to account for the repetitiveness of some genomic sequences, the number of false positives may be reduced and the overall quality of pseudoquantification may be further improved by implementing a reference-specific ‘background’ model based on k-mer frequencies. In embodiments, alternatively or additionally, low-complexity k-mers masked by standard methods such as dustmasked may be excluded from the pseudoquantification.

[0147] In embodiments, a set U = {u1,u2,...,un} of genomic sequences, e.g., from a reference database such as GENCODE or RefSeq, is defined. In embodiments, afterwards for each sequence ui tthe set of k-mers K ui~) is generated using the same k-mer generation process as in the main index construction. In embodiments, two-step counting mechanism is employed, when multiple sequences are associated with a same feature, e.g., multiple alternative transcripts under the same gene ID g, the occurrence of each unique k-mer within its sequences is counted, creating a set Cg= {( / c, 1) | k G K(ug~)}. In embodiments, these counts are aggregated across all genes, forming a global count set C = {( / c, |£g | k G Cfl}|) | k G K(ug).

[0148] In embodiments, without loss of generality, in the case of transcript or UTR sequences for the same gene, the use of a background model gives equal weight to each preventing bias towards genes with longer or multiple isoforms. In embodiments, the resulting background model is stored as a map where keys are the encoded k-mers and values are the counts of genes in which each k-mer appears. In embodiments, these counts are then normalized, either by dividing by the total number of genes or by employing a rank-based normalization, to facilitate interpretation and threshold setting. In embodiments, during the inventive querying process, each k-mer k from the query sequence is checked against this background model. In embodiments, let f(k) be the normalized frequency of k in the background model. In embodiments, a complexity threshold 9 is defined. In embodiments, k-mers where ( / c) > 9 are either excluded from the matching process or given reduced weight in the scoring algorithm, determined by a function w( ( / c)) that maps the background frequency to a weight.

[0149] In embodiments, it may be preferred to improve memory efficiency for handling large datasets, by employing a chunked storage strategy. In embodiments, the k-mers from consecutive subsets of input sequences are collected, sorted based on their encoded values and added to a persistent storage structure. In embodiments this enables generating various chunks (of indices, pointers, data), which can be merged into a single chunk. In embodiments, this chunked storage approach can be implemented using different data structures, such as HDF5 files or custom page-aligned array systems with a dedicated page cache. In embodiments, such storage structures contain single or multiple datasets storing the encoded k-mers, pointer arrays for identifier locations, and / or arrays of associated identifiers. In embodiments, also further metadata comprising the k- mer size and number of chunks, and / or page size may be used. In embodiments, the present method allows efficient identification of occurrences of query sequences within the (two-tiered) indexed input dataset.

[0150] In embodiments, step d. comprises receiving at least one query from a user, comprising biological sequence information. In embodiments, step d. comprises receiving at least one query from a user, comprising language-based prompts. In embodiments, step d. comprises receiving at least one query from a user, comprising queries regarding the metadata comprised within or provided with the biological input data.

[0151] In embodiments, a set of query sequences may be defined as Q = {qltq2, ..., qn}. In embodiments, for each query sequence q of length |q| and a window size w, a set of sliding sequences S(q) = {q i, / ) | 0 < i < k, \q(i, / ) > k}, is generated, where k is the k-mer size used in the index construction. Thereby, all possible k-mer offsets within the query sequence are captured.

[0152] In embodiments, for each sliding sequence s G S(Q), the set of k-mers K(s) is generated, preferably as defined herein, e.g. according to the aforementioned k-mer generation process. In embodiments the associated identifiers are retrieved using the pointer and / or data arrays, preferably by finding the corresponding location of a query k-mer via either a hash table built upon loading the index to memory, or a hierarchical tree search in the indices array. In embodiments, the array has the same dimensionality as the k-mer index array. In embodiments, the array contains the offsets of 1 :1 mapping k-mers to the data array, which is stored consecutively for only the locations represented in a k-mer.

[0153] In embodiments, a sliding window analysis is employed to evaluate the significance of matches.

[0154] In embodiments, for each window W of size w in q, defined as W = q(i, i + w) for 0 < i < |q| - w], a match score is computed for every sequence identifier found in the previous step.

[0155] In embodiments, the match score for an identifier ID in window l / l / is defined as: where M(W, ID) is the set of k-mers in W that match k-mers associated with the identifier ID in the generated index, and (w - k + 1) is the total number of k-mers in the window.

[0156] In embodiments, a match is considered significant, if its score equals or exceeds a predefined threshold T (e.g., in embodiments a threshold may be 0.65, or any threshold-value disclosed herein above).

[0157] In embodiments, the querying process aggregates significant matches across all windows, maintaining a count of the number of windows in which each identifier appeared as a significant match.

[0158] Embodiments of the present approach advantageously allow for efficient identification of sequences similar to the query, tolerating some mismatches or gaps.

[0159] In embodiments, the biological sequence input data and the query in d. comprise nucleic acid and / or amino acid sequence information, and the query is processed by (i) decomposing the query sequences into sequence portions of length k (k-mers), preferably using a sliding-window approach, (ii) matching each k-mer to the two-tiered index(ed portions) of biological input data, (iii) optionally comparing the k-mers to a database comprising the sequences of ubiquitous k- mers.

[0160] In embodiments, the method further comprises employing a large language model (LLM) component to query / interrogate the two-tiered index of the biological input data based on the language-based prompts provided in d. The use of such large language model (LLM) advantageously enables the enhancement of query processing.

[0161] In embodiments, Large Language Models (LLMs) may be used to enhance and optimize query processing and enable autonomous data analysis. In embodiments the LLM acts as an interface, connecting element or translational layer between queries and the created two-tiered index structure comprising / storing the data of the analyzed datasets.

[0162] In embodiments, the language model may be augmented with a Retrieval-Augmented Generation (RAG) database or framework.

[0163] In embodiments, a RAG (Retrieval-Augmented Generation) framework using, e.g., the LangChain or Llamalndex frameworks, is used where the LLM acts as an interface between user queries and the sequence database.

[0164] In embodiments, such an approach is compatible with open-source models such as Llama 3, BLOOM, Mistral, or other transformer-based models that can be fine-tuned on biological literature and database schemas.

[0165] In some embodiments natural language queries may be processed with the support of an integrated large language model (LLM), such as Llama 3.1 via Ollama, as a translational layer between user prompts and analyzed datasets, e.g., gene datasets. In some embodiments, the language model may be augmented with a Retrieval-Augmented Generation (RAG) database, e.g., built from MSigDB, which preferably includes curated gene sets from one or more online pathway database, such as Reactome, GO, and / or other pathway databases.

[0166] In a non-limiting example of such embodiments, a user’s query regarding "inflammatory response genes in macrophages" is processed by the present method comprising retrieving relevant gene sets from an online database, such as MSigDB, whereby the language model may be used to refine the selection based on the specific query context. Preferably, the obtained gene sets are subsequently converted to sequences and searched against the two-tiered index according to the invention like any other query. An advantage of such embodiments would be that it provides a convenient interface for exploring biological concepts without requiring users to manually compile gene lists.

[0167] In embodiments, when a natural language query is provided by a user, the LLM processes these queries by consulting a vector store (e.g., using FAISS or ChromaDB) containing one or more of indexed biological literature, gene nomenclature, and database schemas to generate appropriate sequence-based searches. As a non-limiting example, when a user asks "find cell types expressing genes with RBFOX2 binding motifs in cortical regions," the LLM identifies the RBFOX2 consensus motif, retrieves relevant genes or, more generally, sequence contexts with this motif, and formulates structured queries combining sequence and spatial information in the two-tiered index generated according to the invention. In embodiments, a second component is used that enables autonomous analysis of sequence data in its biological context. In embodiments, after retrieving results from the two-tiered index, the LLM analyzes patterns using a chain-of-thought prompting strategy implemented in, e.g., LangChain. In embodiments, the LLM thereby considers multiple data modalities selected from sequence information, spatial coordinates, cell types, and / or expression levels. In embodiments of complex analyses, the system employs an agent architecture (e.g., using LangChain's Agent framework) capable of decomposing queries into subtasks, and executing them sequentially, and synthesize results. In embodiments the LLM generates natural language descriptions of identified patterns, which are stored in a vector database (FAISS) and indexed alongside the original data. In embodiments a feedback loop is created thereby such that the LLM's analyses become searchable content. In embodiments this enables sophisticated queries that combine sequencelevel and higher-level biological insights. For example, such sophisticated queries combining sequence-level and higher-level biological insights may be "identify cell types showing coexpression of genes with similar splice site strength near inflammatory regions."

[0168] In embodiments, the present method thereby combines two types of information retrieval when processing a user query. Namely, in such embodiments it performs at first exact sequence searches using an efficient k-mer index, which provides fast access to biological sequence data and its associated metadata (spatial coordinates, cell types, expression levels etc.). Second, it retrieves in embodiments relevant biological context from a knowledge base containing information about genes, pathways, and cellular relationships. Preferably, this knowledge base is then implemented using a vector database that allows quick searches for related biological concepts.

[0169] In embodiments, the present method comprises a. providing biological sequence input data, preferably originating from a sample comprising at least one cell and / or the contents of at least one cell and / or preferably comprising genomic, transcriptomic or proteomic data, obtained from a single-cell sequencing analysis, spatial sequencing analysis, or mass spectrometry analysis, b. decomposing the sequence input data into non-overlapping portions, and c. indexing each portion of the sequence data with at least one associated information of said portion comprised within or provided with the biological sequence input data, thereby creating a two-tiered index for each portion, preferably wherein the associated information comprises metadata, comprising information on the spatial position and / or cell from which said biological sequence data was obtained, and / or sequence identifier information, wherein the sequence input data in b. is decomposed into a set of consecutive, nonoverlapping data strings of length k (k-mers), and wherein the two-tiered index is constructed in c. to comprise three arrays, namely a first array (indices array) comprising the k-mers generated in b., a second array (pointer array) linking the first with the third array, and a third array (data array) comprising the at least one associated information of each k-mer, wherein the method further comprises d. receiving at least one query from a user, wherein the query comprises biological sequence information, language-based prompts and / or queries regarding metadata comprised within or provided with the biological sequence input data, wherein, if the query comprises biological sequence information, said sequence information is decomposed into a set of consecutive, non-overlapping k- mers, that are preferably to be queried I compared against (used to interrogate) the two tiered-index.

[0170] In embodiments, the present method comprises a. providing biological sequence input data, comprising spatial and / or single cell nucleic acid sequencing data, b. decomposing the sequence input data into non-overlapping portions, preferably comprising decomposing the biological input data into consecutive, non-overlapping data strings of length k (k-mers), preferably wherein only the last data string of length k (k-mer), e.g., covering the C- terminus of a respective decomposed input sequence, is allowed to overlap with the second last / the adjacent / the preceding k-mer, if the entire input sequence comprises a sum of elements that cannot be divided by k, c. indexing each portion of the sequence data with at least one associated information of said portion comprised within or provided with the biological sequence input data, thereby creating a two-tiered index for each portion, preferably wherein the at least one information ( / )) associated with a sequence (si) (“associated information”) and its corresponding portions (k-mers) comprises at least one sequence identifier and / or metadata of said sequence, which was preferably comprised within the biological input data, preferably wherein for each input sequence and its associated information, a set of encoded portions (k-mers) paired with the identifier is created and lexicographically sorted resulting in a sorted list L, wherein preferably indexing the obtained portions (k-mers) comprises the sorting of the portion list (k-mer list) L of encoded portions (k-mers) paired with their associated information ( / ), wherein the list L is subsequently transformed into three arrays: an indices array, a pointer arrays from indices to data, and a data array, preferably wherein the two-tiered index is constructed, to comprise three arrays: a first array (indices array) comprising the portions (k-mers) extracted in b., a second array (pointer array) linking the information disclosed in the first array (k-mers) with the information disclosed in the third array, and a third array ((meta)data array) comprising the at least one associated information of each portion (k-mer), preferably wherein indexing each portion (k-mer) comprises adding or assigning a respective k-mer in the first array (indices array) via the second array (pointer array) to at least one entry in the third array ((meta)data array), d. receiving at least one query from a user, wherein the query comprises biological sequence information, language-based prompts and / or queries regarding metadata comprised within or provided with the biological sequence input data, preferably wherein the processing of the query received in d. comprises interrogating the generated two-tiered index, based on the query, e. preferably employing a large language model (LLM) component to query / interrogate the two-tiered index of the biological input data based on the language-based prompts provided in d. (pseudoquantification), preferably wherein, if a biological sequence is provided during step d., for each query sequence, a set of sliding sequences is generated from the query sequence, preferably wherein each sliding sequence starts at a different position in the query, from 1 st to k-th, and extends to the end of the query, as long as the resulting sequence is at least k nucleotides long, wherein k is the k-mer size, preferably during pseudoquantification for each sliding sequence, k-mers are generated and looked-up in (searched / compared to) the two-tiered index, preferably using either inmemory hash table (e.g., for smaller datasets) or hierarchical search on the indices array (e.g., for very large datasets), preferably subsequently for each sliding window in the query, a match score is computed for each window, preferably by dividing the number of matching k-mers in the window by the total number of possible k-mers in the window, preferably the final step of pseudoquantification comprises aggregating significant matches across all windows, preferably an output is generated, comprising a list of spatial locations and / or single cell identifiers and / or numbers (counts) with their respective match counts.

[0171] As a non-limiting example, when searching for "genes involved in T cell development," embodiments of the present method may retrieve relevant sequences and understand at the same time, which cell types and developmental stages to look for.

[0172] In embodiments, the present method may ensure biological accuracy through a two-stage process. In embodiments, rather than training the language model on biological knowledge, a comprehensive knowledge base of biological relationships is maintained that is queried during the retrieval phase. In embodiments, when processing queries like "genes active in stem cells", the system may retrieve relevant biological context (stem cell markers, expected expression patterns etc.) from this knowledge base. In embodiments, the retrieved information guides a database search, for example, by first identifying appropriate stem cell markers before querying the sequence index. In embodiments, when analyzing results, the language model subsequently uses the retrieved biological context to validate and explain findings. For example, when examining gene expression patterns, it references retrieved literature about typical expression in specific cell types. In embodiments, this “RAG approach” means the language model focuses on reasoning about retrieved information rather than needing to learn biology through fine-tuning, making the system more maintainable and updateable as biological knowledge expands. In embodiments, the method further comprises e. contacting at least one data resource, preferably at least one database, comprising biological information corresponding to the respective biological input data provided in a. and the user query in d., and retrieving at least one information with regard to at least one portion of biological input data.

[0173] In embodiments, the contacting at least one data resource (e.), preferably at least one database, comprising biological information (e.g., public repositories, such as, without limitation thereto, GEO, Human Cell Atlas, GEO, SRA, ENA, ArrayExpress, CNGB and / or HCA Data Portal, etc.) may comprise obtaining data (e.g., using a crawler algorithm) from at least one data resource to expand the data stored in the two-tiered index. In embodiments the obtained data are asynchronously processed into a separate instance of the two-tiered index and preferably progressively merged into the ‘main’ two-tiered index previously constructed (based on biological sequence input data).

[0174] More specifically, the present method creates a data structure (two-tiered index) optimized for querying the input data, e.g., single-cell and spatial sequencing data. At its core, it uses in embodiments a k-mer based approach, where sequences, such as sequencing reads, are decomposed into substrings of length k that are indexed and mapped to corresponding spatial coordinates (or other (single) cell-specific identifiers), creating a compact representation of the dataset (Fig. 1A). This enables a two-tiered indexing system, akin to web search engines - first a k-mer lookup table, then a spatial (or any other label) index mapping k-mers to their locations in the tissue (or to individual cells).

[0175] When comparing the size of the two-tiered index generated according to embodiment of the present method to their corresponding raw data across different technologies and organisms the inventors surprisingly found that a two-tiered index generated according to the present invention achieves 5-1 Ox compression compared to the original data while preserving complete sequence information (Fig. 1 B). This compression is achieved without resorting to probabilistic methods, ensuring that exact sequence matches remain precise.

[0176] In embodiments, the indexing procedure is performed only once, wherein subsequent queries preferably only take advantage of the index itself and require only milliseconds, if a single sequence is queried (Fig. 1C). When comparing performance across methods in an experiment, the present method surprisingly showed a significant speed advantage, wherein, e.g., query times scaled primarily with sequence length rather than dataset size. In such an example, the dataset of the Human Cell Atlas was processed according to an embodiment of the present method, wherein a query of 1 ,000 bases across the entire Human Cell Atlas took 0.8 seconds, compared to hours that would be required by alignment-based approaches. Notably, the inventors hypothesize that these measurements likely still underestimate the true performance advantage of the present method.

[0177] When compared against established methods, including MetaGraph, BLAST, kallisto, and STAR, the present method achieved in an experiment across all analyzed datasets up to 1 ,000x faster querying, while memory usage remained constant across query sizes, indicating that computational complexity did not represent a performance bottleneck.

[0178] Embodiments of the present method enable a scalability to billions of labels, as the method’s design allows it to scale to billions of spatial coordinates or single-cell identifiers. In embodiments this may be achieved through the use of 32-bit unsigned integers for identifiers, supporting up to ~4 billion spatial units or cells per indexed experiment. In embodiments this may also be achieved through the use of chunked storage strategy, allowing efficient handling of large datasets that may not fit in memory.

[0179] Another advantage of embodiments of the present method is that mainly non-overlapping k-mers are generated, preferably with the only exception of, e.g., the last k-mer of each sequence / read (such that a sequence 5 may be covered entirely by k-mers of equal length). This approach is unlike other methods that use overlapping k-mers. Embodiments of the present method significantly reduces the number of k-mers generated, thereby improving both indexing speed and storage efficiency without compromising accuracy.

[0180] A further improvement of certain embodiments of the present method over the prior art, is the efficient on-disk storage and retrieval. The use of HDF5 file structure or custom page-aligned array systems with a dedicated page cache, in embodiments with a compressed row storage-like approach (indices, pointers, data arrays), allows for efficient disk storage and quick retrieval of k- mer information, which is particularly important for handling large spatial transcriptomics datasets.

[0181] An advantage of the novel approach of pseudoquantification according to the present invention used when querying the input data is the use of a sliding window approach with a flexible window size, allowing for tolerance of mismatches and gaps. The match score calculation used in embodiments also provides a fast approximation of sequence presence without full alignment, akin to some scores used in the information retrieval field. Lastly, the use of a threshold for determining significant matches in embodiments allows for fine-tuning of sensitivity vs. specificity.

[0182] In embodiments, where spatial sequencing or single-cell sequencing data is analyzed, the direct spatial indexing provided by constructing the two-tiered index enables, unlike many existing methods, which typically index at the sample level, the present method directly enables indexing spatial coordinates or cell identifiers, thereby enabling rapid spatial queries, which has great advantages for spatial transcriptomics analysis.

[0183] In embodiments, the present invention enables a high flexibility in query preparation by employing the sliding window approach and thereby “sliding” sequences for each query the method ensures that all possible k-mer offsets are captured, improving sensitivity without impacting processing speed.

[0184] In embodiments, the present method may be used for analyzing and / or interrogating biochemical sequence data, such as spatial or single-cell sequencing or amino acid sequence data.

[0185] Preferably such use is facilitated by decomposing and indexing the (raw / input) biochemical sequence data according to the present invention. In embodiments, the indexed biochemical sequence data may be stored in a two-tiered index, generated according to the methods described herein, thereby preferably enabling the interrogation and / or analysis of the sequence data, e.g., through user queries.

[0186] In embodiments, the present method may be used for indexing biochemical sequence data, such as spatial or single-cell sequencing or amino acid sequence data, thereby preferably enabling the interrogation and / or analysis of the sequence data, e.g., through user queries. In embodiments, the indexed biochemical sequence data may be stored in a two-tiered index, generated according to the methods described herein. In one aspect the invention relates to the use of the method according to the invention for identifying tissue and / or cell localized expression of at least one gene and / or protein of interest, comprising: analyzing biological sequence input data obtained from a sample of a subject according to steps a.-c., a.-d. or a.-e. of the method according to the present invention, wherein the biological sequence input data comprises single-cell nucleic acid sequencing data and / or spatial sequencing data, providing a user query with regard to at least one associated information of the at least one gene and / or protein of interest, preferably comprising a nucleic acid and / or amino acid sequence, a gene or protein identifier, a biochemical pathway identifier or specific disease, in which at least one gene and / or protein of interest is considered to be involved, processing the user query according to steps d. and / or e. of the present method, analyzing the two-tiered index generated in step c. of the method of the present invention, thereby identifying the cell and / or tissue specific expression of the of at least one gene and / or protein of interest, optionally providing at least one output comprising the tissue and / or cell localized expression of the at least one gene and / or protein of interest, preferably by a graphical user interface.

[0187] In another aspect, the invention relates to the use of the method according to the invention for the analysis, diagnosis, prognosis, treatment guidance and / or monitoring of a medical condition, comprising analyzing biological sequence input data obtained from a sample of a subject suspected or diagnosed with a disease, according to steps a.-c., a.-d. or a.-e. of the method according to the invention.

[0188] As can be determined from the details provided herein, the method of the invention allows characterization of complex tissues or cells, which may be obtained from patients or subjects suspected of having a medical condition, including, without limitation, histological sections, or other tissue or cell material susceptible to interrogation using nucleic acid sequencing or other biological sequencing methods. By performing the inventive method, thereby creating a searchable indexed data structure for biological sequence data, and enabling searching in said data structure for any given biological sequence data or associated information, detailed biological information can be derived in a quick and reproducible manner, that may well provide insights into the medical condition of any given individual.

[0189] In another aspect, the invention relates to a computer-implemented method for constructing a data structure, such as a database of biological input data, comprising: a. providing biological sequence input data that has been experimentally determined, b. decomposing the biological sequence input data into non-overlapping portions, c. indexing each portion with at least one associated information of the portion comprised within or provided with the biological input data, thereby creating a two- tiered index for each portion, d. optionally providing reference biological input data, e. generating a data structure of the biological sequence input data that was decomposed and indexed according to steps b. to c.

[0190] In embodiments, in the computer-implemented method according to the invention, the data structure is an indexed database of biological sequence input data, wherein the sequence input data in b. is decomposed into a set of consecutive, non-overlapping data strings of length k (k- mers), and wherein the two-tiered index is constructed in c. to comprise three arrays, namely a first array (indices array) comprising the k-mers generated in b., a second array (pointer array) linking the first with the third array, and a third array (data array) comprising the at least one associated information of each k-mer.

[0191] In embodiments, in the computer-implemented method according to the invention the biological sequence input data originates from a sample comprising at least one cell and / or the contents of at least one cell.

[0192] In embodiments, in the computer-implemented method according to the invention the biological sequence input data comprises genomic, transcriptomic or proteomic data, obtained from a single-cell sequencing analysis, spatial sequencing analysis, or mass spectrometry analysis.

[0193] In another aspect, the invention relates to a computer-readable medium comprising a data structure, such as a database, constructed according to the method disclosed herein.

[0194] In embodiments, the data structure comprises the two-tiered index obtained in step c. of the method and optionally the biological sequence input data provided in step a. of said method.

[0195] In another aspect, the invention relates to a computer-readable medium comprising a data structure, comprising an indexed database of biological sequence data, the database comprising decomposed biological sequence data, wherein said decomposed data comprises biological sequences split or partitioned into (preferably adjacent) non-overlapping sequence portions, wherein the database comprises a two-tiered index comprising the non-overlapping portions and additional associated information, such as metadata and / or sequence identifier information.

[0196] In another aspect, the invention relates to a data structure, preferably present on a computer- readable medium, comprising an indexed database of biological sequence data, the database comprising: biological sequence data decomposed into non-overlapping portions, wherein the database comprises a two-tiered index for each portion, wherein each portion is indexed with associated information comprised within or provided with the biological input data.

[0197] In another aspect, the invention relates to a data structure, preferably present on a computer- readable medium, comprising an indexed database of biological sequence data, the database comprising biological sequence data decomposed into non-overlapping data strings of length k (k-mers), wherein the database comprises a two-tiered index, wherein each k-mer is associated with additional metadata and optionally additional sequence identifier information. The data structure of the invention is therefore an independent aspect of the invention that enables the benefits described herein, in particular the fast interrogation of complex sequence data.

[0198] In another aspect, the invention relates to a system for creating a searchable indexed data structure for biological sequence data, comprising the features of the inventive method and / or data structure as described herein. In embodiments, the system may comprise a sequencing device, for example a nucleic acid sequencing device or a mass spectrometer, which are as such known in the art, and their methods are described briefly by way of example herein.

[0199] In another aspect, the invention relates to a software for creating a searchable indexed data structure for biological sequence data, comprising in essence the features of the inventive method and / or data structure as described herein. The software may be present in a computer readable medium, or computing device, which is configured to carry out the method described herein. The invention therefore also comprises physical computing entities configured for carrying out the method of the invention, configured to comprise (having stored thereon) and / or execute the software, or data structure, as described herein.

[0200] The various aspects and embodiments of the invention are defined, and unified, by the inventive features as described above. For example, the generation of non-overlapping indexed biological sequence portions, each portion (k-mer) being indexed with at least one associated information for said portion, such as metadata, thereby creating a two-tiered index for each portion, leads to a novel data structure associated with the advantages described herein. The data structure generated by the methods of the invention enable unexpected and beneficial properties that could not have been derived from existing prior art approaches. Each aspect and embodiment of the invention is considered to be disclosed in the context of each other embodiment and aspect. Each embodiment, optional or preferred feature of the invention that is disclosed or described in the context of one aspect of the invention is herewith also disclosed in the context of the other aspects of the invention described herein. All features disclosed in the context of the computer- implemented method according to the invention also relate to, and are herewith also disclosed in the context of, the uses of said method, the data structures, databases, systems, software and other computer program products, and methods for database generation, and vice versa.

[0201] DETAILED DESCRIPTION OF THE INVENTION

[0202] All cited documents of the patent and non-patent literature are hereby incorporated by reference in their entirety.

[0203] The present invention is directed to a computer-implemented method comprising providing biological sequence input data, decomposing the biological sequence input data into nonoverlapping portions, and indexing each portion of the biological sequence data with at least one associated information of said portion, comprised within or provided with the biological input data, thereby creating a two-tiered index for each portion

[0204] As used herein, the terms “biological sequence input data”, “biological sequence data”, “biological input data”, “input data” and “sequence data”, may be used interchangeably.

[0205] In embodiments, a “portion” or “fraction” of a sequence may herein be also referred to as “k-mer”. In embodiments such a k-mer may be defined as a sequence (string) of length k, wherein preferably each element or integer of said string of length k refers to a nucleotide or amino acid of the corresponding biological sequence. In other words, a k-mer may correspond to I constitute a consecutive sequence of nucleotides or amino acids of length k. In preferred embodiments a respective individual sequence 5 comprised within the biological input data may be decomposed (e.g., in step b.) into a set of consecutive k-mers of equal length, wherein each k-mer constitutes a “sub-sequence” (sequence portion or fragment) of said individual sequence (5). In this context, the terms “consecutive” or “non-overlapping” k-mers or portions preferably do not exclude that single k-mers of the sum of k-mers generated from an individual sequence 5 overlap with their neighboring k-mer(s), such that different individual sequences 5 of varying length may be decomposed into k-mers of equal length.

[0206] Herein, a “sequence” may be any biological sequence comprising biological sequence information, such as, without limitation, nucleic acid sequences, e.g., DNA and / or Resequences, or amino acid sequences. Such sequences are available as physical entities (as biological molecules) or as computer implemented representations of the biological molecules, as used in the invention herein.

[0207] Commonly a “sequencing read” or brief “read” refers to a sequence read (detection of a single, consecutive biological sequence of length x), obtained by a sequencing machine or system, such as a nucleic acid sequencer, mass spectrometer or sequencing-by-synthesis amino acid sequencer. The skilled person is familiar with suitable variants of machines and systems for the provision and determination of biological sequence reads (biological sequence information). DNA and RNA sequencing technologies can be broadly categorized into short-read and long-read sequencing platforms. Short-read sequencing methods generate high-accuracy reads of 30-300 base pairs. Long-read sequencing platforms can generate reads of several kilobases to megabases, though typically with lower throughput and accuracy compared to short-read methods. DNA / RNA sequencers examples: short-read (Illumina - preferred, BGI, Ultima Genomics), long-read (Pacbio, Oxford Nanopore).

[0208] In the context of computational sequence analysis, the term “indexing” generally describes the process of organizing and structuring sequence data in a way that enables efficient storage, searching, and retrieval operations. This involves creating data structures that store sequence information along with additional metadata and reference points, similar to how a book index allows quick location of specific content. In bioinformatics, indexing is particularly crucial for facilitating quick sequence comparisons, alignments, and searches.

[0209] Spatially resolved gene expression, spatial sequencing, or spatial transcriptomics, is typically considered a quantitative or semiquantitative readout of either whole transcriptome or targeted gene expression mapped to specific locations in a tissue section or a cell, and is an established method to understand cellular composition, activity, disease, and various other biological functions in the native tissue context, or in the sub-cellular context. Spatial transcriptomics can be achieved through next-generation sequencing-based approaches, where mRNA is mapped in tissues at the whole transcriptome level and then sequenced ex vivo. Spatial transcriptomics may be complemented through imaging-based approaches, often referred to as microscopy-based spatial methods, where mRNAs are imaged in situ, or spatial gene expression can be combined with a histological stain or immunofluorescence detection, to complement insights into tissue complexity.

[0210] Spatial transcriptomics technologies are assessed by key metrics, such as resolution of the spatial biology, gene plexity (the number of genes detected), and gene sensitivity (the linear dynamic range of accurate transcript detection). In embodiments, a local database system is used that interfaces with online genomic resources. In embodiments, an approach according to the invention primarily utilizes the Ensembl database as the source of reference sequences (Harrison et al. 2024).

[0211] In embodiments, complete transcript or 3’UTR sequences may be downloaded for a specified species (e.g., Homo sapiens or Mus musculus) from the Ensembl FTP server. In embodiments, these sequences are then indexed and stored in a local SQLite database, which allows for rapid querying and retrieval of sequence information without the need for repeated online access.

[0212] In embodiments, for handling user queries a parsing system may be used that is able to interpret various input formats. In embodiments, this may include direct DNA sequences, FASTA- formatted sequences, gene symbols, and / or Ensembl IDs. In embodiments, the parser may also recognize special commands prefixed with "gene:" or "ensembl:" to specify the type of identifier being used. Additional parameters can be included in embodiments to refine the query, such as specifying the species or selecting a particular region of the sequence using the "split" parameter. In embodiments, sequences are parsed from the local database. In embodiments, for queries that cannot be resolved locally, the system falls back to real-time API calls to the Ensembl REST server.

[0213] In embodiments, when spatial transcriptomics data is indexed, the querying of sequences and other features, as well as, and the visualization of the spatial distribution of matches across the indexed dataset is facilitated, via a web-based visualization interface.

[0214] In embodiments, this is implemented using the Flask framework in Python. In embodiments, upon initialization, the server loads the index generated according to the present invention and associated metadata. In embodiments, a subset of the dataset is precomputed and stored in memory as a background distribution of counts per spatial location, which allows to visually delineate the tissue structure. In embodiments, the interface provides a text input element that serves as a search bar that supports two types of queries: (1) standard sentence queries, where users can input a nucleotide sequence or a specific gene ID, or (2) natural language queries, which are processed using a language model to extract relevant gene sequences. In embodiments, for each query, the server uses the index generated according to the present invention to identify matching locations and intensities.

[0215] Analogously, spatial proteomics, or spatial mass spectrometry, is a method that allows for the visualization and quantitative or semi-quantitative analysis of the distribution of targeted or untargeted molecules directly in tissue sections. Advanced mass spectrometry instruments and ionization methods, i.e., matrix-assisted laser desorption ionization (MALDI) and electrospray ionization (ESI), are well established and may be employed. The speed, sensitivity, and molecular specificity of mass spectrometers enable the direct and simultaneous detection of hundreds to thousands of molecules from various spatially distributed locations from within a tissue section.

[0216] The present invention is, in preferred embodiments, characterized as a method, system, data structure or database, referring to a computer implemented invention, in which software, algorithms or other computer code is configured to carry out the method is presented.

[0217] The various aspects of the invention may in some embodiments comprise one or more conventional computing devices having a processor, an input device such as a keyboard, mouse, or microphone, and memory, such as a hard drive and volatile or nonvolatile memory, and computer code (software) for the functioning of the invention. The computers may also comprise a programmable printed circuit board, microcontroller, or other device for receiving and processing data signals such as those received from local controllers.

[0218] The various aspects of the invention may comprise one or more conventional computing devices that are pre-loaded with the required computer code or software, or it may comprise custom- designed hardware. The system may comprise multiple computing devices which perform the steps of the invention. In certain embodiments, a plurality of clients such as desktop, laptop, mobile device, or tablet computers can be connected to a server such that, for example, multiple users can employ the inventive software from various locations. The computer system and other aspects of the disclosure may also be networked with other computers over a local area network (LAN) connection or via an Internet connection. The system and other aspects of the disclosure may also comprise a backup system which retains a copy of the data obtained by the invention. The data connections may be conducted or configured via any suitable means for data transmission, such as over a local area network (LAN) connection or via an Internet connection, either wired or wireless.

[0219] A client or user computer can have its own processor, input means and memory, or it may be a dumb terminal which does not have its own independent processing capabilities, but relies on the computational resources of another computer, such as a server, to which it is connected or networked. The components of the computer system and other aspects of the disclosure may be conventional, or be custom-configured for each particular implementation. The computer system may run on any particular architecture, for example, personal / microcomputer, minicomputer, or mainframe systems. Exemplary operating systems include Apple Mac OS X and iOS, Microsoft Windows, and UNIX / Linux; SPARC, POWER and Itanium-based systems; and z / Architecture. The computer code to perform the invention may be written in any programming language or model-based development environment, such as but not limited to C / C++, C#, Objective-C, Python, Cython, Java, Basic / VisualBasic, MATLAB, Simulink, StateFlow, Lab View, or assembler. The computer code may comprise subroutines which are written in a proprietary computer language which is specific to the manufacturer of a circuit board, controller, or other computer hardware component used in conjunction with the invention.

[0220] In certain embodiments of the invention, a human monitor may be present to oversee the processes and to resolve any errors or faults. Nevertheless, in preferred embodiments, the method is automated or semi-automated, and the monitor will not substantially participate in the method, and therefore will not routinely need to interfere directly with computer-implemented aspects of the invention.

[0221] In embodiments, the various aspects of the invention are computer implemented. With reference to the disclosure above, the invention relates to a technical field (here molecular biology, genomics and biological sequencing), is concerned with a technical problem (fast interrogation of complex biological sequencing data, medical and diagnostic improvements via genomic / proteomic / sequencing approaches) and has technical features, with reference to the features of the claims. The features recited in the claims and throughout the application represent technical features, as they provide technical improvements inside and / or outside the technical working of a computer, they represent, simulate, model, manipulate and / or interact with real-world entities, and / or involve technical considerations, thus imbuing the features with technical character and contributing to the solution of technical problems. The data structure disclosed herein contributes to the technical character of the invention as it has an intended technical use (here making a searchable indexed data structure comprising biological sequence data for interrogation) and it causes a technical effect (faster searching, in combination with the potential wealth of biological information to be derived from sequencing applications, in particular complex spatial sequencing applications) when used according to this intended technical use.

[0222] It will be understood that particular embodiments described herein are shown by way of illustration and not as limitations of the invention. The principal features of this invention can be employed in various embodiments without departing from the scope of the invention. Those skilled in the art will recognize or be able to ascertain, using most routine study, numerous equivalents to the specific procedures described herein. Such equivalents are considered to be within the scope of this invention and are covered by the claims. All publications and patent applications mentioned in the specification indicate the skill level of those skilled in the art to which this invention pertains. All publications and patent applications are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference.

[0223] The use of the word "a" or "an" when used in conjunction with the term "comprising" in the claims and / or the specification may mean "one," but it is also consistent with the meaning of "one or more," "at least one," and "one or more than one." The use of the term "or" in the claims is used to mean "and / or" unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive. However, the disclosure supports a definition of only alternatives and "and / or." Throughout this application, where relevant, the term "about" or indicates that a value includes the inherent variation of error or common variation in the art.

[0224] FIGURES

[0225] The invention is further described by the following figures. These are not intended to limit the scope of the invention, but represent preferred embodiments of aspects of the invention provided for greater illustration of the invention described herein.

[0226] Figure 1 : A comprehensive framework for ultrarapid analysis of spatial transcriptomics data (A) Schematic overview of the present method's indexing architecture, illustrating the two- tiered indexing architecture. (B) Indexing speed across methods and dataset sizes: across technologies: index size vs raw data for Open-ST (n=5), Stereo-seq (n=20), and 10x Genomics Visium and Chromium v3 (n=2) datasets. (C) Query speed comparison of different methods by dataset and query complexity.

[0227] Figure 2: Reference-free reconstruction of gene expression preserves biological signals.

[0228] (A) Pseudoquantification workflow illustrating sliding window approach and background model.

[0229] (B) Cell-cell correlation of counts between the present method and the original quantification. (C) Gene-gene correlation of UMI counts. (D) Correlation of gene expression quantification per cell, between the present method and the original dataset. (E) Quantification of concordant and nonconcordant genes when performing differential expression (DE) analysis between clusters found with Malva and spacemake. The concordant genes between Malva and spacemake are differentially expressed (orange, log2FC > 1 in both cases), or non-differential (light green, log2FC < 1). The non-concordant genes are split into two groups, those with log2FC < -1 in one method and log2FC > 1 in the other (orange), and those with opposite direction (brown). (F) Spatial clustering results: align-and-count with spacemake (left), Malva (center). (G) Cell type annotations and correspondences between clusters shown in (F). (H) Separability of original cluster annotation in Malva-derived embedding space for a single-cell atlas (E13.5 mouse midbrain). (I) Expression of top marker genes for the original cell classes, on Malva and the original quantification. (J) Correlation of developmental trajectory in mouse embryonic brain atlas between Malva and original gene expression quantification. In A, B and C, blue intensity represents the kernel density estimation (KDE) of data points, revealing density in areas of overplotting; gray points represent outliers (below 5th density percentile).

[0230] Figure 3: Direct detection of complex RNA features. (A) Query interface architecture showing sequence, gene and concept translation workflow. (B) User interface for systematic RNA feature search on the Malva index from sequences, gene IDs or natural language prompts. (C) Cell-type specificity analysis of CDRIas in a single-cell adult mouse brain atlas. (D) Representative reads for CDR1 as found with Malva, and its alignment to the CDR1 as sequence with BLAST.

[0231] Figure 4: De novo sequence assembly with cluster-specific k-mer subsets, (a) Schematic of the workflow for identifying marker k-mers from Malva-derived clusters. Cluster-enriched k-mers are extracted and used either for transcript quantification (gene-based mode) or for de novo assembly (k-mer-based mode), (b) Comparison of sequence-based versus gene-based clustering across >30M healthy human cells, as shown in Fig. 10b. Latent structure agreement (AUC) and Adjusted Mutual Information (AMI) quantify how well sequence-derived embeddings preserve cluster structure relative to gene counts. Lower agreement in single-nucleus data reflects low-complexity intronic sequences.

[0232] Figure 5: The unified ‘Malva’ Index according to the present invention, (a) Public data sources are screened for scRNA-seq and Spatial Transcriptomics experiments. All metadata is collected, harmonized and linked to the Malva Index containing the k-mers indexed alongside cell (or spatial) barcodes, (b) Composition of the Malva Index (v2024.11). (c) Evolution of data availability across time, per tissue and technology type (spatial vs. single-cell), (d) Word cloud of diseases represented in the Malva Index.

[0233] Figure 6: The Malva platform according to the present invention enables sequence-level queries beyond current data portals, (a) The present method returned results in 0.6 seconds, and confirmed the expected cell-type specificity of CDR1 as: 95% of positive cells were excitatory neurons, with lower detection in inhibitory interneurons, and very low detection across other tissues, (b) Using the present method, Add2 3 -UTR variants were queried across mouse embryonic development in seconds. The results revealed that distal UTR usage concentrates in developing neural tissue while proximal forms dominate in the developing liver, with Malva coverage profiles closely matching those generated by STAR alignment, suggesting that the apparent 3’UTR lengthening of this gene across embryonic development is due to a shift in cell population proportions, (c) The present method recapitulates isoform switches that remain hidden in conventional gene expression analysis. Exon usage at the Ptprc locus in Tabula Muris 4. Coverage-like queries reveal cell-type-specific inclusion of exon A (CD45RA isoform in B cells) versus exclusion (CD45RO-like in other immune cells). Rows represent cell types, with dot size and color reflecting exon inclusion, (d) The present method analysis explains 3’UTR usage patterns in its spatial context, i.e., 3'UTR length regulation of Add2 during spatiotemporal mouse embryogenesis. The plot shows the coverage profiles across stages E9.5-E15.5 show proximal and distal polyadenylation site usage; ‘Malva’-derived profiles (top) closely match STAR alignments (bottom).

[0234] Figure 7: The present invention enables sequence queries beyond predefined gene annotations (a) Query performance across different sequence types and lengths, ranging from short sequences (24 bp) to large genomic regions (~2 Mbp). Bubble size reflects the number of cells queried (100,000 to 100 million), (b) Dot plot showing the percentage of SARS-CoV-2- positive cells per cell type and sample (dot size), with expression values z-score normalized within each cell type to highlight relative enrichment, (c) Coverage-like analysis of Mycoplasma genome across positive samples from Index shows predominant localization to 23S rRNA loci, (d) Cell-type-specific differential exon usage across the Tabula Muris atlas. Representative examples highlight known regulatory switches, such as Ptprc exon inclusion in B cells and alternative splicing of Cdca2 in neuronal populations, (e) Tissue-specific detection of the circular RNA CDRIas across the Human Cell Atlas and Tabula muris, compared with expression of marker genes SLC17A7 (excitatory neurons) and DLX2 (inhibitory neurons).

[0235] Figure 8: General overview over the Malva platform according to the present invention. Conventional single-cell portals lack sequence-level resolution, and existing sequence search tools lack single-cell resolution. The present invention enables integrating both, providing realtime results at atlas scale. Researchers can query the two-tiered Index using raw nucleotide sequences, gene symbols, or natural language descriptions. The present invention enables resolving these inputs to their underlying sequences and returns all matching cells together with rich metadata at both the cell and sample level (e.g., cell types, disease status, sex, age, study identifiers). In embodiments, raw data from several public sequence archives may be continuously downloaded and used to build a two-tiered index, that users can query via a public interface (API or web browser-based), along three axes: what to search (sequences, gene IDs, or natural language queries), how to search (expression or coverage) and where to search (cell types, samples, organs, diseases). These queries are processed by the engine and preferably returned as standardized outputs that essentially contain which cells are positive for the query, the quantification of the queried features, and the metadata related to the cells (e.g., cell type labels) and the samples these cells come from (e.g., disease, tissue, sex, etc.). Results can be fed into computational systems for automated analysis (e.g., Neural Networks), or analyzed interactively, for example as expression distributions across cell types (A), coverage profiles along the queried sequence queries (B), or spatial maps when coordinates are available (C).

[0236] Figure 9: Non-limiting example queries (e.g., according d. of the present method) and nonlimiting examples of the output the present method may return.

[0237] Figure 10: Sequence-based transcriptomic analysis without predefined gene annotations.

[0238] (a) Conceptual view of the single-cell information funnel. Biological (“physical”) cells are converted to “digital” cells via sequencing, which can be represented either as cell-by-gene or cell-by-sequence matrices. These representations enable reconstruction of latent spaces underlying cell types and trajectories, (b) Local Structure Agreement (Area Under the Curve, AUC) comparing sequence-based (k-mer bucket) versus gene-count Principal Component embeddings across >30 million healthy human cells. Values >0.5 indicate better-than-random preservation of local neighborhoods, (c) Example comparisons of clustering based on gene features versus sequence features in human samples with. Panels show cases with low (NMI = 0.04) and high (NMI = 0.65) concordance, corresponding to tissues highlighted in (b). (d) Sequence-based clustering of non-model organisms (Hoilungia and Trichoplax). Clustering structure based on k-mer features closely resembles gene-based annotations, (e) Spatial clustering of a head-and-neck tumor dataset using gene-count-based (left) versus sequencebased (Malva, right) embeddings. KS1 and KS2 are two k-mer-specific clusters, (f) Clusterspecific de novo assembly of marker sequences. Contigs assembled from discriminative k-mers reveal both annotated human features (genes and intergenic regions) and unaligned sequences. Dot plot shows the fraction of cells per cluster expressing each assembled contig (dot size) and the average log-normalized expression rescaled per contig so that the highest mean expression across clusters is set to 1 . (g) Representative marker sequences assembled from KS1 / KS2 clusters. Examples include a truncated 3'UTR isoform of KLK10 (human) and bacterial 23S rRNA sequences from Gemella morbillorum.

[0239] EXAMPLES

[0240] The invention is further described by the following examples. These are not intended to limit the scope of the invention, but represent preferred embodiments of the invention, provided for greater illustration of the invention described herein.

[0241] Example 1 - Malva algorithms and data structures

[0242] A generalized representation of the inventive approach is described below, by way of example:

[0243] Indexing and pseudoquantification algorithm:

[0244] In the first stage, the index construction transforms raw sequencing data (e.g. from Spatial Transcriptomics) into a highly optimized, searchable structure that allows for rapid querying of sequence information across spatial coordinates.

[0245] 1 . Input Preparation: Parse paired FASTQ files containing sequencing reads and associated spatial barcodes.

[0246] 2. Spatial Coordinate Mapping: a. Create a mapping of spatial barcodes to unique 32-bit integer identifiers. b. Supports up to ~4 billion spatial units or cells per experiment.

[0247] Direct spatial indexing was employed, unlike sample-level indexing in most existing methods.

[0248] 3. K-mer Generation: a. For each read, generate non-overlapping k-mers and encode each k-mer as a 64- bit integer using 2-bit representation. By default, a k-mer size of 24 was used. b. By default, packed data structures were not used, unlike other approaches, because we aim to optimize for indexing speed, not for memory overhead. These optimizations, e.g., hierarchical and Elias delta encoding of the sorted k-mer indices, can be applied after indexing.

[0249] Non-overlapping k-mer generation reduces computational overhead and storage requirements.

[0250] 4. K-mer Index Construction: a. Create a list L of encoded k-mers paired with their spatial identifiers (metadata). b. Transform L into three arrays: indices, pointers from indices to data, and data. c. Store these arrays (preferably in an HDF5 file structure or custom page-aligned array systems with a dedicated page cache)

[0251] Preferred use of compressed row storage-like approach for efficient on-disk storage and retrieval.

[0252] 5. Preferred Chunked Storage: a. Divide the index into manageable chunks for large datasets b. Merge chunks into a single final chunk if possible

[0253] In the second stage, pseudoquantification enables approximate quantification of sequence presence without the need for alignment. Malva is different to other k-mer indices in the sense that it can be exploited to query / quantify long sequences (e.g., a whole gene) against an index of very short sequences, e.g., Illumina reads associated with single labels, as is the case for ST.

[0254] 1. Query Preparation: a. For each query sequence, generate a set of sliding sequences from the query. Each sliding sequence starts at a different position in the query, from 1st to k-th, and extends to the end of the query, as long as the resulting sequence is at least k nucleotides long, k is the k-mer size, which is 24 by default (must be the same as during index construction).

[0255] This ensures capture of all possible k-mer offsets, thereby improving sensitivity.

[0256] 2. K-mer Matching: a. For each sliding sequence, generate k-mers and look up each in the index using either in-memory lookup table (for smaller datasets) or hierarchical binary search on the indices array (for very large datasets).

[0257] This flexible lookup strategy balances speed and memory usage.

[0258] 3. Sliding Window Analysis: a. For each sliding window in the query, compute a match score for each window. This score is calculated by dividing the number of matching k-mers in the window by the total number of possible k-mers in the window. The total number of possible k-mers is the window size minus k-mer size plus one. b. Then, consider a match significant if score > threshold T (default 0.65, can vary)

[0259] This enables a fast approximation of sequence presence without full alignment, tolerating mismatches and gaps.

[0260] 4. Result Aggregation: a. Aggregate significant matches across all windows. b. Output: List of spatial locations with their respective match counts.

[0261] 5. Optional use of a background model.

[0262] Malva may incorporate a background model based on k-mer frequencies in reference transcriptomes to distinguish between informative and ubiquitous k-mers.

[0263] Data structure during indexing and pseudoquantification: During indexing, the data structure is preferably organized as follows:

[0264] 1 . K-mer Index: This is a sorted array of 64-bit integers representing encoded k-mers. Each k-mer is associated with a pointer in the Spatial Pointers array. Since these are stored as lexicographically sorted, these can be arranged hierarchically, similar to a B+ tree, and the leaf level can be compressed using coded deltas, in a page-aware manner.

[0265] 2. Spatial Pointers: This array contains pointers to lists of spatial IDs. Each pointer corresponds to a k-mer in the K-mer Index and points to a list of spatial locations where that k-mer is found.

[0266] 3. Spatial Metadata: This structure maps spatial IDs to their actual coordinates in the tissue.

[0267] During index construction, the data structure is populated as follows:

[0268] 1 . K-mers are generated from reads and encoded as 64-bit integers.

[0269] 2. These k-mers are sorted and deduplicated to form the K-mer Index.

[0270] 3. For each k-mer, a list of spatial IDs is created and stored, with pointers in the Spatial Pointers array.

[0271] 4. Spatial coordinates are stored separately in the Spatial Metadata structure.

[0272] During pseudoquantification, the index is queried as follows:

[0273] 1 . The query sequence is broken into sliding windows.

[0274] 2. K-mers from each window are looked up in the K-mer Index (using hierarchical search or hash table).

[0275] 3. For matching k-mers, the Spatial Pointers are followed to retrieve lists of spatial IDs.

[0276] 4. These spatial IDs are used to compute match scores for each window.

[0277] 5. Significant matches are aggregated across windows.

[0278] This structure allows for efficient k-mer lookups and retrieval of spatial information without the need for graph traversal.

[0279] Comparison to state of the art:

[0280] Focusing on the data structure and indexing procedure, Malva is a hybrid approach that combines elements from both color-aggregative and k-mer aggregative methods, while introducing novel features to spatial transcriptomics data.

[0281] Like color-aggregative methods such as Vari (Muggli et al., 2017) or Mantis (Pandey et al., 2018), Malva constructs a global index of k-mers. However, instead of associating k-mers with colors representing datasets, Malva directly maps k-mers to spatial coordinates or cell barcodes. This is more conceptually similar to the k-mer indices used in the colored de Bruijn graph approaches, but tailored for single-cell and spatial information. For instance, methods like BOSS (Boucher et al., 2015), BIGSI (Bradley et al., 2019), Mantis (Pandey et al., 2018), Bifrost (Holley and Melsted, 2020), MetaGraph (Karasikov et al., 2024), kmtricks (Lemane et al., 2024), and REINDEER (Marchet et al., 2021) have introduced k-mer indexing strategies optimized for large-scale sequence collections. While these methods excel at indexing and querying across a number of bulk samples, their data structures are optimized for dataset-level annotations rather than finegrained cell-level or spatial annotations required for single-cell and spatial transcriptomics, where the number of labels may scale to billions per experiment. Malva does not maintain the connections between k-mers, therefore, it does not require the construction of a graph. Instead, it uses just the hash-based (or hierarchical search-based) k-mer lookup, more similar to a web search index. One concept behind Malva is that graph structure (i.e. , a de Bruijn graph) is not necessary for accurate quantification of sequence presence across single cells or spatial locations - this enables the core feature, which is to associate k-mers with tissue or cell information (see above “Data structure during indexing and pseudoquantification”). This assumption, by itself, does not guarantee accurate quantification for data with a few thousand base pairs per label (as is the case for spatial transcriptomics), unless a proper method for quantification is implemented (i.e., we term pseudoquantification).

[0282] Therefore, Malva extends the concept of k-mer indexing by integrating spatial annotation directly into the index structure. This is achieved through a compressed annotation matrix that associates k-mers with spatial locations or cell barcodes, conceptually similar to the color matrix in VARI (Muggli et al., 2017), but optimized for spatial data, which represents the case where a one label is assigned to a few reads covering thousands of base pairs, very different from the common case of associating one label per hundreds of thousands or millions of base pairs.

[0283] Malva queries k-mers and retrieves associated spatial information directly, without the need to traverse a graph structure. This enables rapid lookups, optionally directly on disk, critical for the scalability of the pseudoquantification method.

[0284] Malva's pseudoquantification algorithm introduces a novel quantification approach for k-mer indexing methods, leveraging the observation that in short-read single-cell data, the mapping between sequencing reads and labels tends to be close to 1 :1. It uses a sliding window approach for k-mer counting in combination with a flexible presence threshold and a background model specifically designed for sparse single-cell and spatial data.

[0285] Summary of preferred features, embodiments and advantages:

[0286] Scalability to billions of labels: Malva's design allows it to scale to billions of spatial coordinates or cell identifiers. This is achieved through the use of 32-bit unsigned integers for identifiers, supporting up to ~4 billion spatial units or cells per indexed experiment; and, through the use of chunked storage strategy, allowing efficient handling of large datasets that may not fit in memory.

[0287] Non-overlapping k-mer generation: Unlike other methods that use overlapping k-mers, Malva generates non-overlapping k-mers (typically except for the last one). This significantly reduces the number of k-mers generated, improving both indexing speed and storage efficiency without compromising accuracy.

[0288] Efficient on-disk storage and retrieval: The preferred use of HDF5 file structure or custom page-aligned array systems with a dedicated page cache with a compressed row storagelike approach (indices, pointers, data arrays) allows for efficient disk storage and quick retrieval of k-mer information. While compressed row storage approaches are well- established, Malva's implementation is specifically tailored for the access patterns common in single-cell and spatial queries.

[0289] Pseudoquantification approach: Malva's pseudoquantification method uses sliding windows with a flexible window size, allowing for tolerance of mismatches and gaps. The match score calculation provides a fast approximation of sequence presence without full alignment, akin to some scores used in the information retrieval field. Lastly, the use of a threshold for determining significant matches allows for fine-tuning of sensitivity vs. specificity.

[0290] Direct spatial indexing: Unlike many existing methods that typically index at the sample level, malva directly indexes spatial coordinates or cell identifiers. This enables rapid spatial queries, which is crucial for spatial transcriptomics analysis.

[0291] Flexibility in query preparation: The generation of sliding sequences for each query.

[0292] Example 2 - Malva: Ultrarapid and reference-free search engine for single-cell and spatial sequencing data

[0293] A detailed representation of the inventive approach is described below, by way of example:

[0294] Methods

[0295] Construction of the index generated according to the present method

[0296] The index generated according to the present method is constructed in embodiments from a collection of n input sequences S = where each stis a string over the alphabet S =

[0297] {A, C, G, T}. For embodiments of the present method, the inventors assume stcomes from shortread sequencing, thus the typical length of each string will be ~32-200 nucleotides. Each sequence stis associated with additional information, typically an identifier such as a cell barcode. The inventors denoted this associated information as I = {i1, i2,..., in}. The inventors assumed this information is already encoded as an integer value or a tuple of values (e.g., in the case of 2D coordinates).

[0298] For a given integer k > 8, the inventors generates all k-mers from each sequence using a nonoverlapping approach, with a special case for the last k-mer. More specifically, given a string s of length |s| = I over an alphabet , the inventors defined the set of k-mers K(s~) as:

[0299] / (s) = {s(i, i + / <) | i = jk; 0 < j < [ / / / cj] U {s(l — k,l) \ l mod k =# 0} where s(i ’) denotes the substring of s starting at index i (inclusive) and ending at index j (exclusive). This formulation generates non-overlapping k-mers for most of the sequence, with a potential overlapping k-mer at the end to ensure complete coverage. The total number of k-mers generated for a sequence is \l / k], This approximation enables perfect retrieval from a sequence q queried against the final index, if q G S.

[0300] Each k-mer is then encoded as a 64-bit integer using a function f. Here, the inventors used naive 2-bit encoding (A=00, C=01 , G=10, T=11). For each sequence stand its associated information iitthe inventors created a set of encoded k-mers paired with the identifier: E(siti ) = { / (%), ij | x is a kmer in s . The complete set of encoded k-mers with their associated identifiers is then formed as the union of these sets, E = u”=1E(siti ).

[0301] Next, the inventors sorted the set E lexicographically based on the encoded k-mer values, resulting in a sorted list L. During the sorting process, the inventors removed duplicate entries for identical k-mers while preserving all associated identifiers. To store this information on disk, the inventors transformed L into three arrays: indices, pointers and data, similarly to the principle of compressed row storage matrices. The indices array contains the unique encoded k-mers. The pointers array stores the start and end indices in the data array for the identifiers associated with each k-mer. The data array is a concatenation of all 32-bit unsigned integer identifier sets for the k-mers - supports a maximum of ~4 billion spatial units or cells per indexed experiment.

[0302] These arrays are then stored in an HDF5 file structure or in custom page-aligned array systems with a dedicated page cache, which allows for efficient disk storage and retrieval of k-mer information.

[0303] To improve memory efficiency when handling large datasets, the inventors employed a chunked storage strategy. The k-mers from a consecutive subsets of input sequences are collected, sorted based on their encoded values and added to the HDF5 file or page-aligned array system. This will generate various chunks of (indices, pointers, data), which can be merged into a single chunk. The final HDF5 file or page-aligned array contains a single dataset storing the encoded k- mers, pointer arrays for identifier locations, and arrays of associated identifiers. Additionally, the inventors stored metadata including the k-mer size and number of chunks used.

[0304] Hierarchical memory management

[0305] The inventors implemented a B+-tree inspired hierarchical structure with configurable page size P (default 1024). The inventors construct s = \logPN] levels where level I contains [W / P entries for each N unique k-mers. Each non-leaf node stores P k-mers representing maximal values in child pages. The hierarchical structure reduces random disk accesses through three mechanisms: page-aligned reads matching storage system block size (typically 4KB), predictive prefetching based on query patterns, and an LRU cache maintaining frequently accessed pages. This organization achieves a query time complexity of 0(log(PNf) for exact matches. Memory usage is bounded by O(P logPN)), enabling efficient operation even with datasets orders of magnitude larger than available RAM.

[0306] Querying and Seguence Pseudoguantification

[0307] The core functionality of the present method is to efficiently identify occurrences of query sequences within the indexed dataset. Given a set of query sequences Q = {qltq2, ..., qn}, the inventors first prepared these for analysis. For each query sequence q of length |q| and a window size w, the inventors generated a set of sliding sequences S(q) = {q(i, I) | 0 < i < k, \q i, I) > k}, where k is the k-mer size used in the index construction. This approach ensures that the inventors captured all possible k-mer offsets within the query sequence.

[0308] For each sliding sequence s G S( ), the inventors generated the set of k-mers K(s) as defined in the k-mer generation process. The inventors then retrieved the associated identifiers using the indptrand data arrays by finding the corresponding location of a query k-mer via either a hash table built upon loading the index to memory, or a coarse binary search in the indices array.

[0309] The inventors then employed a sliding window analysis to evaluate the significance of matches. For each window W of size w in q, defined as W = q(i, i + w) for 0 < i < |q| - w], the inventors computed a match score for every sequence identifier found in the previous step. The match score for an identifier id in window W is defined as: where M(W, 1D) is the set of k-mers in W that match k-mers associated with the identifier ID in our index, and (w - k + 1) is the total number of k-mers in the window. A match is considered significant if its score equals or exceeds a predefined threshold T (default 0.65).

[0310] Search Algorithm Optimization

[0311] The querying process aggregates significant matches across all windows, maintaining a count of the number of windows in which each identifier appeared as a significant match. This approach allows for efficient identification of sequences similar to the query, tolerating some mismatches or gaps.

[0312] Moreover, the search algorithm employs several optimizations, like grouping multiple k-mer lookups to amortize disk access, pre-sorting of queries to maximize cache utilization, and early termination by aborting window processing when score cannot exceed threshold.

[0313] Query error analysis

[0314] Formally, the worst-case scenario for false negatives can be a query sequence q of length I, where I > w and w is the window size. Let M = [L / k\ be the number of non-overlapping k-mers in q. The worst case for false negatives would occur when there are exactly M errors in q, with each error positioned in a different non-overlapping k-mer. In this scenario, if the inventors used a threshold T = 1, requiring a perfect match within a window, the method would fail to detect any significant matches despite the query being highly similar to a sequence in our index. This illustrates the trade-off between sensitivity and specificity in our matching algorithm, and underscores the importance of choosing appropriate values for T and w based on the specific requirements of the analysis.

[0315] Background Model Generation

[0316] To account for the repetitiveness of genomic sequences, the number of false positives is reduced and improves the overall quality of pseudoquantification, the inventors implemented referencespecific background models based on k-mer frequencies. Alternatively, the present method can exclude low-complexity k-mers masked by standard methods such as dustmasker.

[0317] The inventors began with a set U = {ultu2, . . . , un) of genomic sequences, e.g., from a reference database such as GENCODE or RefSeq. For each sequence iq, the inventors generated the set of k-mers K(iq) using the same k-mer generation process as in the main index construction. The inventors then employed a two-step counting mechanism. When multiple sequences are associated to a same feature, e.g., multiple alternative transcripts under the same gene ID g, the occurrence of each unique k-mer within its sequences were counted, creating a set Cg= {( / <, 1) | k G K(ugy}. The inventors then aggregated these counts across all genes, forming a global count set C = {(k, |£g | k G Cfl}|) | k G K(ug)}.

[0318] Without loss of generality, in the case of transcript or UTR sequences for a same gene, this approach gives equal weight to each preventing bias towards genes with longer or multiple isoforms. The resulting background model is stored as a map where keys are the encoded k- mers and values are the counts of genes in which each k-mer appears. The inventors then normalized these counts, either by dividing by the total number of genes or by employing a rankbased normalization, to facilitate interpretation and threshold setting. During the present method querying process, each k-mer k from the query sequence is checked against this background model. Let ( / c) be the normalized frequency of k in the background model. The inventors defined a complexity threshold 9. K-mers where ( / c) > 9 are either excluded from the matching process or given reduced weight in the scoring algorithm, determined that maps the background frequency to a weight. ses

[0319] The inventors developed a local database system that interfaces with online genomic resources. Our approach primarily utilizes the Ensembl database as the source of reference sequences (Harrison et al. 2024).

[0320] The inventors downloaded complete transcript or 3’UTR sequences for a specified species (e.g., Homo sapiens or Mus musculus) from the Ensembl FTP server. These sequences are then indexed and stored in a local SQLite database, which allows for rapid querying and retrieval of sequence information without the need for repeated online access.

[0321] To handle user queries, the inventors developed a parsing system that can interpret various input formats. This includes direct DNA sequences, FASTA-formatted sequences, gene symbols, and Ensembl IDs. The parser also recognizes special commands prefixed with "gene:" or "ensembl:" to specify the type of identifier being used. Additional parameters can be included to refine the query, such as specifying the species or selecting a particular region of the sequence using the "split" parameter. By default, sequences are parsed from the local database. For queries that cannot be resolved locally, the system falls back to real-time API calls to the Ensembl REST server.

[0322] Querying with Large Language Models

[0323] The inventors integrated the open-source Llama 3.1 model through the Ollama interface into the present workflow (Dubey et al. 2024). The inventors designed a specific prompt template to generate Python lists of genes relevant to user queries, in a structured format for subsequent analysis steps.

[0324] To provide biological context, the inventors constructed a Retrieval-Augmented Generation (RAG) database using MSigDB, the Molecular Signatures Database (Liberzon et al. 2015; Castanza et al. 2023), managed with ChromaDB. The inventors created this database by encoding the textual descriptions from MSigDB using the mxbai-embed-large text embedding model. The resulting embedding vectors were then added to the ChromaDB database along with their corresponding lists of genes and the name of the database of origin as metadata. This enables the language model to retrieve curated biological information based on the context of the user's query.

[0325] User queries are first converted to text embeddings compatible with the RAG database, to retrieving metadata, i.e., all possible lists of annotated genes relevant to the query. These, together with the user query, are fed into Llama, which generates a curated list of relevant genes. The expanded gene list is then used to fetch coding sequences from our local genomic database, which are queried against the malva index to obtain spatial expression data.

[0326] To highlight cells or regions with specific phenotypes, the inventors implemented a scoring system that considers both positive and negative expression patterns. The inventors calculate scores based on the pseudoquantification values from gene lists, with the final score for each spatial location computed as the normalized difference between these positive and negative expression levels.

[0327] Interactive exploration of spatial data

[0328] When spatial transcriptomics data is indexed, the inventors facilitate the querying of sequences and other features, and the visualization of the spatial distribution of matches across the indexed dataset, via a web-based visualization interface.

[0329] This was implemented using the Flask framework in Python. Upon initialization, the server loads the malva index and associated metadata. The visualization system employs a tile-based architecture that maintains two visualization channels: one for background expression displayed in red, and another for query results shown in green. Each spatial dataset is processed through datashaderwith dynamic resolution adjustment based on zoom levels, generating 256x256 pixel tiles. To enable efficient spatial querying, the system maintains spatial indices using scipy.spatial.cKDTree for both background and query data. During tile rendering, the system first calculates bounds based on the dataset's coordinate limits and current zoom level. Points within these bounds are retrieved using the spatial index and rasterized through datashader, with intensity normalization based on global value ranges. The system then applies Gaussian smoothing to enhance visualization clarity, using different smoothing parameters based on zoom level (o=1 .0 for overview, o=1 .2 for detailed views). Background and query channels are combined into RGBA images where the alpha channel indicates data presence.

[0330] Then, the interface provides a text input element that serves as a search bar that supports two types of queries: (1) standard sentence queries, where users can input a nucleotide sequence or a specific gene ID, or (2) natural language queries, which are processed using a language model to extract relevant gene sequences. For each query, the server uses the malva index to identify matching locations and intensities.

[0331] Datasets

[0332] The inventors showcase the present invention’s (‘malva’s’) ability to handle ST and scRNA-seq datasets, at different scales, with the following datasets. In particular, Open-ST E13.5 mouse embryo head (GSM7990097), human primary head and neck squamous cell carcinoma (GSM7990099), Slide-seq v2 E9.5 mouse embryo (GSM5915059), Stereo-seq CIRSTA atlas (downloaded from CNGB) (Schott et al. 2024; Kumar et al. 2023; Wu et al. 2024; La Manno et al. 2021 ; Joglekar et al. 2021 ; Boileau et al. 2022; McKellar et al. 2023). The Human Cell Atlas (HCA) data was downloaded from their portal, by selecting samples from “normal” (non-diseased) tissue, with open access, and for the technologies 10x Genomics Chromium 3’ v1 , v2 and v3, as of October 2024. The Human Adult Brain Atlas (HABA) (Siletti et al., 2023) was downloaded from NeMO. These datasets were chosen to represent a range of experimental designs, tissue types, and data sizes, with the CIRSTA atlas, HABA and HCA being particularly large at approximately 7TB, 20TB and 50TB of raw data.

[0333] For each technology, the inventors downloaded pairs of raw FASTQ files from their respective repositories. Then, the inventors standardized the barcode-to-identifier information into standard CSV files containing the mapping between barcode and identifier (e.g., unique number for a cell in scRNA-seq data, or 2D spatial location in ST data). In the present method, the inventors provide different run-modes that automatically detect the read structure for different technologies, including Open-ST, Seq-scope, Stereo-seq (STOmics), Slide-seq, Visium (non-HD), and 10x Genomics Chromium. Thus, FASTQ files can be processed seamlessly. For all spatial datasets, the inventors indexed their spatial locations with the maximum resolution supported by each technology, i.e., it is possible to query sequences at 0.6 pm spots in Open-ST data.

[0334] Benchmarking

[0335] To assess the computational efficiency of the present invention (‘malva’), the inventors measured the time required to build the Malva index for each dataset. The inventors also analyzed how the indexing time scales with the size of the input data (Fig. 1 B). All benchmarks were performed on a quad-socket server with 256 cores (4x Intel(R) Xeon(R) Platinum 8280 CPU, each with 38.5MB L3 cache), 4TB of RAM, and an HDD-based ZFS file system with 1 PB storage, running Ubuntu 22.04.

[0336] The inventors evaluated the query performance of the present invention (‘malva’) against other indexing-based methods (MetaGraph and BLAST) using subsets of genes ranging from 1 to all sequences in the cDNA references. For each method, the inventors measured the time required to quantify gene expression across all cells or spatial locations in the datasets. This analysis was performed for all datasets used in the computational complexity analysis.

[0337] To assess the accuracy of gene pseudoquantification with the present invention (‘malva’), the inventors compared its results to those obtained from spacemake, leveraging STAR for alignment and Dropseq-tools for quantification. For each dataset, the inventors computed the counts per cell (or spatial location) for each gene using all tools. The inventors then plotted these counts against each other, with each point representing a gene, and calculated the Pearson correlation coefficient (Fig. 2B).

[0338] The inventors performed differential expression analysis between cell clusters identified in the Open-ST dataset (Fig. 2E). Clusters were defined using the original publication's annotations. For each cluster comparison, the inventors identified differentially expressed genes using both malva and Spacemake. The inventors used scanpy's rank_genes_groups function with the Wilcoxon rank-sum test for differential expression analysis. The inventors then compared the Iog2 fold changes and adjusted p-values obtained from both methods to assess the concordance of differential expression results.

[0339] Gene-based spatial clustering

[0340] The inventors performed spatial clustering of cell types based on pseudoquantification from prebuilt 3’UTR+ncRNA references for the suitable species, masking the k-mers with abundance 9 > 1 in the background model computed from all 3’UTR+ncRNA sequences. Then, the inventors obtained a cell-by-gene matrix that was loaded into scanpy 1.10.2 for data processing and analysis (Wolf et al. 2018). The inventors normalized data using sc.pp.normalize_total and applied logl p transformation. The inventors then identified highly variable features using sc.pp.highly_variable_genes and performed principal component analysis with sc.pp.pca. For clustering, the inventors computed a neighborhood graph using sc.pp. neighbors and applied the Leiden algorithm (sc.tl.leiden) for community detection. To compare the clustering results of the present method with traditional methods, the inventors performed the same clustering analysis using gene expression data processed by spacemake (Sztanka-Toth et al. 2022). The inventors then computed the Adjusted Rand Index between the two clustering results to quantify their similarity.

[0341] Reference-free analysis of spatial transcriptomes

[0342] The inventors constructed a matrix where rows represent spatial units (e.g., spots or cells) and columns represent k-mers. This matrix is then processed using scanpy 1 .10.2 following a standard single-cell analysis workflow. First, the inventors perform normalization of k-mer counts to account for differences in sequencing depth across spatial units. Highly variable k-mers (HVMs) are then selected based on their dispersion relative to the mean expression with sc.pp.highly_variable_genes using the ‘seurat’ flavor. The inventors apply principal component analysis (PCA) to the HVM matrix to reduce dimensionality and capture the main sources of variation in the data. Using the top 30 principal components, the inventors used sc.tl. neighbors to construct a nearest neighbor graph with default parameters. Finally, the inventors detected clusters of spatial units with similar k-mer profiles using the Leiden algorithm with resolution 1 on the nearest neighbor graph.

[0343] Following clustering, the inventors identified differentially present k-mers between distinct clusters using the t-test method implemented in scanpy. The inventors considered k-mers with a Benjamini-Hochberg adjusted p-value < 0.05 and Iog2 fold change > 1 as significantly differentially present.

[0344] Using the differentially present 24-mers (selected based on Iog2 fold change > 0.25 and adjusted p-value < 0.05), the inventors performed de novo transcriptome assembly using SPAdes 4.0 (Bankevich et al. 2012). The assembly was performed using the RNA-seq pipeline (--rna flag) with a k-mer size of 17 bp (-k 17), 32 GB of RAM allocation (-m 32), and 8 processing threads (-t 8). Only the assembly stage was executed (--only-assembler flag), taking the differential k-mers as input sequences. This targeted assembly focuses on reconstructing transcripts that are most relevant to the observed spatial patterns. Assembled transcripts are visualized using Bandage to inspect their structure and connectivity (Wick et al. 2015). For validation, the inventors employed BLAT search on the UCSC browser against known reference databases to identify similarities with annotated transcripts or to discover potentially novel transcripts (Kent 2002).

[0345] Results:

[0346] Ultrarapid indexing of spatial and single-cell seguencing data

[0347] For any given single-cell or spatial dataset, the inventors assume raw reads are provided (e.g. as FASTQ or SAM / BAM / CRAM). Across technologies, these contain genomic sequences, as well as cell barcodes and, optionally, unique molecular identifiers (UMIs). The inventors developed a tool that rapidly converts these into searchable objects without requiring reference sequences (Fig. 1A).

[0348] More specifically, Malva creates a data structure optimized for querying single-cell and spatial data. At its core, it uses a k-mer based approach, where sequencing reads are decomposed into substrings of length k that are indexed and mapped to corresponding spatial coordinates (or other cell-specific identifiers), creating a compact representation of the dataset (Fig. 1A, Methods). This enables a two-tiered indexing system, akin to web search engines - first a k-mer lookup table, then a spatial (or any other label) index mapping k-mers to their locations in the tissue.

[0349] To assess storage efficiency, the inventors compared the size of indexes generated according to the present invention to their corresponding raw data across different technologies and organisms. The index achieves 5-1 Ox compression compared to the original data while preserving complete sequence information. For instance, a typical Open-ST dataset (~1TB raw data) requires only around 100GB of storage, while the complete Human Cell Atlas index (50TB of raw data) requires only 4.2TB of storage. This compression is achieved without resorting to probabilistic methods, ensuring that exact sequence matches remain precise (Methods).

[0350] To benchmark the present method's efficiency, the inventors measured performance on datasets from different technologies across spatial or single-cell resolutions, species, tissue types, and number of cells (Methods). The present method indexes an Open-ST dataset containing ~1 billion reads across ~30 million locations in approximately 20 minutes using a single CPU core from a 32-core workstation with solid-state storage, as well as larger datasets consisting of dozens of sections at high sequencing depths (20 Stereo-seq sections at ~1-3 billion reads each) in just over 1 hour using all available cores. In both cases, memory usage is below 10 GB per used CPU core, depending on the chunk size during indexing. These represent a significant speedup over methods like kallisto (pseudoalignment) or spacemaker (STAR-based mapping), reducing overall data processing time from days to minutes (Fig. 1 B).

[0351] More importantly, the indexing procedure is performed only once - subsequent queries take place within milliseconds per query sequence (Fig. 1C). When comparing performance across methods, the present method maintained a significant speed advantage, with query times scaling primarily with query size rather than dataset size. For instance, a typical query of ~1 ,000 bases (i.e. , one gene) across spatial transcriptomics technologies, compared to hours required by alignment-based approaches. Notably, these measurements likely underestimate the present method's true performance advantage. Our benchmarks, conducted on conventional hard disk drives due to the massive scale of some datasets (e.g., 50TB of raw reads from a subset of the Human Cell Atlas), are predominantly limited by storage throughput (~2GB / s on our ZFS filesystem). This suggests that in settings with faster solid-state storage infrastructure, the present method's performance gains could be even more pronounced.

[0352] To systematically evaluate query performance across different scenarios, the inventors performed a comprehensive grid search varying the number of cells (4,000-10M), sequencing depth (10M- 3B reads), and query complexity. The inventors compared the present method against established methods including MetaGraph, BLAST, kallisto, and STAR. Across datasets, the present method maintained up to 1 ,000x faster querying, though this advantage diminishes when processing >1 ,000,000 sequences simultaneously, where the performance difference narrows to 2-3x. Memory usage remained constant across query sizes, indicating that data structure access, rather than computational complexity, represents the primary performance bottleneck. More specifically, small queries on indices stored in solid-state storage, or cached in memory, showed at least 10-fold lower latency due to fast random access.

[0353] Accurate reconstruction of gene expression patterns without genome alignment Upon indexing of raw reads, users can provide arbitrary sequences and compute fast approximations of their abundance at the indexed locations, without the need for full read alignment. The inventors term this process pseudoquantification (Fig. 2A).

[0354] Pseudoquantification relies on three key aspects: (1) sliding a window of length w (approximately matching the experimental read length) across query sequences to generate k-mers, (2) applying a presence threshold T to determine significant matches overlapping between locations and the sliding window, and (3) the use of a background model to account for common genomic sequences. Parameter sensitivity analysis revealed robust performance (R>0.8) across a range of settings, with good speed-accuracy balance at / c=24, T=0.65, and w set to read length.

[0355] To validate this approach, the inventors compared expression profiles from the present method to those from standard alignment-based tools. Cell-cell correlations between the present method and standard quantification showed high concordance across datasets (Fig. 2B-D), with genelevel correlations particularly strong for cell-type specific markers. The accuracy decreased (R<0.5) primarily in two scenarios: sequences with non-masked low complexity regions spanning >30% of their length, and areas of extremely high local sequence similarity, affecting <5% of typical queries. The inventors then assessed whether these approximate expression profiles capture biologically meaningful patterns. In a spatially-resolved HNSCC tumor (Open-ST dataset), unsupervised clustering based on quantification according to the present invention accurately identified known anatomical structures, showing high concordance with traditional methods (Fig. 2E). Key marker genes showed near-identical spatial patterns between approaches (Fig. 2F), indicating faithful reconstruction of tissue architecture.

[0356] The inventors extended our validation to a single-cell embryonic mouse brain atlas, where accurate quantification of developmental dynamics is crucial. The present method preserved both marker gene detection (Fig. 2G) and developmental trajectories (Fig. 2H), with particularly strong agreement for cell-type-specific markers. Differential expression analysis between major cell types revealed >80% overlap in differentially regulated genes (|log2FC|>1 , FDR<0.05) between methods.

[0357] Direct querying of complex RNA features in spatial and single-cell context

[0358] Having validated the present method's accuracy, the inventors explored its capability to detect complex RNA features. The query interface allows users to provide nucleotide sequences, gene identifiers, or natural language prompts, all ultimately translated to searchable sequences (Fig. 3A). This system enables interactive and rapid detection of diverse RNA features, e.g., proteincoding genes, probes for splice variants, circRNAs, and viral sequences, via direct sequence input, standardized sequence identifiers, or natural language prompts (Fig. 3B-C).

[0359] To demonstrate this capability, the inventors compiled reference sets from circBase (circular RNA back-splice junctions) and systematically analyzed their presence in embryonic and adolescent mouse brain atlas data. Despite these sequences not being specifically enriched in standard library preparation protocols, the inventors detected multiple circRNAs, including cell-type specific expression of CDRIas in excitatory neuronal populations (Fig. 3C), validating the present method's ability to mine standard sequencing data for complex RNA biology.

[0360] To facilitate exploration of spatial patterns, the inventors integrated a large language model engine that translates biological concepts into gene lists. For instance, the prompt "Show immune cells in the stroma", returns gene lists that can ultimately be translated into sequences searchable against an index according to the present invention. This enables rapid assessment of gene programs, pathways or, more generally, any phenotype that can be represented as the presence of specific sequences in cells or spatial domains. The inventors ensure that gene lists returned by the LLM are both relevant to user queries and meaningful according to prior knowledge in curated databases by using a Retrieval-Augmented Generation paradigm (Gao et al. 2023); that is, the inventors feed databases to the LLM engine, so it can report references for the generated answers.

[0361] More specifically, the inventors implemented a two-step validation process. First, the inventors used a separate LLM instance to generate 1 ,000 biological descriptions of known spatial patterns from published datasets. The inventors then compared the spatial distribution of genes identified by our query LLM against these reference patterns. The model achieved >90% accuracy in identifying correct spatial patterns across 1 ,000 test queries, with errors primarily occurring between closely related cell types. This enables rapid assessment of gene programs and pathways through natural language queries.

[0362] Taken together, the present method can accurately interpret both sequencing and natural language user queries to interrogate the indexed data.

[0363] De novo identification of tissue architecture from sequence patterns

[0364] The k-mer index at the core of the present method enables direct, reference-free analysis of spatial transcriptomes. Unlike traditional approaches that rely on mapping to annotated references, the present method generates cell-by-mer or spot-by-mer matrices that can be analyzed without prior knowledge. This capability is particularly valuable for discovering novel transcripts in complex tissues such as tumors, where standard references may miss important biological variation.

[0365] To demonstrate this approach, the inventors analyzed a Open-ST head-and-neck squamous cell carcinoma (HNSCC) dataset. Clustering analysis of k-mer profiles revealed the major spatial domains and tumor subpopulations identified by traditional methods. These domains, consistently detected across all samples, showed distinct transcriptional signatures enriched for alternatively processed transcripts. De novo transcript reconstruction from spatially co-occurring k-mers identified previously unannotated sequences; among these, the inventors found tumor-specific variants, including novel 3' UTR isoforms.

[0366] Encouraged by these results, the inventors explored the potential of reference-free analysis in cross-species comparisons. By comparing k-mer profiles between mouse and human brain samples, the inventors were able to identify regions of conserved gene expression without relying on genome alignments or explicit orthology mapping. This approach revealed similarities in spatial gene expression patterns across species, showcasing the power of sequence-based analysis to bridge evolutionary gaps and highlight conserved functional domains in complex tissues.

[0367] In summary, the present method can perform reference-free analysis. This feature is particularly valuable for studying non-model organisms or for discovering novel transcripts in entities such as tumors, where the presence of non-annotated transcript variants might be highly relevant for their biology.

[0368] Discussion The fields of single-cell and spatial sequencing have been constrained by reference-bound analyses, limiting our ability to explore the full complexity of gene expression in tissues. Here the inventors introduce the present method, which addresses this gap by enabling reference-free exploration of sequence space through a searchable k-mer index built directly from raw data. This approach offers several key advantages over existing methods.

[0369] First, the present method's speed enables interactive exploration of massive datasets. Our benchmarks demonstrate that a full dataset can be processed in minutes rather than hours or days, with subsequent queries completed in milliseconds. This computational efficiency opens new possibilities for hypothesis generation and testing - researchers can rapidly explore expression patterns of novel transcripts, splice variants, or gene programs without the computational overhead of traditional alignment-based approaches.

[0370] Second, the present invention reduces the dependence on reference genomes. This capability is particularly valuable for studying non-model organisms or for discovering novel transcripts in entities such as tumors, where traditional references may be incomplete or inaccurate. Our successful reconstruction of known cell types and spatial domains in HNSCC tissue without using reference information demonstrates the practical utility of this approach.

[0371] The integration of natural language queries represents a third key innovation. Our results show that language models can effectively translate biological concepts into queryable sequences. The accuracy of these translations will depend to some extent on the quality and completeness of the training data, and capturing more complex biological relationships will improve with deep domain expertise. However, the framework the inventors established can be incrementally improved as language models and biological databases evolve.

[0372] The present method is particularly valuable for rapid, large-scale analysis of single-cell and spatial sequencing data. In research settings, it enables real-time exploration of gene expression patterns across tissues and organisms. The ability to identify conserved expression domains without relying on sequence alignment could provide new insights into functional evolution. For clinical applications, rapid processing and flexible querying will facilitate real-time analysis of patient samples.

[0373] Example 3

[0374] RNA encodes cell-type-specific information through splicing, isoforms, and unannotated transcripts. Single-cell and spatial transcriptomics technologies can now measure most of this diversity, generating hundreds of millions of cellular profiles that grow at petabyte scale annually. Yet researchers cannot interrogate specific sequences across these vast datasets - standard pipelines discard sequence information, retaining only gene or isoform counts, and accessing raw data requires downloading and processing thousands of large files. The present inventors employed the present method (‘Malva’), in the present example for direct sequence searches of splice junctions, viral genomes, novel transcripts across millions of cells in seconds. They demonstrated this through the public Malva Index containing ~60 million cells from thousands of experiments. Beyond search, the present method enables reference-free discovery: researchers can identify cell types based on their complete sequence repertoire and assemble novel transcripts that distinguish cell populations. Advantageously, the present method transforms single-cell atlases from static gene counts into dynamic sequence databases. Methods

[0375] Data sources and

[0376] The presently tested embodiment of the method and index generated according to the invention (‘Malva’) incorporated public single-cell and spatial transcriptomic datasets from major repositories (GEO, SRA, ArrayExpress, CNGB, Human Cell Atlas). Currently spanning ~60 million cells from thousands of experiments, the index includes diverse technologies (10x Chromium, Visium, Open-ST, Stereo-seq, Smart-seq) across multiple species and conditions. The present method ran on-premise infrastructure with query limits to ensure consistent performance for all users. As the index grows, we prioritize datasets that expand biological diversity - new tissues, species, disease states - over redundant samples.

[0377] The present method maintained local databases to translate user queries into searchable sequences. In the tested embodiments, when users query gene symbols or Ensembl IDs, these are mapped to genomic coordinates on standard reference genomes (hg38 for human, mm10 for mouse) with default repeat masking. For expression queries, the inventors retrieved full transcript sequences from Ensembl. For coverage queries that visualize read distribution along genes, they used the genomic coordinates to define the region of interest and return k-mer matches at each position. Several specialized databases support specific biological queries. For example, the inventors integrated circRNA sequences from circBase, formatting back-splice junctions as 56bp probes optimized for our k-mer search parameters (w=48, T=1 , see “Malva Index Architecture”). For mutation analysis, they incorporated ClinVar variants, generating mutant and wildtype sequence pairs to identify cells carrying specific mutations. Viral genomes from NCBI RefSeq enabled contamination screening and infection studies. All reference sequences were preprocessed and stored locally to enable rapid query resolution without external database calls.

[0378] To support natural language queries, they integrated Llama 3.1 via Ollama as a translation layer between user prompts and gene sets. The language model was augmented with a Retrieval- Augmented Generation (RAG) database built from MSigDB, which includes curated gene sets from Reactome, GO, and other pathway databases. When users submit queries like "inflammatory response genes in macrophages", the system retrieves relevant gene sets from MSigDB and uses the language model to refine the selection based on the specific query context. This approach provides a convenient interface for exploring biological concepts without requiring users to manually compile gene lists. The resulting gene sets are converted to sequences and searched against the Malva index like any other query.

[0379] Pre-computed gene expression matrices

[0380] To accelerate common gene-level queries, the inventors pre-computed and stored gene expression matrices for frequently accessed gene sets. For each species in the ‘Malva Index’ (index generated according to the present method), they compiled comprehensive transcript catalogs from Ensembl, including all protein-coding transcripts, non-coding RNAs, and 3' UTR sequences. They then queried the ‘Malva Index’ with these reference sequences to generate pseudocount matrices for every indexed dataset.

[0381] The pre-computation process aggregated k-mer counts across all isoforms of each gene, producing a cell-by-gene matrix analogous to standard single-cell expression matrices. These matrices were stored in a distributed database indexed by gene symbol, Ensembl ID, and dataset identifier. When users query common genes through the API, the system retrieves pre-computed values directly from the database rather than searching the k-mer index, reducing query latency from seconds to milliseconds. The database was updated incrementally as new datasets are added to the index, ensuring that gene-level queries always reflect the complete corpus of indexed data. For custom non-standard gene symbols, the system falls back to real-time k-mer queries, maintaining flexibility while optimizing performance for common use cases.

[0382] Metadata standardization

[0383] Critical to the ability of the tested embodiments of the present method to aggregate results across studies was the harmonization of heterogeneous metadata from diverse sources (HCA Data Portal, GEO, SRA, ArrayExpress, CNGB). Raw metadata arrives in various formats with inconsistent naming conventions — the same tissue might be labeled "brain", "cerebral cortex", or "frontal lobe" across different studies. Similarly, developmental stages, disease states, and technology descriptors vary widely between repositories.

[0384] The inventors implemented an automated standardization pipeline that maps all metadata to Human Cell Atlas ontology standards. The pipeline uses state-of-the-art language models to interpret free-text descriptions and map them to controlled vocabularies for anatomical structures, cell types, developmental stages, and diseases. For example, diverse technology descriptions like "10x 3' v2", "10X Genomics Chromium Single Cell 3' v2", and "single cell 3prime v2" are all mapped to a standardized "sc_10x_chromium_3p_v2" identifier. This harmonization enables meaningful cross-study queries - users searching for "lung epithelial cells" retrieve results whether the original metadata said "pulmonary epithelium", "lung tissue", or "respiratory epithelial cells".

[0385] Cell type classification

[0386] To enable aggregation of query results at the cell type level, the inventors implemented an automated annotation pipeline for all indexed cells. Author-provided cell type labels are often heterogeneous across studies or entirely absent in many submissions. To address this, they performed de novo annotation using a two-stage approach.

[0387] First, they apply a marker-based scoring system and compiled curated marker gene sets for major cell types following Cell Ontology classifications, with each cell type defined by 10-20 literature-validated markers. For each cell, they calculate enrichment scores for all cell type signatures using the pre-computed gene expression matrices. The inventors then performed clustering using standard methods (PCA, k-nearest neighbors, Leiden algorithm) and identified differentially expressed genes for each cluster. Clusters are assigned cell types based on the differential expression of marker genes, considering both the number of expressed markers and their specificity to the cluster.

[0388] Second, the inventors refined these annotations using a language model that integrates biological context, as previously described. The model receives the top differentially expressed genes for each cluster along with sample metadata (tissue origin, developmental stage, disease state) and produces cell type assignments. To ensure biological accuracy, the model evaluates whether each annotation is plausible given the tissue context, flagging unlikely assignments for correction. This pipeline produces cell type annotations that follow controlled vocabularies. Malva Index architecture

[0389] The index according to the present invention, single-cell sequencing data is preferably organized as a searchable collection of k-mers (fixed-length nucleotide sequences) linked to their cellular origins. Each sequencing read was decomposed into non-overlapping k-mers of length k (default k=24), with a final overlapping k-mer to ensure complete sequence coverage. Specifically, for a sequence of length I, the inventors extracted k-mers at positions 0, k, 2k, ..., plus one k-mer starting at position l-k, if needed. This approach balanced computational efficiency with the ability to detect any k-length sequence within the data.

[0390] Each k-mer was encoded as a 64-bit integer using 2-bit nucleotide encoding (A=00, C=01 , G=10, T=11), enabling efficient storage and comparison. The cellular origin was tracked through a composite identifier that encodes both the cell barcode and dataset source. They allocated the lower bits for dataset identification and upper bits for cell-specific barcodes - for example, using 8 bits for dataset ID and 24 bits for cell barcode allows indexing up to 256 datasets with ~16 million cells each. This hierarchical encoding enabled queries to return both the specific cell and its source dataset. The index used a two-tiered structure optimized for sequence search conceptually similar to inverted indices for web search. The sequence array (first array) stored all unique k-mers found across datasets, sorted numerically. The location array (also termed ‘data array’, or third array) contained the composite cell-dataset identifiers where each k-mer appears. These were linked via the pointer array (second array) mapping each k-mer to its locations, storing start and end positions within the location array. This design enabled rapid lookup: finding all cells containing a specific k-mer requires just one binary search in the sequence array followed by a single read from the location array.

[0391] Individual datasets were processed independently before being merged into the unified index. FASTQ files were parsed in parallel using optimized routines (including SIMD instructions for pattern matching), extracting cell barcodes and sequences at rates exceeding 2 million reads per second. For each dataset, reads were streamed through memory-efficient buffers, extracting k- mers and their associated barcodes. The k-mers were lexicographically sorted and deduplicated, keeping track of all cells where each appears. These dataset-specific indices were then merged hierarchically: smaller indices combine into progressively larger ones using k-way merge algorithms that require minimal memory. This approach enabled processing of datasets far larger than available RAM while maintaining near-linear scaling with data size.

[0392] To handle indices exceeding available memory, the inventors implemented a hierarchical page- based structure. During construction, sequences were processed in chunks requiring only 0(C) memory per chunk, then merged using k-way merge with O(kB) memory where B is the buffer size. For the final index, when the number of unique k-mers N exceeds RAM capacity, they organized the sequence array into pages of P k-mers (default 1024). These pages form a tree structure with L = [log_P N] levels, where each non-leaf node stores the maximum k-mer value of its child pages. This design bounds memory usage to O(P log_P N) integers - for example, indexing 10A12 unique k-mers requires only ~40 MB of memory with P=1024.

[0393] The final merged index was stored in a compressed binary format optimized for disk-based retrieval. Arrays were partitioned into fixed blocks (512 entries) that were independently compressed using delta-encoding followed by blosc compression, achieving 10-20x reduction compared to raw FASTQ files. The hierarchical organization enabled efficient searches: locating a k-mer requires 0(log_P N) page accesses, typically 3-4 disk reads for billion-scale indices. A small LRU cache maintained frequently accessed pages in memory.

[0394] Sequence querying and pseudoquantification

[0395] The present method identified query sequences by decomposing them into k-mers and searching the index for matches. For queries longer than k, the inventors employed a sliding window approach to ensure all possible k-mer alignments are considered. Given a query sequence q and window size w (default 48), they extracted all windows of length w and evaluate the fraction of matching k-mers in each window. The match score for a window W and cell ID was defined as the ratio of matching k-mers to total possible k-mers: score(W, ID) = |M(W, I D)| / (w-k+1 ), where M(W, ID) represented k-mers in W that appear in cell ID. To reduce false positives from repetitive sequences, k-mers appearing in many genes across the reference genome can be excluded from scoring. Additionally, users can mask low-complexity regions using standard tools like dustmasker. Windows with scores exceeding threshold T (default 0.65) are considered significant matches.

[0396] This approach tolerates sequencing errors and mutations — a single nucleotide change affects only one k-mer, allowing the remaining k-mers to drive detection. The complete query process aggregated matches across all windows, returning cells ranked by the number of significant windows. For typical use cases (w=48, k=24, T=0.65), the method reliably detected sequences with up to ~15% divergence while maintaining low false positive rates. Query optimization includes batching k-mer lookups to minimize disk access and exploiting the sorted index structure for cache efficiency.

[0397] For coverage analysis, the inventors preserved spatial resolution along transcripts by reporting k- mer matches for each sliding window independently. They applied a sliding average (default window: 100bp) to smooth the coverage profile, producing visualizations comparable to those from alignment-based tools. Importantly, they masked highly repetitive k-mers during coverage calculation to prevent artifactual peaks - without masking, certain positions can show 100-1000x spikes due to sequences shared across many genes. This positional information enabled detection of alternative polyadenylation sites, splice junction usage, among others.

[0398] Benchmarking

[0399] The inventors evaluated in this example the performance of an embodiment of the present method (‘Malva’) against established tools for both indexing speed and query accuracy. For indexing benchmarks, they compared Malva with kallisto-bustools, MetaGraph (indexing only), BLAST+ (indexing only) and a STAR-based workflow, on datasets ranging from single samples to the 7TB CIRSTA mouse atlas. For sequence search capabilities, they compared against MetaGraph and BLAST+. All benchmarks used 96 threads on a dual-socket server with Intel Xeon Platinum processors (3.5 GHz, 256 cores total) and 4TB RAM, with resource monitoring via snakemake. They selected representative datasets spanning different scales and technologies: Open-ST tumor sections (subcellular resolution), Stereo-seq CIRSTA atlas (extreme scale), 10x developmental brain (single-cell), and Visium tissues (spatial gene expression). These datasets tested Malva's ability to handle varying spot densities (0.6 pm in Open-ST to 55 pm in Visium), sequencing depths, and data volumes. For each technology, the present method (‘Malva’) automatically detected read structures and processed FASTQ files without manual configuration.

[0400] From a user's perspective, the present method can transform how researchers explore transcriptomes. Instead of downloading datasets and running pipelines, users can instantly query along four dimensions: what feature they seek (a raw sequence, gene, or pathway), where to search (filters like tissue, disease state, or protocol), how to measure it (base-level coverage or simple abundance), and at what scale (individual cells, whole libraries, or aggregated groups). These capabilities are demonstrated through three complementary search modes.

[0401] First, researchers can provide an exact sequence - e.g., from a 24-mer splice probe to a full viral genome - and immediately retrieve which cells contain it. They can specify a genomic interval and view live coverage tracks colored by lineage or tissue. Or they can type natural language queries like "cells expressing high levels of mitochondrial genes in tumor samples", which an integrated language model translates into structured searches.

[0402] Validation of biological signal preservation

[0403] The inventors then evaluated whether Malva's k-mer-based pseudocounts produce biologically meaningful results comparable to traditional pipelines. Across multiple datasets (Visium, Open- ST, Stereo-seq, 10x Chromium), we compared Malva pseudocounts to standard quantification methods (spacemake with STAR alignment, kallisto, or deposited count matrices). For each dataset, they computed per-cell gene counts and calculated Pearson correlations between methods, with each point representing one gene's expression across cells, or one cell’s total counts across genes.

[0404] To assess whether pseudocounts support downstream analyses, they performed clustering on all benchmarked datasets using species-specific Ensemble transcript sequences after masking low- complexity and repetitive k-mers. The inventors applied standard scanpy processing (normalization, log transformation, highly variable gene selection, PCA, nearest neighbor graph construction, and Leiden clustering) and compared cluster assignments between Malva and traditional pipelines using Normalized Mutual Information.

[0405] For the Open-ST tumor dataset, they additionally validated differential expression capabilities by comparing tumor versus stromal cells using the original study's annotations. They applied Wilcoxon rank-sum tests to both Malva and spacemake counts, assessing concordance through Iog2 fold changes and adjusted p-values.

[0406] Reference-free analysis using k-mer composition

[0407] Traditional single-cell analysis relies on mapping reads to reference genomes and counting annotated genes. However, this approach cannot capture the full diversity of RNA sequences, particularly in non-model organisms or when studying unannotated transcripts. The inventors developed the present a reference-free analysis pipeline that clusters cells based on their complete k-mer repertoire, revealing relationships invisible to gene-based methods.

[0408] Bucketing approach for scalable k-mer analysis

[0409] Direct analysis of k-mer composition faces a fundamental challenge: even modest datasets contain billions of unique k-mers, making cell-by-k-mer matrices computationally intractable. To address this, the inventors implemented a bucketing strategy that groups similar k-mers while preserving sequence relationships. The bucketing algorithm uses a MinHash-based approach to assign k-mers to a fixed number of buckets based on their w-mer composition, where w < k. For each k-mer, we extract all possible contiguous w-mers (typically w = 16 for k = 24) and compute two independent hash functions (MurmurHash3 and xxHash64) for each w-mer. The inventors retained the minimum hash value from each function across all w-mers in the k-mer, then combine these minima into a composite signature. The final bucket assignment is determined by taking the modulo of this signature with the predefined number of buckets (default 100,000). This approach promotes that k-mers with similar w-mer composition - and thus likely similar biological origin - are assigned to the same or nearby buckets. The resulting cell-by-bucket matrix reduces dimensionality from billions of unique k-mers to, e.g., 100,000 buckets while preserving sequence relationships. Importantly, bucket assignments are deterministic and consistent across datasets, enabling cross-dataset and cross-species comparisons. We process k-mers in chunks and accumulate bucket counts per cell using sparse matrix operations, allowing analysis of datasets with hundreds of millions of cells.

[0410] Cell clustering and marker discovery

[0411] Using the afore cell-by-bucket matrix, the inventors followed a standard single-cell analysis workflow implemented in scanpy. They normalized bucket counts to account for sequencing depth differences, then select highly variable buckets based on their dispersion relative to mean expression. After dimensionality reduction with PCA (typically retaining 30 components), a nearest neighbor graph was constructed and the Leiden algorithm was applied to identify cell clusters with similar k-mer profiles. Clustering similarity between k-mer profiles and gene-count inputs were quantified using the Normalized Mutual Information metric: where Yrand Y2are two clusterings to be compared, H is the entropy function and I ^; T2) = is the mutual information function.

[0412] To understand the biological basis of these clusters, they identified differentially abundant buckets between groups using statistical tests (t-test with Benjamini-Hochberg correction, adjusted p-value < 0.05 and Iog2 fold change > 1). The inventors then retrieved the k-mers assigned to these differential buckets and performed targeted de novo assembly using SPAdes in RNA mode. This approach reconstructs full-length transcripts from cluster-specific k-mers, which can be annotated using BLAT searches against reference databases. The combination of bucketing and targeted assembly enabled to discover cell-type-specific isoforms, novel transcripts, and even contaminating sequences that would be missed by reference-based approaches.

[0413] To determine genomic origin and interpretability, contigs were annotated in two passes. First, contigs were aligned to the appropriate host references using minimap2 (v2.x) with spliced presets for genome mapping and ungapped mapping to transcriptomes, against GRCh38. For transcript-level alignment we mapped to Ensembl cDNA / IncRNA catalogs, for alignments with high mapping quality. Based on these, contigs were classified as (i) genic if overlapping annotated transcripts, (ii) intergenic if mapping uniquely outside annotated features, or (iii) unmapped if no confident hit was found. For loci with multiple contigs, per-cell coverage tracks were generated and compared across the locus to distinguish isoform-specific versus locus-wide signal. Second, contigs without confident host assignment (or with rRNA signatures) were analyzed using Kraken2 (v2.1.5) with a standard RefSeq PlusPF database (bacteria, archaea, viral, human, UniVec / adapter sequences). Subsequently, the expression was recomputed using the annotated and unannotated assembled contigs, and compared cell-wise totals and featurewise abundances. This reference-free pipeline proved particularly valuable for analyzing organisms with poor genomic resources (such as Placozoa) and for discovering tumor-specific isoforms in spatial transcriptomics data, as demonstrated in the present results.

[0414] Detection of circular RNAs

[0415] To interrogate the expression of circular RNA CDRIas across the two-tiered index, a 36-nucleotide probe centered on its back-splice junction was designed. Therefore, the probe sequence spanned 18 nucleotides upstream of the splice donor site and 18 nucleotides downstream of the splice acceptor site, capturing the unique junction sequence that distinguishes circular from linear transcripts. We used w=24, T=0.65 for querying the two-tiered index.

[0416] Results

[0417] The present method enables real-time sequence search with single-cell resolution

[0418] When developing the present method for indexing potentially billions of cells while maintaining fast query times posed several challenges, the inventors first surveyed existing k-mer indices, but these were designed for hundreds of bulk samples, not millions of single cells. Their computational complexity scaled poorly with sample count, causing prohibitive memory usage or query times when extended to millions of cellular labels. Thus, they needed to develop a method that enables (i) indexing fast enough to keep pace with new data releases, with (ii) query performance that scales to billions of cellular (or spatial) labels and (iii) in a way that query results are comparable to gene-count, state-of-the-art pipelines. The inventors indeed validated the present method’s performance in one experiment through small-scale benchmarking before deploying it on large-scale public repositories.

[0419] First, they validated indexing performance on a Stereo-seq mouse liver atlas containing ~61 billion reads across twenty sections, each with over 3 billion reads and hundreds of millions of spatial barcodes. The present method’s (termed Malva) completed indexing in 24 CPU-hours with peak memory under 8 GB, compared to 190 hours for kallisto and 724 hours for STAR. This efficiency comes from the method’s streaming architecture, which processes reads in batches and flushes intermediate indices to disk every 100 million reads. Importantly, runtime per sample scales with unique barcodes rather than sequencing depth, and indexing speed approaches the theoretical limit of gzip decompression - one of the unavoidable bottlenecks when processing raw sequencing data.

[0420] Second, the inventors validated that query performance enabled real-time analysis. When searching for a sequence, Malva tiles it into k-mers and counts exact matches per cell. These raw match totals, termed pseudocounts, emerge directly from k-mer matches without alignment. Search latency depends only on disk access: 70 ms for a single k-mer, 0.9 s for a 1 kb transcript, and 45 minutes for the entire Ensembl catalog on a single CPU core. Other methods performed worse - MetaGraph or BLAST would require orders of magnitude more time, among other reasons, because of requiring to load the entire index into memory before each query. Meanwhile, traditional pipelines would need complete read realignment for every new search query.

[0421] Third, they validated quantification accuracy. They tested Malva on an Open-ST human tumor section with 1 .8 billion reads and ~30,000 cells (Schott et al. 2024). After masking low-complexity regions and filtering repetitive k-mers, pseudocounts for 19,634 transcripts correlated strongly with conventional UMI counts (per-cell Pearson r = 0.92, per-gene r = 0.90). Differential expression between tumor and immune clusters recovered 91% of genes identified by standard pipelines. The remaining discrepancies occurred primarily at low-complexity loci where k-mer methods inherently face ambiguity. Moreover, across multiple datasets - from Visium brain sections to Slide-seq experiments - the inventors observed consistent patterns. Spatial datasets showed stronger diagonal correlation with traditional counts, while single-cell datasets often exhibited systematic offsets due to higher sequencing saturation. These offsets can be corrected through dataset-specific normalization based on total unique k-mers, similar to approaches used in bulk RNA-seq reanalysis (Bessiere et al., 2024).

[0422] Pseudocount matrices from the present method proved suitable for standard analyses. Louvain clustering of tumor sections achieved an Adjusted Rand Index of 0.89 compared to traditional pipelines. In mouse brain atlases, ‘Malva’-derived pseudotime trajectories correlated with kallisto results at r = 0.93. These validations demonstrate that Malva supports both discrete cell-type classification and continuous developmental analysis without reference alignment.

[0423] Screening for viral sequences across all samples

[0424] The inventors systematically searched for viral genomes across all indexed samples. Beyond expected hits - retroviruses and PhiX spike-ins from short-read sequencing protocols - they discovered surprising patterns. HIV sequences appeared in ostensibly healthy blood donor samples, while SARS-CoV-2 RNA concentrated in specific lung epithelial subsets from COVID-19 patients (). Negative controls (unrelated viruses) showed no enrichment, confirming specificity. This viral survey, completed in minutes across millions of cells, demonstrates how Malva can retrospectively screen archived data for emerging pathogens or contamination.

[0425] Mapping isoform usage without predefined annotations

[0426] Standard atlases often collapse all isoforms into single gene counts, obscuring critical regulatory switches. The present method enables computing coverage profiles directly from k-mers to reveal cell-type-specific isoform usage without predefined transcript models. Analyzing the Tabula Muris atlas, 1 ,138 genes were identified with cell-type-specific splicing patterns (Fig. 7d). Ptprc (CD45) showed differences: B cells predominantly included exon 3 (CD45RA isoform) while other immune cells excluded it (CD45RO), recapitulating the canonical immune activation switch. For instance, when searched for CDRIas, a conserved circular RNA highly expressed in the brain. Malva confirmed the expected cell-type specificity of CDRIas at single-cell resolution (Fig. 7e). This pattern emerged without splice-aware alignment or isoform annotation - just from k-mer coverage along the gene body.

[0427] Tracking UTR variants in spatial context

[0428] Most single-cell and spatial transcriptomics datasets are highly 3' biased, which limits the ability to discover or score isoform usage, but holds great potential for analyzing 3' UTR choice. This is a fundamental process regulating transcript and protein localization and stability, yet atlases report only total gene expression. For example, previous bulk and single-cell studies identified Add2 as exhibiting cell-type and tissue-specific 3’UTR lengthening across organogenesis, but its spatial distribution remained unexplored - no atlas-scale database tracks UTR variants spatially, and investigating this would typically require downloading terabytes of raw spatial data for custom analysis. Using the present method, Add2 3 -UTR variants were queried across mouse embryonic development in seconds. The results revealed that distal UTR usage concentrates in developing neural tissue while proximal forms dominate in the developing liver, with the present method’s (Malva’s) coverage profiles closely matching those generated by STAR alignment (Fig. 6d), suggesting that the apparent 3’UTR lengthening of this gene across embryonic development is due to a shift in cell population proportions. The concordance between Malva and STAR coverage profiles validates that k-mer-based quantification accurately captures transcript structure without alignment.

[0429] Querying circular RNAs across cell atlases

[0430] Circular RNAs, produced by back-splicing, regulate gene expression through miRNA sponging and other mechanisms but remain invisible in gene expression matrices provided by most current atlases. Researchers must currently rely on specialized circular RNA databases that cover limited samples. Using present method, CDR1 as, a conserved circular RNA that acts as a miR-7 sponge and is highly expressed in brain, was queried across all ~60 million cells of the Index. The present method returned results in 0.6 seconds, and confirmed the expected cell-type specificity of CDRIas (Fig. 7e): 95% of positive cells were excitatory neurons, with lower detection in inhibitory interneurons, and very low detection across other tissues (Fig. 7e). This specificity, previously known from bulk studies, emerges clearly from droplet-based data not designed to capture circRNAs - demonstrating how the present method can interrogate biology hidden in existing datasets.

[0431] These queries - impossible with current atlas platforms - demonstrate how Malva bridges critical gaps in transcriptome analysis. Rather than being limited to predefined gene sets, researchers can now interrogate any RNA sequence across millions of cells. This transforms how we explore published data: from static gene counts to dynamic sequence discovery. k-mer signatures faithfully recover human cell lineages

[0432] Traditional cell classification relies on counting annotated genes, but cells can also be defined by their complete RNA sequence repertoire - including unannotated transcripts, novel isoforms, and non-coding elements. In single-cell analysis, cells are typically embedded in low-dimensional manifolds based on gene expression similarity. However, this same principle can be applied to sequence space: cells with similar biological states should share k-mer patterns even without gene annotations. While the tow-tiered index generated according to the present method is suitable for sequence search (finding cells containing specific k-mers), the same data structure can be transposed to enable analysis - creating a matrix where cells are compared based on their complete k-mer repertoire. The present method implements this sequence-based view by grouping k-mers into similarity buckets, where sequences differing by just one or two nucleotides cluster together. This creates a cell-by-bucket matrix that captures sequence relationships, allowing computation of pairwise cell similarities directly from RNA sequences (Fig. 7a).

[0433] This approach was applied to a large subset of the Human Cell Atlas stored in the two-tiered index. Without using any gene annotations, cells organized into recognizable lineages with an average normalized mutual information (NMI) of 0.52 when compared to gene-based clustering. Performance was lowest in tissues dominated by single-nucleus data (adipose, brain, eye), likely due to reduced sequence diversity from (intronic) nuclear RNA (Fig. 7b). Importantly, clustering concordance remained stable regardless of cell numbers, demonstrating that bucketed k-mer spectra reliably capture cell relationships (Fig. 7c). While the reduced dimensionality makes it computationally feasible to apply standard single-cell algorithms, the non-gene nature of these features may require adapted methods for integration and batch correction.

[0434] Tumor-specific sequences define spatial organization

[0435] Previous studies showed that k-mer signatures can predict cancer prognosis and reveal tumorspecific biology invisible to gene-based approaches. The inventors wondered whether sequencebased clustering could provide similar insights at spatial resolution. Analyzing the Open-ST head- and-neck tumor through k-mer buckets partitioned the tissue into tumor core, stromal rim, immune infiltrate, and keratin pearls - purely from sequence similarity (Fig. 7e). While annotated genes revealed the same compartments (ARI = 0.87), the sequence approach enabled discovery of compartment-specific isoforms. By identifying bucket-specific k-mers and assembling them into contigs, they discovered sequences unique to each tissue region. The tumor core expressed a novel KRT6A isoform with an unannotated terminal exon and a SPRR1 A variant retaining an internal exon (Fig. 7f). The invasive front specifically expressed a truncated KLK10 3'-UTR, potentially altering mRNA stability at the tumor margin. Across all regions, we assembled 4,255 contigs >1 OOnt absent from current annotations. While bulk k-mer studies leverage multiple samples to correlate sequences with clinical outcomes, this proof-of-concept demonstrates that spatial k-mer analysis can reveal region-specific isoforms within individual tumors.

[0436] Screening for arbitrary seguences across the two-tiered index

[0437] Upon querying a nucleotide sequence, the present method returns which cells express it, regardless of whether it’s part of a predefined reference (e.g., human genome) or not. To demonstrate this, a small panel of viral genomes, common lab contaminants, endogenous retroviruses and experimental vectors were compiled, and queried against the two-tiered index to identify cells / samples containing these. First, the sensitivity was assessed via positive controls, for instance, PhiX sequencing spike-in, routinely added during library preparation, was detected across most datasets. Similarly, signal for Mycoplasma (well-documented contaminant in cell culture) was detected in ~0.5-2% of samples, mostly at the 23S rRNA locus (Fig. 7c). This is consistent with prior reports of Mycoplasma contamination in RNA-seq datasets. Similarly, human endogenous retroviral signal (specifically, HERV-K) was detected, as expected, in embryonic samples where these elements are known to be active. Interestingly, they observed tissuespecific expression in adult samples, e.g., healthy kidney and heart, cell lines, and diseases including cancer and neurological conditions, supporting reports of HERV-K reactivation in adult tissues and pathological states. The inventors also tested the retrieval of cells carrying specific synthetic constructs. In particular, a lentiviral vector was queried used for clonal tracing in a rhabdomyosarcoma study. Top hits returned by the method as output, across the whole vector body, corresponded to cells from said study.

[0438] Then, specificity via negative controls was assessed: signals for pathogenic viruses that were not represented in the two-tiered index used were not detected (e.g., SARS-CoV-2, Hepatitis C, Measles, West Nile, and Zika). The inventors only observed some hits of SARS-CoV-2-positive cells from lung samples labeled as ‘healthy’ in the respective dataset. Interestingly, these correspond to a cell line experiment (derived from healthy donors) where cells were infected in vitro, highlighting how the present method is even able to overcome annotation biases.

[0439] Screening for germline variation across single-cell atlases

[0440] The inventors next used the present method and a two-tiered index constructed accordingly to locate cells with specific mutation signatures and designed sequence probes targeting 296,985 common germline SNPs from the 1000 Genomes Project, focusing on biallelic variants in proteincoding regions with minor allele frequency >1 %; this ensured they focused on variants that reached equilibrium rather than rare, deleterious mutations.

[0441] Population-based allele frequencies correlated strongly to the frequencies determined by the present method, across healthy and disease samples. Thus, single-cell RNA data in Index recapitulated the underlying population genetic structure despite heterogeneous sequencing technologies and depths. They additionally observed that common coding variants in their panel behaved neutrally regardless of disease status, as there was no deviation between observed and expected derived allele frequencies. Lastly, they validated that allele frequencies at the single-cell level remained around 50% irrespective of population frequency, supporting how most common coding SNPs segregate at intermediate frequencies in outbred populations, and individuals are frequently heterozygous. Altogether, variant querying recapitulated population genetic structure and aligned with expectations that variants reaching >1% frequency in coding regions represent evolutionarily stable, predominantly neutral alleles, also at a single-cell level.

[0442] Quantifying isoform usage without mapping

[0443] Gene-centric atlases collapse many isoforms into single counts, obscuring cell-type specific regulatory switches. The inventors leveraged Tabula Muris (profiled at full-length-coverage), to systematically compare expression across loci as a proxy for isoform usage and identified 1 ,138 loci with cell-type-specific differential coverage patterns: for example, Ptprc (CD45) in B cells predominantly included exon A (CD45RA isoform) while other immune cells mostly excluded it (CD45RO), recapitulating this well-known isoform variation across lymphocyte subsets (Fig. 6c). This pattern emerging solely from coverage-like analysis supports the present method’s ability to interrogate exon / isoform usage.

[0444] Moreover, the inventors focused on isoforms at the 3’UTR: these are fundamental for regulation of transcript and protein localization and stability and, more importantly, common single cell and spatial sequencing data is often 3' biased, therefore suitable for exploring this landscape. As an example, the inventors interrogated the 3’UTR of Add2, known to have tissue- and stage-specific patterns in the mouse embryo, against a high-resolution spatial atlas of mouse organogenesis. Malva validated how distal 3’UTR has the strongest signal in the developing neural tissue while the proximal form accumulates at the developing liver. Malva’s coverage-like profiles closely matched those generated by STAR alignment, confirming that the bulk-level lengthening of Add2 across developmental stages is likely a shift in cell type proportions (neuronal vs. liver progenitors). circular RNAs across cell atlases Circular RNAs, produced by back-splicing, remain invisible in gene expression matrices provided by most current atlases. Researchers currently rely on specialized circular RNA databases that cover limited samples, often not covering single-cell nor spatial data. The present method enables interactive analysis of circRNA expression by searching for probes specific to their backsplice junctions. For instance, they searched for CDRIas, a conserved circular RNA highly expressed in the brain. The analysis using the present method confirmed the expected cell-type specificity of CDRIas at single-cell resolution: 95% of positive cells were excitatory neurons, with lower detection in inhibitory interneurons, and very low detection across other tissues. This specificity, previously known from bulk studies, emerges from droplet-based data not specifically designed to capture circRNAs. Altogether, these analyses demonstrate how, rather than limited to predefined gene sets, the present method allows researchers to interrogate millions of cells across diverse functional layers of RNA biology, e.g., expression, mutations and isoforms, via a single sequence search interface.

[0445] Predicting cell-cell similarity from sequence composition alone

[0446] Interrogating biological sequence data using the present method allows retrieval of cells containing specific biological signals. However, this does not enable cell-cell comparison, which is critical for single-cell clustering, trajectory inference, or marker discovery. These operations are carried in a reduced (latent) space computed from cell-by-feature matrices, commonly, cell-by- genes.

[0447] The present method enables generating a two-tiered index encoding billions of sequences into a large sequence-by-ce\\ matrix. By transposing it, the inventors could perform sequence-based, cell-cell comparison. In practice, these ce\\-by-sequence matrices are too large to handle (millions of cells times billions of features). Thus, sequence k-mers were collapse into buckets of sequence similarity. This ensures that cell-by-feature matrices are small enough such that dimensionality reduction is tractable.

[0448] Latent space recovery across modalities and species

[0449] The inventors then tested this idea on 30+ million cells from 5,000+ healthy samples and found that the PCs derived from cell-by-bucket matrices were comparable to those derived from cell-by- gene matrices (Fig. 10b). This was quantified via two metrics, (i) Neighborhood Preservation and (ii) Clustering similarity. Across samples and tissue types, cells neighboring in cell-by-gene space also did in cell-by-bucket space, and were assigned to the same clusters (Fig. 10c, Fig. 4a). This approach becomes particularly useful in settings where high-quality reference genomes are lacking, sparse or incomplete: for example, this generalized to non-model organisms such as the phylum Placozoa (Fig. 10d), showcasing the utility of sequence-based representations for large- scale biodiversity studies. It could be noticed, however, that samples coming from single-nuclei technologies yielded consistently lower values across both metrics. This was attributed to the overrepresentation of low-complexity intronic (nuclei-enriched) RNA (Fig. 4b), which is equivalent to what low sequence diversity (e.g., using 4-mers and below) yields.

[0450] Identification of sequence-specific clusters not explainable by gene-based analysis

[0451] Cell clusters are often characterized by their marker genes. However, in bucket-based clustering, features (groups of similar sequences) are not directly explainable. The inventors assessed cell- by-bucket representations on a spatial transcriptomics sample (high-resolution profiling of a human tumor), where tissue architecture provides an orthogonal readout for interpretability; that is, the degree (or lack of) of spatial organisation can inform biological relevance. Indeed, genebased and bucket-based representations captured similar major tissue compartments (tumor and stromal), plus two k-mer specific (KS) clusters (KS1 and KS2) with distinct spatial organization, one within and other outside the tissue boundaries (Fig. 10e). The inventors hypothesized that there might be sequence signatures specific to these two clusters, whose signal is absent from the original reference-based gene representation.

[0452] Assembly of cluster-specific sequences identifies isoforms and microbial features

[0453] Across all bucket-based clusters, the inventors extracted the specific collections of k-mers and used them as the input for de novo assembly. This yielded thousands of contiguous unique sequences (contigs), that were termed marker sequences. Upon mapping these to the human genome, ~70% were annotated to known genes (mostly within 1 kb of the TES), ~29% to intergenic regions, and ~1% remained unmapped. Also, by narrowing the k-mer search space to those informative of cluster identity, downstream assembly (i.e., with SPAdes) operated more efficiently (~1 hour vs. ~27 hours wall time).

[0454] The inventors then examined the cluster-specific enrichment of all assembled contigs. All clusters (except KS1 and KS2) were enriched in contigs mapping to human genes. Some contigs mapped to the same gene bodies but across different locations, and a subset of these were significantly detected as markers (Fig. 10f). One example is KLK10_676, a contig mapping to KLK10, a gene coding for a tumor-enriched serine protease. This was the only KLK10 contig enriched at the Keratin Pearl (Fig. 10g) structures. Coverage-like profiles showed that this feature may correspond to an unannotated 3'UTR end; in fact, a polyA tract at its 3’ further supports it as a TES. On the other hand, KS1 and KS2 were indeed enriched in intergenic or unknown contigs (Fig. 10f). They first identified that KS2 was enriched in sequences mapping to human rRNA (Fig. 10g), often removed during reference preprocessing - and thus missing from the original cell-by- gene matrix. Meanwhile, the KS1 cluster was enriched in contigs compatible with bacterial 23S rRNA from several species known to occur in the oral mucosa (Fig. 10g). This confirmed that this cluster separated from a genuine microbial signal that is only captured at the sequence-based analysis.

[0455] In summary, the present method achieved faster indexing than existing methods while maintaining a low memory footprint, fast query times for individual sequences, and high concordance with traditional pipelines for gene quantification and downstream analyses.

[0456] REFERENCES

[0457] Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19: 455-477.

[0458] Baysoy A, Bai Z, Satija R, Fan R. 2023. The technological landscape and applications of single-cell multi-omics. Nat Rev Mol Cell Biol 24: 695-713.

[0459] Boileau E, Li X, Naarmann-de Vries IS, Becker C, Casper R, Altmuller J, Leuschner F, Dieterich C. 2022. Full-Length Spatial Transcriptomics Reveals the Unexplored Isoform Diversity of the Myocardium Post-ML Front Genet 13: 912572.

[0460] Bradley P, den Bakker HC, Rocha EPC, McVean G, Iqbal Z. 2019. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 37: 152-159.

[0461] Castanza AS, Recla JM, Eby D, Thorvaldsdottir H, Bult CJ, Mesirov JP. 2023. Extending support for mouse data in the Molecular Signatures Database (MSigDB). Nat Methods 20: 1619-1620.

[0462] Chen A, Liao S, Cheng M, Ma K, Wu L, Lai Y, Qiu X, Yang J, Xu J, Hao S, et al. 2022. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185: 1777- 1792.e21.

[0463] Cho C-S, Xi J, Si Y, Park S-R, Hsu J-E, Kim M, Jun G, Kang HM, Lee JH. 2021 . Microscopic examination of spatial transcriptome using Seq-Scope. Cell 184: 3559-3572. e22.

[0464] Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Letman A, Mathur A, Schelten A, Yang A, Fan A, et al. 2024. The Llama 3 Herd of Models. arXiv. doi: 10.48550 / arxiv.2407.21783.

[0465] Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, Wang H. 2024. Retrieval-augmented generation for large language models: A survey. arXiv. doi: 10.48550 / arxiv.2312.10997.

[0466] Harrison PW, Amode MR, Austine-Orimoloye O, Azov AG, Barba M, Barnes I, Becker A, Bennett R, Berry A, Bhai J, et al. 2024. Ensembl 2024. Nucleic Acids Res 52: D891-D899.

[0467] He D, Zakeri M, Sarkar H, Soneson C, Srivastava A, Patro R. 2022. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19: 316-322.

[0468] Heumos L, Schaar AC, Lance C, Litinetskaya A, Drost F, Zappia L, Lucken MD, Strobl DC, Henao J, Curion F, et al. 2023. Best practices for single-cell analysis across modalities. Nat Rev Genet 24: 550- 572.

[0469] Holley G, Melsted P. 2020. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 21 : 249.

[0470] Joglekar A, Prjibelski A, Mahfouz A, Collier P, Lin S, Schlusche AK, Marrocco J, Williams SR, Haase B, Hayes A, et al. 2021 . A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain. Nat Commun 12: 463.

[0471] Kaminow B, Yunusov D, Dobin A. 2021. STARsolo: accurate, fast and versatile mapping / quantification of single-cell and single-nucleus RNA-seq data. BioRxiv. doi: 10.1101 / 2021.05.05.442755. Karasikov M, Mustafa H, Danciu D, Barber C, Zimmermann M, Ratsch G, Kahles A. 2020. MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale. BioRxiv. doi: 10.1101 / 2020.10.01.322164.

[0472] Kent WJ. 2002. BLAT-the BLAST-like alignment tool. Genome Res 12: 656-664.

[0473] La Manno G, Siletti K, Furlan A, Gyllborg D, Vinsland E, Mossi Albiach A, Mattsson Langseth C, Khven I, Lederer AR, Dratva LM, et al. 2021. Molecular architecture of the developing mouse brain. Nature 596: 92-96.

[0474] Liberzon A, Birger C, Thorvaldsdottir H, Ghandi M, Mesirov JP, Tamayo P. 2015. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst 1 : 417-425.

[0475] Liu L, Chen A, Li Y, Mulder J, Heyn H, Xu X. 2024. Spatiotemporal omics for biology and medicine. Cell 187: 4488-4519.

[0476] Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. 2021. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 31 : 1-12.

[0477] Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. 2020. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics 36: i177— i185.

[0478] McKellar DW, Mantri M, Hinchman MM, Parker JSL, Sethupathy P, Cosgrove BD, De Vlaminck I. 2023. Spatial mapping of the total transcriptome by in situ polyadenylation. Nat Biotechnol 41 : 513— 520.

[0479] Melsted P, Booeshaghi AS, Liu L, Gao F, Lu L, Min KHJ, da Veiga Beltrame E, Hjorleifsson KE, Gehring J, Pachter L. 2021. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol 39: 813-818.

[0480] Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. 2018. Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. Cell Syst 7: 201-207. e4.

[0481] Schott M, Leon-Perinan D, Splendiani E, Strenger L, Licha JR, Pentimalli TM, Schallenberg S, Alles J, Samut Tagliaferro S, Boltengagen A, et al. 2024. Open-ST: High-resolution spatial transcriptomics in 3D. Cell 187: 3953-3972.e26.

[0482] Solomon B, Kingsford C. 2016. Fast search of thousands of short-read sequencing experiments. Nat Biotechnol 34: 300-302.

[0483] Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, Arlotta P, Macosko EZ, Chen F. 2021. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat Biotechnol 39: 313-319.

[0484] Sztanka-Toth TR, Jens M, Karaiskos N, Rajewsky N. 2022. Spacemake: processing and analysis of large-scale spatial transcriptomics data. Gigascience 11 .

[0485] Vickovic S, Eraslan G, Salmen F, Klughammer J, Stenbeck L, Schapiro D, Aijo T, Bonneau R, Bergenstrahle L, Navarro JF, et al. 2019. High-definition spatial transcriptomics for in situ tissue profiling. Nat Methods 16: 987-990.

[0486] Wick RR, Schultz MB, Zobel J, Holt KE. 2015. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31 : 3350-3352. Wolf FA, Angerer P, Theis FJ. 2018. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19: 15.

[0487] Wu B, Shentu X, Nan H, Guo P, Hao S, Xu J, Shangguan S, Cui L, Cen J, Deng Q, et al. 2024. A spatiotemporal atlas of cholestatic injury and repair in mice. Nat Genet 56: 938-952. Xi J, Lee JH, Kang HM, Jun G. 2022. STtools: A Comprehensive Software Pipeline for Ultra-high Resolution Spatial Transcriptomics Data. Bioinformatics Advances 2.

[0488] Yu Y, Liu J, Liu X, Zhang Y, Magner E, Lehnert E, Qian C, Liu J. 2018. SeqOthello: querying RNA-seq experiments at scale. Genome Biol 19: 167.

[0489] Zhao J, Both JP, Rodriguez-R LM, Konstantinidis KT. 2024. GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs. Nucleic Acids Res. doi: 10.1093 / nar / gkae609.

[0490] Zormpas E, Queen R, Comber A, Cockell SJ. 2023. Mapping the transcriptome: Realizing the full potential of spatial data analysis. Cell 186: 5677-5689.

Claims

CLAIMS1 . A computer-implemented method, comprising a. providing biological sequence input data, b. decomposing the sequence input data into non-overlapping portions, and c. indexing each portion of the sequence data with at least one associated information of said portion comprised within or provided with the biological sequence input data, thereby creating a two-tiered index for each portion.

2. The method according to claim 1 , wherein the at least one associated information comprises metadata and / or sequence identifier information.

3. The method according to claim 1 or 2, wherein the method further comprises d. receiving at least one query from a user, wherein the query comprises biological sequence information, language-based prompts and / or queries regarding metadata comprised within or provided with the biological sequence input data.

4. The method according to claim 3, wherein the method further comprises e. providing at least one output comprising at least one sequence, sequence identifier information and / or metadata of one or more portions of the biological input data corresponding to the query.

5. The method according to any one of the preceding claims, wherein the biological sequence input data comprises genomic, transcriptomic or proteomic data.

6. The method according to any one of the preceding claims, wherein the biological sequence input data is obtained by single cell or spatial nucleic acid sequencing.

7. The method according to any one of the preceding claims, wherein b. decomposing the biological input data into portions, comprises decomposing the biological input data into consecutive data strings of length k (k-mers).

8. The method according to any one of the preceding claims, wherein indexing each portion in c. comprises the construction of at least one index look-up table assigning each portion to the respective (two-tiered) index generated in c.

9. The method according to any one of the preceding claims, wherein the two-tiered index is constructed in c. to comprise three arrays, namely a first array (indices array) comprising the portions (k-mers) generated in b., a second array (pointer array) linking the first with the third array, and a third array (data array) comprising the at least one associated information of each portion (k-mer).

10. The method according to any one of claims 2-9, wherein the metadata comprised within or provided with the biological input data comprises information regarding the origin or location of said biological input data, such as a spatial position and / or cell from which said biological input data was obtained.

11. The method according to claim 10, wherein the information regarding the origin of said biological input data comprises sequence information of nucleic acid labels (e.g., barcodes, UMIs), spatial coordinates and / or chemical information of biochemical labels (e.g., amino acid labels, isotopes or chemical modifications).

12. The method according to any one of claims 3-11 , wherein the method further comprises employing a large language model (LLM) component to query / interrogate the two-tiered index of the biological input data based on the language-based prompts provided in d.

13. The method according to any one of claims 3-12, wherein the biological input data and the query in d. comprise nucleic acid and / or amino acid sequence information, and the query is processed by (i) decomposing the query sequences into sequence portions of length k (k-mers), preferably using a sliding-window approach, (ii) matching each k-mer to the two-tiered index(ed portions) of biological input data, (iii) optionally comparing the k- mers to a database comprising the sequences of ubiquitous k-mers.

14. The method according to any one of the preceding claims, wherein the sequence input data in b. is decomposed into a set of consecutive, nonoverlapping data strings of length k (k-mers), and wherein the two-tiered index is constructed in c. to comprise three arrays, namely a first array (indices array) comprising the k-mers generated in b., a second array (pointer array) linking the first with the third array, and a third array (data array) comprising the at least one associated information of each k-mer, wherein the at least one associated information comprises metadata, comprising information on the spatial position and / or (single) cell, from which a respective k-mer was obtained, and / or sequence identifier information, and wherein the method further comprises d. receiving at least one query from a user, wherein the query comprises biological sequence information, language-based prompts and / or queries regarding metadata comprised within or provided with the biological sequence input data, wherein, if the query comprises biological sequence information, said sequence information is decomposed into a set of consecutive, non-overlapping k- mers.

15. The method according to the preceding claim, wherein the processing of the query received in d. comprises interrogating the generated two-tiered index, based on the query.

16. The method according to claim 14 or 15, wherein the biological sequence input data originates from a sample comprising at least one cell and / or the contents of at least one cell.

17. The method according to any one of claims 14 - 16, wherein the biological sequence input data comprises genomic, transcriptomic or proteomic data, obtained from a single-cell sequencing analysis, spatial sequencing analysis, or mass spectrometry analysis.

18. Use of the method according to any one of the preceding claims for identifying tissue and / or cell localized expression of at least one gene and / or protein of interest, comprising: analyzing biological sequence input data obtained from a sample according to steps a.-c., a.-d. or a.-e. of the method according to any one of claims 1-17, wherein the biological sequence input data comprises single-cell nucleic acid sequencing data and / or spatial sequencing data, providing a user query with regard to at least one associated information of the at least one gene and / or protein of interest, preferably comprising a nucleic acid and / or amino acid sequence, a gene or protein identifier, a biochemical pathway identifier or specific disease in which at least one gene and / or protein of interest is considered to be involved, processing the user query according to steps d. and / or e., analyzing the two-tiered index generated in step c. of claim 1 , thereby identifying the cell and / or tissue specific expression of the of at least one gene and / or protein of interest, optionally providing at least one output comprising the tissue and / or cell localized expression of the at least one gene and / or protein of interest, preferably by a graphical user interface.

19. Use of the method according to any one of claims 1 -17 for the analysis, diagnosis, prognosis, treatment guidance and / or monitoring of a medical condition, comprising: analyzing biological sequence input data obtained from a sample of a subject suspected or diagnosed with a disease according to steps a.-c., a.-d. or a.-e. of the method according to any one of claims 1-17.

20. A computer-implemented method for constructing a data structure, such as an indexed database of biological sequence input data, comprising: a. providing biological sequence input data that has been experimentally determined, b. decomposing the biological sequence input data into non-overlapping portions, c. indexing each portion with at least one associated information of the portion comprised within or provided with the biological input data, thereby creating a two- tiered index for each portion, d. optionally providing reference biological input data, and e. generating a data structure of the biological sequence input data that was decomposed and indexed according to steps b. to c.

21. The computer-implemented method according to the preceding claim, wherein the data structure is an indexed database of biological sequence input data,wherein the sequence input data in b. is decomposed into a set of consecutive, nonoverlapping data strings of length k (k-mers), and wherein the two-tiered index is constructed in c. to comprise three arrays, namely a first array (indices array) comprising the k-mers generated in b., a second array (pointer array) linking the first with the third array, and a third array (data array) comprising the at least one associated information of each k-mer.

22. The computer-implemented method according to claim 20 or 21 , wherein the biological sequence input data originates from a sample comprising at least one cell and / or the contents of at least one cell, and / or wherein the biological sequence input data comprises genomic, transcriptomic or proteomic data, obtained from a single-cell sequencing analysis, spatial sequencing analysis, or mass spectrometry analysis.

23. A computer-readable medium comprising a data structure, constructed according to the method of the preceding claim, comprising an indexed database of biological sequence data, the database comprising: biological sequence data decomposed into non-overlapping portions, wherein the database comprises a two-tiered index for each portion, wherein each portion is indexed with associated information comprised within or provided with the biological input data.