Memory allocation architecture

The sparse, zero-page memory-mapped allocation method optimizes disk space and RAM usage for large datasets, enabling efficient processing of terabyte-scale data without swap space, enhancing CPU utilization and task performance.

WO2026132142A1PCT designated stage Publication Date: 2026-06-25ILLUMINA INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ILLUMINA INC
Filing Date
2025-12-17
Publication Date
2026-06-25

Smart Images

  • Figure EP2025087858_25062026_PF_FP_ABST
    Figure EP2025087858_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A computer-implemented method for providing a process with access to data stored in a file comprises receiving a request, from a process, to access at least a portion of the data, creating an empty sparse file and associated metadata; creating a memory-mapped sparse file in a virtual memory of the process and allocating memory for the process using a pointer to a starting position of the memory-mapped sparse file; copying in the at least a portion of the data into the sparse file; and providing the process with access to the data based upon the based upon the memory-mapped sparse file. The zero content of the memory-mapped sparse file is mapped to a system zero page. By using the memory-mapped sparse file as a container for the data and the system zero page, data may be more efficiently allocated by the computer system.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] M&C PM365233US

[0002] 1

[0003] MEMORY ALLOCATION ARCHITECTURE

[0004] Field

[0005] The present disclosure relates to systems and methods for processing files comprising large amounts of data. In particular, the present disclosure relates to the processing of files comprising large amounts of data generated using next generation sequencing technologies.

[0006] Background

[0007] Data generated using high throughput methods or multimodality imaging technologies may be in the order of terabytes. For example, a spatial transcriptomics dataset generated using next generation sequencing technologies may comprise a table of values with around 50,000 columns, each column corresponding to a respective gene, and at least 1 ,000,000 rows, each row corresponding to respective cell observation. When such a dataset is represented using a typical numeric element binary representation size of eight bytes (a “double”), the representation has a size 0.4 terabytes to 0.8 terabytes when decompressed. It is not atypical for spatial transcriptomics datasets to comprise 7,000,000 cell locations, corresponding to a size of around 2.8 terabytes.

[0008] Omics datasets are typically processed as part of an analysis pipeline to determine biological information. For example, a spatial transcriptomics dataset may undergo processing steps such as filtering, normalization, clustering, and dimensionality reduction in order to determine features such as gene activity and relationships between cells. It can be challenging to process files that contain data that has a size in the order of terabytes when decompressed because such data typically exceeds the RAM space available on a computer, which is typically in the order of gigabytes.

[0009] A sparse file is a file that uses file system space more efficiently. A file is stored in blocks, some of which may be filled with bytes and some of which may be empty. With a sparse file, the empty blocks of the file are not stored on disk and are instead represented by metadata. When a sparse file is read, the file system converts the metadata representing the empty blocks into blocks filled with zero bytes.

[0010] 37888576-1 M&C PM365233US

[0011] 2

[0012] When a process is running on a computer, it is typically necessary for the process to access various addresses in the physical memory of the RAM and on the disk. A process may use virtual addresses that are located in a virtual memory of the process in order to access the contents of these various, fragmented, physical addresses. The virtual addresses appear as a contiguous address space in the virtual memory of the process. Memory mapping is a system call that maps a portion of a file, or an entire file, to a virtual address space of a process.

[0013] Summary of the Invention

[0014] The embodiments described herein provide systems and methods for reading and processing files comprising a large amount of data in the order of terabytes or greater that minimise the allocation of space on disk. The sparse, “zero page”, memory mapped allocation method described herein addresses the problem of large data such as single cell or spatial transcriptomics data that require space for processing that can typically exceed the RAM space available.

[0015] According to a first aspect, a computer-implemented method for providing a process with access to data stored in a file. The method comprises receiving a request, from a process, to access at least a portion of the data, the at least a portion of the data comprising non-zero content and zero content; creating an empty sparse file and associated metadata; creating a memory-mapped sparse file in a virtual memory of the process and allocating memory for the process using a pointer to a starting position of the memory-mapped sparse file; copying the non-zero content of the at least a portion of the data into the sparse file and updating metadata to represent the zero content of the at least a portion of the data; and providing the process with access to the non-zero content of the at least a portion of the data and zero content of the at least a portion of the data based upon the memory-mapped sparse file and a zero page.

[0016] Memory is allocated for the requested data according to the size of a memory-mapped sparse file. Since the sparse file is initially empty, memory of size zero bytes is initially allocated. Only the non-zero content of the at least a portion of the data is loaded into the sparse file. As a result, the disk space occupied by the sparse file is smaller than the at least a portion of the data if the at least a portion of the data comprises some zero content. The embodiments described herein are particularly advantageous when the

[0017] 37888576-1 M&C PM365233US

[0018] 3 data is numerically sparse, i.e. when the data comprises a majority of zero values. Even if the at least a portion of the data does not comprise any zeros, there are be memory allocation improvements according to the embodiments described herein due to the benefits of using a memory-mapped file, as will now be described

[0019] The usage of memory mapping, mmap, for allocation provides for processing space which is present in RAM at the discretion of the operating system. An operating system may undergo a memory pressure when a system component or process needs RAM but there is little or none available as a result of the RAM space is being used by other system components or processes. By memory mapping a file, only contents of the file that are actively in use by the process are required to be loaded into the RAM. It is at the discretion of the operating system as to how much of the memory-mapped file that is not currently in use should remain on the RAM. Advantageously, in scenarios where the process only requires access to the memory-mapped sparse file in order to read it, the zero content of the sparse file may be read from the system zero page without loading any content in the RAM.

[0020] The empty sparse file may be created with a logical size which is greater than the size of the at least a portion of the data that the process requires access to. This may be advantageous in scenarios where the processed data has a larger size than the data before it is processed. This may be the case, for example, when constructing a graph based on data such as a spatial transcriptomics or single cell transcriptomics dataset. Before processing the data, it is unknown how many edges will be in the computed graph. The sparse file may be created with a logical file size that is large enough to store a graph with the maximum number of possible edges in the graph. This ensures that the sparse file is sufficiently large to store any eventual graph computed based on the data. When writing the graph, only the edges of the eventual graph may be stored in the sparse file.

[0021] Optionally, the memory-mapped sparse file is created by performing a memory-mapping system call on the sparse file.

[0022] Optionally, providing the process with access to the non-zero content and the zero content based upon the memory-mapped sparse file comprises: loading one or more pages corresponding to the non-zero content of the sparse file into a RAM based upon

[0023] 37888576-1 M&C PM365233US

[0024] 4 the memory-mapped sparse file; and providing access to zeros of the zero page according to the zero content of the memory-mapped sparse file.

[0025] Optionally, the method further comprises reading, with the process, some, or all, of the non-zero content of the memory-mapped sparse file and some, or all, of the zero content of the memory-mapped sparse file.

[0026] The read content of the memory-mapped sparse file corresponds to the content of the at least a portion of the data.

[0027] Optionally, the method further comprises performing, with the process, a calculation or function based upon the read non-zero content and read zero content.

[0028] Some calculations or functions, such as filtering, can be performed on the data without the necessity of copying the zero content of the data into the RAM.

[0029] Optionally, the method further comprises outputting, with the process, the results of the calculation or function.

[0030] Optionally, the method further comprises attempting, by the process, to modify or copy some, or all, of the zero content of the memory-mapped sparse file; allocating one or more pages in the RAM for the some, or all, of the zero content of the memory-mapped sparse file that is attempted to be modified or copied; and filling the one or more allocated pages with zeros according to the memory-mapped sparse file .

[0031] Space is allocated for the zero contents of the memory-mapped sparse file only upon attempts to copy or modify, i.e. write to, the zero content that is mapped to the zero page. This is a type of copy-on-write technique.

[0032] Optionally, the method further comprises processing, with the process, the contents of one or more pages corresponding to some, or all, of the of the non-zero content of the memory-mapped sparse file and some, or all, of the zero content of the memory-mapped sparse file.

[0033] 37888576-1 M&C PM365233US

[0034] 5

[0035] Optionally, the method further comprises copying the content of the one or more pages corresponding to some, or all, of the non-zero content of the memory-mapped sparse file and the one or more pages corresponding to the some, or all, of zero content of the memory-mapped sparse file; and / or modifying the content of the one or more pages corresponding to some, or all, of the non-zero content of the memory-mapped sparse file and the one or more pages corresponding to some, or all, of the zero content of the memory-mapped sparse file.

[0036] Optionally, the method further comprises copying any non-zero content of the modified content into the sparse file on the disk and updating the metadata according to any zero content of the modified content. The sparse file may then be written to disk.

[0037] The modified content is only written to disk lazily after write.

[0038] Optionally, the method further comprises deleting the sparse file. Optionally, the deletion of the sparse file is the final step of a first iteration of providing a process with access to data stored in a file, and the method further comprises performing a second iteration of providing a process with access to data stored in a file.

[0039] Optionally, the file comprises compressed data, and wherein loading / copying non-zero content of the at least a portion of the omics dataset into the sparse file comprises loading / copying a decompressed version of non-zero content of the at least a portion of the data into the sparse file.

[0040] Optionally, the data has a size equal to greater than a free RAM space.

[0041] Typically, a computer system would be under memory pressure when trying to access data that has a size equal to or greater than the free RAM space, i.e. the RAM space available. Advantageously, the methods described herein allow access to a file with a size equal to or greater than a free RAM space without having to use a swap space.

[0042] Optionally, the data comprises spatial transcriptomics data or single cell transcriptomics data. Optionally, the calculation comprises one or more of filtering, clustering and / or ANOVA. Optionally, the modifying comprises one or more of normalization or graph

[0043] 37888576-1 M&C PM365233US

[0044] 6 construction. Optionally, the function comprises annotating, cell typing or other typing, or cell segmentation.

[0045] According to a second aspect there is provided a carrier medium carrying computer readable program code configured to cause a computer to carry out a method according to the first aspect.

[0046] According to a third aspect there is provided a computer apparatus for providing a process with access to data, the apparatus comprising: a memory storing processor readable instructions; and a processor configured to read and execute instructions stored in the memory; wherein said processor readable instructions comprise instructions controlling the processor to carry out a method according to the first aspect.

[0047] The computer apparatus may comprise high speed storage disks with disk speeds greater than 2GB per second for storing the file and sparse file. These disk speeds are optimal for supporting unimpeded multithreaded processing using the memory allocation methods described herein. However, the methods disclosed herein may also be performed using disks with lower disk speeds of around, for example, 2 Gb. These disk speeds can be provided by a SSD as cache on regular disk storage or by using pure SSD storage solutions. These disks can be provided at the fraction of the cost of a RAM with a processing space large enough to process terabytes of data.

[0048] Brief Description of Figures

[0049] Arrangements of the present disclosure will be understood and appreciated more fully from the following detailed description, made by way of example only and taken in conjunction with drawings in which:

[0050] Figure 1 illustrates a process requesting access to data stored in a file.

[0051] Figure 2 illustrates a computer system for performing the embodiments disclosed herein; Figure 3 illustrates an improved method for providing a process with access to data stored in file;

[0052] Figure 4 illustrates a block diagram of an improved method for providing a process with access to data stored in file; and

[0053] 37888576-1 M&C PM365233US

[0054] 7

[0055] Figure 5 illustrates a block diagram of an improved method for providing a process with access to data stored in a file in order to modify the data.

[0056] Description

[0057] The present disclosure provides an improved method and system for processing large amounts of data, such as single cell data or spatial transcriptomics data. The method and system described herein allows for large amounts of data to be efficiently read and / or processed even in scenarios where the size of the data exceeds the RAM space available. This is achieved by using an empty sparse file as a container for the data. Zero content of the data in the sparse file may be read using the system zero page, which removes the requirement for loading the zero content into the RAM when it is being read only.

[0058] With reference to figure 1 , a file 101 located on a disk 190 stores data 103. The data 103 comprises non-zero content and zero content. Space is allocated on the disk for both the non-zero content and the zero content of the file.

[0059] The data 103 may comprise biological data generated through high throughput technologies. The biological data may comprise genomic, transcriptomic, proteomic, or metabolomic data. The biological data may be generated using next generation sequencing methods. The data may comprise a plurality of observations for a plurality of respective variables. The data 103 may be represented as a matrix. In one embodiment, the data 103 comprises spatial transcriptomics data comprising a plurality of observations and a plurality of variables, each of the observations corresponding to respective cell locations or cells, and each of the variables corresponds to numbers of expressed genes or transcripts. In one embodiment, the data 103 comprises single cell data comprising a plurality of observations and a plurality of variables, each of the observations corresponding to respective cell locations or cells, and each of the variables corresponds to numbers of expressed genes or transcripts. The data 103 may comprise any other type of data comprising a plurality of values, such as imaging data, multimodal data or health records data.

[0060] The data 103 may be sparse. Data sparsity is the condition where a relatively large proportion of entries within a dataset are set to zero. This may occur in a spatial

[0061] 37888576-1 M&C PM365233US

[0062] 8 transcriptomics or single cell transcriptomics dataset when zero reads of a given transcript are detected for a cell or location. As a result of the data being sparse, the file 101 may have many blocks of allocated memory comprising zero bytes.

[0063] The file 101 may be in any suitable format for storing data such as CSV, TSV, HDF5, MTX, or the like. The file 101 may be in a compressed format, such as a ZIP or GZIP. The size of the data stored in the file 101 when decompressed may be gigabytes (GB), terabytes (TB), or pentabytes (PB) in size. The size of the data 103 stored in the file 105 when decompressed may exceed the size of the RAM.

[0064] A process 102 requires access to the file 101 in order to read the at least a portion of the data 103 contained within the file 101 and / or to process at least a portion of the data 103 contained within the file 101.

[0065] The process 102 may be any type of software or application for the reading and / or processing of data. The process 120 may be bioinformatics software for reading and processing biological data. The process 120 may be part of a transcriptomics pipeline and be configured to perform one or more of the following tasks: QA / QC, Filtering, Normalization, Dimensional reduction such as one or more of the following PCA, LIMAP, t-SNE, SVD, Clustering such as one or more of k-means clustering, graph based clustering (Louvain, Leiden), Hierarchical clustering, Statistical analysis such as one or more of the following: Differential Expression, Correlation, Descriptive statistics, Compute biomarkers, Cox Regression, Kaplan-Meier, Differential expression analysis such as one or more of the following: ANOVA, Hurdle model, DESeq2, Limma-trend, Welch's ANOVA, Kruskal-Wallis, Poisson regression, Negative binomial regression, Gene Specific Analysis, or a Classification analysis such as cell annotation, cell typing or cell segmentation.

[0066] In scenarios where the decompressed data 103 stored in the file 101 has a size in the region of TB and / or exceeds the size of the RAM, it can be challenging to allocate memory for processing the file 101. In these scenarios, the file 101 may be stored on the disk in a compressed format to save storage space on the disk. According to methods known in the art, upon the request of the process 102 to access the file 101 , the compressed data may be read to a numeric binary representation container laid out as a vector of columns (column major) or a vector of rows (row major) if the data

[0067] 37888576-1 M&C PM365233US

[0068] 9 corresponds to a matrix. This numeric binary representation container is allocated as a single memory allocation whose allocation size is the number of columns times the number of rows times the size of an element in the numeric binary representation. One possible way to process decompressed data of this size is to use a computer system with a RAM of a sufficient size to store the decompressed data. However, systems with RAM of a sufficient size are generally not available or are prohibitively expensive.

[0069] Figure 2 illustrates an example computing system 200 for performing embodiments of the present disclosure. The computing system 200 comprises a central processing unit (CPU) 210 coupled to a disk 290 and accessing a physical memory such as a random access memory (RAM) 235. The disk 290 stores file 201 , which corresponds to file 101. The disk 290 may be one or more disks. The disk 290 may be a hard disk and / or a solid state drive (SSD) disk. The disk 290 may have write speeds greater than 2 GB per second. Usual procedures for the loading of software into memory and the storage of data in the disk 290 apply. The CPU 210 also accesses, via bus 240, a communications interface 250 that is configured to receive data from and output data to an external system (e.g. an external network or a user input device or output device, such as a keyboard, mouse, display screen and / or touch-interface). The communications interface 250 may be a single component or may be divided into a separate input interface and a separate output interface.

[0070] The CPU 210 is configured to implement the methodology described herein based on executable software stored within the disk 290. The software can be embedded in original equipment, or can be provided, as a whole or in part, after manufacture. For instance, the software can be introduced, as a whole, as a computer program product, which may be in the form of a download, or to be introduced via a computer program storage medium, such as an optical disk or connectable memory drive (such as a Universal Serial Bus flash drive). Alternatively, modifications to an existing controller can be made by an update, or plug-in, to provide features of the above described embodiment. Whilst only a single CPU 210 is described, in practice, the computer system 200 may comprise multiple CPUs 210 for implementing the methodology described herein.

[0071] With reference to Figure 3, an overview of the methodology 300 that the CPU 210 is configured to implement will now be described.

[0072] 37888576-1 M&C PM365233US

[0073] 10

[0074] At 310, the CPU receives a request from a process 302 to access a file to read and / or process at least a portion of the data 303 stored in file 301 . The process 302 corresponds to process 102 and the file 301 corresponds to file 101 and 201 , and the data 303 corresponds to data 103.

[0075] The process 302 may be operating on the same computer system as the CPU 210 or it may be operating on another computer system which is in communication with computer system 200 over a wired or wireless network.

[0076] At 320, an empty sparse file and associated metadata is created. The metadata comprises information representing the zero content, or empty blocks, of the sparse file. The empty blocks of the sparse file are unallocated.

[0077] At 330, the sparse file is memory-mapped using a memory mapping system call, i.e. a memory-mapped sparse file is created. Memory is allocated for the process using a pointer to the memory-mapped sparse file.

[0078] At 340, the non-zero content of the at least a portion of the data is copied into the sparse file and the metadata is updated to represent the zero content of the at least a portion of the data.

[0079] At 350, the process is provided with access to the non-zero content of the at least a portion of the data and with access to the zero content of the at least a content of the data based upon the memory-mapped sparse file.

[0080] Optionally, at 360, the process reads some, or all, of the non-zero content of the at least a portion of the data and some, or all, of the zero content of the at least a portion of the data.

[0081] Optionally, at 370, the process processes some, or all, of the non-zero content of the at least a portion of the data and some, or all, of the zero content of the at least a portion of the data.

[0082] 37888576-1 M&C PM365233US

[0083] 11

[0084] Whilst the above method 300 and following examples with reference to Figure 4 and Figure 5 begin with step 310, in which a CPU receives a request from a process to access a file, it will be understood that method 300 may also be applied in situations in which the CPU does not receive such a request. This may occur in situations where a process is configured to create new content, as opposed to performing calculations or modifications based on existing data. In these embodiments, the method 300 starts at step 320. At step 340, no non-zero content is added to the sparse file and there is no update to the metadata since there is no file comprising data.

[0085] Figures 4A to 4C schematically represent operations to perform the method 300 of Figure 3.

[0086] Starting with Figure 4A, at 410, a process 402 requests access to a file 401 stored on a disk 490 to read and / or process at least a portion of the data stored in file 401. The process 402 corresponds to process 102 and the file 301 corresponds to files 201 and 101.

[0087] At 420, an empty sparse file 407 and associated metadata 404 are created on disk 490. The disk may be the same disk, or a different disk, to disk 490, however disk 490 is referenced for simplicity. The sparse file 407 has a logical file size. The logical file size may be a predetermined size. The predetermined size may be selected to be equal to or greater than the size of the at least a portion of the decompressed data stored in the file 401. The predetermined size may be selected to be equal to, or 1x, 2x, 4x,...10x, etc, greater than the size of the at least a portion of the decompressed data stored in the file 401.

[0088] The metadata 404 stores information comprising the logical size of the sparse file, information about which blocks of the sparse file are allocated, and information about which blocks of the sparse file are unallocated, or sparse. The metadata 404 occupies some space on the disk 490, but the occupied space is negligible in size. The metadata 404 is stored in the file system containing the sparse file 403. The form that the metadata 404 is stored in depends on the file system where the sparse file is stored. The file system of the disk 490 organizes and manages the files stored on the disk 490. If the sparse file 407 is created with a predetermined logical file size, the sparse file 407 will appear as

[0089] 37888576-1 M&C PM365233US

[0090] 12 the predetermined logical file size to the file system even though the sparse file 407, being empty, occupies no space on the disk 490.

[0091] Turning to Figure 4B, at 430, the sparse file 407 is memory mapped using a memory mapping system call. The memory mapping system call creates a memory-mapped sparse file 405 in the virtual memory of the process 402. The memory-mapped sparse file 405 is a region in the virtual memory of the process 402 that is assigned a direct byte- for-byte correlation with the non-zero content of the sparse file 407 and the zero content represented by the metadata 404. The virtual addresses of the memory-mapped sparse file 405 are mapped to the physical address spaces in the RAM 435. During creation of the memory-mapped sparse file 405, the metadata 404 is used to determine the range of the virtual memory to be mapped as well as the form of the mapping. The metadata 404 determines which portions of the memory-mapped sparse file (i.e. those portions corresponding to zero content) may be mapped to the zero page, and which portions of the memory-mapped sparse file 405 (i.e. those portions corresponding to non-zero content) may be mapped to physical page addresses in the RAM. The zero page is a page filled with zeros and it is located at the beginning of the address space in the RAM 435. Details of the mapping are stored in a page table. A page table is a data structure that stores the mappings, or translations, between virtual addresses and physical addresses in the RAM 435.

[0092] Memory is allocated for the process 402 for the at least a portion of the data using a pointer to the memory-mapped sparse file 405. Specifically, the pointer stores the memory address of the starting position of the memory-mapped sparse file 405 in the virtual memory of the process. The pointer may be returned by the memory mapping system call.

[0093] Turning to Figure 4C, at 440, the non-zero content of the at least a portion of the data stored in the file 401 is copied into the sparse file 407 and the metadata 404 is updated to describe the zero content of the at least a portion of the data. The sparse file 407 acts as a numeric binary representation container in which the non-zero content is to be copied into and / or is to be processed in or into. Space on the disk 490 is allocated for the non-zero content of the sparse file 407 only. No space is allocated on the disk 490 for the zero content of the sparse file 407 which is represented by the metadata 404. In

[0094] 37888576-1 M&C PM365233US

[0095] 13 other words, the zero content of the at least a portion of the data represented by the metadata 404 is not initialized.

[0096] The non-zero content of the at least a portion of the data may be copied into the sparse file 407 in decompressed form. If the sparse file 407 is created to have a logical file size greater than the decompressed at least a portion of the data, the logical size of the sparse file will be larger than the actual allocated space on the disk 490 for the sparse file. This may also be the case when the sparse file 407 has been created to have a logical file size that is equal to or smaller than the decompressed at least a portion of the data, since space is only allocated on disk 490 for the non-zero content of the sparse file.

[0097] At 450, the process 402 is provided with access to the at least a portion of the data based upon the memory-mapped sparse file 405.

[0098] In further detail, the process first requests access to one or more parts of the memorymapped sparse file 405.

[0099] If the process only requires access to the one or more parts of the memory-mapped sparse file 405 in order to read the contents of the one or more parts of the memorymapped sparse file 405, memory is only allocated on the RAM 435 for the non-zero content of the one or more parts of the memory-mapped sparse file 405. The zero content of the one or more parts of the memory-mapped sparse file 405 is read from the zero page. The physical addresses of the zero content is the address of the zero page.

[0100] The non-zero content of the one or more parts of the memory-mapped sparse file 405 are read by demand paging. The page table is read in order to identify the physical addresses of the non-zero content of the one or more parts of the memory-mapped sparse file 405. If the page table does not hold any physical addresses for the non-zero content, a page fault occurs. Pages of the non-zero content of the sparse file 403 are then paged into the RAM 435. The contents of the pages paged into the RAM 435 correspond to the contents of one or more parts of the memory-mapped sparse file 405 that are non-zero. The page table is updated to reflect the updated physical addresses on the RAM 435 for the non-zero content.

[0101] 37888576-1 M&C PM365233US

[0102] 14

[0103] At 460, the process 402 then reads the one or more parts of the memory-mapped sparse file 405 from the pages in the RAM 435 and the zero page 406.

[0104] The process 402 may perform one or more calculations or functions based on the read content. For example, the process 402 may perform calculations such as statistical testing techniques comprising one or more of ANOVA, filtering, clustering based on the read content. As a further example, the process 402 may perform functions such as annotating, cell typing or other typing, or cell segmentation. The results of the calculation may be output by the process and optionally stored on the disk 490.

[0105] When the process 402 no longer requires access to the read content, the sparse file 407 and associated metadata 404 may be deleted.

[0106] Figure 5 schematically represents operations for providing a process 502 with access to data when the process requires the data in order to process it. Each of the features 501- 507 and 590 correspond to respective features 401-407 and 490 in Figure 4.

[0107] At 550, the process requests access to the one or more parts of the memory-mapped sparse file 505 in order to copy or modify the contents of the one or more parts of the memory-mapped sparse file 505.

[0108] The page table is read in order to identify the physical addresses that are mapped to the zero content and the non-zero content of the one or more parts of the memory-mapped sparse file 505.

[0109] If any of the physical addresses of the non-zero content of the one or more parts of the memory-mapped sparse file 505 do not correspond to an address page in the RAM 535, a page fault occurs and the corresponding parts of the sparse file 507 are paged into the RAM and the page table is updated in the same manner as described with reference to Figure 4C.

[0110] By default, the address of the zero page is used as the address for the zero content of the memory-mapped sparse file 505 unless some of the the physical addresses for the zero content of the one or more parts of the memory-mapped already correspond to a page in the RAM 535.

[0111] 37888576-1 M&C PM365233US

[0112] 15

[0113] Optionally, at 560, the process attempts to modify the contents of the one or more parts of the memory-mapped sparse file 505 according to a processing function. The processing function may comprise one of the following functions: addition, subtraction, multiplication, or division, one or more times and in any order.

[0114] The contents of the non-zero content of the one or more parts of the memory-mapped sparse file are modified by applying the processing function to the contents of the corresponding pages in the RAM. Any pages in the RAM that comprise zero content of the one or more parts of the memory-mapped sparse file are also modified by applying the processing function to the contents of those pages.

[0115] An attempt to modify (i.e. write to) any of the contents of the one or more parts of the sparse file 505 that are mapped to the zero page triggers page fault since the system zero page is read only. One or more pages are allocated in the RAM 535 of a sufficient size to hold the zero content that is not already paged into the RAM that is to be modified. The allocated pages are then filled with zero content by copying from the zero page. The page table is updated with the new physical addresses for any of the zero content of one or more parts of the memory-mapped sparse file 505 that have been paged into RAM. The contents of the newly allocated pages are filled with zeros are then modified by applying the processing function to the contents.

[0116] The results of the processing may be output by the process. The non-zero results of the processing may be stored in the sparse file 503 and the metadata 504 may be updated to reflect the zero content of results. The memory-mapped sparse file 505 is updated according to the new non-zero and zero content. The results of the processing may be saved to the disk 590.

[0117] Optionally, at 570, the process attempts to copy the contents of the one or more parts of the memory-mapped sparse file 505. In order to copy data, the data must be in the RAM. There are already pages in the RAM corresponding to the non-zero contents of the one or more parts of the memory-mapped sparse file 505 are already in the RAM and there may also be pages corresponding to the zero contents of the one or more parts of the memory-mapped sparse file 505 in the RAM. For any of the zero content that is not already in the RAM, one or more pages are allocated in the RAM 535 of a sufficient size

[0118] 37888576-1 M&C PM365233US

[0119] 16 to hold the zero content that is not already paged into the RAM and is to be copied. The allocated pages are then filled with zero content by copying from the zero page. The page table is updated with the new physical addresses for any of the zero content of one or more parts of the memory-mapped sparse file 505 that have been paged into RAM. The pages in the RAM corresponding to the non-zero and zero content of the memorymapped sparse file are then copied to RAM 535 or to disk 590.

[0120] When process 502 no longer requires access to the memory-mapped sparse file 505, the sparse file 503 and associated metadata 504 may be deleted.

[0121] After deletion of the sparse file, the process may request access to the same file 501 , or another file, to process a second at least a portion of the data stored in the file and the methodology repeats according to methods 300, 400 and 500.

[0122] Examples of the improved method 300, 400, 500 will now be described.

[0123] Example 1 - Filtering of spatial transcriptomics data

[0124] In one example, the file comprises a spatial transcriptomic data set comprising a table of values where each column of the table corresponds to a given feature gene and each row of the table corresponds to a given observation of a cell or location within a cell. Each element of the table is the numeric representation of the number of transcripts for that gene feature at that given cell’s location.

[0125] A standard, necessary step in analyzing spatial transcriptomics data is to filter out either the columns, feature genes, or rows, observation cell locations, for columns or for rows which only have zero or some low number of transcript counts or low aggregate by column or by row.

[0126] In order for the process to filter the spatial transcriptomics data, the process first requests access to all of the spatial transcriptomics data in the file. An empty sparse file and associated metadata is created. The sparse file is then memory-mapped and memory for the process is allocated using a pointer to the memory-mapped sparse file. The nonzero content of the spatial transcriptomics data is copied into the sparse file and the metadata is updated to represent the zero content of the spatial transcriptomics data.

[0127] 37888576-1 M&C PM365233US

[0128] 17

[0129] The process then requests to read the memory-mapped sparse file. The non-zero content of the memory-mapped sparse file in paged in to the RAM. The zero page and the paged in pages are read by the process to determine which columns, or rows, should be filtered out according to a criterion. For example, columns corresponding to genes that have a number of entries equal to zero above a threshold may be filtered out. As another example, rows corresponding to cell locations that have a number of zero entries above a second threshold may be filtered out. The columns or rows that are to be kept may be represented as an index. The index may then be output by the process and saved to disk.

[0130] In this example, calculations are performed on the spatial transcriptomics data without the necessity of paging any zero content into the RAM. Since a sparse file is being used to contain the spatial transcriptomics data, the required disk space starts at zero and then scales up to the amount of non-zero pages of content written via the pointer to the virtual address space of the process. The amount of RAM used will depend on various factors including the access pattern of the process in the virtual address space and will dynamically vary from no RAM used for pages of content up to the amount of free RAM. Once the operating system has written the non-zero page contents to the sparse file, the operating system is enabled to not use RAM for that page of content if those page contents are no longer needed by the process. data

[0131] In another example, spatial transcriptomics data is processed to normalised the data. A necessary (but not typically sufficient) way to normalize the data is to add one to every element and then take the log value of each element.

[0132] In order for the process to normalize the spatial transcriptomics data, the process first requests access to all of the spatial transcriptomics data in the file. An empty sparse file and associated metadata is created. The sparse file is then memory-mapped and memory for the process is allocated using a pointer to the memory-mapped sparse file. The non-zero content of the spatial transcriptomics data is copied into the sparse file and the metadata is updated to represent the zero content of the spatial transcriptomics data. The process then requests access to the memory-mapped sparse file to process it. The non-zero content of the memory-mapped sparse file is paged in to the RAM. The zero

[0133] 37888576-1 M&C PM365233US

[0134] 18 content of the memory-mapped sparse file is mapped to the content of the zero page. The process attempts to modify every entry of the accessed data by adding a value of one to each entry and then log transforming each entry. In response to the attempt by the process to modify the zero content mapped to the zero page, a page fault is triggered and pages are allocated in the RAM for the zero content. The allocated pages are filled with zeros. The values in all of the pages in the RAM corresponding to the spatial transcriptomics data are then modified by adding 1 to every value and taking the log transform. The resultant normalized dataset that is stored in the RAM is then saved to the disk. The sparse file is then deleted.

[0135] Benchmarking

[0136] The advantages of the methodology described herein in comparison to standard memory allocation using system swap to disk will now be described with reference to the following examples.

[0137] System swap to disk is a memory allocation used by a computer system when in runs low on RAM. A swap space is a data storage that is used as an extension of the computer’s RAM. The operating system moves data from the RAM that is being less frequently used to the swap space to free up space in the RAM.

[0138] According to a first example, the embodiments described herein are applied to enable a process to perform a processing function on a computer system according to method steps 320-350 and 370. The processing function creates a data structure of size 1126GB filled with values of 1. The computer system has 1TB RAM, 64 CPUs, and 64TB high speed storage for scratch and swap space. There is no step 310 because the process does not require access to any data. At 320, an empty sparse file with logical file size 1126GB and associated metadata is created on the disk. At 330, the empty sparse file is memory-mapped and memory is allocated for the process using a pointer to the memory-mapped sparse file. At 340, since there is no data file, no non-zero content is copied into the sparse file and there is no change to the metadata. At 350, the process accesses the memory-mapped sparse file. At 370, the process fills all the contents of the memory-mapped sparse file with a value of 1 using the C++ std: Till command running in parallel on a plurality of CPUs. At the end of the processing, the sparse file is deleted. By performing steps 310-350 and 370, the process is able to perform processing of the

[0139] 37888576-1 M&C PM365233US

[0140] 19 sparse file that has a logical size greater than the RAM without running out of space on the RAM or using the swap space.

[0141] According to a second example, the same processing function is performed on a computer system configured to overallocate memory using swap. The computer system has 1TB RAM, 64 CPUs, and 64TB high speed storage for scratch and swap space. Since the processing function results in the creation of a data structure of the size of 1126GB, which is greater than the space available on the RAM, the computer system is under memory pressure and the swap space must be utilised. A swap file on the disk is created. Pages of the swap file are paged into the RAM and filled with values of 1 using the C++ std: Till command running in parallel on a plurality of CPUs and then moved into the swap space when they are no longer needed. At the end of the processing, a swap file resides on the disk with a size of 1126GB and filled with values of 1 .

[0142] Table 1 shows the execution time and CPU usage from example 1 and example 2 using the Linux “time” command.

[0143] It can be seen that using the improved memory allocation technique according to the methods described herein, the overall execution time of the process is 37 minutes and 35 seconds and the computer system was able to use approximately 47 CPUS. This can be compared to when the same processing is performed with memory allocation using system swap, which has an overall execution time of 52 minutes and 39 seconds and approximately only 4 CPUs were used. The use of multiple CPUs is beneficial for CPUintensive operations, such as the processing of large amounts of data, since the load of such high performance computing tasks can be distributed over many CPUs. Advantageously, by using the methodology described herein, faster processing time and better CPU utilization can be achieved in comparison to when system swap to disk is used.

[0144] While certain arrangements have been described, the arrangements have been presented by way of example only, and are not intended to limit the scope of protection.

[0145] 37888576-1 M&C PM365233US

[0146] 20

[0147] The inventive concepts described herein may be implemented in a variety of other forms. In addition, various omissions, substitutions and changes to the specific implementations described herein may be made without departing from the scope of protection defined in the following claims.

[0148] 37888576-1

Claims

M&C PM365233US21CLAIMS:

1. A computer-implemented method for providing a process with access to data stored in a file, the method comprising: receiving a request, from a process, to access at least a portion of the data, the at least a portion of the data comprising non-zero content and zero content; creating an empty sparse file and associated metadata; creating a memory-mapped sparse file in a virtual memory of the process and allocating memory for the process using a pointer to a starting position of the memorymapped sparse file; copying the non-zero content of the at least a portion of the data into the sparse file and updating metadata to represent the zero content of the at least a portion of the data; and providing the process with access to the non-zero content of the at least a portion of the data and zero content of the at least a portion of the data based upon the memorymapped sparse file and a zero page.

2. The method according to claim 1 , wherein the memory-mapped sparse file is created by performing a memory-mapping system call on the sparse file.

3. The method according to claim 1 or 2, wherein providing the process with access to the non-zero content and the zero content based upon the memory-mapped sparse file and the zero page comprises: loading one or more pages corresponding to the non-zero content of the sparse file into a RAM based upon the memory-mapped sparse file; and providing access to zeros of the zero page according to the zero content of the memory-mapped sparse file.

4. The method according to any preceding claim, further comprising: reading, with the process, some, or all, of the non-zero content of the memorymapped sparse file and some, or all, of the zero content of the memory-mapped sparse file.37888576-1M&C PM365233US225. The method according to claim 4, further comprising: performing, with the process, a calculation or function based upon the read nonzero content and read zero content.

6. The method according to claim 5, further comprising: outputting, with the process, the results of the calculation or function.

7. The method according to any preceding claim when dependent on claim 3, the method further comprising: attempting, with the process, to modify or copy some, or all, of the zero content of the memory-mapped sparse file; allocating one or more pages in the RAM for the some, or all, of the zero content of the memory-mapped sparse file that is attempted by the process to be modified or copied; and filling the one or more allocated pages with zeros according to the memorymapped sparse file.

8. The method according to claim 7, further comprising: processing, with the process, the contents of the one or more pages corresponding to some, or all, of the of the non-zero content of the memory-mapped sparse file and some, or all, of the zero content of the memory-mapped sparse file.

9. The method according to claim 8, wherein the processing comprises: copying the content of the one or more pages corresponding to some, or all, of the non-zero content of the memory-mapped sparse file and the one or more pages corresponding to some, or all, of the zero content of the memory-mapped sparse file; and / or modifying the content of the one or more pages corresponding to some, or all, of the non-zero content of the memory-mapped sparse file and the one or more pages corresponding to some, or all, of the zero content of the memory-mapped sparse file.

10. The method according to claim 9, further comprising: copying any non-zero content of the modified content into the sparse file on the disk and updating the metadata according to any zero content of the modified content.37888576-1M&C PM365233US2311 . The method according to any preceding claim, further comprising: deleting the sparse file.

12. The method according to claim 11 , wherein the deletion of the sparse file is the final step of a first iteration of providing a process with access to data stored in a file, the method further comprising performing a second iteration of providing a process with access to data stored in a file.

13. The method according to any preceding claim, wherein the file comprises compressed data, and wherein loading / copying non-zero content of the at least a portion of the omics dataset into the sparse file comprises: loading / copying a decompressed version of non-zero content of the at least a portion of the data into the sparse file.

14. The method according to any preceding claim, wherein the data stored in the file has a size equal to or greater than a free RAM space15. The method according to any preceding claim, wherein the data is spatial transcriptomics data or single cell transcriptomics data, wherein the calculation comprises filtering, clustering and / or ANOVA, wherein the function comprises annotating, cell typing or other typing, or cell segmentation, and wherein the modifying comprises normalization or graph construction.

16. A carrier medium carrying computer readable program code configured to cause a computer to carry out a method according to any one of claims 1 to 15.

17. A computer apparatus for providing a process with data, the apparatus comprising: a memory storing processor readable instructions; and a processor configured to read and execute instructions stored in the memory; wherein said processor readable instructions comprise instructions controlling the processor to carry out a method according to any one of claims 1 to 15.37888576-1