Data compression and decompression methods, apparatus, and storage system
By using comparison terms in FastQ and BAM data for reference compression, the problem of poor compression performance in existing technologies is solved, achieving higher compression ratios and faster compression speeds.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-09-22
- Publication Date
- 2026-07-02
AI Technical Summary
Existing technologies have poor compression performance when performing reference compression on FastQ data, failing to meet the reduction performance requirements of commercial storage, and the compression ratio is not high.
By obtaining the comparison terms from FastQ and BAM data, the comparison results for each sequence are determined, and the comparison terms from BAM data are used to perform reference compression on FastQ data, omitting the reference genome alignment process, thus achieving joint compression.
It improves the compression ratio and compression performance of FastQ data, meeting the reduction performance requirements of commercial storage.
Smart Images

Figure CN2025122918_02072026_PF_FP_ABST
Abstract
Description
Data compression, decompression methods, apparatus and storage systems
[0001] This application claims priority to Russian patent application No. 2024139330, filed on December 25, 2024, entitled "Method, Apparatus and Storage System for Data Compression and Decompression", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of data compression, and more specifically, to a method, apparatus and storage system for data compression and decompression. Background Technology
[0003] In the era of rapid development in precision medicine, gene sequencing, as one of the popular technologies favored for personalized diagnosis and treatment, is experiencing an ever-expanding data volume. The stock of human genome sequencing data is projected to exceed 120 exabytes (EB) by 2025. Guidelines from the American College of Medical Genetics and Genomics (ACMGG) recommend that fast sequence quality (FastQ) data and binary alignment / map (BAM) data be retained for at least two years, with most requiring retention for decades or even indefinitely. Therefore, there is a very substantial market for gene data storage, with next-generation sequencing (NGS) accounting for over 90% of the entire sequencing market.
[0004] Currently, data compression ratio is a major concern for customers. Higher compression ratios result in lower storage costs, which is highly desirable. In second-generation sequencing data storage, FastQ and BAM data constitute the largest proportion. Dedicated compression algorithms for FastQ data generally fall into two categories: reference-free (direct) compression and reference compression. Reference compression of the sequence content in FastQ data can improve the compression ratio, thus lowering storage costs for customers. However, reference compression of FastQ data requires determining the precise location of the sequence content on the reference genome through alignment, resulting in poor compression performance. For example, single-threaded compression performance is typically less than 10MB / s, failing to meet the reduction performance requirements of commercial storage.
[0005] Therefore, how can we improve data compression performance while simultaneously increasing the data compression rate? Summary of the Invention
[0006] This application provides a data compression method that can improve both the data compression ratio and the data compression performance.
[0007] In a first aspect, a data compression method is provided, comprising: acquiring FastQ data and BAM data, wherein the FastQ data includes multiple sequences, and the BAM data includes multiple comparison terms corresponding to the multiple sequences, the multiple comparison terms being obtained by comparing the multiple sequences with a reference genome respectively; determining a first comparison term corresponding to a first sequence among the multiple sequences, the first comparison term being one of the multiple comparison terms, the first comparison term including the comparison result of the first sequence, the comparison result of the first sequence including the position information of the first sequence in the reference genome; and performing reference compression on the first sequence based on the comparison result of the first sequence included in the first comparison term to obtain compressed data of the FastQ data.
[0008] In the above technical solution, during the reference compression of FastQ data, the corresponding alignment terms for each sequence in the FastQ data can be determined, and the alignment results contained in each alignment term in the BAM data can be directly used to perform reference compression on the sequences in the FastQ data. This avoids the alignment operation with the reference genome during reference compression of FastQ data, thereby improving the compression performance of reference compression. In this way, reference compression can be used to improve the compression ratio of FastQ data, while also omitting the reference alignment process with the reference genome, thus improving the compression performance of FastQ data.
[0009] In conjunction with the first aspect, in some implementations of the first aspect, the first sequence is compared with the reference genome to obtain the first comparison term, so as to determine the correspondence between the first comparison term and the first sequence; and the first comparison term corresponding to the first sequence is determined based on the correspondence.
[0010] In the above technical solution, during the process of aligning each sequence in the FastQ data based on the reference genome and obtaining the corresponding alignment terms in the BAM data, the correspondence between the alignment terms in the BAM data and the sequences in the FastQ data can also be recorded incidentally. Since the correspondence between the alignment terms in the BAM data and the sequences in the FastQ data is recorded incidentally during the process of obtaining the alignment terms in the BAM data, this operation of recording the correspondence can be achieved with almost no resource consumption.
[0011] In conjunction with the first aspect, in certain implementations of the first aspect, the sequence among the plurality of sequences includes at least one of the following information: sequence content, sequence identifier, and sequence corresponding to a quality score sequence; the comparison item among the plurality of comparison items includes at least one of the following information: content of the sequence corresponding to the comparison item, identifier of the sequence corresponding to the comparison item, and quality score sequence corresponding to the sequence corresponding to the comparison item; the first comparison item corresponding to the first sequence is determined based on a comparison between the content of the first sequence and the content of the sequences included in the comparison items among the plurality of comparison items; or the first comparison item corresponding to the first sequence is determined based on a comparison between the identifier of the first sequence and the identifier of the sequences included in the comparison items among the plurality of comparison items; or the first comparison item corresponding to the first sequence is determined based on a comparison between the quality score sequence corresponding to the first sequence and the quality score sequence included in the comparison items among the plurality of comparison items.
[0012] In conjunction with the first aspect, in some implementations of the first aspect, the method further includes: obtaining joint compressed data of the FastQ data and the BAM data, the joint compressed data including compressed data of the FastQ data and multiple compressed comparison terms corresponding to the multiple sequences, the first compressed comparison term corresponding to the first sequence including the position information of the first sequence in the reference genome, and excluding the first sequence.
[0013] In the above technical solution, the compressed data of BAM data and FastQ data can also be jointly compressed to remove the overlapping parts between the compressed data of BAM data and FastQ data, thereby further improving the data compression rate.
[0014] In conjunction with the first aspect, in some implementations of the first aspect, when the comparison item among the plurality of comparison items includes the sequence corresponding to the comparison item, obtaining the joint compressed data of the FastQ data and the BAM data includes: removing the first sequence from the first comparison item to obtain the first compressed comparison item, thereby obtaining the joint compressed data of the FastQ data and the BAM data.
[0015] In conjunction with the first aspect, in some implementations of the first aspect, the correspondence between the first comparison item and the first sequence is recorded in the joint compressed data, and the correspondence is used to decompress the first compressed comparison item in the joint compressed data.
[0016] In the above technical solution, the correspondence can also be recorded in the compressed data of BAM data and the aforementioned FastQ data, so as to facilitate the decompression of the joint compressed data.
[0017] In conjunction with the first aspect, in some implementations of the first aspect, the correspondence is recorded in the first compression comparison item of the joint compressed data.
[0018] In conjunction with the first aspect, in some implementations of the first aspect, the correspondence between the first comparison item and the first sequence includes: the position information of the first sequence in the FastQ data.
[0019] In conjunction with the first aspect, in some implementations of the first aspect, the comparison result of the first sequence included in the first comparison item is used to replace the first sequence in order to obtain compressed data of the FastQ data.
[0020] Secondly, a data compression method is provided, comprising: acquiring FastQ sequence quality data and Binary Alignment / Mapping (BAM) data, wherein the FastQ data includes multiple sequences, and the BAM data includes multiple alignment terms corresponding to the multiple sequences, the multiple alignment terms being obtained by comparing the multiple sequences with a reference genome respectively; determining a first alignment term corresponding to a first sequence among the multiple sequences, the first alignment term being one of the multiple alignment terms, the first alignment term including the alignment result of the first sequence, the alignment result of the first sequence including the position information of the first sequence in the reference genome; removing the repetitive parts between the first sequence and the first alignment term, to obtain jointly compressed data of the FastQ data and the BAM data.
[0021] In the above technical solution, BAM data and FastQ data can be jointly compressed to remove the overlapping parts between BAM data and FastQ data, thereby improving the data compression rate.
[0022] In conjunction with the second aspect, in some implementations of the second aspect, the plurality of comparison items include the plurality of sequences and the position information of the plurality of sequences in the reference genome, and the first sequence is removed to obtain the joint compressed data.
[0023] In conjunction with the second aspect, in some implementations of the second aspect, the method further includes: recording the correspondence between the first comparison item and the first sequence in the joint compressed data, the correspondence being used to decompress the first sequence.
[0024] In the above technical solution, the correspondence can also be recorded in the joint compressed data of BAM data and FastQ data, so as to facilitate the decompression of the first sequence in FastQ data.
[0025] In conjunction with the second aspect, in some implementations of the second aspect, the correspondence is recorded in the first comparison item of the jointly compressed data.
[0026] In conjunction with the second aspect, in some implementations of the second aspect, the plurality of comparison terms include the plurality of sequences and the position information of the plurality of sequences in the reference genome, and the first sequence is removed from the first comparison term to obtain the first compressed comparison term to obtain the joint compressed data, wherein the first compressed comparison term includes the position information of the first sequence in the reference genome but does not include the first sequence.
[0027] In conjunction with the second aspect, in some implementations of the second aspect, the method further includes: recording the correspondence between the first comparison item and the first sequence in the joint compressed data, the correspondence being used to decompress the first compressed comparison item in the joint compressed data.
[0028] In conjunction with the second aspect, in some implementations of the second aspect, the correspondence is recorded in the first compression comparison item of the joint compressed data.
[0029] In conjunction with the second aspect, in some implementations of the second aspect, the correspondence between the first comparison item and the first sequence includes the position information of the first sequence in the FastQ data.
[0030] In conjunction with the second aspect, in some implementations of the second aspect, the first sequence is compared with the reference genome to obtain the first comparison term, so as to determine the correspondence between the first comparison term and the first sequence; and the first comparison term corresponding to the first sequence is determined based on the correspondence.
[0031] In conjunction with the second aspect, in some implementations of the second aspect, the comparison item among the plurality of comparison items includes at least one of the following information: the sequence corresponding to the comparison item, the identifier of the sequence corresponding to the comparison item, and the quality score sequence corresponding to the sequence corresponding to the comparison item; determining the first comparison item corresponding to the first sequence based on the comparison between the first sequence and the sequences included in the comparison items among the plurality of comparison items; or determining the first comparison item corresponding to the first sequence based on the comparison between the identifier of the first sequence and the identifiers of the sequences included in the comparison items among the plurality of comparison items; or determining the first comparison item corresponding to the first sequence based on the comparison between the quality score sequence corresponding to the first sequence and the quality score sequences included in the comparison items among the plurality of comparison items.
[0032] In conjunction with the second aspect, in some implementations of the second aspect, the method further includes replacing the first sequence with the comparison result of the first sequence included in the first comparison item.
[0033] Thirdly, a data decompression method is provided, comprising: acquiring jointly compressed data of FastQ sequence quality data and Binary Alignment / Mapping (BAM) data, wherein the jointly compressed data includes multiple sequences, the position information of the multiple sequences in a reference genome, and the correspondence between the multiple sequences and multiple alignment terms in the BAM data; and decompressing the jointly compressed data according to the correspondence between the multiple sequences and the multiple alignment terms to obtain the FastQ data and / or the BAM data.
[0034] In the above technical solution, the joint compressed data of FastQ data and BAM data can be decompressed according to the correspondence between multiple sequences and multiple comparison items recorded in the joint compressed data to obtain the FastQ data and / or the BAM data.
[0035] In conjunction with the third aspect, in some implementations of the third aspect, the jointly compressed data includes multiple comparison items of the BAM data and the correspondence, wherein the first comparison item among the multiple comparison items includes the first sequence among the multiple sequences and the position information of the first sequence in the reference genome, and the first sequence is obtained from the jointly compressed data according to the correspondence between the first sequence and the first comparison item to obtain the FastQ data.
[0036] In conjunction with the third aspect, in some implementations of the third aspect, the correspondence is recorded in the first comparison item of the jointly compressed data.
[0037] In conjunction with the third aspect, in some implementations of the third aspect, the joint compressed data includes the FastQ data, multiple compressed comparison terms corresponding to multiple sequences in the FastQ data and the correspondence, the first compressed comparison term corresponding to the first sequence includes the position information of the first sequence in the reference genome and does not include the first sequence, and according to the correspondence between the first sequence in the multiple sequences and the first comparison term, the first sequence is filled into the first compressed comparison term to obtain the first comparison term, so as to obtain the BAM data.
[0038] In conjunction with the third aspect, in some implementations of the third aspect, the correspondence is recorded in the first compression comparison item of the joint compressed data.
[0039] In conjunction with the third aspect, in some implementations of the third aspect, the correspondence between the first comparison item and the first sequence includes the position information of the first sequence in the FastQ data.
[0040] In conjunction with the third aspect, in some implementations of the third aspect, the sequence content included in the plurality of sequences is represented by the position information of the sequence content in the reference genome, and the method further includes: obtaining the sequence content of the first sequence from the reference genome based on the position information of the sequence content of the first sequence of the plurality of sequences in the reference genome.
[0041] Fourthly, a data compression apparatus is provided, comprising: an acquisition module, a determination module, and a compression module. The acquisition module acquires FastQ sequence quality data, which includes multiple sequences. The BAM data includes multiple comparison terms corresponding to the multiple sequences, obtained by comparing each of the multiple sequences with a reference genome. The determination module determines a first comparison term corresponding to a first sequence among the multiple sequences. The first comparison term is one of the multiple comparison terms and includes the comparison result of the first sequence, which includes the position information of the first sequence in the reference genome. The compression module performs reference compression on the first sequence based on the comparison result included in the first comparison term to obtain compressed FastQ data.
[0042] In conjunction with the fourth aspect, in some implementations of the fourth aspect, the determining module is specifically used to: compare the first sequence with the reference genome to obtain the first comparison term, so as to determine the correspondence between the first comparison term and the first sequence; and determine the first comparison term corresponding to the first sequence based on the correspondence.
[0043] In conjunction with the fourth aspect, in some implementations of the fourth aspect, the sequence among the plurality of sequences includes at least one of the following information: the content of the sequence, the identifier of the sequence, and the corresponding quality score sequence of the sequence; the comparison item among the plurality of comparison items includes at least one of the following information: the content of the sequence corresponding to the comparison item, the identifier of the sequence corresponding to the comparison item, and the quality score sequence corresponding to the sequence corresponding to the comparison item; the determining module is specifically used to: determine the first comparison item corresponding to the first sequence based on the comparison between the content of the first sequence and the content of the sequences included in the comparison items among the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the identifier of the first sequence and the identifier of the sequences included in the comparison items among the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the corresponding quality score sequence of the first sequence and the quality score sequence included in the comparison items among the plurality of comparison items.
[0044] In conjunction with the fourth aspect, in some implementations of the fourth aspect, the acquisition module is further used to acquire the joint compressed data of the FastQ data and the BAM data. The joint compressed data includes the compressed data of the FastQ data and multiple compressed comparison terms corresponding to the multiple sequences. The first compressed comparison term corresponding to the first sequence includes the position information of the first sequence in the reference genome, but does not include the first sequence.
[0045] In conjunction with the fourth aspect, in some implementations of the fourth aspect, where the comparison item among the plurality of comparison items includes the sequence corresponding to the comparison item, the compression module is further configured to: remove the first sequence from the first comparison item to obtain the first compressed comparison item, so as to obtain the joint compressed data of the FastQ data and the BAM data.
[0046] In conjunction with the fourth aspect, in some implementations of the fourth aspect, the apparatus further includes a recording module for recording the correspondence between the first comparison item and the first sequence in the joint compressed data, the correspondence being used to decompress the first compression comparison item in the joint compressed data.
[0047] In conjunction with the fourth aspect, in some implementations of the fourth aspect, the correspondence is recorded in the first compression comparison item of the joint compressed data.
[0048] In conjunction with the fourth aspect, in some implementations of the fourth aspect, the correspondence between the first comparison item and the first sequence includes: the position information of the first sequence in the FastQ data.
[0049] In conjunction with the fourth aspect, in some implementations of the fourth aspect, the compression module is specifically used to: replace the first sequence with the comparison result of the first sequence included in the first comparison item to obtain compressed data of the FastQ data.
[0050] Fifthly, a data compression apparatus is provided, comprising: an acquisition module, a determination module, and a compression module, wherein the acquisition module is used to acquire FastQ sequence quality data and Binary Alignment / Mapping (BAM) data, wherein the FastQ data includes multiple sequences, and the BAM data includes multiple alignment terms corresponding to the multiple sequences, the multiple alignment terms being obtained by comparing the multiple sequences with a reference genome respectively; the determination module is used to determine a first alignment term corresponding to a first sequence among the multiple sequences, the first alignment term being one of the multiple alignment terms, the first alignment term including the alignment result of the first sequence, the alignment result of the first sequence including the position information of the first sequence in the reference genome; the compression module is used to remove the repetitive parts between the first sequence and the first alignment term to obtain jointly compressed data of the FastQ data and the BAM data.
[0051] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the compression module is specifically used to: remove the first sequence from the plurality of comparison items, which include the plurality of sequences and their respective position information in the reference genome, and obtain the joint compressed data.
[0052] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the apparatus further includes: a recording module for recording the correspondence between the first comparison item and the first sequence in the joint compressed data, the correspondence being used to decompress the first sequence.
[0053] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the correspondence is recorded in the first comparison item of the jointly compressed data.
[0054] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the plurality of comparison items include the plurality of sequences and the position information of the plurality of sequences in the reference genome, and the compression module is specifically used to: remove the first sequence from the first comparison item to obtain the first compressed comparison item, so as to obtain the joint compressed data, wherein the first compressed comparison item includes the position information of the first sequence in the reference genome, but does not include the first sequence.
[0055] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the recording module is used to record the correspondence between the first comparison item and the first sequence in the joint compressed data, and the correspondence is used to decompress the first compressed comparison item in the joint compressed data.
[0056] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the correspondence is recorded in the first compression comparison item of the joint compressed data.
[0057] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the correspondence between the first comparison item and the first sequence includes the position information of the first sequence in the FastQ data.
[0058] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the determining module is specifically used to: compare the first sequence with the reference genome to obtain the first comparison term, so as to determine the correspondence between the first comparison term and the first sequence; and determine the first comparison term corresponding to the first sequence based on the correspondence.
[0059] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the comparison item among the plurality of comparison items includes at least one of the following information: the sequence corresponding to the comparison item, the identifier of the sequence corresponding to the comparison item, and the quality score sequence corresponding to the sequence corresponding to the comparison item; the determining module is specifically used to: determine the first comparison item corresponding to the first sequence based on the comparison between the first sequence and the sequences included in the comparison items among the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the identifier of the first sequence and the identifiers of the sequences included in the comparison items among the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the quality score sequence corresponding to the first sequence and the quality score sequences included in the comparison items among the plurality of comparison items.
[0060] In conjunction with the fifth aspect, in some implementations of the fifth aspect, the compression module is further configured to: replace the first sequence with the comparison result of the first sequence included in the first comparison item.
[0061] Sixthly, a data decompression apparatus is provided, comprising: an acquisition module and a decompression module. The acquisition module is used to acquire jointly compressed data of FastQ sequence quality data and Binary Alignment / Mapping (BAM) data, the jointly compressed data including multiple sequences, their respective position information in a reference genome, and the correspondence between the multiple sequences and multiple alignment terms in the BAM data; the decompression module is used to decompress the jointly compressed data according to the correspondence between the multiple sequences and the multiple alignment terms to obtain the FastQ data and / or the BAM data.
[0062] In conjunction with the sixth aspect, in some implementations of the sixth aspect, the jointly compressed data includes multiple comparison items of the BAM data and the correspondence, wherein the first comparison item among the multiple comparison items includes the first sequence among the multiple sequences and the position information of the first sequence in the reference genome, and the decompression module is specifically used to: obtain the first sequence from the jointly compressed data according to the correspondence between the first sequence and the first comparison item to obtain the FastQ data.
[0063] In conjunction with the sixth aspect, in some implementations of the sixth aspect, the correspondence is recorded in the first comparison item of the jointly compressed data.
[0064] In conjunction with the sixth aspect, in some implementations of the sixth aspect, the joint compressed data includes the FastQ data, multiple compressed comparison terms corresponding to multiple sequences in the FastQ data and the correspondence, the first compressed comparison term corresponding to the first sequence includes the position information of the first sequence in the reference genome and does not include the first sequence, and according to the correspondence between the first sequence in the multiple sequences and the first comparison term, the decompression module is specifically used to: fill the first sequence into the first compressed comparison term to obtain the first comparison term, so as to obtain the BAM data.
[0065] In conjunction with the sixth aspect, in some implementations of the sixth aspect, the correspondence is recorded in the first compression comparison item of the joint compressed data.
[0066] In conjunction with the sixth aspect, in some implementations of the sixth aspect, the correspondence between the first comparison item and the first sequence includes the position information of the first sequence in the FastQ data.
[0067] In conjunction with the sixth aspect, in some implementations of the sixth aspect, the sequence content included in the plurality of sequences is represented by the position information of the sequence content in the reference genome, and the decompression module is further configured to: obtain the sequence content of the first sequence from the reference genome based on the position information of the sequence content of the first sequence of the plurality of sequences in the reference genome.
[0068] It should be understood that for the beneficial effects of the fourth aspect and its various implementations, please refer to the first aspect and its various implementations; for the beneficial effects of the fifth aspect and its various implementations, please refer to the second aspect and its various implementations; and for the beneficial effects of the sixth aspect and its various implementations, please refer to the third aspect and its various implementations. These will not be elaborated upon here.
[0069] A seventh aspect provides a storage system including a controller and a memory, the controller being configured to execute a method in the first aspect or any possible implementation thereof, or to execute a method in the second aspect or any possible implementation thereof, to store compressed data in the memory; or the controller being configured to execute a method in the third aspect or any possible implementation thereof, to retrieve compressed data from the memory.
[0070] Eighthly, a computing device is provided, the computing device including a processor and a memory; the processor is configured to execute instructions stored in the memory such that the computing device performs a method of the first aspect or any possible implementation thereof, or performs a method of the second aspect or any implementation thereof, or performs a method of the third aspect or any implementation thereof.
[0071] Optionally, the processor can be a general-purpose processor, which can be implemented in hardware or software. When implemented in hardware, the processor can be a logic circuit, integrated circuit, etc.; when implemented in software, the processor can be a general-purpose processor that reads software code stored in memory. This memory can be integrated into the processor or located outside the processor and exist independently.
[0072] As an example, the computing device may be a controller in the storage system, or it may be a computing-capable device connected to the storage system; this application does not specifically limit this. The computing-capable device may be an application server connected to the storage system, or it may be a third-party computing device connected to the storage system.
[0073] Ninthly, a computer program product comprising instructions is provided, which, when executed by a computing device, causes the computing device to perform a method as described in the first aspect and any implementation thereof, or a method as described in the second aspect and any implementation thereof, or a method as described in the third aspect and any implementation thereof.
[0074] In a tenth aspect, a computer-readable storage medium is provided, including computer program instructions that, when executed by a computing device, enable the computing device to perform a method as described in the first aspect and any implementation thereof, or to perform a method as described in the second aspect and any implementation thereof, or to perform a method as described in the third aspect and any implementation thereof.
[0075] As examples, these computer-readable storage devices include, but are not limited to, one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), flash memory, electrically EPROM (EEPROM), and hard drive.
[0076] Alternatively, as one implementation method, the aforementioned storage medium can specifically be a non-volatile storage medium. Attached Figure Description
[0077] Figure 1 is a schematic flowchart of a data compression method provided in an embodiment of this application.
[0078] Figure 2 is a schematic diagram of a data compression process provided in an embodiment of this application.
[0079] Figure 3 is a schematic diagram of another data compression process provided in an embodiment of this application.
[0080] Figure 4 is a schematic flowchart of another data compression method provided in an embodiment of this application.
[0081] Figure 5 is a schematic diagram of another data compression process provided in an embodiment of this application.
[0082] Figure 6 is a schematic diagram of another data compression process provided in an embodiment of this application.
[0083] Figure 7 is a schematic flowchart of a data decompression method provided in an embodiment of this application.
[0084] Figure 8 is a schematic block diagram of a data compression apparatus 800 provided in an embodiment of this application.
[0085] Figure 9 is a schematic block diagram of a data decompression apparatus 900 provided in an embodiment of this application.
[0086] Figure 10 is a schematic diagram of a system architecture provided in an embodiment of this application.
[0087] Figure 11 is a schematic diagram of a centralized storage system applicable to an embodiment of this application.
[0088] Figure 12 is a schematic diagram of a distributed storage system provided in an embodiment of this application. Detailed Implementation
[0089] The technical solutions in this application will now be described with reference to the accompanying drawings.
[0090] This application will present various aspects, embodiments, or features relating to systems comprising multiple devices, components, modules, etc. It should be understood and appreciated that individual systems may include additional devices, components, modules, etc., and / or may not include all devices, components, modules, etc. discussed in conjunction with the accompanying drawings. Furthermore, combinations of these approaches are also possible.
[0091] Furthermore, in the embodiments of this application, the words "exemplary," "for example," etc., are used to indicate that they are examples, illustrations, or descriptions. Any embodiment or design scheme described as "exemplary" in this application should not be construed as being more preferred or advantageous than other embodiments or design schemes. Specifically, the use of the term "exemplary" is intended to present the concept in a concrete manner.
[0092] In the embodiments of this application, "corresponding" and "corresponding" can sometimes be used interchangeably. It should be noted that when the distinction is not emphasized, their intended meanings are consistent.
[0093] The business scenarios described in the embodiments of this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided in the embodiments of this application. As those skilled in the art will know, with the evolution of network architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
[0094] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0095] In this application, "at least one" means one or more, and "more than one" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can mean: A alone, A and B simultaneously, and B alone, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.
[0096] For ease of description, the basic concepts involved in the embodiments of this application will be described in detail below.
[0097] 1. Base
[0098] A base is a biochemical term, short for "basic group," and can be understood as the basic unit that makes up deoxyribonucleic acid (DNA). There are four types of bases in a DNA sequence: adenine (A), guanine (G), cytosine (C), and thymine (T). A base can be represented by one of these four characters; therefore, a gene sequence can be understood as a string (or character sequence) composed of these four characters. Sequencing can be understood as measuring the base sequence information contained in a gene fragment.
[0099] 2. Short read sequence
[0100] A read, also known as a short read, refers to the sequence information of a DNA fragment obtained using sequencing technology. These DNA fragments are relatively short in length, hence the name "short read." For example, in next-generation sequencing (NGS), a short read typically contains 30-200 bases (i.e., 30-200 characters). Considering actual measurement errors, bases that cannot be effectively measured are represented by "N" in the short read.
[0101] 3. Fast sequence quality (FastQ) file
[0102] FastQ files are typically used to store raw data output from next-generation sequencers. They are plain text files with the extension "*.fastq" or "*.fq". A FastQ file can be understood as a collection of short reads obtained from measurements. The FastQ file contains an identifier (ID) (or name) for each read, the sequence string contained within each read (a string of 30-200 characters), and a quality score sequence (of the same length as the sequence string contained within each read, used to describe the measurement confidence score for each base contained in the corresponding sequence string).
[0103] FastQ files are typically compressed using Gzip, a common lossless compression tool, to obtain FastQ.gz files (with the extension "*.fastq.gz" or "*.fq.gz"). Users typically save these as the original sequencing files. FastQ.gz files and FastQ files differ only in their storage format; their actual information is identical. Therefore, they are considered equivalent in this application's description.
[0104] Furthermore, single-end sequencing generates a single FastQ.gz file from the original sample, while paired-end sequencing generates a pair of FastQ.gz files. These FastQ.gz files will serve as input information for subsequent gene analysis procedures. Paired-end sequencing is typically the more common choice.
[0105] 4. Reference genome
[0106] In addition to metagenomic data, generally speaking, a single FastQ dataset or a pair of FastQ datasets will have a corresponding definite species classification (common examples include humans, mice, and japonica rice). A specific species will correspond to a reference genome similar to a standard genome, which contains relatively fixed sequences.
[0107] 5. Read mapping
[0108] Alignment refers to the process of comparing sequenced reads with a reference genome using specific algorithms and software to determine the precise location of these reads within the genome. From a string perspective, reads can be understood as substrings, and the reference genome as a reference string; alignment is the process of performing a fuzzy match between the substring and the reference string.
[0109] 6. Sequence alignment / map (SAM) file
[0110] SAM files are a file format used to store the results of comparing sequencing reads with a reference genome. In other words, a SAM file is a text file obtained after comparing a FastQ or a pair of FastQs (which can be understood as a set of reads) with a reference genome (which may also undergo post-processing such as sorting or deduplication). SAM files have the extension "*.sam".
[0111] For example, after mapping a read in a FastQ file, a comparison item (e.g., an alignment item) will be generated. An alignment item is a single line containing specific information about the read (e.g., the read's ID, the sequence strings within the read, and the quality score sequence) and the alignment result for that read. A SAM file can be understood as a collection of multiple alignments. For the read mapping process, the input is one or a pair of FastQ files, and the output is a SAM file.
[0112] 7. Binary alignment / map (BAM) files
[0113] Because SAM files are in text format, they occupy a large amount of storage space and are slow to process. Therefore, in practical applications, SAM files are usually converted to BAM files. BAM files are compressed versions of SAM files in binary format, with the extension "*.bam". BAM files have smaller storage space and faster processing speeds, while retaining all the information from the SAM file.
[0114] Parsing a BAM file generally requires first decompressing it to restore it to a SAM file. For example, "a.sam" and "a.bam" actually contain equivalent information, only their storage formats are different. Therefore, in the description of this application, BAM files and SAM files are considered equivalent.
[0115] Generally, the BAM file generated after the read mapping process is called the primary BAM file. Following standard gene analysis procedures, the primary BAM file undergoes post-processing such as sorting and repeat labeling to obtain the secondary BAM file. For example, all alignment entries in the initial BAM file are globally sorted according to certain rules (e.g., the positions of reads in the alignment entries relative to the reference genome) to obtain the secondary BAM file. Typically, clients do not save the primary BAM file but instead save the secondary BAM file.
[0116] 8. Compression ratio
[0117] In this application, compression ratio refers to the quotient of the original data size divided by the compressed data size, and only lossless compression is discussed. A higher compression ratio results in a higher data reduction rate and better algorithm performance. The joint compression discussed in this application calculates its compression ratio as (FastQ.gz file size + BAM file size) / compressed single file size.
[0118] 9. Compression performance
[0119] In this application, compression performance refers to the processing speed of the compression algorithm during compression or decompression, typically expressed in MB / s. Compression performance equals the original file size divided by the compression time, and decompression performance equals the decompressed file size divided by the decompression time. Higher compression performance indicates a faster algorithm and better results.
[0120] 10. FastQ Reference Compression
[0121] FastQ data compression algorithms are generally divided into two main categories: no-reference (direct) compression and reference compression. While compression of the sequence content within a read involves reference compression, compression of the ID and quality score sequences of a read in FastQ data does not.
[0122] It should be understood that, generally, for a given species, there is a corresponding reference genome. When performing reference compression on the sequence content of a read, the process is similar to the read mapping described above: the sequence content of the read is compared with the reference genome to determine its precise location on the reference genome, and the sequence content of the read is replaced with its location on the reference genome (e.g., byte offset and length). Reference compression can improve the compression ratio by at least 50% compared to reference-free compression.
[0123] In the era of rapid development in precision medicine, gene sequencing, as one of the popular technologies favored for personalized diagnosis and treatment, is experiencing an ever-expanding data volume. The stock of human genome sequencing data is projected to exceed 120 exabytes (EB) by 2025. Guidelines from the American College of Medical Genetics and Genomics (ACMGG) recommend that FastQ and BAM files be retained for at least two years, with most requiring retention for decades or even indefinitely. Therefore, there is a very substantial market for gene data storage, with next-generation sequencing (NGS) accounting for over 90% of the entire sequencing market.
[0124] Currently, data compression ratio is a major concern for customers. Higher compression ratios result in lower storage costs, making them more attractive to clients. In the storage of second-generation sequencing data, FastQ and BAM data constitute the largest proportion. Existing technologies for FastQ and BAM data employ their respective commercial lossless compression algorithms for deep compression (also known as recompression), and these algorithms are independent of each other. For recompression of FastQ data, reference compression is used, achieving a higher compression ratio and thus lower storage costs for customers. However, reference compression of FastQ data requires a comparison process, resulting in poor compression performance. For example, single-threaded compression performance is typically less than 10MB / s, failing to meet the reduction performance requirements of commercial storage.
[0125] In view of this, embodiments of this application provide a data compression method that can improve the compression ratio and compression performance of FastQ data.
[0126] The following describes in detail a data compression method provided by an embodiment of this application, with reference to Figure 1. It should be understood that the examples in Figure 1 are merely to help those skilled in the art understand the embodiments of this application, and are not intended to limit the embodiments to the specific values or scenarios illustrated in Figure 1. Those skilled in the art can obviously make various equivalent modifications or variations based on the examples given below in Figure 1, and such modifications and variations also fall within the scope of the embodiments of this application.
[0127] Figure 1 is a schematic flowchart of a data compression method provided in an embodiment of this application. As shown in Figure 1, the method may include steps 110-130, which will be described in detail below.
[0128] Step 110: Obtain FastQ data and BAM data.
[0129] In this embodiment of the application, FastQ data and BAM data can be obtained. The following is an example illustrating how FastQ data and BAM data can be obtained.
[0130] For example, there are multiple ways to obtain FastQ data and BAM data. This application does not limit the specific implementation of these methods. The following examples illustrate the specific implementation process of obtaining FastQ data and BAM data in different scenarios.
[0131] Implementation method 1 (scenario 1): Receive FastQ data and BAM data.
[0132] In scenario 1 above, FastQ data and BAM data can be received directly.
[0133] For example, the FastQ data described above may include multiple sequences. Specifically, each sequence may include the sequence ID (or name), the sequence content, and the corresponding quality score sequence.
[0134] It should be understood that the aforementioned FastQ data can refer to plain text FastQ data, or it can be compressed data in FastQ.gz format obtained by losslessly compressing the plain text FastQ data. This application does not specifically limit this. For example, current gene sequencers from various manufacturers have built-in Gzip compression functions or provide related software, so the original plain text FastQ data can be losslessly compressed using the universal lossless compression format Gzip. Typically, this universal Gzip tool can be used to decompress the FastQ.gz format compressed data into plain text FastQ data.
[0135] The embodiments of this application do not specifically limit the number of sequences included in the FastQ data. For ease of description, Figure 2 illustrates an example of FastQ data including two sequences.
[0136] For example, as shown in Figure 2, assume the received FastQ data includes two reads, read1 and read2. Read1 includes its ID, sequence content, and corresponding quality score sequence. Read2 includes its ID, sequence content, and corresponding quality score sequence.
[0137] For example, the BAM data mentioned above may include multiple alignment items (also called alignment items) corresponding to the multiple sequences mentioned above. Each alignment item includes the sequence being compared (including the ID of the sequence being compared, the sequence content of the sequence being compared, and the quality score sequence corresponding to the sequence being compared) and the comparison result of the sequence being compared.
[0138] It should be understood that the BAM data mentioned above corresponds to the FastQ data, and the number of comparison terms included in the BAM data is the same as the number of sequences included in the FastQ data.
[0139] For ease of description, Figure 2 uses the FastQ data containing two sequences and the BAM data containing two contrast items as examples.
[0140] For example, as shown in Figure 2, assume that the received BAM data includes two alignment items, namely alignment1 and alignment2.
[0141] For example, alignment1 above includes read1 and the first alignment result of read1, wherein read1 includes the ID of read1, the sequence content of read1, and the quality score sequence corresponding to read1. The first alignment result corresponding to read1 includes the second alignment result of read1 and other alignment information of read1, wherein the second alignment result of read1 includes the position information of the sequence content of read1 in the reference genome (e.g., start position and length), and the other alignment information of read1 includes, but is not limited to: the confidence level of the second alignment result of read1, the name of the reference subsequence of the paired (or next) read of read1, and the position information of the paired (or next) read of read1 in the reference genome.
[0142] Optionally, the location information of the sequence content in read1 in the reference genome may also include at least one of the following: the editing information of read1 (the difference between the sequence content of read1 and the reference genome), the marker position of the second comparison result of read1, and the name of the reference genome.
[0143] For example, alignment2 above includes a first alignment result of read2 and read2, wherein read2 includes the ID of read2, the sequence content of read2, and the quality score sequence corresponding to read2. The first alignment result corresponding to read2 includes a second alignment result of read2 and other alignment information of read2, wherein the second alignment result of read2 includes the position information of the sequence content of read2 in the reference genome (e.g., start position and length), and the other alignment information of read2 includes, but is not limited to: the confidence level of the second alignment result of read2, the name of the reference subsequence of the paired (or next) read of read2, and the position information of the paired (or next) read of read2 in the reference genome.
[0144] Optionally, the location information of the sequence content in read2 in the reference genome may also include at least one of the following: the editing information of read2 (the difference between the read2 sequence content and the reference genome), the marker position of the second comparison result of read2, and the name of the reference genome.
[0145] In this embodiment of the application, the second comparison result described above can also be referred to as the sequence comparison result.
[0146] Implementation Method 2 (Scenario 2): Receive FastQ data. In this embodiment of the application, BAM data is generated based on the FastQ data.
[0147] In the above implementation method 2, FastQ data can be received, and an alignment operation can be performed on each sequence contained in the received FastQ data to obtain BAM data.
[0148] For example, each sequence in the received FastQ data can be aligned with the corresponding reference genome, and the alignment terms for each sequence can be obtained. These alignment terms constitute the BAM data mentioned above.
[0149] For example, taking the received FastQ data as including two reads (read1 and read2), as shown in Figure 3, the generated BAM data includes two alignments, alignment1 and alignment2. The specific implementation process of generating BAM data from FastQ data is described in detail below.
[0150] For example, an alignment operation is performed on read1 in FastQ data to generate alignment1 in BAM data. Alignment1 includes the first comparison result of read1 and read1, where read1 includes the ID of read1, the sequence content of read1, and the quality score sequence corresponding to read1.
[0151] For example, an alignment operation is performed on read2 in FastQ data to generate alignment2 in BAM data. Alignment2 includes the first comparison result of read2 and read2. Read2 includes the ID of read2, the sequence content of read2, and the quality score sequence corresponding to read2.
[0152] Step 120: Determine the first comparison term corresponding to the first sequence among multiple sequences.
[0153] In this embodiment of the application, after obtaining FastQ data and BAM data, the first comparison term corresponding to the first sequence among multiple sequences of FastQ data can be determined.
[0154] It should be understood that the aforementioned first comparison item is one of multiple comparison items in the BAM data, and this first comparison item includes the comparison result of the first sequence. The comparison result of the first sequence includes the location information of the first sequence in the reference genome.
[0155] There are multiple ways to determine the first comparison term corresponding to the first sequence among multiple sequences of FastQ data, and the embodiments of this application do not specifically limit this.
[0156] As an example, the following uses implementation method 1 above as an example to illustrate different implementation methods for determining the first comparison term corresponding to the first sequence among multiple sequences of FastQ data.
[0157] Example 1: The identifier of the first sequence in the FastQ data can be compared with the identifier of the sequence contained in each of the multiple comparison items in the BAM data. Based on the comparison between the identifier of the first sequence and the identifier of the sequence contained in each of the multiple comparison items in the BAM data, it can be determined that the comparison item corresponding to the first sequence is the first comparison item among the multiple comparison items in the BAM data.
[0158] For example, taking read1 in FastQ data as the first sequence, and alignment1-alignment2 as the multiple comparison items in BAM data, we can compare the ID of read1 with the identifiers of reads included in alignment1-alignment2 respectively. Assuming that read1 includes the ID of read1, and alignment1 also includes the ID of read1, alignment1 can be determined as the first comparison item corresponding to read1 in FastQ data.
[0159] Example 2: The sequence content of the first sequence in the FastQ data can be compared with the sequence content of the sequence contained in each of the multiple comparison items in the BAM data. Based on the comparison results of the sequence content of the first sequence with the sequence content of the sequence contained in each of the multiple comparison items in the BAM data, it can be determined that the comparison item corresponding to the first sequence is the first comparison item among the multiple comparison items in the BAM data.
[0160] For example, taking the first sequence as read1 in FastQ data, and the multiple comparison items in BAM data including alignment1-alignment2, we can compare the sequence content of read1 with the sequence content of read included in alignment1-alignment2 respectively. Assuming that read1 includes the sequence content of read1, and alignment1 also includes the sequence content of read1, alignment1 can be determined as the first comparison item corresponding to read1 in FastQ data.
[0161] Example 3: The quality score sequence corresponding to the first sequence in the FastQ data can be compared with the quality score sequences corresponding to the sequences contained in each of the multiple comparison items in the BAM data. Based on the comparison between the quality score sequence corresponding to the first sequence and the quality score sequences corresponding to the sequences contained in each of the multiple comparison items in the BAM data, it can be determined that the comparison item corresponding to the first sequence is the first comparison item among the multiple comparison items in the BAM data.
[0162] For example, taking the first sequence as read1 in the FastQ data, and the multiple comparison items in the BAM data including alignment1-alignment2, we can compare the sequence content of read1 with the quality score sequences corresponding to read in alignment1-alignment2 respectively. Assuming that read1 includes the quality score sequence corresponding to read1, and alignment1 also includes the quality score sequence corresponding to read1, alignment1 can be determined as the first comparison item corresponding to read1 in the FastQ data.
[0163] It should be noted that any one of Examples 1-3 above can be executed individually, or at least two of them can be executed. This application does not specifically limit this.
[0164] Another example, using the above implementation method 2 as an example, is given below to illustrate how to determine the first comparison term corresponding to the first sequence among multiple sequences of FastQ data.
[0165] In the above implementation method 2, during the process of generating BAM data from FastQ data, the correspondence between multiple sequences in FastQ data and multiple comparison items in BAM data can be determined, and the comparison item corresponding to the first sequence can be determined as the first comparison item among multiple comparison items based on the correspondence between multiple sequences and multiple comparison items.
[0166] In this embodiment of the application, during the process of aligning each sequence in the FastQ data based on the reference genome and obtaining the corresponding alignment item in the BAM data, the correspondence between the alignment item in the BAM data and the sequence in the FastQ data can also be recorded incidentally. Since the correspondence between the alignment item in the BAM data and the sequence in the FastQ data is recorded incidentally during the process of obtaining the alignment item in the BAM data, this operation of recording the correspondence can be implemented with almost no resource consumption.
[0167] For example, the following describes the specific implementation process of determining the correspondence between multiple sequences in FastQ data and multiple comparison items in BAM data, taking FastQ data including two reads (read1 and read2) and BAM data including two alignments (alignment1 and alignment2) as an example.
[0168] For example, in implementation 2 above, since alignment1 in the BAM data is obtained by performing a comparison operation on read1 in the FastQ data in this embodiment, it can be determined that read1 in the FastQ data corresponds to alignment1 in the BAM data. Similarly, in implementation 2 above, since alignment2 in the BAM data is obtained by performing a comparison operation on read2 in the FastQ data, it can be determined that read2 in the FastQ data corresponds to alignment2 in the BAM data.
[0169] For example, taking read1 from the FastQ data as the first sequence mentioned above, based on the established correspondence, it can be determined that read1 corresponds to alignment1 in the BAM data. That is, alignment1 can be identified as the first comparison item corresponding to read1 in the FastQ data.
[0170] Step 130: Based on the comparison results of the first sequence included in the first comparison item, perform reference compression on the first sequence to obtain compressed FastQ data.
[0171] In this embodiment of the application, after determining that the comparison term corresponding to the first sequence is the first comparison term, the first sequence can be compressed according to the position information of the first sequence in the reference genome included in the first comparison term to obtain compressed FastQ data (also known as compressed FastQ data).
[0172] For example, the location information of the first sequence in the reference genome included in the first comparison item can be replaced with the first sequence to obtain compressed data of the FastQ data.
[0173] Specifically, the position information of the first sequence in the reference genome included in the first comparison item is used instead of the sequence content of the first sequence. For example, the first sequence in the compressed data of FastQ data includes the ID of the first sequence, the position information of the first sequence in the reference genome, and the quality score sequence corresponding to the first sequence.
[0174] For example, taking the first sequence as read1 (including read1's ID, read1's sequence content, and read1's corresponding quality score sequence), and the first comparison item as alignment1 (including read1's ID, read1's sequence content, read1's corresponding quality score sequence, and read1's first comparison result (including read1's second comparison result)), as shown in Figure 2 or Figure 3, the second comparison result of read1 can replace the read1's sequence content to obtain the compressed FastQ data. The compressed FastQ data includes the compressed read1, which includes read1's ID, read1's second comparison result, and read1's corresponding quality score sequence.
[0175] Similarly, as shown in Figure 2 or Figure 3, the compressed data of the FastQ data also includes a compressed read2, which includes the ID of read2, the second comparison result of read2, and the quality score sequence corresponding to read2.
[0176] In the above technical solution, during the reference compression of FastQ data, the corresponding alignment terms for each sequence in the FastQ data can be determined. The positions of the sequences contained in each alignment term in the BAM data within the reference genome are then directly used for reference compression of the sequences in the FastQ data. This avoids the alignment operation with the reference genome during FastQ data reference compression, thereby improving the compression performance. In this way, reference compression can be used to improve the compression ratio of FastQ data while omitting the reference alignment process with the reference genome, thus enhancing the compression performance of FastQ data.
[0177] Optionally, in some embodiments, the compressed data of BAM data and the aforementioned FastQ data can be jointly compressed to obtain jointly compressed data of BAM data and FastQ data, thereby further improving the data compression rate.
[0178] In this embodiment of the application, the aforementioned combined compressed data includes compressed FastQ data and multiple compressed comparison entries corresponding to multiple sequences in the FastQ data. For example, the multiple sequences in the FastQ data include a first sequence, the multiple compressed comparison entries include a first compressed comparison entry, and the first compressed comparison entry corresponding to the first sequence includes the position information of the first sequence in the reference genome, but does not include the first sequence itself.
[0179] For example, taking implementation method 1 (scenario 1) as described above, since the BAM data is generated externally, the first comparison term in the BAM data not only contains the first sequence but also its position information in the reference genome. Therefore, the first sequence contained in the first comparison term can be removed to obtain the first compressed comparison term corresponding to the first sequence.
[0180] Another example is taken from implementation method 2 (scenario 2) above. Since the BAM data is generated from FastQ data in this application embodiment, the first comparison item of the generated BAM data may contain the first sequence and its position information in the reference genome, or it may contain the position information of the first sequence in the reference genome but not the first sequence. For example, if the first comparison item of the generated BAM data contains the first sequence and its position information in the reference genome, the first sequence contained in the first comparison item can be removed to obtain the first compressed comparison item corresponding to the first sequence. As another example, if the first comparison item of the generated BAM data contains the position information of the first sequence in the reference genome but does not contain the first sequence, this first comparison item is the first compressed comparison item mentioned above.
[0181] For example, consider compressed FastQ data including compressed read1 and compressed read2, and BAM data including alignment1 and alignment2. As shown in Figure 2 or Figure 3, in one example, the generated joint compressed data 1 of BAM and FastQ data includes compressed read1, compressed read2, compressed alignment1, and compressed alignment2. Specifically, compressed read1 includes the ID of read1, the second comparison result of read1, and the corresponding quality score sequence of read1; compressed read2 includes the ID of read2, the second comparison result of read2, and the corresponding quality score sequence of read2; compressed alignment1 includes the first comparison result of read1 (which does not include the second comparison result of read1); compressed alignment2 includes the first comparison result of read2 (which does not include the second comparison result of read2). Another example shows that the generated joint compressed data 1 of BAM and FastQ data includes compressed read1, compressed read2, compressed alignment1, and compressed alignment2. The compressed read1 includes the ID of read1 and the corresponding quality score sequence; the compressed read2 includes the ID of read2 and the corresponding quality score sequence; the compressed alignment1 includes the first comparison result of read1 (which contains the second comparison result of read1); and the compressed alignment2 includes the first comparison result of read2 (which contains the second comparison result of read2).
[0182] In this embodiment of the application, the correspondence between the first comparison item and the first sequence can also be recorded in the joint compressed data of the BAM data and FastQ data. This correspondence is used to decompress the first compressed comparison item in the joint compressed data of BAM data and FastQ data.
[0183] For example, taking the first sequence as read1 and the first comparison item as alignment1, the correspondence between read1 and alignment1 can be recorded in the joint compressed data of BAM data and FastQ data.
[0184] This application does not specifically limit the location of the above correspondence in the joint compressed data. Several possible implementation methods are listed below.
[0185] In one possible implementation, the correspondence between the first comparison item and the first sequence can be recorded in the first compression comparison item of the joint compressed data.
[0186] For example, the first compressed comparison term includes a required field and an optional field (or additional field). The position information of the first sequence in the reference genome can be recorded in the required field of the first compressed comparison term, and the correspondence between the first comparison term and the first sequence can be recorded as supplementary information in the optional field (or additional field) of the first compressed comparison term.
[0187] In the above implementation, the correspondence between the first comparison item and the first sequence may include the position information of the first sequence in the FastQ data.
[0188] For example, taking read1 from FastQ data as the first sequence, alignment1 from BAM data as the first comparison term, and compressed alignment1 (including the position information of read1 in the reference genome) as the first compressed comparison term, as shown in Figure 2 or Figure 3, the compressed alignment1 can also include the position information of read1 in FastQ data. For example, the offset of read1 in FastQ data can be recorded in the compressed alignment1.
[0189] Similarly, as shown in Figure 2 or Figure 3, the compressed alignment2 can also include the position information of read2 in the FastQ data. For example, the offset of read2 in the FastQ data.
[0190] In another possible implementation, the correspondence between the first comparison item and the first sequence can be recorded separately in the joint compressed data.
[0191] In this implementation, the correspondence between the first comparison item and the first sequence may include the position information of the first sequence in the FastQ data and the position information of the first comparison item in the BAM data.
[0192] For example, taking read1 in FastQ data as the first sequence and alignment1 in BAM data as the first comparison item, the correspondence between the first comparison item and the first sequence, recorded separately, includes: the position information of read1 in FastQ data and the position information of alignment1 in BAM data. For example, the position information of alignment1 in BAM data could be the offset of alignment1 in BAM data.
[0193] Optionally, this application also provides a data compression method, which can jointly compress FastQ data and BAM data, removing duplicate parts in FastQ data and BAM data to improve the compression ratio of FastQ data and BAM data.
[0194] Figure 4 is a schematic flowchart of another data compression method provided in an embodiment of this application. As shown in Figure 4, the method may include steps 410-430, which will be described in detail below.
[0195] Step 410: Obtain FastQ data and BAM data.
[0196] Similar to step 110, please refer to the description in step 110 for details, which will not be repeated here.
[0197] Step 420: Determine the first comparison term corresponding to the first sequence among multiple sequences.
[0198] Similar to step 120, please refer to the description in step 120 for details, which will not be repeated here.
[0199] Step 430: Remove the duplicate parts between the first sequence and the first comparison item to obtain the joint compressed data of FastQ data and BAM data.
[0200] In this application, the duplicate parts between the first sequence and the first comparison item can be removed to obtain the joint compressed data of FastQ data and BAM data.
[0201] In one possible implementation, the FastQ data includes a first sequence, and the first comparison entry in the BAM data includes the first sequence and its location information in the reference genome. The first sequence in the FastQ data can be removed, while the first comparison entry in the BAM data is retained, resulting in jointly compressed data of the FastQ and BAM data. That is, in this implementation, the jointly compressed data includes multiple comparison entries from the BAM data but excludes multiple sequences from the FastQ data.
[0202] For example, as shown in Figure 5, consider FastQ data including read1 and read2, and BAM data including alignment1 (including read1 and the first comparison result of read1) and alignment2 (including read2 and the first comparison result of read2). For instance, since alignment1 includes read1, read1 can be removed from the FastQ data to obtain the joint compressed data of FastQ and BAM data. This joint compressed data includes alignment1 from the BAM data but excludes read1 from the FastQ data. Similarly, since alignment2 includes read2, read2 can be removed from the FastQ data to obtain the joint compressed data of FastQ and BAM data. This joint compressed data includes alignment2 from the BAM data but excludes read2 from the FastQ data.
[0203] In the above implementation, the correspondence between the first comparison item and the first sequence can also be recorded in the joint compressed data of FastQ data and BAM data, wherein the correspondence is used to decompress the first sequence.
[0204] For example, the correspondence between the first comparison item and the first sequence can be recorded in the first comparison item of the jointly compressed data. For instance, as shown in Figure 5, alignment1 may also include the position information of read1 in the FastQ data (the offset of read1 in the FastQ data). alignment2 may also include the position information of read2 in the FastQ data (the offset of read2 in the FastQ data).
[0205] It should be understood that after obtaining read1 from alignment1 included in the jointly compressed data shown in Figure 5, the position of read1 in the FastQ data can be determined based on the position information of read1 recorded in the jointly compressed data within the FastQ data. Similarly, after obtaining read2 from alignment2 included in the jointly compressed data shown in Figure 5, the position of read2 in the FastQ data can be determined based on the position information of read2 recorded in the jointly compressed data within the FastQ data. Thus, the FastQ data can be reconstructed based on the positions of read1 and read2 in the FastQ data. Therefore, the position information of read1 and read2 recorded in the jointly compressed data of Figure 5 within the FastQ data is used to reconstruct the FastQ data.
[0206] Optionally, in some embodiments, the read1 included in alignment1 of the joint compressed data in FIG5 can be further compressed. For example, the second comparison result of read1 can be used to replace the sequence content of read1 included in read1 to achieve reference compression of read1. For example, since the first comparison result of read1 in alignment1 includes the second comparison result of read1, the sequence content of read1 included in read1 can be removed.
[0207] Similarly, the read2 included in alignment2 of the joint compressed data in Figure 5 can be further compressed. For example, the second comparison result of read2 can be used to replace the sequence content of read2 included in read2 to achieve reference compression of read2. For example, since the first comparison result of read2 in alignment2 includes the second comparison result of read2, the sequence content of read2 included in read2 can be removed.
[0208] In another possible implementation, the FastQ data includes a first sequence, and the first comparison item in the BAM data includes the first sequence and its location information in the reference genome. The first sequence can be removed from the first comparison item to obtain a first compressed comparison item, thus obtaining the jointly compressed data. This first compressed comparison item includes the location information of the first sequence in the reference genome but does not include the first sequence. That is, in this implementation, the jointly compressed data includes multiple sequences from the FastQ data and multiple compressed comparison items corresponding to multiple comparison items in the BAM data.
[0209] For example, as shown in Figure 6, consider FastQ data including read1 and read2, and BAM data including alignment1 (including read1 and the first comparison result of read1) and alignment2 (including read2 and the first comparison result of read2). For instance, read1 can be removed from alignment1 to obtain compressed alignment1, and read1 from the FastQ data can be retained to obtain the joint compressed data, which includes read1 from the FastQ data and compressed alignment1. The compressed alignment1 includes the first comparison result of read1 but does not include read1. Similarly, read2 can be removed from alignment2 to obtain compressed alignment2, and read2 from the FastQ data can be retained to obtain the joint compressed data, which includes read2 from the FastQ data and compressed alignment2. The compressed alignment2 includes the first comparison result of read2 but does not include read2.
[0210] In the above implementation, the correspondence between the first comparison item and the first sequence can also be recorded in the joint compressed data of FastQ data and BAM data, wherein the correspondence is used to decompress the first compressed comparison item in the joint compressed data.
[0211] For example, the correspondence between the first comparison item and the first sequence can be recorded in the first compression comparison item of the joint compressed data.
[0212] For example, as shown in Figure 6, the compressed alignment1 includes not only the first comparison result of read1, but also the position information of read1 in the FastQ data (the offset of read1 in the FastQ data). The compressed alignment2 includes not only the first comparison result of read2, but also the position information of read2 in the FastQ data (the offset of read2 in the FastQ data).
[0213] It should be understood that after compressing alignment1 from the joint compressed data shown in Figure 6, the corresponding read (e.g., read1) can be obtained from the joint compressed data based on the position information of read1 in the FastQ data included in the compressed alignment1, and the obtained read1 is filled into the compressed alignment1 to obtain alignment1 in the BAM data. Similarly, after compressing alignment2 from the joint compressed data shown in Figure 6, the corresponding read (e.g., read2) can be obtained from the joint compressed data based on the position information of read2 in the FastQ data included in the compressed alignment2, and the obtained read2 is filled into the compressed alignment2 to obtain alignment2 in the BAM data. Therefore, the position information of read1 and read2 in the FastQ data recorded in the joint compressed data of Figure 6 is used to reconstruct the BAM data.
[0214] Optionally, in some embodiments, the read1 included in the joint compressed data of FIG6 can be further compressed. For example, the second comparison result of read1 can be used to replace the sequence content of read1 included in read1 to achieve reference compression of read1. For example, since the first comparison result of read1 in the compressed alignment1 includes the second comparison result of read1, the sequence content of read1 included in read1 can be directly removed.
[0215] Similarly, the read2 included in the joint compressed data of Figure 6 can be further compressed. For example, the second comparison result of read2 can be used to replace the sequence content of read2 included in read2 to achieve reference compression of read2. For example, since the first comparison result of read2 in the compressed alignment2 includes the second comparison result of read2, the sequence content of read2 included in read2 can be directly removed.
[0216] Figure 7 is a schematic flowchart of a data decompression method provided in an embodiment of this application. As shown in Figure 7, the method may include steps 710-720, which will be described in detail below.
[0217] Step 710: Obtain the combined compressed data of FastQ data and BAM data.
[0218] In this embodiment of the application, jointly compressed data of FastQ data and BAM data can be obtained. The jointly compressed data includes multiple sequences, the location information of the multiple sequences in the reference genome, and the correspondence between the multiple sequences and multiple comparison items in the BAM data.
[0219] For example, the above-mentioned jointly compressed data includes the following information: read1, the first comparison result of read1, read2, the first comparison result of read2, the correspondence between read1 and alignment1, and the correspondence between read2 and alignment2.
[0220] Step 720: Decompress the joint compressed data according to the correspondence between multiple sequences and multiple comparison terms to obtain FastQ data and / or BAM data.
[0221] In this embodiment of the application, after obtaining the joint compressed data of FastQ data and BAM data, the joint compressed data can be decompressed based on the correspondence between multiple sequences and multiple comparison items recorded in the joint compressed data, and FastQ data and / or BAM data can be obtained as needed.
[0222] As an example, the aforementioned combined compressed data has had FastQ data removed while retaining BAM data. That is, the combined compressed data includes multiple alignment entries from the BAM data and the correspondences between multiple sequences and these alignment entries. For instance, the first alignment entry includes the first sequence among the multiple sequences and its location information in the reference genome.
[0223] For example, if you want to obtain BAM data from the combined compressed data, since the combined compressed data includes the BAM data, you can directly decompress the combined compressed data to obtain the BAM data.
[0224] For example, if you want to obtain FastQ data from the joint compressed data, you need to obtain multiple sequences in the FastQ data in sequence according to the correspondence between multiple sequences and multiple comparison terms included in the joint compressed data, so as to obtain FastQ data.
[0225] For example, to obtain the first sequence in FastQ data, the correspondence between the first sequence and the first comparison item can be obtained from the first comparison item of the above-mentioned joint compressed data, and the first sequence can be obtained from the above-mentioned joint compressed data according to the correspondence to obtain FastQ data.
[0226] For example, taking the joint compressed data including alignment1 as an example, alignment1 includes the ID of read1, the sequence content of read1, the quality score sequence corresponding to read1, and the first comparison result of read1. Based on the correspondence between read1 and alignment1 stored in alignment1, the ID of read1, the sequence content of read1, and the quality score sequence corresponding to read1 in alignment1 are used as read1 in FastQ data.
[0227] Optionally, if alignment1 includes the ID of read1, the quality score sequence corresponding to read1, and the first alignment result of read1, but does not include the sequence content of read1, the second alignment result of read1 can be obtained from the first alignment result of read1. Based on the position information of the sequence content of read1 in the reference genome included in the second alignment result of read1, the sequence content of read1 can be obtained from the reference genome. The obtained sequence content of read1, along with the ID of read1 and the quality score sequence corresponding to read1 in alignment1, is used as read1 in the FastQ data.
[0228] Similarly, based on the correspondence between read2 and alignment2 stored in alignment2, the ID of read2, the sequence content of read2, and the quality score sequence corresponding to read2 included in alignment2 can be used as read2 in the FastQ data. That is, the obtained FastQ data includes the following sequences: read1 and read2.
[0229] Optionally, if alignment2 includes the ID of read2, the quality score sequence corresponding to read2, and the first alignment result of read2, but does not include the sequence content of read2, the second alignment result of read2 can be obtained from the first alignment result of read2. Based on the position information of the sequence content of read2 in the reference genome included in the second alignment result of read2, the sequence content of read2 can be obtained from the reference genome. The obtained sequence content of read2, along with the ID of read2 and the quality score sequence corresponding to read2 in alignment2, are used as read2 in the FastQ data.
[0230] Another example is if the aforementioned combined compressed data is obtained by removing portions of the BAM data that overlap with the FastQ data; that is, if the compressed data includes the FastQ data, multiple compressed comparison entries corresponding to multiple sequences in the FastQ data, and the aforementioned correspondences. For example, the first compressed comparison entry corresponding to the first sequence among these multiple sequences includes the location information of the first sequence in the reference genome, but does not include the first sequence itself.
[0231] For example, if you want to obtain FastQ data from the combined compressed data, since the combined compressed data includes the FastQ data, you can directly decompress the combined compressed data to obtain the FastQ data.
[0232] For example, if you want to obtain BAM data from the joint compressed data, you need to obtain multiple comparison terms in the BAM data in sequence according to the correspondence between multiple sequences and multiple comparison terms included in the joint compressed data, so as to obtain the BAM data.
[0233] For example, to obtain the first comparison item in BAM data, the correspondence between the first sequence and the first comparison item can be obtained from the first compressed comparison item of the above-mentioned joint compressed data, and the first sequence can be filled into the first compressed comparison item to obtain the first comparison item, so as to obtain BAM data.
[0234] For example, taking the combined compressed data as including read1 and compressed alignment1 (including the first comparison result of read1, but not including read1), according to the correspondence between read1 and alignment1 stored in compressed alignment1, read1 is filled into compressed alignment1 to obtain alignment1 in BAM data.
[0235] Optionally, if the above-mentioned combined compressed data's read1 includes read1's ID and the corresponding quality score sequence, but not read1's sequence content—that is, the obtained alignment1 includes read1's ID, the corresponding quality score sequence, and read1's first alignment result, but not read1's sequence content—then a second alignment result of read1 can be obtained from the first alignment result of read1 contained in the compressed alignment1. Based on the position information of read1's sequence content in the reference genome included in the second alignment result of read1, the sequence content of read1 can be obtained from the reference genome, and this obtained read1 sequence content can be filled into the above-mentioned alignment1 to obtain BAM data.
[0236] Similarly, based on the correspondence between read2 and alignment2 stored in compressed alignment2, read2 can be filled into compressed alignment2 to obtain alignment2 in the BAM data. That is, the obtained BAM data includes the following sequence: alignment1, alignment2.
[0237] Optionally, if the above-mentioned combined compressed data's read2 includes the read2 ID and the corresponding quality score sequence, but not the read2 sequence content—that is, the obtained alignment2 includes the read2 ID, the corresponding quality score sequence, and the first alignment result of read2, but not the read2 sequence content—then the second alignment result of read2 can be obtained from the first alignment result of read2 contained in the compressed alignment2. Based on the position information of the read2 sequence content in the reference genome included in the second alignment result of read2, the read2 sequence content is obtained from the reference genome, and the obtained read2 sequence content is filled into the above-mentioned alignment2 to obtain BAM data.
[0238] The methods provided by the embodiments of this application have been described in detail above with reference to Figures 1 to 7. The embodiments of the apparatus of this application will now be described in detail below with reference to Figures 8 to 12. It should be understood that the descriptions of the method embodiments correspond to the descriptions of the apparatus embodiments; therefore, any parts not described in detail can be referred to the preceding method embodiments.
[0239] Figure 8 is a schematic block diagram of a data compression apparatus 800 provided in an embodiment of this application. The apparatus 800 can be implemented by software, hardware, or a combination of both. The apparatus 800 provided in this embodiment can implement the method flow shown in Figure 1 or Figure 4 of this embodiment. The apparatus 800 includes: an acquisition module 810, a determination module 820, and a compression module 830.
[0240] In one example, the acquisition module 810 is used to acquire FastQ data of rapid sequence quality, which includes multiple sequences. The BAM data includes multiple comparison terms corresponding to the multiple sequences, which are obtained by comparing the multiple sequences with a reference genome. The determination module 820 is used to determine a first comparison term corresponding to a first sequence among the multiple sequences. The first comparison term is one of the multiple comparison terms and includes the comparison result of the first sequence, which includes the position information of the first sequence in the reference genome. The compression module 830 is used to perform reference compression on the first sequence based on the comparison result of the first sequence included in the first comparison term to obtain compressed data of the FastQ data.
[0241] Optionally, the determining module 820 is specifically used to: compare the first sequence with the reference genome to obtain the first comparison item, so as to determine the correspondence between the first comparison item and the first sequence; and determine the first comparison item corresponding to the first sequence based on the correspondence.
[0242] Optionally, the sequence in the plurality of sequences includes at least one of the following: sequence content, sequence identifier, and corresponding quality score sequence of the sequence; the comparison item in the plurality of comparison items includes at least one of the following: content of the sequence corresponding to the comparison item, identifier of the sequence corresponding to the comparison item, and quality score sequence corresponding to the sequence corresponding to the comparison item; the determining module 820 is specifically used to: determine the first comparison item corresponding to the first sequence based on the comparison between the content of the first sequence and the content of the sequences included in the comparison items of the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the identifier of the first sequence and the identifier of the sequences included in the comparison items of the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the corresponding quality score sequence of the first sequence and the quality score sequence included in the comparison items of the plurality of comparison items.
[0243] Optionally, the acquisition module 810 is further configured to acquire the joint compressed data of the FastQ data and the BAM data. The joint compressed data includes the compressed data of the FastQ data and multiple compressed comparison terms corresponding to the multiple sequences. The first compressed comparison term corresponding to the first sequence includes the position information of the first sequence in the reference genome, but does not include the first sequence.
[0244] Optionally, if one of the comparison items includes the sequence corresponding to the comparison item, the compression module 830 is further configured to: remove the first sequence from the first comparison item to obtain the first compressed comparison item, so as to obtain the joint compressed data of the FastQ data and the BAM data.
[0245] Optionally, the device 800 further includes a recording module for recording the correspondence between the first comparison item and the first sequence in the joint compressed data, the correspondence being used to decompress the first compression comparison item in the joint compressed data.
[0246] Optionally, this correspondence is recorded in the first compression comparison item of the joint compressed data.
[0247] Optionally, the correspondence between the first comparison item and the first sequence includes: the position information of the first sequence in the FastQ data.
[0248] Optionally, the compression module 830 is specifically used to: replace the first sequence with the comparison result of the first sequence included in the first comparison item to obtain compressed data of the FastQ data.
[0249] In another example, the acquisition module 810 is used to acquire FastQ data and Binary Alignment / Mapping (BAM) data. The FastQ data includes multiple sequences, and the BAM data includes multiple alignment terms corresponding to the multiple sequences. These alignment terms are obtained by comparing the multiple sequences with a reference genome. The determination module 820 is used to determine a first alignment term corresponding to a first sequence among the multiple sequences. The first alignment term is one of the multiple alignment terms and includes the alignment result of the first sequence. The alignment result of the first sequence includes the position information of the first sequence in the reference genome. The compression module 830 is used to remove the repetitive parts between the first sequence and the first alignment term to obtain jointly compressed data of the FastQ data and the BAM data.
[0250] Optionally, the compression module 830 is specifically used to: remove the first sequence from the multiple comparison items, which include the multiple sequences and their corresponding position information in the reference genome, to obtain the joint compressed data.
[0251] Optionally, the device 800 further includes a recording module for recording the correspondence between the first comparison item and the first sequence in the joint compressed data, the correspondence being used to decompress the first sequence.
[0252] Optionally, this correspondence is recorded in the first comparison item of the joint compressed data.
[0253] Optionally, the plurality of comparison items include the plurality of sequences and the position information of the plurality of sequences in the reference genome, and the compression module 830 is specifically used to: remove the first sequence from the first comparison item to obtain the first compressed comparison item, so as to obtain the joint compressed data, wherein the first compressed comparison item includes the position information of the first sequence in the reference genome, but does not include the first sequence.
[0254] Optionally, the recording module is used to record the correspondence between the first comparison item and the first sequence in the joint compressed data, and the correspondence is used to decompress the first compressed comparison item in the joint compressed data.
[0255] Optionally, this correspondence is recorded in the first compression comparison item of the joint compressed data.
[0256] Optionally, the correspondence between the first comparison item and the first sequence includes the position information of the first sequence in the FastQ data.
[0257] Optionally, the determining module 820 is specifically used to: compare the first sequence with the reference genome to obtain the first comparison item, so as to determine the correspondence between the first comparison item and the first sequence; and determine the first comparison item corresponding to the first sequence based on the correspondence.
[0258] Optionally, the comparison item among the plurality of comparison items includes at least one of the following: the sequence corresponding to the comparison item, the identifier of the sequence corresponding to the comparison item, and the quality score sequence corresponding to the sequence corresponding to the comparison item; the determining module 820 is specifically used to: determine the first comparison item corresponding to the first sequence based on the comparison between the first sequence and the sequences included in the comparison items among the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the identifier of the first sequence and the identifiers of the sequences included in the comparison items among the plurality of comparison items; or determine the first comparison item corresponding to the first sequence based on the comparison between the quality score sequence corresponding to the first sequence and the quality score sequences included in the comparison items among the plurality of comparison items.
[0259] Optionally, the compression module 830 is further configured to: replace the first sequence with the comparison result of the first sequence included in the first comparison item.
[0260] The aforementioned device 800 can be embodied in the form of functional modules. The term "module" here can be implemented in software and / or hardware, without specific limitations.
[0261] It should be noted that the device 800 provided in the above embodiments is only illustrated by the division of the above functional modules when executing the above methods. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. For example, the acquisition module 810 can be used to execute any step in the above methods, the determination module 820 can be used to execute any step in the above methods, and the compression module 830 can be used to execute any step in the above methods. The steps implemented by the acquisition module 810, the determination module 820, and the compression module 830 can be specified as needed. By implementing different steps in the above methods through the acquisition module 810, the determination module 820, and the compression module 830, all the functions of the above device can be realized.
[0262] Figure 9 is a schematic block diagram of a data decompression apparatus 900 provided in an embodiment of this application. The apparatus 900 can be implemented by software, hardware, or a combination of both. The apparatus 900 provided in this embodiment can implement the method flow shown in Figure 7 of this embodiment. The apparatus 900 includes: an acquisition module 910 and a decompression module 920. The acquisition module 910 is used to acquire jointly compressed data of FastQ sequence quality data and Binary Alignment / Mapping (BAM) data. The jointly compressed data includes multiple sequences, their respective position information in a reference genome, and the correspondence between the multiple sequences and multiple alignment terms in the BAM data. The decompression module 920 is used to decompress the jointly compressed data according to the correspondence between the multiple sequences and the multiple alignment terms to obtain the FastQ data and / or the BAM data.
[0263] Optionally, the combined compressed data includes multiple comparison items of the BAM data and the correspondence, wherein the first comparison item among the multiple comparison items includes the first sequence among the multiple sequences and the position information of the first sequence in the reference genome, and the decompression module 920 is specifically used to: obtain the first sequence from the combined compressed data according to the correspondence between the first sequence and the first comparison item to obtain the FastQ data.
[0264] Optionally, this correspondence is recorded in the first comparison item of the joint compressed data.
[0265] Optionally, the combined compressed data includes the FastQ data, multiple compressed comparison terms corresponding to multiple sequences in the FastQ data, and the correspondence relationship. The first compressed comparison term corresponding to the first sequence includes the position information of the first sequence in the reference genome, but does not include the first sequence. Based on the correspondence relationship between the first sequence in the multiple sequences and the first comparison term, the decompression module 920 is specifically used to: fill the first sequence into the first compressed comparison term to obtain the first comparison term, so as to obtain the BAM data.
[0266] Optionally, this correspondence is recorded in the first compression comparison item of the joint compressed data.
[0267] Optionally, the correspondence between the first comparison item and the first sequence includes the position information of the first sequence in the FastQ data.
[0268] Optionally, the sequence content included in the plurality of sequences is represented by the position information of the sequence content in the reference genome, and the decompression module 920 is further configured to: obtain the sequence content of the first sequence from the reference genome according to the position information of the sequence content of the first sequence of the plurality of sequences in the reference genome.
[0269] Furthermore, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments above, which will not be repeated here.
[0270] The aforementioned device 900 can be embodied in the form of functional modules. The term "module" here can be implemented in software and / or hardware, without specific limitations.
[0271] It should be noted that the device 900 provided in the above embodiments is only illustrated by the division of the above functional modules when executing the above methods. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. For example, the acquisition module 910 can be used to execute any step in the above methods, and the decompression module 920 can be used to execute any step in the above methods. The steps implemented by the acquisition module 910 and the decompression module 920 can be specified as needed. By implementing different steps in the above methods through the acquisition module 910 and the decompression module 920, all the functions of the above device can be realized.
[0272] The modules of the various examples described in the embodiments of this application can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0273] Furthermore, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments above, which will not be repeated here.
[0274] Figure 10 shows a schematic diagram of a system architecture provided in an embodiment of this application. The system includes an application server 200 and a storage system 100.
[0275] In practical applications, users can access data through applications. The computer running these applications is called "application server 200". Application server 200 can be a physical machine or a virtual machine. Physical application server 200 includes, but is not limited to, desktop computers, servers, laptops, and mobile devices. Application server 200 accesses storage system 100 to access data via a switch. The switch is an optional device; application server 200 can also communicate directly with storage system 100 via a network.
[0276] Storage system 100 includes one or more storage devices with specific storage functions, such as hard disks. The actual addresses of the storage space provided by the hard disks are not directly exposed to application server 200. Hard disks can be of any type, such as solid-state drives (SSDs), hard disk drives (HDDs), or other types of hard disks. Each hard disk is divided into several physical chunks, which are mapped to logical chunks to form a storage pool. This storage pool provides storage space upwards (e.g., to application server 200), and this storage space actually originates from the hard disks included in storage system 100.
[0277] To minimize the storage space occupied by data copies, for example, storage system 100 can perform the data compression method described above to compress user data and store the compressed data (compressed data) in memory; alternatively, storage system 100 can also retrieve the stored compressed data (compressed data) from memory, perform the data decompression method described above to decompress the compressed data to obtain the uncompressed data, and send the obtained uncompressed data to the user. For another example, application server 200 can perform the data compression method described above to compress user data and store the compressed data (compressed data) in storage system 100; alternatively, application server 200 can also retrieve the stored compressed data (compressed data) from storage system 100, perform the data decompression method described above to decompress the compressed data to obtain the uncompressed data, and send the obtained uncompressed data to the user.
[0278] The type of storage system 100 is not limited in this application embodiment. The storage system 100 can be a centralized storage system or a distributed storage system. Figures 11-12 below exemplarily illustrate a possible centralized storage system and a possible centralized storage system and a distributed storage system.
[0279] It should be understood that, for ease of description, the following explanation will use the storage system 100 performing the above-described data compression or decompression method as an example.
[0280] As shown in Figure 11, this is a centralized storage system applicable to an embodiment of this application. The centralized storage system is characterized by a unified entry point through which all data from external devices must pass. This entry point is the engine 121 of the centralized storage system. The engine 121 is the most crucial component of the centralized storage system, where many advanced functions of the storage system are implemented.
[0281] As shown in Figure 11, there are one or more controllers in engine 121. Figure 11 illustrates the case where the engine includes controller 0.
[0282] Engine 121 also includes a front-end interface 123 and a back-end interface 124. The front-end interface 123 is used to communicate with the application server 200 to provide storage services to the application server 200. The back-end interface 124 is used to communicate with the hard disk 132 to expand the capacity of the storage system. Through the back-end interface 126, engine 121 can connect to more hard disks 132, thereby forming a very large storage resource pool.
[0283] In this embodiment, controller 0 can execute the data compression and / or data decompression methods mentioned in this embodiment. That is, controller 0 can compress user data to obtain compressed data (compressed data), and store the compressed data (compressed data) in hard disk 132 within hard disk enclosure 140, thereby executing the aforementioned data compression method. Controller 0 can also retrieve the compressed data (compressed data) from hard disk 132 in hard disk enclosure 140, decompress the compressed data to obtain the uncompressed data, and send the obtained uncompressed data to the user, thereby executing the aforementioned data decompression method.
[0284] In terms of hardware, as shown in Figure 11, controller 0 includes at least processor 125 and memory 122. Processor 125 is a central processing unit (CPU) used to process data access requests from the application server, such as write data requests initiated by the user through the application server, or read data requests initiated by the user through the application server, and also to process requests generated internally by the storage system. For example, when processor 125 receives write data requests sent by application server 200 through front-end port 123, it temporarily stores the data in these write data requests in memory 122. When the total amount of data in memory 122 reaches a certain threshold, processor 125 sends the data stored in memory 122 to hard disk 132 for persistent storage through back-end port.
[0285] Memory 122 refers to internal memory that directly exchanges data with the processor. It can read and write data at any time and at high speed, serving as temporary data storage for the operating system or other running programs. Memory can be random access memory (RAM) or read-only memory (ROM). RAM can be dynamic random access memory (DRAM) or storage class memory (SCM). DRAM is a semiconductor memory and, like most RAM, is a volatile memory device. Memory 122 can also be a module composed of dual in-line memory modules (DIMMs) or DRAM, or a solid-state disk (SSD).
[0286] The memory 122 stores computer program instructions, and the processor 125 runs the computer program instructions in the memory 122 to execute the data compression method or data decompression method provided in the embodiments of this application.
[0287] Engine 121 may not have a hard drive slot; hard drive 132 needs to be placed in hard drive enclosure 140, and back-end interface 124 communicates with hard drive enclosure 140. Back-end interface 124 exists in the form of an adapter card within engine 121, and two or more back-end interfaces 124 can be used simultaneously on one engine 121 to connect multiple hard drive enclosures. Alternatively, the adapter card can be integrated onto the motherboard, in which case the adapter card communicates with processor 125 via the PCIe bus.
[0288] It should be noted that only one engine 121 is shown in Figure 11. However, in actual applications, the storage system may contain two or more engines 121, and redundancy or load balancing may be performed among multiple engines 121.
[0289] The hard disk enclosure 140 includes a control unit 131 and several hard disks 132. The control unit 131 can have various forms. In one case, the hard disk enclosure 140 is a smart enclosure, as shown in Figure 11, where the control unit 131 includes a CPU and memory. The CPU is used to perform address translation and data reading / writing operations. The memory is used to temporarily store data to be written to the hard disks 132, or to read data from the hard disks 132 to be sent to the controller. In another case, the control unit 131 is a programmable electronic component, such as a data processing unit (DPU). Optionally, the DPU can also be replaced by a graphics processing unit (GPU), an embedded neural network processing unit (NPU), or other processing chips.
[0290] Figure 12 illustrates a distributed storage system provided in an embodiment of this application. The distributed storage system includes a computing node cluster and a storage node cluster. The computing node cluster includes one or more computing nodes 210 (two computing nodes 210 are shown in Figure 12, but the system is not limited to two computing nodes 210), and the computing nodes 210 can communicate with each other. A computing node 210 is a computing device, such as a server, desktop computer, or a controller for a storage array.
[0291] Compute node 210 can handle data access requests from outside the distributed storage system, such as write data requests initiated by users through the application server for writing data, or read data requests initiated by users through the application server for reading data, and can also handle requests generated within the storage system.
[0292] In terms of hardware, as shown in Figure 12, the computing node 210 includes at least a processor 212, memory 213, and a network interface card (NIC) 214. The processor 212 is a central processing unit (CPU) or other processing chip with processing capabilities, used to process data access requests from outside the computing node 210 or requests generated internally within the computing node 210. For example, when the processor 212 receives a write data request from a user, it temporarily stores the data in the write data request in memory 213. When the total amount of data in memory 213 reaches a certain threshold, the processor 212 sends the data stored in memory 213 to the storage node 220 for persistent storage. Figure 12 shows only one CPU 212; in practical applications, there are often multiple CPUs 212, and each CPU 212 has one or more CPU cores. This embodiment does not limit the number of CPUs or the number of CPU cores.
[0293] The type of memory 213 is similar to that of memory 122 mentioned above, and the details can be found in the previous description, which will not be repeated here.
[0294] Network interface card 214 is used to communicate with storage node 220. For example, when the total amount of data in memory 213 reaches a certain threshold, compute node 210 can send a request to storage node 220 via network interface card 214 to persistently store the data. Additionally, compute node 210 may also include a bus for communication between components within compute node 210.
[0295] Functionally, since the primary function of the compute node 210 in Figure 12 is computing, it can utilize remote storage for persistent data storage. Therefore, it has less local storage than a conventional server, thus saving cost and space. However, this does not mean that the compute node 210 cannot have local storage. In actual implementation, the compute node 210 can also have a small amount of internal hard disk or external hard disk.
[0296] Any compute node 210 can access any storage node 220 in the storage node cluster via the network. The storage node cluster includes multiple storage nodes 220 (three storage nodes 220 are shown in Figure 12, but it is not limited to three storage nodes 220).
[0297] Storage node 220 can handle data access requests sent from compute node 210, write data to or read data from hard disk 225.
[0298] In this embodiment, storage node 220 can execute the data compression method or data decompression method provided in this embodiment.
[0299] From a hardware perspective, a storage node 220 includes a processor 221, memory 222, a network interface card (NIC) 223, and multiple hard disks 225. The NIC 223 is used to communicate with the computing node 210. The hard disks 225 are used to store data and can be disks or other types of storage media, such as solid-state drives (SSDs) or shingled magnetic recording hard disks (SMARDs). The processor 221 is used to write data to or read data from the hard disks 225 according to read / write data requests. During the read / write process, the processor 221 needs to convert the address carried in the read / write data request into an address that the hard disk can recognize. In this embodiment, the processor 221 serves as the main core unit of the storage node 220, executing the data compression or decompression method provided in this embodiment.
[0300] The type of processor 221 is similar to that of processor 212, as detailed in the foregoing description, and will not be repeated here.
[0301] In another implementation, some functions of the processor 221 can be offloaded to the network interface card 223 (NIC 223). NIC 223 is a smart NIC. It may contain a CPU and memory. The CPU performs address translation and read / write operations, which can be performed by the processor 221. The memory in the NIC is used to temporarily store data to be written to the hard disk 225, or to read data from the hard disk 225 to be sent to the compute node 210. The CPU can also be a programmable electronic component, such as a digital processing unit (DPU). Optionally, the DPU can also be a GPU, NPU, or other processing chip. There is no ownership relationship between the NIC 223 and the hard disk 225 in the storage node 220; the NIC 223 can access any one of the hard disks 225 in the storage node 220.
[0302] It should be noted that Figures 11 and 12 are merely illustrative examples of possible hardware structures for storage systems. The data compression or decompression methods provided in this application embodiment are also applicable to other systems or devices with storage functions.
[0303] In this embodiment, a computer program product containing instructions is also provided. The computer program product may be a software or program product containing instructions capable of running on a computing device cluster or stored on any available medium. When run by the computing device cluster, it causes the computing device cluster to perform the methods provided above, or causes the computing device cluster to implement the functions of the apparatus provided above.
[0304] In this embodiment, a computer-readable storage medium is also provided. This computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method provided above.
[0305] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0306] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0307] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0308] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0309] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0310] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0311] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0312] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for data compression, characterized in that, The method includes: Acquire FastQ sequence quality data and Binary Alignment / Mapping (BAM) data, wherein the FastQ data includes multiple sequences and the BAM data includes multiple alignment terms corresponding to the multiple sequences, and the multiple alignment terms are obtained by comparing the multiple sequences with a reference genome respectively; A first comparison term is determined for a first sequence among the plurality of sequences. The first comparison term is one of the plurality of comparison terms. The first comparison term includes the comparison result of the first sequence. The comparison result of the first sequence includes the position information of the first sequence in the reference genome. Based on the comparison results of the first sequence included in the first comparison item, the first sequence is compressed by reference to obtain the compressed data of the FastQ data.
2. The method according to claim 1, characterized in that, The acquisition of BAM data includes: The first sequence is compared with the reference genome to obtain the first comparison item, so as to determine the correspondence between the first comparison item and the first sequence; Determining the first comparison term corresponding to the first sequence among the plurality of sequences includes: The first comparison item corresponding to the first sequence is determined based on the correspondence.
3. The method according to claim 1, characterized in that, The plurality of sequences includes at least one of the following: sequence content, sequence identifier, and the corresponding quality score sequence of the sequence; the plurality of comparison items includes at least one of the following: the content of the sequence corresponding to the comparison item, the identifier of the sequence corresponding to the comparison item, and the quality score sequence corresponding to the sequence corresponding to the comparison item. Determining the first comparison term corresponding to the first sequence among the plurality of sequences includes: Based on the comparison between the content of the first sequence and the content of the sequences included in the plurality of comparison items, the first comparison item corresponding to the first sequence is determined; or Based on the comparison between the identifier of the first sequence and the identifiers of the sequences included in the plurality of comparison items, the first comparison item corresponding to the first sequence is determined; or The first comparison item corresponding to the first sequence is determined based on the comparison between the quality score sequence corresponding to the first sequence and the quality score sequences included in the multiple comparison items.
4. The method according to any one of claims 1 to 3, characterized in that, The method further includes: Obtain the joint compressed data of the FastQ data and the BAM data. The joint compressed data includes the compressed data of the FastQ data and multiple compressed comparison terms corresponding to the multiple sequences. The first compressed comparison term corresponding to the first sequence includes the position information of the first sequence in the reference genome, but does not include the first sequence.
5. The method according to claim 4, characterized in that, When one of the multiple comparison items includes a sequence corresponding to the comparison item, obtaining the jointly compressed data of the FastQ data and the BAM data includes: The first sequence is removed from the first comparison item to obtain the first compressed comparison item, thereby obtaining the joint compressed data of the FastQ data and the BAM data.
6. The method according to claim 4 or 5, characterized in that, The process of obtaining the combined compressed data of the FastQ data and the BAM data includes: The correspondence between the first comparison item and the first sequence is recorded in the joint compressed data, and the correspondence is used to decompress the first compression comparison item in the joint compressed data.
7. The method according to claim 6, characterized in that, The correspondence is recorded in the first compression comparison item of the joint compressed data.
8. The method according to claim 6 or 7, characterized in that, The correspondence between the first comparison item and the first sequence includes: the position information of the first sequence in the FastQ data.
9. The method according to any one of claims 1 to 8, characterized in that, The step of performing reference compression on the first sequence based on the comparison result of the first sequence included in the first comparison item to obtain compressed data of the FastQ data includes: The comparison result of the first sequence included in the first comparison item is used to replace the first sequence to obtain the compressed data of the FastQ data.
10. A method for data decompression, characterized in that, The method includes: Obtain jointly compressed data of FastQ sequence quality data and Binary Alignment / Mapping (BAM) data. The jointly compressed data includes multiple sequences, their respective position information in the reference genome, and the correspondence between the multiple sequences and multiple alignment terms in the BAM data. Based on the correspondence between the plurality of sequences and the plurality of comparison items, the jointly compressed data is decompressed to obtain the FastQ data and / or the BAM data.
11. The method according to claim 10, characterized in that, The jointly compressed data includes multiple comparison items of the BAM data and the corresponding relationships. The first comparison item among the multiple comparison items includes a first sequence among the multiple sequences and the position information of the first sequence in the reference genome. The step of decompressing the jointly compressed data according to the correspondence between the plurality of sequences and the plurality of comparison items to obtain the FastQ data includes: Based on the correspondence between the first sequence and the first comparison item, the first sequence is obtained from the joint compressed data to obtain the FastQ data.
12. The method according to claim 11, characterized in that, The correspondence is recorded in the first comparison item of the joint compressed data.
13. The method according to claim 10, characterized in that, The joint compressed data includes the FastQ data, multiple compressed comparison terms corresponding to multiple sequences in the FastQ data, and the corresponding relationships. The first compressed comparison term corresponding to the first sequence includes the position information of the first sequence in the reference genome, but does not include the first sequence itself. The step of decompressing the jointly compressed data according to the correspondence between the plurality of sequences and the plurality of comparison items to obtain the BAM data includes: Based on the correspondence between the first sequence and the first comparison term among the plurality of sequences, the first sequence is filled into the first compressed comparison term to obtain the first comparison term, thereby obtaining the BAM data.
14. The method according to claim 13, characterized in that, The correspondence is recorded in the first compression comparison item of the joint compressed data.
15. The method according to any one of claims 11 to 14, characterized in that, The correspondence between the first comparison item and the first sequence includes the position information of the first sequence in the FastQ data.
16. The method according to any one of claims 11 to 15, characterized in that, The sequence content included in the plurality of sequences is represented by the position information of the sequence content in the reference genome. The method further includes: The sequence content of the first sequence is obtained from the reference genome based on the position information of the sequence content of the first sequence in the reference genome.
17. A data compression apparatus, characterized in that, Includes units / modules for performing the method according to any one of claims 1 to 9.
18. A data decompression apparatus, characterized in that, Includes units / modules for performing the method of any one of claims 10 to 16.
19. A storage system, characterized in that, It includes a controller and a memory, the controller being configured to perform the method as described in any one of claims 1 to 16, storing compressed data in the memory, or retrieving compressed data from the memory.
20. A computing device, characterized in that, The computing device includes a processor and memory; The processor is configured to execute instructions stored in memory to cause the computing device to perform the method as described in any one of claims 1 to 16.
21. A computer program product containing instructions, characterized in that, When the instructions are executed by the computing device, the computing device performs the method as described in any one of claims 1 to 16.
22. A computer-readable storage medium, characterized in that, It includes computer program instructions, which, when executed by a computing device, cause the computing device to perform the method as described in any one of claims 1 to 16.