A distributed array storage and microbe-based high capacity error correction DNA storage technology (Bio-RAID)
By using distributed DNA storage technology based on microbial systems and disk array technology, combined with RAID5 and RAID6 error correction codes, the problems of file copying and high error rate in DNA storage are solved, achieving efficient and low-cost DNA storage and retrieval.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF MICROBIOLOGY CHINESE ACAD OF SCI
- Filing Date
- 2021-07-09
- Publication Date
- 2026-06-26
AI Technical Summary
Existing DNA storage technologies have the problem that each stored file requires separate DNA synthesis, making it impossible to replicate and distribute in batches. Furthermore, the DNA sequence synthesis, storage, and retrieval processes are prone to errors.
We employ distributed DNA storage technology based on microbial systems, combined with disk array technology, and use RAID5 and RAID6 error correction codes to form a Bio-RAID distributed DNA disk array. Through the distributed arrangement and error correction verification of index information units and content information units, we achieve high-tolerance error-correcting storage.
This enables efficient replication and batch distribution of DNA storage, reduces storage costs, and improves the accuracy of DNA sequence synthesis and reading.
Smart Images

Figure CN114927169B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of biotechnology, and in particular to a distributed array storage and high-tolerance error-correcting DNA storage method. Background Technology
[0002] With the development of electronic and computer technologies, the world has fully entered the era of big data. All life-related activities involve data storage and processing. Currently, the total amount of global data is 3.52 × 10⁻⁶. 22 Based on its consistently exponential growth rate, bits are projected to reach 3 x 10^66 bits by 2040. 24 This growth trend will soon exceed the capacity of existing storage media. Existing storage media, such as magnetic storage (e.g., magnetic tape or hard disk drives), optical storage (e.g., Blu-ray), and solid-state storage (e.g., flash memory), can no longer meet the ever-increasing demand for storage capacity and have become a problem that humanity has to face, including short effective storage time, data loss due to environmental factors, high energy consumption of production equipment, and environmental pollution.
[0003] Currently, mainstream data storage is typically achieved by altering the properties of materials: such as the electrical properties of flash memory and phase-change memory, the optical properties of Blu-ray discs, or the magnetic properties of hard disk drives and magnetic tapes. Although these technologies have made rapid progress, their storage density is nearing its limit.
[0004] In contrast, DNA storage uses as few atoms as possible to store a single bit of information, achieving high-density storage, making it more attractive.
[0005] Deoxyribonucleic acid (DNA) is a natural information storage medium, attracting attention due to its exceptionally secure storage capacity for the vast amounts of genetic information from almost all living organisms worldwide. As the densest and most stable known data storage medium, DNA possesses characteristics such as high storage density, long storage time, good parallel access, and strong compatibility. A complete DNA data storage process includes encoding digital information into DNA sequences, synthesizing actual DNA molecular sequences, organizing these DNA sequences into a database and storing them long-term, retrieving and selectively accessing DNA sequences, reading molecules (sequencing), converting them into data (decoding), and thus decoding the information back into original digital data.
[0006] DNA storage can be traced back to 1964 when Wiener and Neiman proposed the concept of molecular "genetic memory." In 1996, Davis first "wrote" a DNA sequence (5'-CCCCCCAACGCGCGCG-3') encoding a 35-bit black-and-white image into a cellular storage vector. In 1999, Bancroft et al. successfully recovered a 23-character message hidden in a DNA sequence. In 2000, Leier et al. implemented encryption and decryption methods applicable to DNA "barcodes." In 2001, Reif et al. at Duke University first constructed a small, randomly accessible DNA database. It wasn't until 2012, when Church at Harvard University stored a 50,000-word book in DNA, that the DNA storage boom began. With breakthroughs in DNA synthesis (data writing) and DNA sequencing (data reading), DNA storage has become a hot topic in next-generation storage technology. In 2018, the U.S. published the "Semiconductor Synthetic Biology Roadmap," predicting that DNA-based data storage would likely solve the challenges of massive data storage, data center scale, and energy consumption. In July 2019, Scientific American listed DNA storage as one of the top ten breakthrough technologies of the year. Currently, a DNA Storage Consortium has been formed, comprised of 16 organizations including Microsoft, Illumina, Twist Bioscience, and data storage giant Western Digital, dedicated to establishing commercially viable DNA storage systems to address the explosive growth of digital data.
[0007] However, current DNA storage technology is still based on in vitro synthesis and storage, which has certain drawbacks. For example, each stored file requires separate DNA synthesis, making batch distribution impossible; and errors are prone to occur during DNA sequence synthesis, storage, and retrieval. Summary of the Invention
[0008] In view of this, based on the characteristics of microorganisms that can reproduce rapidly and transmit genetic information accurately, this invention has invented a DNA storage technology based on a microbial system; at the same time, combined with disk array technology, a distributed DNA storage array technology with error correction capability and the ability to reduce the sequencing depth of stored data has been invented.
[0009] To achieve the above-mentioned objectives, the present invention provides the following technical solution:
[0010] This invention provides a distributed array storage unit library, comprising one or more basic storage units. Each basic storage unit includes an index information unit and a content information unit to be stored; the number of index information units is one or more; the basic storage units composed of the index information units and the content information units to be stored are arranged in a distributed manner.
[0011] The index information unit and the content information unit to be stored are encoded as corresponding base sequences; the encoded basic storage unit sequence module is composed of four bases, A, T, C, and G, arranged randomly, and its length is 12-36 bases; the basic storage unit has no five consecutive A or T and no four consecutive C or G; the GC content of the basic storage unit sequence is between 40% and 60%; the basic storage unit sequence itself has no four or more consecutive reverse complementary sequences; the Hamming distance between the basic storage unit sequences is greater than 3; the basic storage unit sequence module has no overlap region, and the last 6 positions of the basic storage unit sequence module have no sequences that are homologous to or reverse complementary to the restriction enzyme site.
[0012] In some specific embodiments of the present invention, the index information unit includes stripe address information and in-strip coding unit address information.
[0013] In some specific embodiments of the present invention, a row check information is also included; the row check information is RAID5; the RAID5 row check information corresponds to the row information; the RAID5 row check mode allows the loss of 1 DNA fragment out of every 5 basic DNA fragments, that is, the loss of any one of the 5 information fragments, namely 3 index information base sequences, 1 content information base sequence and row check base sequence, can recover the metadata.
[0014] In some specific embodiments of the present invention, column check information is further included between storage units, wherein the column check is RAID6; the RAID6 column check information corresponds to the basic storage unit. The RAID6 is calculated by re-verifying the base sequences of the five information segments in each of the four basic DNA units: three index information segments, one content information segment, and row check information. The amino acid sequence distribution of each pair of column check oligonucleotides corresponds to a specific base sequence of the four data information segments. That is, even if any two of the six corresponding basic storage units are lost, the metadata can be recovered.
[0015] In some specific embodiments of the present invention, the index information unit, the content information unit to be stored, and the RAID5 row check unit include a connection sequence, which is an overlap area.
[0016] In some specific embodiments of the present invention, the invention further includes a head enzyme digestion base sequence, an overlap base sequence, and a tail enzyme digestion base sequence.
[0017] The present invention also provides a high-tolerance error-correcting Bio-RAID distributed DNA disk array, including the distributed array storage unit library.
[0018] The present invention also provides a storage medium, including the distributed array storage unit library or the high-tolerance error-correcting Bio-RAID distributed DNA disk array; the storage medium is microorganisms.
[0019] This invention also provides a method for constructing the distributed array storage unit library or the high-tolerance error-correcting Bio-RAID distributed DNA disk array system. The data information to be stored is divided into two parts: index information and content information units. The index information is divided into multiple index information units according to the actual storage content. The index information units and content information units are corresponding base sequences. The base sequence of each content information unit and the base sequences of multiple corresponding index information units form a basic DNA storage unit.
[0020] The present invention also provides a high-tolerance error-correcting DNA storage method, which stores the content to be stored in the distributed array storage unit library.
[0021] In some specific embodiments of the present invention, the high-tolerance error-correcting DNA storage method includes the following steps:
[0022] Step 1: Divide the data to be stored into several blocks that can be represented by content information units. Each content information unit, along with several index information units, forms an addressable basic DNA storage unit. The index information is divided into multiple index information units based on the actual storage content. The index information units and content information units are corresponding base sequences. Each content information unit's base sequence, along with its corresponding multiple index information unit base sequences, forms a basic DNA storage unit.
[0023] Step 2: Based on the sequence information of the DNA basic storage unit, write row and column checksums, and use row checksum RAID 5 and column checksum RAID 6 disk checksum technologies for encoding to obtain a high-tolerance error-correcting Bio-RAID distributed DNA disk array system that can accurately recover DNA storage unit information.
[0024] Based on the characteristics of microorganisms, such as rapid reproduction and precise transmission of genetic information, this invention provides a DNA storage technology based on a microbial system. At the same time, combined with disk array technology, it provides a distributed DNA storage array technology with error correction capabilities and the ability to reduce the sequencing depth of stored data. Attached Figure Description
[0025] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below.
[0026] Figure 1 This diagram illustrates a disk array of an exemplary DNA character library module according to the present invention.
[0027] Figure 2 This is a schematic diagram of the structure of an exemplary basic unit for storing DNA information according to the present invention;
[0028] Figure 3 This diagram illustrates an exemplary method of adding pUC-19 to a font array in situ according to the present invention.
[0029] Figure 4 This diagram illustrates an exemplary hybrid DNA storage information basic unit fragment and pUC-19 vector homology ligation of the present invention;
[0030] Figure 5 This diagram illustrates an exemplary plasmid system for the ancient Chinese poem "Ascending the Stork Tower" according to the present invention.
[0031] Figure 6 This diagram illustrates an exemplary method for plasmid transformation of the ancient Chinese poem "Ascending the Stork Tower" according to the present invention.
[0032] Figure 7 This shows the sequencing results of the plasmid concentration of the ancient Chinese poem "Ascending the Stork Tower" (0.5 ng / ul).
[0033] Figure 8 This shows the sequencing results of the 5 ng / ul mixed ancient poem "Ascending the Stork Tower" transformation plasmid concentration;
[0034] Figure 9 The graph shows the translation accuracy of 100 reads randomly sampled from a 0.5 ng / ul mixed plasmid concentration group.
[0035] Figure 10 The translation accuracy of 100 reads randomly sampled from a 5.0 ng / ul mixed plasmid concentration group is shown. Detailed Implementation
[0036] This invention discloses a distributed array storage cell library and a high-tolerance error-correcting DNA storage method. Those skilled in the art can refer to the content of this document and appropriately modify the process parameters to achieve the desired result. It is particularly important to note that all similar substitutions and modifications are obvious to those skilled in the art and are considered to be included in this invention. The methods and applications of this invention have been described through preferred embodiments. Those skilled in the art can clearly modify or appropriately change and combine the methods and applications described herein without departing from the content, spirit, and scope of this invention to implement and apply the technology of this invention.
[0037] The objectives of this invention are twofold: first, to address the problem that current DNA storage technologies require the separate synthesis of DNA for each stored file, making batch distribution impossible; and second, to resolve the issue that current DNA storage technologies are prone to errors during DNA sequence synthesis, storage, and retrieval.
[0038] This invention, based on the characteristics of microorganisms—rapid reproduction and precise transmission of genetic information—and the error-correcting capabilities of disk arrays, presents a distributed DNA storage array technology with high error-correcting capabilities based on a microbial system. The specific details of the invention are as follows:
[0039] In a first aspect, the present invention provides a distributed DNA basic storage unit array technology. In the storage method of the present invention, distributed array storage is the core and focus. The feature is that the information stored in DNA is divided into basic storage units in the form of "index + content". The basic storage units are distributed among each other. Its advantage is that the index fragments and content fragments in the basic storage units can be combined at any time according to the stored content, breaking the non-reusability of the original synthetic storage information and greatly reducing the storage cost.
[0040] (1) The data to be stored is divided into fixed-size data units, represented by content information units. The content information units are then address-encoded according to their offset addresses in the original data to form index information units. Depending on the address range, each index information unit may include multiple index address units. The index information units and content information are encoded according to the following encoding rules to form corresponding base sequences. Each content fragment base sequence and its corresponding multiple index information unit base sequences constitute a basic DNA storage unit.
[0041] The basic storage unit module encoding rules are as follows: the basic storage unit sequence module is composed of four bases, A, T, C, and G, arranged randomly, with a length of 12-36 bp; the basic storage unit has no five consecutive A or T bases and no four consecutive C or G bases; the GC content of the basic storage unit sequence is between 40% and 60%; the basic storage unit sequence itself has no four or more consecutive reverse complementary sequences; the Hamming distance between the basic storage unit sequences is greater than 3; the basic storage unit sequence module has no overlap region, and the last 6 positions of the basic storage unit sequence module have no sequences that are homologous to or reverse complementary to the restriction enzyme site.
[0042] (2) The data represented by the index information unit and content address unit formed in step (1) is used to generate a check unit using the RAID5 mechanism, forming row check information. The index address unit, content address unit, and check unit constitute a basic DNA unit.
[0043] (3) Construct redundant stripes based on the RAID6 mechanism using the basic storage units consisting of index information units, content information units, and check information units from step (2). In the DNA storage system, multiple content information units are encoded into two check information units through erasure coding. The content information units and check information units form a data stripe. The index information unit covers the stripe address and the addresses of the content information units and check information units within the stripe. The column check generated based on the RAID6 mechanism only calculates the column check unit for the content address unit and the check information unit of each DNA storage unit. The index information unit, since it needs to represent address information, does not need to be checked and instead represents the address range normally. The index address unit and the corresponding content information form each DNA basic unit. Multiple DNA basic units constitute a DNA storage file library, and one or more DNA storage files constitute a DNA basic unit library that stores all the data to be stored.
[0044] According to the distributed DNA basic unit storage method of the present invention, the preferred data index information includes page number, row and column information.
[0045] A second aspect of this invention provides a high-tolerance error-correcting DNA storage technology, namely Bio-RAID technology. In the storage method of this invention, microbial RAID verification is the core and key element. It involves encoding based on the sequence information of the basic DNA storage unit using row and column verification methods, and employing row-verification RAID 5 and column-verification RAID 6 disk verification technologies. This ultimately forms a high-tolerance error-correcting Bio-RAID distributed DNA disk array system capable of accurately recovering DNA storage unit information.
[0046] Figure 1 An example of a Bio-RAID distributed DNA disk array system is shown, which includes the following components:
[0047] This invention provides a RAID5 row check unit physical library within a DNA basic storage unit sequence library. The row check physical library consists of multiple row check fragment oligonucleotides, each with a DNA sequence corresponding to a specific data information. Each corresponding basic DNA fragment comprises three index information (index X, index Y, and index Z) base sequences, a content information base sequence, and a row check base sequence calculated using RAID5 disk technology. The RAID5 row check mode allows for the loss of one DNA fragment out of every five basic DNA fragments; that is, even if any one of the five information segments (the three index information base sequences, the content information base sequence, and the row check base sequence) is lost, the metadata can still be recovered.
[0048] Row parity information - RAID5 calculation process: XOR the three index information values and one content information value to generate a parity value as row parity information, which is used to determine whether there is an error in the current row.
[0049] Column parity information - RAID6 calculation process: Column parity information is used to ensure redundancy among multiple different storage units. Column parity information acts on the redundancy of content information segments and row parity information segments between different storage units (the index information segment needs to represent stripe information and the identifiers of different storage units within the stripe). In this embodiment, corresponding column parity information segments are generated for the content information segments and row parity information segments respectively, and these segments are filled into the generated content information segments and row parity information segments of the column parity storage unit. In this embodiment, column parity information is calculated based on a 2*4 VanderMonte matrix.
[0050] The calculation steps are as follows:
[0051] 1) The content information fragments in the four storage units are organized into a column vector in sequence;
[0052] 2) Perform matrix multiplication by multiplying the 2*4 Vandermonde matrix by the column vectors in step 1);
[0053] 2.a) In a 2x4 Vandermonde matrix, each row is multiplied one-to-one with each value in the column vector, and then an XOR operation is performed to obtain the XOR value; the multiplication process is as follows:
[0054] The RAID6 mechanism uses RS encoding, and its encoding calculation uses the Galois field GF(2). w ) Operation rules. Internally, operations in the Galois field are generally implemented using discrete mathematics and logarithms. When it is necessary to complete GF(2... 4 When performing calculations on the domain, we first use the primitive polynomial P(X) = X on GF(24).4 +X+1 obtains GF(2) 4 The discrete logarithm table for ) is shown in Table 1:
[0055] Table 1. The discrete logarithm and discrete opposition table on GF(24). 4 )
[0056]
[0057] In Table 1, the discrete logarithm gfilog[i] = 2 i mod P(2), discrete antilogarithm gflog[gfilog[i]]=i. Based on the properties of discrete logarithms and discrete antilogarithms, we can obtain the method for multiplication and division calculations over the Galois field:
[0058] A*B=gfilog[gflog[A*B]]=gfilog[gflog[A]+gflog[B]]
[0059] A / B=gfilog[gflog[A / B]]=gfilog[gflog[A]-gflog[B]]
[0060] Therefore, we are in GF(2 4 When calculating 3*7, the calculation process is as follows:
[0061] 3*7=gfilog[gflog[3*7]]=gfilog[gflog[3]+gflog[7]]=gfilog[4+10]=gfilog
[14] =9
[0062] 2.b) Repeat step 2.a) to obtain the XOR value of the second matrix multiplication based on the second row of the 2*4 Vandermonde matrix.
[0063] 3) Steps 2.a) and 2.b) yield two column validation information fragments.
[0064] 4) Calculate two column check information segments corresponding to the content information segments and row check information segments in the four storage units respectively, and fill them into the segments in the column check storage units respectively.
[0065] Row check information is used to confirm whether the current DNA basic storage unit is faulty. During data writing, check information is calculated based on three index information and content information. After data reading (data sequencing), check information is calculated again based on the three index information and content information, and the calculated check information is compared with the check information in the storage basic unit. If the comparison is inconsistent, the current DNA basic storage unit is considered to be faulty and considered to be faulty. The RAID5 row check base sequence calculation steps from data writing to the formation of a DNA basic storage unit with row check information are as follows:
[0066] 1) During the data writing process, the data is divided into data segments according to the number of bits corresponding to the basic storage unit of DNA, which is generally a multiple of 8 bits;
[0067] 2) The segmented data fragments are encoded according to their displacements, and the DNA basic storage units are organized in a RAID6 manner. In this embodiment, every 4 DNA basic storage units generate 2 parity storage units. The 4 content basic storage units and 2 parity storage units constitute a stripe, which is numbered 0-5 in sequence. The stripes are address-encoded in sequence. In this embodiment, 3 index information is used, of which 2 index information represent the stripe number and the last index information represents the number within the stripe.
[0068] 3) Within the basic DNA storage unit, there are 3 index information segments, 1 content information segment, and 1 check information segment. As shown in step 2), the 3 index information segments represent the stripe encoding, and the 1 content information segment represents the actual stored content (in this embodiment, 5 segments are used as an example; in actual production environments, the index information segments, content information segments, and check information segments are specifically divided according to the number of spliced segments and the addressing range). The check information segment is obtained by XORing the 3 index information segments and the 1 content information segment.
[0069] 4) In this embodiment, the data is divided into fixed-size data segments, then striped according to the stripes, and address encoded, and then row check information is generated. Within each stripe, every four basic storage units generate two check storage units based on the Vandermonde matrix (only for content segments and check segments, corresponding check segments are generated according to RAID6; index information segments need to provide addressing functionality and need to be address-encoded according to actual addressing). This invention provides a RAID6 column check unit physical library in the DNA basic storage unit sequence library. The column check physical library consists of oligonucleotides that separately store various column check fragments. Two column check base sequences are calculated by re-verifying the base sequences of the five information segments—three index information segments, one content information segment, and row check information—from every four DNA basic units. The amino acid sequence distribution of each pair of column check oligonucleotides corresponds to a specific base sequence among the four data information segments. That is, even if any two of the six corresponding basic storage units are lost, the metadata can be recovered.
[0070] For the five base encoding parts formed by the information text, namely index X, index Y, index Z, basic content storage unit, and line check, the head enzyme digestion base, overlap base sequence, and tail enzyme digestion base sequence are sequentially inserted into the five parts of the information encoding, and finally a DNA basic storage unit is formed. Multiple sets of DNA basic storage units constitute a high-tolerance error-correcting DNA storage library that stores all the stored data.
[0071] According to the high-tolerance error-correcting DNA storage method of the present invention, preferably, a linking sequence, i.e. an overlap region, is further included between the index fragment oligonucleotide, the content fragment oligonucleotide and the row check oligonucleotide in the basic DNA storage unit.
[0072] In the high-tolerance error-correcting DNA storage method of the present invention, preferably, the ligation sequence is an overlapping sequence or an enzyme-digested adapter sequence.
[0073] The third part of this invention provides a multi-fragment DNA storage live array system, with the biological live carrier as the core and focus. Based on a high-tolerance error-correcting DNA storage disk array, a highly redundant multi-fragment DNA storage live array using biological live carriers is constructed, which includes the following aspects:
[0074] a. The ligation method for distributed DNA basic unit fragments can be selected according to actual needs, either in situ ligation into a single plasmid or mixed ligation into multiple fragment plasmids. In situ ligation involves adding the appropriate vector and reaction solvent to a distributed arrangement matrix, where each DNA basic storage unit is a separate system. Mixed ligation involves adding the appropriate vector and reaction solvent to a mixed multi-fragment system, where multiple DNA basic units are separate systems.
[0075] b. Distributed DNA basic unit fragment plasmids are transformed into microbial vectors, with the most suitable microorganisms as vectors. The DNA basic unit fragments are stored in the microorganisms or made into dry powder for long-term preservation. The characteristics of microbial vectors or plasmid vectors can be used to amplify and preserve DNA basic unit fragments, thereby achieving efficient storage of data information.
[0076] c. The mixed DNA basic unit system exhibits a distributed matrix form throughout the entire process of ligation, transformation, amplification, and preservation.
[0077] The preferred ligation methods of the multi-fragment DNA storage live array system of the present invention include, but are not limited to, enzyme ligation and homologous recombination ligation.
[0078] The multi-fragment DNA storage live array system of the present invention preferably includes, but is not limited to, Escherichia coli, yeast and Bacillus.
[0079] The multi-fragment DNA storage live array system of the present invention preferably includes, but is not limited to, thermal shock conversion or electroconversion methods for storing the basic DNA storage unit plasmid.
[0080] The multi-fragment DNA storage live array system of the present invention preferably includes, but is not limited to, the amplification method of the DNA basic storage unit fragments, including but not limited to microbial proliferation culture or polymerase chain reaction.
[0081] The multi-fragment DNA storage live array system of the present invention preferably has the storage process in a distributed matrix form existing in any single or continuous DNA basic unit storage process.
[0082] The raw materials and reagents used in the distributed array storage unit library and high-tolerance error-correcting DNA storage method provided by this invention are all commercially available.
[0083] The present invention will be further illustrated below with reference to the embodiments:
[0084] The following example uses a distributed DNA array to store all 24 Chinese characters and 4 punctuation marks of Wang Zhihuan's poem "Ascending the Stork Tower" as an embodiment of the present invention, and employs two methods for storage: in-situ vector ligation and mixed fragment ligation.
[0085] Example 1: Basic Storage Unit Module Design
[0086] According to the distributed DNA storage array method and coding principle described in the present invention, a file is generated by designing a basic storage unit module for the 24 Chinese characters and 4 punctuation marks of the whole poem "Ascending the Stork Tower". The base sequences of the target file index information, content information, row check sequence, and column check sequence information are shown in Table 1. And for the basic unit file of "Ascending the Stork Tower", the restriction enzyme cleavage sites and Overlap sequences at the head and tail of the target file are designed in combination with the coding principle as shown in Table 2. The connection and combination methods of the restriction enzyme cleavage sites, Overlap regions, index sequences, content sequences, row checks, and column checks in the present invention are as Figure 2 shown.
[0087] Table 1: Target file index module, content sequence module, and check module
[0088]
[0089]
[0090]
[0091] Table 2 Target file restriction enzyme cleavage sites and Overlap sequences
[0092] EcoRI Serial Number GAATTC overlap1 SEQ ID No. 211 ATAAGCCTCGAGTAG overlap2 SEQ ID No. 212 TGATAGTACCAGAGC overlap3 SEQ ID No. 213 TAGTCAACTAGCCTC overlap4 SEQ ID No. 214 CTACATGTCCAGGCA overlap5 SEQ ID No. 215 AGCTATACGGAGCAT overlap6 SEQ ID No. 216 GCTTGTGACAGCATA XbaI SEQ ID No. 217 TCTAGA
[0093] Example 2 Construction of distributed storage array
[0094] The distributed font library array includes two parts: a content information area and a check area. The check area includes row checks and column checks. The row check is RAID5, and the column check is RAID6. Four basic storage units in the DNA storage information are grouped as one, and two checks in the check area are grouped as one. They are arranged in this way in turn to form a font library array.
[0095] Design a font library array for the 24 Chinese characters and 4 punctuation marks of the whole poem "Ascending the Stork Tower" according to the method described in the present invention. Take "Deng", "Guan", "Que", "Lou", "Column Check 1-1", and "Column Check 1-2" as a data block in the first row. The second data block is "Bai", "Ri", "Yi", "Shan", "Column Check 2-1", and "Column Check 2-2", in the second row. And so on. Finally, a 6x7 font library array is constructed, with 6 information segments in each row, a total of 7 rows, 28 characters in the content information area, 14 characters in the check area, and a total of 42 coding sequences. The corresponding arrangement of the whole poem is as Figure 3 shown. The method described in the present invention can construct different font library arrays according to different content information, so as to design distributed font library arrays for different DNA storage files.
[0096] Example 3 Construction of plasmid vector library
[0097] Method 1: In-situ addition of carrier for ligation
[0098] Based on the 6x7 character library array plate quantified using Nanodrop, the appropriate amount of pUC-19 vector and homologous recombinase were added to each well for ligation, such as... Figure 3 An exemplary method for constructing the pUC-19 vector by in situ addition to a high-capacity error-correcting DNA disk storage array is provided. Specific PCR systems and homologous recombinase ligation methods for homologous ligation of the target fragment and vector are described in sections 3.1 and 3.2.
[0099] Method 2: Hybrid Fragment Connection
[0100] Take 2 μl of the DNA basic storage unit fragments from each well of a 6x7 character array plate and mix them. Perform homologous PCR amplification on the mixed DNA basic storage unit target fragments. Take 2 μl of the obtained PCR product and add the appropriate amount of pUC-19 vector and homologous recombinase for ligation. Figure 4 As shown in Figure 3.1 and 3.2, the specific PCR system for homologous ligation of the target fragment and vector, and the homologous recombinase ligation method are described in sections 3.1 and 3.2.
[0101] 3.1 Preparation of target fragments and vectors for homologous ligation
[0102] 3.1.1 Preparation of homologous destination fragments for basic storage units:
[0103] Add 25 μl of 2×TSINGKE Master Mix, 1 μl of DNA basic storage unit plasmid, 1 μl of 10 μM forward primer bio-F, 1 μl of 10 μM reverse primer bio-R, and 23 μl of ddH2O to a total volume of 50 μl. Perform pre-denaturation at 94 °C for 4 min, followed by denaturation at 95 °C for 30 s, annealing at 55 °C for 30 s, extension at 72 °C for 20 s, repeating 30 cycles, and finally extending at 72 °C for 5 min.
[0104] 3.1.2 Preparation of pUC-19 homologous vector
[0105] Add 25 μl of 2×TSINGKE Master Mix, 1 μl of the basic storage unit plasmid, 1 μl of 10 μM forward primer pUC19-F, 1 μl of 10 μM reverse primer pUC19-R, and 23 μl of ddH2O to a total volume of 50 μl. Perform pre-denaturation at 94 °C for 4 min, followed by denaturation at 95 °C for 30 s, annealing at 55 °C for 30 s, extension at 72 °C for 90 s, repeating this cycle 30 times, and finally extending at 72 °C for 5 min.
[0106] Table 3 Primer sequences for homologous linking target fragments and vector preparation
[0107]
[0108] 3.2 Homologous recombinase ligation
[0109] The target fragment and vector prepared in the previous step were quantified using Nanodrop. They were mixed according to the optimal molar ratio of vector to each insert fragment of 1:2. [Further details to be added]. -Uni Seamless Cloning and Assembly Kit, homologous ligation system: 2x Assembly Mix 5μl, then add ddH2O to a final volume of 10μl. Mix gently, incubate at 50°C for 15 min, and after reaction, place the centrifuge tube on ice for 15 s.
[0110] Example 4: Construction of a Microbial Storage Array
[0111] 4.1 Hybrid thermal shock conversion
[0112] Take 2 μl of the intact plasmid from each well of a 6x7 well plate and add it to a clean centrifuge tube. Mix the 42 plasmids thoroughly by blowing and aspirating. Figure 4 As shown. The concentration was determined using Nanodrop, and the mixture was serially diluted with ddH2O to a mixed plasmid concentration of 0.5 ng / μl and 5 ng / μl for later use. 50 μl of JM109 competent cells were then thawed on ice. After complete thawing, 2 μl of plasmid was added to the mixture, and the mixture was gently shaken to ensure thorough mixing. The mixture was then placed on ice for 30 min. The cells were then heat-shocked in a 42°C water bath for 90 s, and immediately cooled on ice for 2 min.
[0113] 4.2 Cultivation and Amplification of Microbial Storage Arrays
[0114] In the heat shock transformation system, 450 μl of LB medium was added and cultured at 37°C on a shaker at 200 rpm for 60 min. 50 μl of the reactivated JM109 competent cells were added to 5 ml of LB medium containing Amp resistance and cultured overnight at 37°C on a shaker at 200 rpm. The remaining JM109 competent cells were evenly spread on a plate and cultured overnight at 37°C. The culture and expansion of the microbial storage array were as follows. Figure 5 and Figure 6 As shown.
[0115] Figure 5 An exemplary diagram of the plasmid system containing the ancient Chinese poem "Ascending the Stork Tower" is shown.
[0116] Figure 6 An exemplary diagram illustrates the plasmid transformation method for the mixed ancient poem "Ascending the Stork Tower".
[0117] The plates cultured overnight were counted, and the number of transformants at 0.5 ng / μl and 5 ng / μl were statistically analyzed and compared.
[0118] Table 4. Statistical table of transformant numbers in JM109 at plasmid concentrations of 0.5 ng / μl and 5 ng / μl.
[0119] plasmid concentration ng / μl Total / Transformer Conversion rate Conversion rate 0.5 1139 1993.25 1993250 2.00E+06 5 6305 11033.75 1103375 1.00E+06
[0120] 4.3. Obtaining DNA Storage Array Library
[0121] Microbial storage arrays were extracted from overnight LB medium using a plasmid miniprep kit. The target fragment was amplified by PCR, and 5 μl of the PCR product was subjected to agarose gel electrophoresis to verify the amplification effect. After verification, sequencing was performed to read the data.
[0122] (1) PCR amplification system of 0.5 ng / μl
[0123] Add 25 μl of 2×TSINGKE Master Mix, 1 μl of the mixed 0.5 ng / μl DNA basic storage unit plasmid, 1 μl of forward primer M13F(-47), 1 μl of reverse primer M13F(-48), and 23 μl of ddH2O to a total reaction volume of 50 μl. Perform pre-denaturation at 94 °C for 4 min, followed by denaturation at 95 °C for 30 s, annealing at 50 °C for 30 s, extension at 72 °C for 20 s, repeating 30 cycles, and finally extending at 72 °C for 5 min.
[0124] (2) Amplification system of 5 ng / μl
[0125] Add 25 μl of 2×TSINGKE Master Mix, 1 μl of the 5 ng / μl DNA basic storage unit plasmid, 1 μl of the forward primer M13F(-47), 1 μl of the reverse primer M13F(-48), and 23 μl of ddH2O. Pre-denature at 94℃ for 4 min, then denature at 95℃ for 30 s, anneal at 50℃ for 30 s, extend at 72℃ for 20 s, repeating 30 cycles, followed by a final extension at 72℃ for 5 min.
[0126] Table 5: Primers for amplifying the target fragment in microbial mixed systems
[0127]
[0128]
[0129] Example 5 Information Reading
[0130] The constructed "Ascending the Stork Tower" ancient poem, comprising 42 basic DNA storage unit fragments including content information and verification content, was extracted, purified with magnetic beads, and used to construct a next-generation sequencing decoding library. Sequencing was performed using the Illumina Novasep PE250 sequencing platform, and decoding software was applied. Decoding analysis using the method of this invention showed that using a microbial-based high-tolerance error-correcting DNA storage array (Bio-RAID) technology, storing mixed "Ascending the Stork Tower" ancient poem DNA basic storage units at concentrations of 0.5 ng / μl and 5 ng / μl, can achieve the storage and retrieval of the entire information sequence. Specific results are as follows... Figure 7 and Figure 8 As shown.
[0131] Translation was performed on 100 reads with different sample sizes, randomly selected from mixed plasmid concentrations of 0.5 ng / μl and 5 ng / μl. The results showed that the likelihood of data recovery increased with the increasing number of sample reads. When the number of samples reached 500, the data recovery rate with RAID technology reached 99.04% and 99.75%, respectively. With 100 sample reads, the data recovery rate with RAID row and column parity checking was 14.21%–16.29% higher than that without parity checking. This demonstrates that with a small number of sample reads, the data recovery rate with RAID row and column parity checking is higher than that without parity checking. The analysis results are as follows: Figure 9 and Figure 10 As shown. The high-tolerance error-correcting DNA live-cell storage array technology, namely Bio-RAID technology, provided by this patent can effectively reduce sequencing depth, accurately recover stored data, and provide live cells as efficient replication vectors.
[0132] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A DNA storage method based on a distributed array storage unit library, characterized in that, The distributed array storage unit library includes multiple basic storage units; The basic storage unit includes 3 index information units, 1 content information unit to be stored, and 1 row verification information unit; The index information unit and the content information unit to be stored are encoded as corresponding base sequences; the base sequence is composed of a random arrangement of four bases: A, T, C, and G, and its length is 12-36 bases; the basic storage unit has no five consecutive A or T and no four consecutive C or G; the GC content of the basic storage unit sequence is between 40% and 60%; the basic storage unit sequence itself has no four or more consecutive reverse complementary sequences; the Hamming distance between the basic storage unit sequences is greater than 3; the basic storage unit sequence module has no overlap region, and the last 6 positions of the basic storage unit sequence module have no sequences that are homologous to or reverse complementary to the restriction enzyme site; The content information unit to be stored is used to store content information to be stored; the content information to be stored may be text, images, or audio / video. The index information unit includes page number information, row information, and column information; the basic storage unit composed of the index information unit and the content information unit to be stored is arranged in a distributed manner. The row check information unit is generated based on the RAID5 mechanism. This unit is used to confirm whether the current basic storage unit is faulty. During data writing, check information is calculated based on three index information units and the content to be stored. After data reading, check information is calculated again based on the three index information units and the content to be stored. The calculated check information is then compared with the check information in the basic storage unit. If the comparison is inconsistent, the current basic storage unit is considered faulty and deemed to have failed. The RAID5 row check mode allows the loss of one DNA fragment out of every five basic DNA fragments. That is, the loss of any one of the five information units—three index information base sequences, one content information base sequence, and the row check base sequence—can still recover the metadata. The distributed array storage unit library also includes two column verification information units between basic storage units; The column check information unit is calculated based on the content information unit to be stored and the row check information unit and generated based on the RAID6 mechanism. The RAID6 mechanism calculates two column check base sequences based on the content information and row check information base sequences in each of the four basic DNA storage units. Each pair of column check base sequences corresponds to a specific base sequence of the four data information. That is, if any two of the six corresponding basic storage units are lost, the metadata can be recovered.
2. The DNA storage method based on a distributed array storage unit library as described in claim 1, characterized in that, The index information unit, the content information unit to be stored, and the row verification information unit include a connection sequence, which is an overlap area.
3. The DNA storage method based on a distributed array storage unit library as described in claim 1, characterized in that, The index information unit, the content information unit to be stored, and the row verification information unit include a head enzyme digestion base sequence, an overlap base sequence, and a tail enzyme digestion base sequence.
4. The DNA storage method based on a distributed array storage unit library as described in any one of claims 1-3, characterized in that, The storage medium for the distributed array storage unit library is microorganisms.