A DNA information encoding and decoding method based on modulation

The DNA information encoding and decoding method based on modulation solves the limitations of high-density coding and PCR primer libraries in DNA information storage, realizes efficient information storage and retrieval, meets the constraints of GC content and homopolymer distribution, and improves coding density and PCR amplification efficiency.

CN116597901BActive Publication Date: 2026-06-23TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
Filing Date
2023-05-22
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing DNA information storage technologies are limited by high-density coding methods and high-capacity PCR primer libraries, which makes information retrieval difficult when the data scale increases. Furthermore, the G and C content and homopolymer distribution in the DNA strand do not meet the constraints, affecting the accuracy of sequencing and synthesis.

Method used

A modulation-based DNA information encoding and decoding method is adopted. Through information segmentation, verification, sequence transformation, modulation optimization, generation of marker sequences and addressing sequences, modulation primers and addressing sequences are constructed to meet the constraints of GC content and homopolymer distribution, and unique marker sequences and addressing sequences are generated.

Benefits of technology

It achieves high-density coding density, approaching the theoretical limit of 1.9 bits/nt, while generating specific addressing sequences, supporting efficient PCR amplification and information retrieval, reducing DNA strand redundancy, and simplifying the PCR primer generation process.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116597901B_ABST
    Figure CN116597901B_ABST
Patent Text Reader

Abstract

The application provides a DNA information encoding and decoding method based on modulation, wherein the encoding method comprises information blocking, information checking, sequence conversion, modulation optimization, generation of a mark sequence, generation of an addressing sequence and output; in the encoding, a special modulation primer is used to optimize a base sequence, so that the base sequence meets a constraint condition, and a mark sequence is constructed; in the decoding, the correct original information can be recovered. The modulation primer, the index sequence and the primer sequence are combined, so that the addressing function is preserved, and the PCR primer can be used, thereby greatly reducing information redundancy. The encoding and decoding scheme can be used for encoding and decoding of any computer-stored digital information, and the encoding density is close to the theoretical limit. The number of specific addressing sequences matching the size of the base sequence is generated, so that the modulation, addressing, PCR amplification and information retrieval functions are met.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer information encoding and decoding technology, and in particular to a DNA information encoding and decoding method based on modulation. Background Technology

[0002] With the development of internet technology, artificial intelligence, cloud computing, and other technologies, the world has entered the information age. Correspondingly, data is exploding, with everyone constantly generating and processing it. According to statistics, the accumulated data has already exceeded 20 zettabytes and is growing at a rate of 50% annually, leading to a huge information storage gap in the near future. To address this problem, attention has turned to molecular storage. Biomolecules with specific structures and sequences, such as DNA and polypeptides, have become a research hotspot. Compared to optoelectronic storage media like flash memory, these molecular storage media offer smaller storage volumes and higher storage densities. Among various molecular storage media, DNA, with its high stability, ability to be stored long-term under simple conditions, and ease of replication, holds the most promise as a next-generation storage medium.

[0003] DNA information storage uses the sequential arrangement of four deoxyribonucleotides, "A," "T," "C," and "G," to store information. The general process involves first encoding computer information into DNA sequence information, storing this DNA sequence in a text file, and then synthesizing DNA to store the information. When synthesizing the DNA for information storage, the stored DNA sequence information is read, and the DNA is synthesized according to the encoded sequence. Compared to traditional information storage methods that can only use "0" and "1" states, DNA information storage, with its quaternary storage, can store much more information. Furthermore, the nanoscale structure of DNA results in a smaller volumetric density, several orders of magnitude higher than traditional optoelectronic storage media such as flash memory. The quaternary encoding method represents the theoretical storage density of DNA. However, in actual encoding, due to limitations in DNA sequencing and synthesis technologies, certain constraints must be met. Specifically, these constraints include two main conditions: 1. The content of G and C in deoxyribonucleotides should be between 40% and 60%; 2. The converted DNA strand should contain as few single-repeated bases as possible, meaning that no single deoxyribonucleotide can appear consecutively in groups of four or more. These constraints ensure that the encoded DNA strand can be correctly synthesized and sequenced, but they also limit the encoding density of the DNA information storage site. Therefore, to improve the practical value of DNA information storage, it is necessary to develop a high-density encoding method that meets these constraints. Furthermore, as the scale of DNA information storage data expands, information retrieval has also become a factor limiting its expansion. Currently, the mainstream large-scale data retrieval method is based on PCR amplification. Data shows that this method can retrieve the desired target file from 200MB of stored information. However, PCR amplification requires the construction of a specific PCR primer library; both the primers themselves and the primers and the stored information must possess specificity.

[0004] In summary, creating high-density coding schemes and high-capacity PCR primer libraries under constraints is essential for the practical application of DNA information storage.

[0005] It should be noted that the information disclosed in the background section above is only for understanding the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0006] The main objective of this invention is to effectively solve the aforementioned problems faced by existing DNA information storage and to provide a DNA information encoding and decoding method based on modulation.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] A method for encoding DNA information based on modulation includes the following steps:

[0009] S1. Information Blocking: The file information to be encoded is divided into multiple information blocks, and an index number is assigned to each information block.

[0010] S2, Information Verification: Add verification information to each information block;

[0011] S3. Sequence Conversion: Convert the byte stream data in each information block into binary information, and then into a base sequence;

[0012] S4. Modulation optimization: Construct modulation primers, determine whether the base sequence meets the constraints, and optimize the base sequence that does not meet the constraints using the modulation primers.

[0013] S5. Generate a flag sequence: Generate a unique flag sequence for each information block based on the modulation process information;

[0014] S6. Generate addressing sequence: Generate the addressing sequence for each information block based on the modulation primer and index number of each information block;

[0015] S7. Output: Connect each information block with its marker sequence and addressing sequence to assemble a complete DNA chain, and output it to a text file for storage.

[0016] Furthermore:

[0017] Step S3 includes converting the information in the information block into a base sequence form using a quaternary encoding rule, where “00”, “01”, “10”, and “11” correspond to the bases “A”, “T”, “C”, and “G”, respectively.

[0018] Step S4 includes: constructing GC modulation primers and homopolymer modulation primers for each information block. The modulation primers are 4-base sequences with 50% GC content and no homopolymer. The GC content and homopolymer of the information sequence are made to meet the constraints by modulation of the modulation primers.

[0019] The following modulation strategy is adopted: if the base sequence meets the constraints, no modulation is performed; if there are problems with the GC content or homopolymer of the base sequence, the corresponding modulation primer is used for modulation; if there are problems with both the GC content and homopolymer of the base sequence, two modulation primers are used for modulation simultaneously.

[0020] Preferably, for sequences whose GC content does not meet the requirements, the entire sequence is directly modulated; for sequences with excessively long homopolymers, the position containing the homopolymer is located, and only the homopolymer at that position is locally modulated.

[0021] Preferably, the modulation result with the shortest flag sequence is selected from the sequences that meet the constraints after modulation optimization as the constraint sequence;

[0022] Preferably, the modulation optimization process is a 2-bit XOR operation, which groups the bases of the base sequence into groups, with each group of bases corresponding to a modulation primer, and performs an XOR operation on each base. The result of the operation is the modulation result.

[0023] Step S5 includes: generating a unique identifier sequence for each information block to record modulation process information. The modulation process information recorded in the identifier sequence includes three parts: whether GC modulation primers were used for overall modulation; the specific location of homopolymer modulation; and the number of times homopolymer modulation was performed.

[0024] Preferably, 1 bit of binary information is used to indicate whether GC primers are used for overall modulation, the specific position of homopolymer modulation is represented by 8 bits of binary number, and the number of homopolymer modulations is stored as 7 bits of binary information, which are combined with the preceding 1 bit of GC modulation information to form 8 bits of binary information.

[0025] Preferably, the marker sequence is converted into a base sequence using a quaternary encoding rule, and then the base sequence is modulated using GC modulation primers and / or homopolymers.

[0026] Step S6 includes: incorporating the index number and modulation primer into the addressing sequence, and generating the addressing sequence for each information block using a structured addressing sequence design method;

[0027] The structured addressing sequence design method combines a marker, a balancing sequence, the modulation primer, and the index number into an addressing sequence. The marker is a homopolymer used to ensure the specificity of the addressing sequence, and the balancing sequence is a randomly generated multi-base sequence used to ensure GC content balance. An addressing sequence is placed at each end of each DNA strand, referred to as the pre-addressing sequence and the post-addressing sequence, respectively. The pre-addressing sequence stores the GC modulation primer, and the post-addressing sequence stores the homopolymer modulation primer.

[0028] Step S7 includes: combining the modulated and optimized constraint sequence with its preceding and following primers and marker sequences to form a complete DNA strand structure for storing information, and outputting it to a text file for storage.

[0029] A modulation-based DNA information decoding method for decoding a DNA strand encoded using the aforementioned DNA information encoding method includes the following steps:

[0030] T1, Addressing Sequence Reconstruction: Read the addressing sequence of the DNA strand and reconstruct the modulation primers and index numbers;

[0031] T2: Modulation Demodulation: Demodulate the marker sequence and decode the base sequence into an unmodulated sequence based on the marker sequence information and modulation primers;

[0032] T3: Sequence Conversion: Converts the unmodulated sequence into a binary sequence and then restores it to byte stream data;

[0033] T4: Information Verification: Verify and correct stored information based on verification algorithms and redundant information;

[0034] T5: Information Reassembly: Restore the correct stored information to the original information based on the index number.

[0035] Furthermore:

[0036] Step T1 includes: determining the location of the addressing sequence from the read DNA sequence, and disassembling the modulation primer and index sequence according to the addressing sequence structure;

[0037] Step T2 includes: first, finding the marker sequence from the DNA sequence, decoding the modulation process information, and then combining it with the modulation primers parsed from the primers to reverse modulate the DNA sequence to obtain the unmodulated storage information sequence;

[0038] Step T3 includes: first converting the unmodulated storage information sequence from a base sequence to a binary sequence, and then converting it into byte stream data through a number system;

[0039] Step T4 includes: modulated and converted byte stream data, containing storage information and verification information, using a verification algorithm to verify the information and correct any errors in the information;

[0040] Step T5 includes: sorting the decoded and verified information according to its index number using a sorting algorithm to restore the original stored information, and finally outputting its original stored file.

[0041] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the DNA information encoding method and / or the DNA information decoding method.

[0042] A control device includes a processor and a storage medium for storing a computer program; wherein the processor is configured to implement the DNA information encoding method and / or the DNA information decoding method when executing the computer program.

[0043] The present invention has the following beneficial effects:

[0044] This invention proposes a DNA information encoding and decoding method based on modulation. During encoding, the method optimizes the base sequence using specially constructed modulation primers to ensure the sequence meets constraints, while simultaneously constructing a marker sequence to recover the correct original information during decoding. Furthermore, the structured addressing sequence design method in this invention combines modulation primers, index sequences, and primer sequences, preserving their addressing function while also serving as PCR primers, significantly reducing information redundancy. Compared to existing technologies, the encoding and decoding scheme in this invention can encode and decode any digital information stored in a computer, achieving an encoding density close to the theoretical limit, reaching 1.9 bits / nt. Simultaneously, the structured addressing sequence design method can generate a specific number of addressing sequences matching the base sequence size, satisfying the needs of multiple functions such as modulation, addressing, PCR amplification, and information retrieval.

[0045] Using the modulation-based DNA information encoding and decoding method and the structured addressing sequence design method of this invention, computer digital information can be encoded and decoded, and converted into DNA base sequences that meet the needs of synthesis and sequencing with high-density coding.

[0046] In a preferred embodiment, by changing primers and screening DNA strand quality, digital information can be encoded with a very high coding density without adding excessive redundancy. Simultaneously, for subsequent PCR-based information retrieval operations, a structured addressing sequence design method is used to generate a large number of specific addressing sequences. This reduces information redundancy on the DNA strand while providing addressing and PCR primer functions, simplifying the PCR primer generation process. Attached Figure Description

[0047] Figure 1 The diagram illustrates the coding flowchart of an embodiment of the present invention.

[0048] Figure 2 This diagram illustrates a quaternary encoding rule according to an embodiment of the present invention.

[0049] Figure 3 A table showing the modulation results of an embodiment of the present invention is provided.

[0050] Figure 4 A schematic diagram of the modulation process according to an embodiment of the present invention is shown.

[0051] Figure 5 The modulation primer table of an embodiment of the present invention is shown.

[0052] Figure 6 This diagram illustrates the generation of a flag sequence according to an embodiment of the present invention.

[0053] Figure 7A schematic diagram of the pre-addressing sequence structure according to an embodiment of the present invention is shown.

[0054] Figure 8 A schematic diagram of the post-addressing sequence structure according to an embodiment of the present invention is shown.

[0055] Figure 9 An addressing sequence table according to an embodiment of the present invention is shown.

[0056] Figure 10 A DNA strand structure diagram of an embodiment of the present invention is shown.

[0057] Figure 11 The present invention illustrates a DNA strand base sequence table, which converts a TXT text file containing information from Chapter 1 of the Tao Te Ching into a DNA strand base sequence.

[0058] Figure 12 A decoding flowchart of an embodiment of the present invention is shown. Detailed Implementation

[0059] The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary and is not intended to limit the scope and application of the present invention.

[0060] It should be noted that when a component is referred to as "fixed to" or "set on" another component, it can be directly on or indirectly on that other component. When a component is referred to as "connected to" another component, it can be directly connected to or indirectly connected to that other component. Furthermore, a connection can be used for fixing, coupling, or communication.

[0061] It should be understood that the terms "length", "width", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", and "outer" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the present invention.

[0062] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of embodiments of the present invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0063] See Figure 1This invention provides a DNA information encoding method based on modulation, comprising the following steps:

[0064] S1. Information Blocking: The file information to be encoded is divided into multiple information blocks, and an index number is assigned to each information block.

[0065] S2, Information Verification: Add verification information to each information block;

[0066] S3. Sequence Conversion: Convert the byte stream data in each information block into binary information, and then into a base sequence;

[0067] S4. Modulation optimization: Construct modulation primers, determine whether the base sequence meets the constraints, and optimize the base sequence that does not meet the constraints using the modulation primers.

[0068] S5. Generate a flag sequence: Generate a unique flag sequence for each information block based on the modulation process information;

[0069] S6. Generate addressing sequence: Generate the addressing sequence for each information block based on the modulation primer and index number of each information block;

[0070] S7. Output: Connect each information block with its marker sequence and addressing sequence to assemble a complete DNA chain, and output it to a text file for storage.

[0071] See Figure 12 This invention also provides a DNA information decoding method based on modulation, used to decode a DNA strand encoded using the DNA information encoding method described above, comprising the following steps:

[0072] T1, Addressing Sequence Reconstruction: Read the addressing sequence of the DNA strand and reconstruct the modulation primers and index numbers;

[0073] T2: Modulation Demodulation: Demodulate the marker sequence and decode the base sequence into an unmodulated sequence based on the marker sequence information and modulation primers;

[0074] T3: Sequence Conversion: Converts the unmodulated sequence into a binary sequence and then restores it to byte stream data;

[0075] T4: Information Verification: Verify and correct stored information based on verification algorithms and redundant information;

[0076] T5: Information Reassembly: Restore the correct stored information to the original information based on the index number.

[0077] In some embodiments, a DNA information encoding method based on modulation includes the following specific steps:

[0078] Step 1: Information segmentation, read the file, and segment the file data into blocks according to the index order;

[0079] Step 2: Information verification, adding a certain amount of logical redundancy information using the RS verification algorithm;

[0080] Step 3: Sequence conversion. First, convert the byte stream data into a binary sequence like "01001100", and then into a base sequence like "TCAG".

[0081] Step 4: Modulation optimization. Construct modulation primers, determine whether the base sequence meets the constraints, and optimize the base sequence that does not meet the constraints using modulation primers.

[0082] Step 5: Generate a flag sequence. Based on the modulation optimization process, record the modulation process information and generate a flag sequence.

[0083] Step 6: Generate addressing sequences. Based on the modulation primers and index numbers of each information block, design the addressing sequence for each information block using structured primer design methods.

[0084] Step 7: Output. Ligate the base sequence with the marker sequence and the preceding and following PCR primers to assemble a complete DNA strand, and output it to a save file for storage.

[0085] In step 1, information is divided into blocks, and reading the file involves reading the computer file to be stored in binary form. The read file is then converted into byte stream data.

[0086] In step 1, information is segmented. Due to limitations in single-DNA strand synthesis methods, the length of a single DNA strand is generally around 200-300 nt. Therefore, the read data needs to be segmented into blocks, dividing the data into information blocks the size that a single DNA strand can store.

[0087] In step 1, information is divided into blocks. Simultaneously, an index number is assigned to each information block according to the block order. This index number can be selected from various formats, such as 8-bit binary information or 16-bit binary information, depending on design requirements, and can store a maximum of 2... n Each information block. For storage of a large amount of content, the index sequence length can be further expanded to store more information blocks.

[0088] In step 2, information verification involves adding redundant information to each information block using the RS check algorithm. The number of bytes that can correct errors is half the size of the added redundant information. The RS check algorithm is mainly used to correct replacement errors in the stored procedure. Insertion and deletion errors can also be converted into replacement errors for correction. Alternatively, LDPC check algorithm, cyclic redundancy check algorithm, and other verification methods can also be used in this information verification process. They differ only in the size of the added redundant information and the error correction capability, and all serve the same function of information verification.

[0089] In step 2, the redundant information added by the RS verification algorithm is still byte stream data. Each information block is combined with the added verification information to form the information block before encoding.

[0090] In step 3, the sequence conversion, the information blocks that have completed information segmentation and verification still exist in the form of byte stream data. Through base conversion, the byte stream data can be converted into a binary sequence and uniformly expanded into an 8-bit binary form.

[0091] In step 3, the sequence is converted into a binary information block, and then converted into a base sequence using a quaternary-like encoding rule. The quaternary-like encoding rule is that "00", "01", "10", and "11" correspond to "A", "T", "C", and "G" respectively.

[0092] In step 4, modulation optimization involves using a 4-base sequence with 50% GC content and no homopolymer. This sequence is generated through random combination followed by screening. The modulation primers used here can vary in length depending on the modulation process design.

[0093] In step 4, modulation optimization requires generating two modulation primers for each information block. Based on their function, these primers are divided into GC modulation primers and homopolymer modulation primers.

[0094] In step 4, modulation optimization employs different modulation strategies for different base sequences, specifically in three cases: 1. If the base sequence meets the constraints, modulation is not required; 2. If there are issues with the GC content or homopolymer of the base sequence, the corresponding modulation primer is used for modulation; 3. If both the GC content and homopolymer of the base sequence are problematic, both modulation primers are used simultaneously.

[0095] In step 4, modulation optimization involves a 2-bit XOR operation. The base sequence is grouped into groups of four bases, and each group corresponds to a modulation primer. Each base is then XORed with the others, and the result is the modulation result. For example, the base sequence "AAAACTAG…" and the homopolymer modulation primer "ACTG" will result in "ACTGCGTA…". Alternatively, other forms, such as groups of eight bases, can be used during modulation optimization. This only requires adjusting the length of the modulation primers accordingly, but the entire modulation process still follows a 2-bit XOR operation to obtain the modulated structure.

[0096] In step 4, modulation optimization involves directly applying the above modulation process to the entire sequence for sequences whose GC content does not meet the requirements; this is a holistic modulation.

[0097] In step 4, modulation optimization involves locating the position containing the homopolymer in a sequence with excessively long homopolymers and modulating only the homopolymer at that position, which is local modulation.

[0098] In step 4, after modulation optimization, the optimized base sequence needs to be quality evaluated to determine if it meets the constraints. If it does not meet the constraints, the modulation primers need to be rewritten and the modulation optimization performed again until the constraints are met. If the constraints are met, the modulation result with the shortest marker sequence is selected and saved as the constraint sequence. The two modulation primers used are also saved.

[0099] Step 5 generates a flag sequence. The modulation optimization process in step 4 requires generating a flag sequence to record modulation process information.

[0100] Step 5 generates a marker sequence. The modulation process information recorded in the marker sequence mainly includes three parts: 1. Whether GC primers are used for overall modulation; 2. The specific location of homopolymer modulation; 3. The number of times homopolymer modulation is performed.

[0101] Step 5 generates a marker sequence, using 1 bit of binary information to indicate whether GC primers were used for overall modulation; "0" represents not used, and "1" represents used. Homopolymer modulation information is directly represented by its position in 8 bits. For example, if homopolymer modulation occurred at the 4th group of 4 bases, the position information would be "00000100". The number of homopolymer modulations is converted into 7 bits of binary information and stored, combined with the preceding 1 bit of GC modulation information to form 8 bits of binary information. During the generation of the marker sequence, the structure of each part can be adjusted according to actual usage, such as changing the length of the homopolymer modulation information or the length of the number of homopolymer modulations.

[0102] Step 5 generates a marker sequence. This marker sequence, which consists entirely of binary information, is converted into a base sequence using a quaternary encoding rule. The base sequence is then modulated using GC modulation primers, or homopolymer-adjusted primers, to obtain the final marker sequence.

[0103] Step 6 generates the addressing sequence. After the previous five steps, in addition to the constraint sequence and the flag sequence, the index number of the information block and the two modulation primers are also very important. Now, the index number and the modulation primers are incorporated into the addressing sequence, and the addressing sequence is generated and saved using the structured addressing sequence design method.

[0104] Step 6 generates the addressing sequence. The structured addressing sequence design method divides the addressing sequence into four parts: marker, balancing sequence, modulation primer, and index number. The marker is a 4-bit homopolymer—"AAAA", "TTTT", "CCCC", "GGGG"—used to ensure the specificity of the addressing sequence. The marker can also be adjusted according to the actual usage process; for example, a 5-bit homopolymer can be used. The balancing sequence is a randomly generated multi-base sequence used to ensure the GC content balance of the entire PCR primer. Its specific length is determined based on the overall length of the PCR primer. The modulation primer and index number are information saved from previous steps. Combining these four parts yields the addressing sequence.

[0105] Step 6 generates addressing sequences. One addressing sequence is placed at each end of each DNA strand, referred to as the pre-addressing sequence and the post-addressing sequence, respectively. The pre-addressing and post-addressing sequences are generated in the same way, using the structured addressing sequence generation method described above. However, there are differences: the pre-addressing sequence stores GC modulation primers, while the post-addressing sequence stores homopolymer modulation primers. Furthermore, the structure between the two can be adjusted according to specific requirements.

[0106] The output of step 7 combines the modulated constraint sequence with the marker sequence and addressing sequence to form a complete DNA strand, which has the function of information storage. Then, all the encoded DNA strand sequences are stored in a save file.

[0107] In the modulation-based DNA encoding method proposed in this invention, the stored information is first converted into a base sequence using a quaternary encoding rule. Then, a modulation primer sequence that meets the requirements is constructed. The converted base sequence is then modulated and optimized using the modulation primer until it meets the constraints. Simultaneously, the modulation primer is saved, a marker sequence is generated, and information segmentation and verification are completed. This encoding method can convert the original information into a base sequence that meets the constraints with a very high coding density, and simultaneously generate corresponding addressing sequences to meet the needs of subsequent operations such as PCR amplification.

[0108] In other embodiments, a DNA information decoding method based on modulation includes the following steps:

[0109] Step 1: Address sequence reconstruction. Read the DNA strand base sequence, disassemble the address sequence from it, and obtain the modulation primers and index numbers;

[0110] Step 2: Modulation restoration. First, read the marker sequence to obtain modulation process information, and then decode the base sequence into an unmodulated sequence.

[0111] Step 3: Sequence conversion, the decoded information block is first converted from a base sequence to a binary sequence, and then restored to byte stream data;

[0112] Step 4: Information verification. Based on the RS verification algorithm, the stored information and redundant information are verified and corrected.

[0113] Step 5: Information Reorganization. Sort the information blocks according to the index number to restore the original information.

[0114] In the address sequence restoration, after reading the base sequence of the DNA strand, the base sequences at both ends are the front and back address sequences. The address sequence of each DNA strand is determined according to the design based on the specific address sequence length.

[0115] In the addressing sequence reconstruction, based on the structured addressing sequence design method, the index number of each DNA strand and two modulation primers can be obtained from the addressing sequence.

[0116] In the modulation and restoration process, the DNA base sequence after removing the addressing sequence is divided into two parts: an information sequence and a marker sequence.

[0117] In the modulation restoration process, the marker sequence is first restored using GC modulation primers, and then restored to binary information to obtain modulation process information such as homopolymer modulation position information.

[0118] In the modulation restoration process, based on the modulation process information obtained from the marker sequence and the two modulation primers, the information sequence is re-modulated at the same position to restore the information sequence to the form of an unmodulated base sequence.

[0119] In the sequence conversion, the base sequence after modulation and restoration is converted into a binary sequence by using a quaternary encoding rule in reverse.

[0120] In the sequence conversion, the binary sequence is further converted into byte stream data through a base conversion.

[0121] The information verification process involves dividing the byte stream data into two parts: stored information and verification redundancy information. The RS check algorithm is used to check for errors in the stored information and correct any errors. During information verification, various verification methods such as LDPC and cyclic redundancy check can also be used. The appropriate verification algorithm is selected based on the algorithm used during encoding to correct and rectify the information.

[0122] In the information reconstruction process, based on the index number obtained from the address sequence decomposition, sorting algorithms such as bubble sort and binary sort are used to combine the corrected stored information together to restore the correct and complete original information, which is then output as a corresponding computer file for storage.

[0123] The following describes specific embodiments of the present invention.

[0124] For example, in this embodiment, the content of Chapter 1 of the Tao Te Ching is used as the encoding object, which is: "01. The Tao that can be told is not the eternal Tao. The name that can be named is not the eternal name. Nameless, it is the origin of heaven and earth. Named, it is the mother of all things. Therefore, one should always be without desire in order to observe its mystery. One should always have desire in order to observe its manifestations. These two are the same in origin but different in name, and are both called profound. Profound and yet more profound, the gateway to all mysteries." A total of 216 bytes of data.

[0125] Figure 1 The flowchart for the encoding scheme using the content of Chapter 1 of the *Tao Te Ching* as the encoding object is as follows: First, the TXT text file containing the content of Chapter 1 of the *Tao Te Ching* is read. It is then divided into blocks according to the storage design and indexed. Each information block is then validated, and redundant validation information is added. For information blocks with added redundant validation information, a binary sequence is first converted, and then converted to a base sequence using a quaternary encoding rule. Modulation primers are randomly generated, and the base sequences are quality evaluated. Base sequences that do not meet the GC content requirements or contain excessively long homopolymer sequences are modulated using the corresponding modulation primers until the base sequence meets the constraints. Simultaneously, a marker sequence is generated based on the modulation process. Finally, a structured addressing sequence design method is used to generate the pre- and post-addressing sequences for each information block, which are then combined into a complete DNA strand structure and output to the TXT text file for storage.

[0126] In the above encoding process, the selected information block size is 60 bytes. This is because the length of a single DNA strand is 200-300 nt. By selecting an information block of 60 bytes, the final encoded DNA strand length is about 308 nt, which can maximize the storage capacity of a single DNA strand and reduce the overhead of index numbers.

[0127] In the above encoding process, a total of 216 bytes of data are used. Based on the designed information block size, this data can be divided into four information blocks, three of which are 60 bytes each and the last one is 36 bytes. In this embodiment, the four information blocks are assigned index numbers "1, 2, 3, 4" according to the block division order. These four index numbers are then converted into base sequences according to the sequence conversion steps, resulting in four 4nt index numbers.

[0128] In the above encoding process, in this embodiment, the RS check algorithm is used for information verification. Six bytes of information are added after each information block, which is 10% check redundancy. Therefore, the size of the information block after adding check redundancy is 66 bytes, with the last one being 42 bytes, allowing for error correction of the 3 bytes of information within it.

[0129] In the above encoding process, the first step of sequence conversion is number system conversion. Using the mathematical relationship between hexadecimal and binary, the byte stream data is converted into an 8-bit binary sequence like "01000100". Then, using a quaternary-like encoding rule, "00", "01", "10", and "11" correspond to "A", "T", "C", and "G" respectively. See details... Figure 2 The information block is converted into a base sequence, with the lengths of the four base sequences being 264nt, 264nt, 264nt, and 168nt, respectively.

[0130] In the above encoding process, when generating modulation primers, a randomized generation method is used. However, the generated modulation sequence has a 50% GC content and no long homopolymers, which is a sequence that perfectly satisfies the constraints.

[0131] In the above encoding process, the modulation process is actually a two-bit XOR operation, and the result can be obtained either by XORing the result or by directly looking up the modulation result table, as shown in the following example. Figure 3 As shown. For example: if the stored information is "AGCC" and the modulation primer is "ATCG", the first bit is modulated, which is "A" and "A". According to the XOR operation, "00" and "00" are XORed, and the result is still "00", which is "A". Checking the modulation result table, the result is the same.

[0132] In the above encoding process, the modulation optimization process is divided into two steps, such as... Figure 4 As shown in the schematic diagram of the modulation process, firstly, the GC content in the base sequence is checked to see if it meets the requirements. If it does not meet the requirements, GC modulation primers are used for overall modulation. Then, the base sequence is screened for homopolymers. If homopolymers are found, homopolymer modulation primers are used to modulate the positions where homopolymers appear.

[0133] In the above encoding process, the modulation optimization process is a multi-cycle process. By continuously changing different modulation primers, the base sequence is made to meet the constraints. Simultaneously, based on the number of homopolymer modulations, the result with the fewest homopolymer modulations, i.e., the shortest marker sequence length, is selected as the final constraint sequence. In the embodiment, the modulation primers used are as follows: Figure 5 As shown.

[0134] In the above encoding process, the flag sequence is generated as follows: Figure 6 As shown. First, a 1-bit binary information is used to indicate whether GC primers were used for overall modulation; "0" represents not used, and "1" represents used. Homopolymer modulation information is directly represented by its position in 8-bit binary numbers. For example, if homopolymer modulation occurred on the first 4-base group, the position information would be "00000001". The number of homopolymer modulations is converted into 7-bit binary information and stored, combined with the preceding 1-bit GC modulation information to form 8-bit binary information. Then, the entirely binary information of the marker sequence is converted into a base sequence using a quaternary-like encoding rule. Finally, the base sequence is modulated in one step using GC modulation primers to obtain the final marker sequence. Figure 6 As shown, the process of generating a marker sequence without GC modulation but with two homopolymer modulations is illustrated.

[0135] In the above encoding process, the modulation optimization process has already selected the optimal modulation primers. In the example, none of the four sets of modulation primers selected were subjected to homopolymer modulation during the modulation optimization process of the four information blocks, therefore the marker sequence is the shortest 4nt length. In addition, the first three base sequences were not subjected to GC content modulation, and only the last one was subjected to GC content modulation.

[0136] In the above encoding process, in this embodiment, the designed addressing sequence length is 20 nt, including: a 4 nt marker, an 8 nt GC balancing sequence, a 4 nt modulation primer, and a 4 nt index number. The marker is a randomly generated homopolymer of length 4, and the GC balancing sequence is also randomly generated to ensure that the GC content of the entire addressing sequence is within the required range. The modulation primer is used during the modulation process, and the index number is the base sequence obtained after modulation by the modulation primer based on the index order assigned during information segmentation.

[0137] In the above encoding process, the addressing sequence is divided into pre-addressing and post-addressing sequences. The pre-addressing and post-addressing sequences on the same DNA strand are generated in the same way, but the order of their components differs, as detailed below. Figure 7 , Figure 8 The primer structure diagram is shown. Furthermore, the modulation primer in the first primer is a GC modulation primer, and the modulation primer in the second primer is a homopolymer modulation primer. In this embodiment, the four sets of addressing sequences generated are as follows... Figure 9 As shown.

[0138] In the above encoding, the complete DNA strand structure includes the base sequence, marker sequence, and front and back addressing sequences, such as... Figure 10 As shown.

[0139] In the above encoding, each part of the information block is assembled into a complete DNA strand. In the example, the first three DNA strands consist of a 264nt base sequence, a 4nt marker sequence, a 20nt pre-addressing sequence, and a 20nt post-addressing sequence, totaling 308nt. The last strand consists of a 168nt base sequence, a 4nt marker sequence, a 20nt pre-addressing sequence, and a 20nt post-addressing sequence, totaling 212nt.

[0140] After the above encoding process, the 216-byte TXT text file containing information from Chapter 1 of the *Tao Te Ching* was converted into four DNA strands, with lengths of 308 nt, 308 nt, 308 nt, and 212 nt respectively. The specific base sequence of each DNA strand is shown below. Figure 11 As shown.

[0141] Figure 12 The decoding process corresponds to the encoding process. In the DNA strand, the addressing sequence length is 20 nt. Based on the DNA strand structure, the pre- and post-addressing sequences can be obtained. Then, based on the structural design of the addressing sequences, two modulation primers and the index number of the DNA strand can be further obtained. Next, the base sequence and marker sequence are separated from the DNA strand. Based on the modulation process information of the marker sequence, the base sequence is modulated again to obtain the unmodulated base sequence. This is then sequence converted to restore the base sequence to byte stream data. Finally, information correction and error correction are performed on each DNA strand. The correct information is assembled into the original stored information according to the index number and output as a TXT text, completing the decoding process.

[0142] The decoding process described above, which involves disassembling the DNA strand and addressing sequence, requires differentiation based on the DNA strand structure and addressing sequence structure designed for actual use. Similarly, the reconstruction of the marker sequence information must also be based on the structure of the marker sequence.

[0143] In the above decoding process, the sequence conversion is still divided into two steps: the first step is to convert the base sequence into binary sequence information, and the second step is to convert the base into byte stream data.

[0144] In the decoding process described above, the RS checksum algorithm is used for information verification, and the DNA strand sorting is performed using the bubble sort algorithm. After information verification and recombination, the DNA strand information is assembled into the correct and complete original information, which is then output as a TXT text file, representing the information of the first chapter of the Tao Te Ching.

[0145] In summary, the DNA information encoding and decoding method based on modulation provided by this invention can optimize the base sequence during encoding using specially constructed modulation primers, ensuring the base sequence meets constraints, and simultaneously construct a marker sequence, allowing the correct original information to be recovered during decoding. Furthermore, the structured addressing sequence design method in this invention combines modulation primers, index sequences, and primer sequences, preserving its own addressing function while also serving as a PCR primer, significantly reducing information redundancy. Compared to existing technologies, the encoding and decoding scheme in this invention can encode and decode any digital information stored in a computer, and its encoding density approaches the theoretical limit, reaching 1.9 bits / nt. Simultaneously, the structured addressing sequence design method can generate a specific number of addressing sequences matching the base sequence size, satisfying the needs of multiple functions such as modulation, addressing, PCR amplification, and information retrieval.

[0146] Using the modulation-based DNA information encoding and decoding method and the structured addressing sequence design method of this invention, computer digital information can be encoded and decoded, and converted into DNA base sequences that meet the needs of synthesis and sequencing with high-density coding.

[0147] In a preferred embodiment, by changing primers and screening DNA strand quality, digital information can be encoded with a very high coding density without adding excessive redundancy. Simultaneously, for subsequent PCR-based information retrieval operations, a structured addressing sequence design method is used to generate a large number of specific addressing sequences. This reduces information redundancy on the DNA strand while providing addressing and PCR primer functions, simplifying the PCR primer generation process.

[0148] This invention also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the DNA information encoding method and / or the DNA information decoding method described above.

[0149] This invention also provides a control device, including a processor and a storage medium for storing a computer program; wherein the processor is used to implement the DNA information encoding method and / or the DNA information decoding method when executing the computer program.

[0150] This invention also provides a processor that executes a computer program, at least performing the methods described above.

[0151] The storage medium can be implemented by any type of volatile or non-volatile storage device, or a combination thereof. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); magnetic surface memory can be disk storage or magnetic tape storage. Volatile memory can be random access memory (RAM), which serves as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), and Direct Rambus Random Access Memory (DRRAM). The storage media described in the embodiments of the present invention are intended to include, but are not limited to, these and any other suitable types of memory.

[0152] In the several embodiments provided in this application, it should be understood that the disclosed systems and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.

[0153] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.

[0154] In addition, in the various embodiments of the present invention, each functional unit can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0155] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0156] Alternatively, if the integrated units of this invention are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this invention, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

[0157] The methods disclosed in the several method embodiments provided in this application can be arbitrarily combined without conflict to obtain new method embodiments.

[0158] The features disclosed in the several product embodiments provided in this application can be arbitrarily combined without conflict to obtain new product embodiments.

[0159] The features disclosed in the several method or device embodiments provided in this application can be arbitrarily combined without conflict to obtain new method or device embodiments.

Claims

1. A DNA information encoding method based on modulation, characterized in that, Includes the following steps: S1. Information Blocking: The file information to be encoded is divided into multiple information blocks, and an index number is assigned to each information block. S2, Information Verification: Add verification information to each information block; S3. Sequence Conversion: Convert the byte stream data in each information block into binary information, and then into a base sequence; S4. Modulation optimization: Construct modulation primers, determine whether the base sequence meets the constraints, and optimize the base sequence that does not meet the constraints using the modulation primers. Step S4 includes: constructing GC modulation primers and homopolymer modulation primers for each information block. The modulation primers are 4-base sequences with 50% GC content and no homopolymer. The GC content and homopolymer of the information sequence are made to meet the constraints by modulation of the modulation primers. The following modulation strategy is adopted: if the base sequence meets the constraints, no modulation is performed; if there are problems with the GC content or homopolymer of the base sequence, the corresponding modulation primer is used for modulation; if there are problems with both the GC content and homopolymer of the base sequence, two modulation primers are used for modulation simultaneously. For sequences whose GC content does not meet the requirements, the entire sequence is modulated directly; for sequences with excessively long homopolymers, the location containing the homopolymer is located, and only the homopolymer at that location is locally modulated. Select the modulation result with the shortest flag sequence from the sequences that meet the constraints after modulation optimization as the constraint sequence; The modulation optimization process is a 2-bit XOR operation. The bases of the base sequence are grouped sequentially, and each group of bases corresponds to a modulation primer. The XOR operation is performed on each base, and the result is the modulation result. S5. Generate a flag sequence: Generate a unique flag sequence for each information block based on the modulation process information; Step S5 includes: generating a unique identifier sequence for each information block to record modulation process information. The modulation process information recorded in the identifier sequence includes three parts: whether GC modulation primers were used for overall modulation; the specific location of homopolymer modulation; and the number of times homopolymer modulation was performed. One bit of binary information is used to indicate whether GC primers are used for overall modulation. The specific location of homopolymer modulation is represented by eight bits of binary information, and the number of homopolymer modulations is stored as seven bits of binary information. These are combined with the previous one bit of GC modulation information to form eight bits of binary information. The marker sequence is converted into a base sequence using a quaternary encoding rule, and then the base sequence is modulated using GC modulation primers and / or homopolymers. S6. Generate addressing sequence: Generate the addressing sequence for each information block based on the modulation primer and index number of each information block; Step S6 includes: incorporating the index number and modulation primer into the addressing sequence, and generating the addressing sequence for each information block using a structured addressing sequence design method; The structured addressing sequence design method combines a marker, a balancing sequence, the modulation primer, and the index number into an addressing sequence. The marker is a homopolymer used to ensure the specificity of the addressing sequence, and the balancing sequence is a randomly generated multi-base sequence used to ensure GC content balance. An addressing sequence is placed at each end of each DNA strand, referred to as the pre-addressing sequence and the post-addressing sequence, respectively. The pre-addressing sequence stores the GC modulation primer, and the post-addressing sequence stores the homopolymer modulation primer. S7. Output: Connect each information block with its marker sequence and addressing sequence to assemble a complete DNA chain, and output it to a text file for storage.

2. The DNA information encoding method based on modulation as described in claim 1, characterized in that, Step S3 includes converting the information in the information block into a base sequence form using a quaternary encoding rule, where "00", "01", "10", and "11" correspond to the bases "A", "T", "C", and "G", respectively.

3. The DNA information encoding method based on modulation as described in claim 1, characterized in that, Step S7 includes: combining the modulated and optimized constraint sequence with its preceding and following primers and marker sequences to form a complete DNA strand structure for storing information, and outputting it to a text file for storage.

4. A DNA information decoding method based on modulation, used to decode a DNA strand encoded using the DNA information encoding method according to any one of claims 1 to 3, characterized in that, Includes the following steps: T1, Addressing Sequence Reconstruction: Read the addressing sequence of the DNA strand and reconstruct the modulation primers and index numbers; T2: Modulation Demodulation: Demodulate the marker sequence and decode the base sequence into an unmodulated sequence based on the marker sequence information and modulation primers; T3: Sequence Conversion: Converts the unmodulated sequence into a binary sequence and then restores it to byte stream data; T4: Information Verification: Verify and correct stored information based on verification algorithms and redundant information; T5: Information Reassembly: Restore the correct stored information to the original information based on the index number.

5. The DNA information decoding method based on modulation as described in claim 4, characterized in that, Step T1 includes: determining the location of the addressing sequence from the read DNA sequence, and disassembling the modulation primer and index sequence according to the addressing sequence structure; Step T2 includes: first, finding the marker sequence from the DNA sequence, decoding the modulation process information, and then combining it with the modulation primers parsed from the primers to reverse modulate the DNA sequence to obtain the unmodulated storage information sequence; Step T3 includes: first converting the unmodulated storage information sequence from a base sequence to a binary sequence, and then converting it into byte stream data through a number system; Step T4 includes: modulated and converted byte stream data, containing storage information and verification information, using a verification algorithm to verify the information and correct any errors in the information; Step T5 includes: sorting the decoded and verified information according to its index number using a sorting algorithm to restore the original stored information, and finally outputting its original stored file.

6. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the DNA information encoding method as described in any one of claims 1 to 3 and / or the DNA information decoding method as described in any one of claims 4 to 5.

7. A control device, comprising a processor and a storage medium for storing a computer program; wherein, When the processor executes the computer program, it implements the DNA information encoding method as described in any one of claims 1 to 3 and / or the DNA information decoding method as described in any one of claims 4 to 5.