A polypeptide storage method, system and related device for encoding information in an ultrahigh base system

By employing probabilistic combinatorial coding and pooling synthesis techniques, an ultra-high radix coding system was designed, which solved the problem of limited peptide coding density and achieved efficient information storage for peptides, reaching a coding density of 15 bits/amino acid.

CN116405036BActive Publication Date: 2026-06-26ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2023-03-06
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing bioinformatics storage technologies, the coding density of peptides is limited by the limits of traditional coding theory, making it impossible to achieve the ideal coding density. This is especially true in DNA and protein storage, where there are more types of amino acid motifs, but the coding density is not fully utilized.

Method used

By employing a probabilistic combination coding method and determining coding parameters such as polypeptide chain length, probability combination bit length, and total number of amino acids, an ultra-high base coding system is designed. Using automatic indexing and pooling synthesis technology, binary sequences are converted into polypeptide sequences, achieving ultra-high information density polypeptide storage.

Benefits of technology

Breaking through the data density limit of traditional bioinformatics coding theory, it achieves ultra-high information density for peptide coding, improves the actual storage efficiency of amino acid storage space, and reaches a logical storage density of 15 bits/amino acid.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116405036B_ABST
    Figure CN116405036B_ABST
Patent Text Reader

Abstract

The application provides a polypeptide storage method, system and device for super-high radix coded information, and the method comprises the following steps: step 1, determining coding parameters (polypeptide chain length l, probability combination bit number k, and amino acids to be used and the total number n thereof); step 2, obtaining a file to be coded, and converting the file into a binary code stream; step 3, dividing the binary code stream into independent binary source blocks according to the coding parameters, and establishing a probability combination coding mapping relationship according to the coding parameters; step 4, generating an index of the independent binary source blocks based on the coding parameters, and obtaining a complete independent binary sequence; step 5, converting the binary source blocks into polypeptide sequences according to the mapping relationship and storing the polypeptide sequences; and step 6, sequencing the polypeptide sequences and statistically analyzing decoded original data. The application breaks through the limit of traditional biological information coding theory data density log2N bits / particle (N is the number of particles), and realizes polypeptide coding with super-high information density. The application also provides a polypeptide storage system and electronic equipment for super-high radix coded information, and has the above beneficial effects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of data encoding, specifically relating to a method, system, and related apparatus for storing polypeptide information in ultra-high base encoding. Background Technology

[0002] With the convergence of information technology and the economy and society, and the advancement of social networks and cloud computing technologies, data has grown rapidly. The global data sphere has experienced exponential growth in recent years, posing significant challenges to data storage. Currently, the total amount of digital information globally is 4.4 ZB, and it is projected to reach 175 trillion GB by 2025. To address this surge in data volume, the mainstream approach is to build large data centers. These centers primarily rely on optical and semiconductor media for storing, retrieving, accessing, and replicating massive amounts of data. However, limited by the storage capacity of existing storage devices, large data centers not only occupy a large area but also consume significant amounts of energy, making it impossible to meet the exponentially growing demand for information storage. Therefore, developing a new generation of high-density, low-energy-consumption information storage media is of paramount importance.

[0003] In recent years, bioinformatics storage has received increasing attention, and related technologies have developed rapidly. Representative examples include DNA and proteins (peptides). However, current DNA data storage is limited to four natural nucleotides (A, T, C, G) and four artificial nucleotides (P, Z, S, B). This inadequacy of basic storage units restricts coding density. Protein data storage, on the other hand, has more storage units (20 natural amino acids and various artificial amino acids). The greater variety of units determines a higher information coding density, and amino acids have smaller molecular weights than nucleotides, allowing for higher storage density with the same mass. Furthermore, peptides are more stable than DNA. Studies have shown that even millions of years later, peptides or proteins can still be detected and sequenced, while DNA has degraded. Researchers have designed octal and hexadecimal encoding and decoding systems based on amino acid motifs, but their encoding density is still limited to the traditional information encoding density (log2N bits / motif) limit (octal corresponds to log28 = 3 bits / motif, hexadecimal corresponds to log216 = 4 bits / motif). Moreover, due to limitations imposed by biochemical properties such as peptide chain length, the actual information encoding density is often lower than log2N bits / motif, making it impossible to achieve the ideal encoding density.

[0004] Peptide libraries in peptide synthesis refer to a large collection of small peptides with specific lengths and different sequences, including the arrangement and combination of various (or most) amino acid sequences in short peptides of this length. Currently, they are mainly used in drug design and protein-protein interaction research. The synthesis of a peptide library often involves a lot of information redundancy, which can seriously waste data value during mass spectrometry analysis. Combining peptide library technology with mass spectrometry data to improve the value of the encoded information density is of great significance.

[0005] Therefore, how to combine relevant knowledge to improve the coding density of bio-information storage is a key technical problem that needs to be solved by those skilled in the art. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of existing technologies and provide a method, system, computer-readable storage medium, and electronic device for storing polypeptide information with ultra-high base encoding. This invention can break through the data density limit (log₂N bits / unit) of traditional bioinformatics coding theory based on probabilistic combinatorial encoding, achieving ultra-high information density. The polypeptide encoding (bits / units).

[0007] To address the above technical problems, this invention provides a method for storing polypeptide information with ultra-high base encoding, comprising the following steps:

[0008] 1) Obtain the file to be encoded and convert it into a binary bitstream; determine the encoding parameters, including the polypeptide chain length l, the number of probability combination bits k, the amino acids to be used and their total number n;

[0009] 2) Divide the binary bitstream into independent binary source blocks according to the encoding parameters, and establish a probabilistic combination encoding mapping relationship according to the encoding parameters;

[0010] 3) Based on the encoding parameters, an index for each independent binary source block is automatically generated to obtain a complete independent binary sequence;

[0011] 4) Convert independent binary sequences into polypeptide sequences according to the probability combination encoding mapping relationship, and synthesize and store polypeptide sequences in separate pools.

[0012] As a preferred embodiment of the present invention, step 1) further includes a step of determining an ultra-high base number after determining the encoding parameters, as follows:

[0013] The unit information density p and p′ are calculated based on the probability combination bit number k in the encoding parameters and the total number of amino acids n, where the probability combination bit number k refers to the combination of selecting k amino acids from n amino acids, and the specific calculation formula is as follows:

[0014]

[0015]

[0016] Based on the above formula, the base y of the encoding scheme is calculated using the following formula:

[0017] y=2 p′

[0018] in, Here is the formula for calculating permutations and combinations. The symbol for rounding down is y, which is a super-high base number.

[0019] As a preferred embodiment of the present invention, in step 2), the binary code stream is divided into independent binary source blocks according to the encoding parameters, specifically as follows:

[0020] The number of information chains s and the number of zeros padded to the end when the binary code stream cannot be uniformly divided are determined based on the binary code stream length b, information density p′, and peptide chain length l. The specific calculation formula is as follows:

[0021]

[0022] z = slp' - b

[0023] in, The symbol for rounding up;

[0024] After padding the binary code stream with zeros, the binary code stream is evenly divided into s independent binary source blocks based on the number of information chains s.

[0025] As a preferred embodiment of the present invention, in step 2), establishing a probabilistic combination coding mapping relationship based on the coding parameters specifically involves:

[0026] Based on the selected amino acids, the total number of amino acids n, and the number of probability combination bits k, permutations and combinations are performed to obtain amino acid combination codes. The total number of amino acid combinations obtained through permutations and combinations is: The amino acid combinations obtained using decimal pairs are numbered sequentially starting from 0. The first y combinations are selected to establish a probabilistic combination coding mapping relationship, where one decimal number corresponds to one amino acid combination, and y is a very high base number.

[0027] As a preferred embodiment of the present invention, the method of generating an index for an independent binary source block based on the encoding parameters in step 3) specifically involves generating an index for an independent binary source block based on the encoding parameters, starting from 0, and converting it into binary form. The number of binary bits of the index is changed to an integer multiple of p′ by padding with leading zeros, ensuring that the index can be mapped to amino acid combination encoding.

[0028] As a preferred embodiment of the present invention, step 4) of converting independent binary sequences into polypeptide sequences according to a probability combination encoding mapping relationship specifically involves:

[0029] Divide the independent binary sequence into p' bits;

[0030] Convert the segmented p′-bit binary sequence into a decimal number;

[0031] Map decimal numbers to amino acid combination codes based on probability combination mapping relationships;

[0032] The complete polypeptide sequence was obtained, and the polypeptide sequences were synthesized and stored in separate pools.

[0033] As a preferred embodiment of the present invention, the step of synthesizing polypeptide sequences by pooling is as follows:

[0034] 4.1) Dissolve the resin, which serves as the substrate for polypeptide sequence synthesis, uniformly into k polypeptide synthesis tubes;

[0035] 4.2) Add one amino acid from the amino acid combination of the current site to each polypeptide synthesis tube for reaction, and k polypeptide synthesis tubes correspond to k amino acids in the amino acid combination of the current site;

[0036] 4.3) After the reaction is complete, combine the solutions from the k polypeptide synthesis tubes;

[0037] 4.4) The combined solution from step 4.3) is then evenly distributed and dissolved into k new polypeptide synthesis tubes, and steps 4.2)-4.4) are repeated to synthesize the next site to be synthesized, until the synthesis of the entire polypeptide sequence is completed.

[0038] As a preferred embodiment of the present invention, after step 4), the method further includes: performing sequencing analysis on the polypeptide sequence fragment to verify whether the amino acid sequence of the polypeptide fragment is consistent with the encoded standard amino acid sequence.

[0039] Preferably, the process of sequencing the polypeptide sequence fragment to verify whether the amino acid sequence of the polypeptide sequence is consistent with the encoded standard amino acid sequence specifically includes: sequencing the polypeptide sequence by tandem mass spectrometry de novo sequencing to obtain mass spectrometry data, and statistically analyzing the amino acid probability values ​​at each point of the polypeptide sequence based on the mass spectrometry data to verify whether the amino acid sequence of the polypeptide sequence is consistent with the encoded standard amino acid sequence.

[0040] The present invention also provides a polypeptide storage system for ultra-high radix encoded information implementing the aforementioned method, comprising:

[0041] The acquisition module is used to acquire encoding parameters and the binary code stream of the file to be encoded. The encoding parameters include polypeptide chain length l, probability combination bit number k, amino acids to be used and their total number n.

[0042] The density calculation module calculates the unit information density p and p′ based on the probability combination bit k and the number of amino acids n in the encoding parameters;

[0043] The decomposition module is used to divide the binary bitstream into independent binary source blocks according to the encoding parameters;

[0044] The mapping calculation module is used to establish a probabilistic combination coding mapping relationship based on the coding parameters;

[0045] The index module automatically generates independent binary source block indices based on the encoding parameters, starting from 0, and converts them into binary form. It pads the binary number of the index with leading zeros to make the number of bits a multiple of p', ensuring that the index can be mapped to amino acid combination encoding.

[0046] The mapping module converts independent binary source blocks into polypeptide sequences according to mapping relationships.

[0047] The synthesis module uses a pooled synthesis method to synthesize and store polypeptide sequences.

[0048] The present invention also provides a computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of the polypeptide storage method for ultra-high base encoded information.

[0049] The present invention also provides an electronic device, characterized in that it includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the polypeptide storage method for ultra-high base encoded information when it calls the computer program in the memory.

[0050] This invention provides a method for storing polypeptide information with ultra-high base encoding, comprising: determining encoding parameters (polypeptide chain length l, probability combination bit depth k, required amino acids and total number n); acquiring the file to be encoded and converting the file into a binary code stream; dividing the binary code stream into independent binary source blocks according to the encoding parameters, and simultaneously establishing a probability combination encoding mapping relationship based on the encoding parameters; generating an index for the independent binary source blocks using automatic indexing based on the encoding parameters to obtain complete independent binary sequences; converting the independent binary sequences into polypeptide sequences according to the mapping relationship, pooling and synthesizing polypeptide sequences, then obtaining polypeptide data through de novo sequencing, and finally statistically analyzing the amino acid probability values ​​at each point to verify whether the amino acid sequence of the polypeptide sequence is consistent with the encoded standard amino acid sequence.

[0051] This invention breaks through the data density (log₂N bits / unit) limit of traditional bioinformatics coding theory, achieving ultra-high information density ( This method encodes peptides using bits per primitive. For example, when the number of primitives N = 20 and the number of probability combinations k = 6, the theoretical data density limit of traditional bioinformatics coding is log₂N bits / primary unit = 4.32 bits / primary unit, while this method can achieve... With an extreme ultra-high information storage density of 15.24 bits / unit (in reality, it can reach 15 bits / unit due to rounding), this method can greatly improve the logical storage density of peptide information storage and improve the actual storage efficiency of amino acid storage space.

[0052] The present invention also provides a polypeptide storage system, a computer-readable storage medium, and an electronic device with ultra-high base encoded information, which have the above-mentioned beneficial effects, and will not be repeated here. Attached Figure Description

[0053] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0054] Figure 1 A flowchart of a method for storing polypeptide information with ultra-high base encoding provided in an embodiment of the present invention;

[0055] Figure 2 This is a schematic diagram of the polypeptide data encoding of the 20 natural and artificial amino acids coexisting in this invention.

[0056] Figure 3 This invention provides a mapping table for a polypeptide storage method with ultra-high base encoding information, as an embodiment of the invention.

[0057] Figure 4 This is a schematic diagram of a method for storing polypeptide information with ultra-high base encoding, provided in an embodiment of the present invention.

[0058] Figure 5 This is a schematic diagram of the amino acid encoding process in a polypeptide storage method with ultra-high base encoding information provided in an embodiment of the present invention;

[0059] Figure 6 This is a schematic diagram of the combined polypeptide sequence obtained during the encoding process in a polypeptide storage method for ultra-high base encoded information provided in an embodiment of the present invention;

[0060] Figure 7 The image shows a pooling synthesizer and a synthesized polypeptide in a polypeptide storage method for ultra-high base encoding information provided in this embodiment of the invention.

[0061] Figure 8 This is a polypeptide sequencing result diagram in a polypeptide storage method with ultra-high radix encoding information provided in an embodiment of the present invention;

[0062] Figure 9 This is a graph showing the probability statistical analysis results of a polypeptide storage method for ultra-high base encoded information provided in an embodiment of the present invention.

[0063] Figure 10 This invention provides a density map of different n and k values ​​in a method for storing polypeptide information with ultra-high base encoding.

[0064] Figure 11 This is a comparison diagram of coding density in a polypeptide storage method with ultra-high base coding information provided in an embodiment of the present invention;

[0065] Figure 12 This is a schematic diagram of a polypeptide storage system for ultra-high base encoded information provided in an embodiment of the present invention. Detailed Implementation

[0066] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0067] The purpose of this invention is to overcome the shortcomings of existing technologies and, by combining aggregated polypeptide pooling synthesis technology and the characteristics of mass spectrometry data, design a probabilistic combinatorial coding algorithm to break through the data density limit (log₂N bits / unit, where N is the number of amino acids used in traditional methods) of traditional bioinformatics coding theory, thereby achieving ultra-high information density. In this embodiment, a peptide encoding scheme using bits / unit is employed, with n=20 and k=6, achieving an information density of 15 bits / amino acid. This scheme successfully synthesizes peptide chains and retrieves data from mass spectrometry, significantly improving the logical storage density of peptide information and the actual storage efficiency of amino acid storage space. This provides an efficient encoding scheme for peptide information storage technology. Based on this, the present invention develops a complete prototype system for storing ultra-high-radix encoded peptide information. The system integrates functions such as binary data acquisition, storage density calculation, binary sequence partitioning, probability combination mapping calculation, automatic index design, and combination amino acid mapping. A text segment was then encoded using this system, and the original text was recovered through chemical peptide synthesis, mass spectrometry sequencing, and sequencing data analysis.

[0068] Please refer to Figure 1 , Figure 1 A flowchart of a polypeptide storage method using ultra-high base information encoding provided in an embodiment of the present invention is shown. The method includes:

[0069] 1) Determine the encoding parameters, obtain the file to be encoded, and convert the file into a binary bitstream; calculate the unit information density p′ based on the probability combination bit number k and the total number of amino acids n in the encoding parameters;

[0070] 2) Divide the binary bitstream into independent binary source blocks according to the encoding parameters; establish a probabilistic combination encoding mapping relationship based on the encoding parameters;

[0071] 3) Based on the encoding parameters, an index for each independent binary source block is automatically generated to obtain a complete independent binary sequence;

[0072] 4) Convert the independent binary sequences into peptide sequences according to the mapping relationship, synthesize peptide sequences by pooling and store them.

[0073] In step 1), the polypeptide chain length *l*, the number of probability combinations *k*, the amino acids to be used, and their total number *n* are first determined. These parameters determine the processing methods in subsequent steps: *l* is the final encoded polypeptide chain length, generally limited by the ability to chemically synthesize polypeptide chains; the coder can only freely set this parameter based on their synthesis capabilities; the number of probability combinations *k*, together with the total number of amino acids *n*, determines the subsequent amino acid mapping. The number of probability combinations *k* is the number of combinations in the permutation and combination, representing the combination of *k* amino acids out of *n* amino acids. Therefore, the total number of possible combinations is... The total number of combinations determines the final unit information density. This also determines that the base y = 2 in this embodiment. p′ The formulas for calculating the unit information density p and p′ are as follows:

[0074] specific formula

[0075]

[0076]

[0077] Based on the above formula, the base y of the encoding scheme is calculated using the following formula:

[0078] y=2 p′

[0079] in, Here is the formula for calculating permutations and combinations. The symbol for rounding down is y, which is a super-high base number.

[0080] While performing the above steps, obtain the file to be encoded and convert it into a binary bitstream. If the file to be encoded is directly binary data, no further conversion is needed before proceeding with the subsequent processes.

[0081] Since a mapping from binary data to amino acids needs to be established subsequently, after obtaining the encoding parameters, it is necessary to calculate the unit information density based on the probability combination bits in the encoding parameters and the total number of amino acids. The unit information density determines the amount needed per p ′ The binary data is mapped to amino acids, which determines that the number of bits in the binary data of the automatic indexing module needs to be p. ′ Integer multiples of; in this embodiment, n=20, k=6, p ′ It is 15.

[0082] In step 2), the obtained binary code stream of the file to be encoded is processed according to the above parameters (l, p). ′ The binary source stream is divided into independent binary source blocks, since the length of the binary bitstream is not necessarily p. ′ Since the binary code stream is not divisible by an integer multiple, it needs to be padded with zeros. The number of zeros to be padded is: Padding with zeros ensures that the length of the polypeptide chain obtained after subsequent mapping steps remains consistent with the pre-set length.

[0083] In step 2), a probabilistic combination coding mapping relationship is established based on the coding parameters. Based on the selected n amino acids, each k-bit combination yields a combination amino acid with a consecutive decimal number (the decimal numbering starts from 0 and continues to 1). ), take the first y = 2 p′ Combinations (in this example, y = 2) 15 =32768, which is also the 32768 super-high base), and further convert the consecutive decimal numbers to p ′ Bit-length binary data, thus obtaining a combined amino acid mapping table ( Figure 3The significance of combined amino acids is that, in a polypeptide chain, the amino acid at a fixed site is uncertain; it could be any amino acid in the combined amino acid mapping, and the probability values ​​of each amino acid in the combined amino acid mapping at that site are equal. For example, if the encoded amino acid chain of length (number of sites) is 12 (KOLFDW, VPQFEM, AVSOQW, KASLFE, KVQBEC, AVHNCW, KASHQR, SFBTEC, ASLYTR, AOQLCW, KVHPLW, KVOFTR), then at the first site, since the amino acid combination is KOLFDW (it should be noted that KOLFDW, etc., are not amino acid sequences), the amino acid at the first site could be one of K, O, L, F, D, or W.

[0084] In step 3), due to the limited length of the peptide chain, when encoding multiple chains, a reasonable index is needed to integrate the information. The index is the numbering information added to the front of the peptide chain. Based on the encoding parameters, an index for an independent binary source block is generated starting from 0 and converted into binary form. The number of binary bits of the index is changed to an integer multiple of p' by padding with leading zeros to ensure that the index can be mapped to the amino acid combination encoding.

[0085] In step 4), the independent binary sequences are converted into polypeptide sequences according to the probability combination encoding mapping relationship, specifically as follows:

[0086] Divide the independent binary sequence into p' bits;

[0087] Convert the segmented p′-bit binary sequence into a decimal number;

[0088] Map decimal numbers to amino acid combination codes based on probability combination mapping relationships;

[0089] The complete polypeptide sequence was obtained, and the polypeptide sequences were synthesized and stored in separate pools.

[0090] In step 4), the steps for synthesizing the polypeptide sequence in separate pools are as follows:

[0091] 4.1) Dissolve the resin, which serves as the substrate for polypeptide sequence synthesis, uniformly into k polypeptide synthesis tubes;

[0092] 4.2) Add one amino acid from the amino acid combination of the current site to each polypeptide synthesis tube for reaction, and k polypeptide synthesis tubes correspond to k amino acids in the amino acid combination of the current site;

[0093] 4.3) After the reaction is complete, combine the solutions from the k polypeptide synthesis tubes;

[0094] 4.4) The combined solution from step 4.3) is then evenly distributed and dissolved into k new peptide synthesis tubes, and steps 4.2)-4.4) are repeated to synthesize the next site until the entire peptide sequence is synthesized. The peptide sequence, composed of combined amino acids, is not a fixed sequence; therefore, a chemical pooling synthesis method is required for storage, ensuring that amino acids at each site are synthesized in equal portions. Afterward, mass spectrometry sequencing is used to obtain all the data, and probabilistic statistical analysis is used to identify the amino acid with the highest probability value at the first k positions of each site, thereby reconstructing the original data.

[0095] This invention breaks through the traditional theoretical data density limit of bioinformatics encoding (log220 = 4.32 bits / amino acid) and achieves an ultra-high information density of 15 bits / amino acid. This greatly improves the logical storage density of peptide information storage, increases the actual storage efficiency of amino acid storage space, provides an efficient encoding scheme for peptide information storage technology, provides a meaningful reference for its application in practical digital storage, and has the potential to be expanded compared with other storage media.

[0096] The present invention will be further described below with reference to the accompanying drawings:

[0097] This invention Figure 2 This is a schematic diagram of the polypeptide data encoding of the 20 amino acids (natural and artificial amino acids coexisting) selected in the examples.

[0098] The 20 amino acids used in this example are representative amino acids. In actual use, any number of these amino acids can be selected, or the range can be expanded to include more amino acids. The number of bits in the probability combination can also be determined according to the actual situation. The coding system designed based on the parameters selected in this example (n=20, k=6) can achieve a maximum density of 15 bits / amino acid.

[0099] This invention Figure 3 The intention is to represent the mapping relationship of probability combination encoding.

[0100] According to the parameters in the example, six amino acids were selected from 20 amino acids (K, A, V, S, H, P, O, N, Q, L, F, Y, B, T, D, E, C, W, R, M) to form a combination (resulting in a total of...). (In various combinations), KAVSHP, KAVSHO, and other amino acid combinations can be obtained sequentially. The first two are taken. 15 species (15 is p) ′ The decimal numbers are sequentially numbered starting from 0. Since the final result is a mapping from binary data to combined amino acids, the decimal numbers also need to be converted to p. ′ The binary number with a length of 1 bit establishes a complete mapping relationship between binary data and combined amino acids.

[0101] To enable those skilled in the art to better understand the technical solution of this method, this embodiment further utilizes scenario embodiments and combines them with the appendix. Figure 4 and 5 This will be explained in more detail. Figure 4 This is a schematic diagram illustrating a method for storing polypeptide information with ultra-high base encoding, provided in an embodiment of the present invention. Figure 5 This is a schematic diagram of the amino acid encoding process in a polypeptide storage method with ultra-high base encoding information provided in an embodiment of the present invention.

[0102] In practice, the encoding parameters are first obtained: peptide chain length l = 12, probability combination bit length k = 6, total number of amino acids n = 20, and the file to be encoded, "Never Cry, Always Try!". Then, the binary code stream of the file to be encoded is obtained: "0100111001100101011101100110010101110". 01000100000010000110111001001111001001011000010000001000001011011000111101110110000101111001011100110010000001010100011100100111100100100001”, thus we know the binary code stream length b = 176; then calculate according to the encoding parameters The density was calculated to be p′ = 15, and y = 2. p′ =32768, each segment of binary data has a length of lp′ = 180 bits, and then the binary code stream is divided accordingly, padded with 0s, and calculated. Therefore, four zeros are padded to the original binary data. The data after padding with zeros is "010011100110010101110110011001010111001000100000010000110111001001111001001011100100101100001000000101101100011101110110000101111001011100110010000001010101000111001001111001001000010000". Since this embodiment only has one chain... Therefore, the index address is omitted. After calculating the density, a mapping table is calculated based on the 20 amino acids selected in this embodiment and the probability combination bit length of 6. If there are multiple chains (taking 4 chains as an example), the indices are 0, 1, 2, 3 in sequence, which are 0, 1, 10, 11 in binary. Then, leading zeros are used to ensure that the index is an integer multiple of the information density, resulting in the final indexes as 000000000000000, 00000000000001, 000000000000010, 000000000000011. In this way, the indexes can be mapped to amino acid combination codes according to the mapping relationship.

[0103] Specifically, acquiring the binary code stream and density calculation can be performed simultaneously, as can mapping relationship calculation and automatic index generation. However, density calculation must be performed before mapping relationship calculation and automatic index generation.

[0104] Then, the binary sequence is divided into p-orders. ′ The position mapping was converted to a combination of amino acids, resulting in a final combined polypeptide sequence of length 12: "KOLFDW,VPQFEM,AVSOQW,KASLFE,KVQBEC,AVHNCW,KASHQR,SFBTEC,ASLYTR,AOQLCW,KVHPLW,KVOFTR", as shown below. Figure 5 It should be noted that the polypeptide sequence composed of combined amino acids is not a fixed sequence. During storage, a chemical pooling synthesis method is required, ensuring that each amino acid at each site is synthesized in equal portions. Then, mass spectrometry sequencing technology is used to obtain all the data. This process is as follows: Figure 4 .

[0105] To make the specific synthesis and sequencing process clearer to those skilled in the art, this embodiment further incorporates the appendix. Figure 6-9 This will be explained. (Among them) Figure 6 This is a schematic diagram of the combined polypeptide sequence to be synthesized obtained from the above encoding process. Figure 7 This is a schematic diagram of the pooling synthesizer and the synthesized peptide. Figure 8 This is a schematic diagram of the peptide sequencing results. Figure 9 This is a schematic diagram of the probability and statistical analysis results.

[0106] Multiple experiments have verified that immobilizing amino acid A at the end of the polypeptide chain can assist in decoding, such as... Figure 6 Therefore, the combined polypeptide chains synthesized in this embodiment are "KOLFDW, VPQFEM, AVSOQW, KASLFE, KVQBEC, AVHNCW, KASHQR, SFBTEC, ASLYTR, AOQLCW, KVHPLW, KVOFTR, A". The cell-specific synthesis steps are as follows:

[0107] (1) Peptides were synthesized on resin at a scale of 0.12 mmol. The resin was first swollen in dimethylformamide and dichloromethane (volume ratio 1:1), and then washed three times with dimethylformamide in a 50 ml peptide solid-phase synthesis tube.

[0108] (2) Amino acid coupling. 1.2 mmol of amino acid and 1.18 mmol of ester were dissolved in 8 mL of dimethylformamide, and then 408 μL of diisopropylethylamine was added. The well-mixed mixture was then added to the resin and reacted with shaking at room temperature for 30 min. The mixture was then washed three times with dimethylformamide, three times with dichloromethane, and then three times with dimethylformamide. Subsequently, 10 mL of 20% (v / v) piperidine was used to remove the protection twice for 5 min at room temperature with shaking, followed by three washes with dimethylformamide, three times with dichloromethane, and then three times with dimethylformamide.

[0109] (3) Polypeptide pooling synthesis. Pooling synthesizers, such as... Figure 7 In (a) and (b), 0.12 mmol of resin was dissolved in 6 mL of dimethylformamide and evenly distributed into six 10 mL peptide solid-phase synthesis tubes. The dimethylformamide in the six tubes was then removed by vacuum drying. In each synthesis tube, 0.2 mmol of amino acid and 0.19 mmol of ester were dissolved in 1 mL of dimethylformamide, followed by the addition of 68 μL of diisopropylethylamine. The well-mixed mixture was then added to the resin, and the mixture was reacted with shaking at room temperature for 30 min. The mixture was then washed three times with dimethylformamide, three times with dichloromethane, and then three times with dimethylformamide again. The resin from the six tubes was then combined into a 50 mL peptide solid-phase synthesis tube, and the mixture was deprotected twice for 5 min at room temperature with shaking in 10 mL of 20% (v / v) piperidine in dimethylformamide. The mixture was then washed three times with dimethylformamide, three times with dichloromethane, and then three times with dimethylformamide again. The preceding steps were then repeated.

[0110] (4) After solid-phase synthesis, thoroughly clean the resin with dichloromethane and vacuum dry. Place 0.02 mmol of resin into a 50 ml peptide solid-phase synthesis tube, add lysis reagent (2.5% (v / v) water, 2.5% (v / v) ethylenedithiol, 2.5% (v / v) anisole sulfide, and 2.5% (v / v) triisopropylsilane in pure trifluoroacetic acid), and react at 30°C for 2 hours. Blow away most of the lysis buffer with nitrogen, and wash the resulting peptide-containing solution twice with cold ether (pre-cooled in an ice-water mixture). Allow the resulting gel-like solid to air dry at room temperature. The synthesized peptide is shown below. Figure 7 (c)

[0111] The synthesized peptide sequence was sent to a mass spectrometer, and a total of 21,294 data points were obtained after sequencing analysis. Figure 8 These are three sets of mass spectrometry results. Among them, for...Figure 8 (a) The obtained data is “K, V, S, F, V, V, H, T, Y, A, K, K, A”. Except for the last A which is a fixed amino acid, the first 12 amino acids can correspond to each site of the combined polypeptide sequence. For example, the first “K” in the result is in the first “KOLFDW” of the combined amino acid sequence, and the second “V” in the result is in the second “VPQFEM” of the combined amino acid sequence. Then, the amino acid of each site can be analyzed by probability statistics.

[0112] After acquiring the large amount of mass spectrometry data mentioned above, probabilistic statistical analysis was used to identify the top 6 amino acids with the highest probability values ​​at each site, thereby reconstructing the original data, such as... Figure 9 .

[0113] To further clarify for those skilled in the art the ultra-high information coding density achieved by this method, this embodiment further incorporates the appendix. Figure 10 and attached Figure 11 This will be explained in more detail. Figure 10 This invention provides a density map of different n and k values ​​in a method for storing polypeptide information with ultra-high base encoding. Figure 11 This is a comparison diagram of coding density in a polypeptide storage method with ultra-high base coding information provided in an embodiment of the present invention.

[0114] This invention designs a polypeptide storage system with ultra-high radix encoding based on probabilistic combinatorial coding. This system overcomes the shortcomings of existing technologies, combining polypeptide pooling synthesis technology and the characteristics of mass spectrometry data. The designed probabilistic combinatorial coding algorithm breaks through the data density (log₂N bits / unit) limit of traditional bioinformatics coding theory, achieving ultra-high information density. The polypeptide encoding (bits / units), such as Figure 10 Different information coding densities can be obtained for different n and k values. Calculations show that, with n ≤ 20 (i.e., using no more than 20 amino acids), the system can achieve a maximum information coding density of 17 bits / amino acid (rounded down). In this case, the parameters are n = 20 and k = 10. Considering the difficulty of peptide chain synthesis and the accuracy of mass spectrometry sequencing, this embodiment uses parameters of n = 20 and k = 6, resulting in an information coding density of 15 bits / amino acid (rounded down). Figure 11 .

[0115] The foregoing has provided a detailed description of an embodiment of a polypeptide storage method for ultra-high base encoded information provided by the present invention. This article also provides a corresponding encoding system. The following is an introduction to a polypeptide storage system for ultra-high base encoded information provided by the embodiments of the present invention. The system described below can be referred to in correspondence with the polypeptide storage method for ultra-high base encoded information described above.

[0116] See Figure 12 , Figure 12 This is a schematic diagram of a polypeptide storage system with ultra-high base encoding provided in an embodiment of the present invention. This application also provides a polypeptide storage system with ultra-high base encoding, comprising:

[0117] Module S1210: Determines encoding parameters, acquires the file to be encoded, and converts the file into a binary bitstream;

[0118] Density calculation module S1220: Calculates unit information density p based on the probability combination bit number k in the encoding parameters and the total number of amino acids n;

[0119] Decomposition module S1230: Divides the binary bitstream into independent binary source blocks according to the encoding parameters;

[0120] Mapping calculation module S1240: Establishes a probabilistic combination coding mapping relationship based on coding parameters;

[0121] Index module S1250: Based on the encoding parameters, it automatically generates an index for independent binary source blocks to obtain a complete independent binary sequence;

[0122] Mapping module S1260: Converts independent binary sequences into polypeptide sequences according to the mapping relationship and stores the polypeptide sequences.

[0123] The synthesis module S1270 synthesizes and stores polypeptide sequences using a pooled synthesis method.

[0124] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed, can perform the steps provided in the above embodiments. The storage medium may include various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0125] This application also provides an electronic device that may include a memory and a processor. The memory stores a computer program, and when the processor calls the computer program in the memory, it can implement the steps provided in the above embodiments. Of course, the electronic device may also include various network interfaces, power supplies, and other components.

[0126] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. As the system provided in the embodiments corresponds to the method provided in the embodiments, the description is relatively simple; relevant parts can be found in the method section.

[0127] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are only for the purpose of helping to understand the method and core ideas of this application. It should be noted that those skilled in the art can make several improvements and modifications to this application without departing from the principles of this application, and these improvements and modifications also fall within the protection scope of the claims of this application.

[0128] It should also be noted that, in this specification, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

Claims

1. A method for storing polypeptide information in ultra-high base encoding, characterized in that, Includes the following steps: 1) Obtain the file to be encoded and convert it into a binary bitstream; determine the encoding parameters, including the polypeptide chain length l, the number of probability combinations k, the amino acids to be used and their total number n; after determining the encoding parameters, the step of determining the ultra-high base number is also included, as follows: Calculate the unit information density p based on the probability combination bit k and the total number of amino acids n in the encoding parameters. The probability combination number k refers to the combination of selecting k amino acids from n amino acids, and the specific calculation formula is as follows: ; ; Based on the above formula, the base y of the encoding scheme is calculated using the following formula: ; in, Here is the formula for calculating permutations and combinations. The floor function is used for rounding down; y represents a high-radix number. 2) Divide the binary bitstream into independent binary source blocks according to the encoding parameters, and establish a probabilistic combination encoding mapping relationship according to the encoding parameters; Specifically, the binary bitstream is divided into independent binary source blocks according to the encoding parameters, as follows: Based on the binary code stream length b and information density The peptide chain length l determines the number of information chains s and the number of zeros z added to the end of the binary code stream when it cannot be evenly divided. The specific calculation formula is as follows: ; ; in, The symbol for rounding up; After padding the binary code stream with zeros, the binary code stream is evenly divided into s independent binary source blocks based on the number of information chains s. The probabilistic combination coding mapping relationship is established based on the coding parameters, specifically as follows: Based on the selected amino acids, the total number of amino acids n, and the number of probability combination bits k, permutations and combinations are performed to obtain amino acid combination codes. The total number of amino acid combinations obtained through permutations and combinations is: The amino acid combinations obtained by using decimal pairs are numbered sequentially starting from 0. The first y combinations are selected to establish a probabilistic combination coding mapping relationship, that is, one decimal number corresponds to one amino acid combination, where y is a super-high base number, y≤ ; 3) Based on the encoding parameters, an index for each independent binary source block is automatically generated to obtain a complete independent binary sequence; In step 3), the process of generating an index for an independent binary source block based on encoding parameters using automatic indexing specifically involves: generating an index for an independent binary source block starting from 0 using automatic numbering based on encoding parameters and converting it to binary form; then, padding the index with leading zeros to change the number of binary bits. Integer multiples of the index ensure that the index can be mapped to amino acid combination encoding; 4) Convert independent binary sequences into polypeptide sequences according to the probability combination encoding mapping relationship, and synthesize and store polypeptide sequences in pools; Step 4) involves converting independent binary sequences into polypeptide sequences according to a probability combination encoding mapping relationship, specifically as follows: The independent binary sequences are arranged according to each Bit segmentation; The segmented Convert a binary sequence to a decimal number; Map decimal numbers to amino acid combination codes based on probability combination mapping relationships; The complete polypeptide sequence was obtained, and the polypeptide sequences were synthesized and stored in separate pools. The steps for synthesizing polypeptide sequences by pooling are as follows: 4-1) The resin used as the substrate for polypeptide sequence synthesis is uniformly dissolved into k polypeptide synthesis tubes; 4-2) Add one amino acid from the amino acid combination of the current site to each polypeptide synthesis tube for reaction, and k polypeptide synthesis tubes correspond to k amino acids in the amino acid combination of the current site; 4-3) After the reaction is complete, combine the solutions from the k polypeptide synthesis tubes; 4-4) The combined solution from step 4-3) is then evenly distributed and dissolved into k new polypeptide synthesis tubes, and steps 4-2) to 4-4) are repeated to synthesize the next site to be synthesized, until the synthesis of the entire polypeptide sequence is completed.

2. The method according to claim 1, characterized in that, Following step 4), the process further includes: The polypeptide sequence fragment was sequenced to verify whether the amino acid sequence of the polypeptide fragment was consistent with the encoded standard amino acid sequence.

3. The method according to claim 2, characterized in that, The process of sequencing the polypeptide sequence fragment to verify whether the amino acid sequence of the polypeptide sequence is consistent with the encoded standard amino acid sequence specifically includes: The peptide sequence was sequenced by de novo sequencing using tandem mass spectrometry to obtain mass spectrometry data. The amino acid probability values ​​at each site of the peptide sequence were statistically analyzed based on the mass spectrometry data to verify whether the amino acid sequence of the peptide sequence is consistent with the encoded standard amino acid sequence.

4. A polypeptide storage system for ultra-high base encoded information implementing the method of claim 1, characterized in that, include: The acquisition module is used to acquire encoding parameters and the binary code stream of the file to be encoded. The encoding parameters include polypeptide chain length l, probability combination bit number k, amino acids to be used and their total number n. The density calculation module calculates the unit information density p based on the probability combination bit k and the number of amino acids n in the encoding parameters. ; The decomposition module is used to divide the binary bitstream into independent binary source blocks according to the encoding parameters; The mapping calculation module is used to establish a probabilistic combination coding mapping relationship based on the coding parameters; The index module automatically generates independent binary source block indices starting from 0 based on the encoding parameters and converts them into binary form. It also pads the binary number of the index with leading zeros to make the number of bits a multiple of p', ensuring that the index can be mapped to amino acid combination encoding. The mapping module converts independent binary source blocks into polypeptide sequences according to mapping relationships; The synthesis module uses a pooled synthesis method to synthesize and store polypeptide sequences.

5. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the polypeptide storage method for ultra-high base encoded information as described in any one of claims 1-4.

6. An electronic device, characterized in that, The method includes a memory and a processor, wherein the memory stores a computer program, and the processor, when calling the computer program in the memory, implements the steps of the polypeptide storage method for ultra-high base encoded information as described in any one of claims 1-4.