A method for improving the quality of DNA code sets using dual match and mismatch constraints

By optimizing the DNA coding set through an improved arithmetic optimization algorithm and a double matching constraint mismatch constraint, the problems of insufficient quantity and low quality of the coding set are solved, and more efficient DNA storage is achieved.

CN115662515BActive Publication Date: 2026-06-19DALIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DALIAN UNIV
Filing Date
2022-11-07
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing DNA storage technologies suffer from insufficient coding set quantity and low quality, resulting in high error rates during sequencing and failing to meet data storage requirements.

Method used

An improved arithmetic optimization algorithm is used, which combines double matching constraints and mismatch constraints, to optimize the DNA candidate set through iterative optimization and screening, thereby improving the quantity and quality of the coding set.

🎯Benefits of technology

It significantly increased the number of DNA coding sets, improved the coding rate, effectively reduced the error rate during sequencing, and improved the stability and efficiency of DNA storage.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115662515B_ABST
    Figure CN115662515B_ABST
Patent Text Reader

Abstract

This invention discloses a method for improving the quality of DNA coding sets using double-matching constraints and mismatch constraints. The method includes: randomly constructing an initial DNA candidate set; iteratively optimizing the initial DNA candidate set using an improved arithmetic optimization algorithm; obtaining DNA sequences that meet traditional constraints; and further screening the DNA sequences that meet the constraints using double-matching constraints and mismatch constraints to obtain the DNA coding set. This invention can obtain a large number of high-quality DNA storage sets, effectively improving the DNA coding rate and reducing errors that occur during DNA sequence storage.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of DNA storage technology, and more specifically to a DNA encoding method utilizing double matching constraints and mismatch constraints. Background Technology

[0002] With the advancement of information technology in society, data is growing explosively, and traditional storage media can no longer meet current data storage demands, urgently requiring new storage media. DNA, with its advantages of large storage capacity and high durability, is one of the most likely storage media to solve the data storage problem. Designing a large number of coding sequences is crucial for DNA storage, as the size of the coding sequence directly affects the coding rate. Increasing the number of sequences leads to a higher coding rate. Furthermore, with a larger coding sequence, shorter sequences can achieve the same coding performance as long sequences.

[0003] On the other hand, low-quality DNA coding increases errors during sequencing, which is a crucial process in DNA storage and directly impacts storage efficiency. Therefore, designing high-quality DNA coding sets is of great significance. Summary of the Invention

[0004] The purpose of this invention is to provide a DNA encoding method that utilizes double matching constraints and mismatch constraints, which can obtain a larger and higher quality DNA encoding set, reduce non-specific hybridization reactions, and effectively reduce the error rate in DNA storage.

[0005] To achieve the above objectives, the technical solution of this application is: a method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints, comprising:

[0006] Randomly construct an initial DNA candidate set;

[0007] The initial DNA candidate set is iteratively optimized using an improved arithmetic optimization algorithm;

[0008] Obtain DNA sequences that meet traditional constraints;

[0009] The DNA sequences that meet the constraints are further screened using double matching constraints and mismatch constraints to obtain the DNA coding set.

[0010] Furthermore, an improved arithmetic optimization algorithm is used to iteratively optimize the initial DNA candidate set. Specifically, the acceleration function MOA and the mathematical optimizer probability MOP of the arithmetic optimization algorithm are modified by using random perturbation parameters h and g generated by elementary functions to increase the number of DNA coding sets. The specific formula is as follows:

[0011]

[0012]

[0013] Where min is the minimum value of the acceleration function MOA, and max is the maximum value of the acceleration function MOA, which can be set to 0.2 and 1 respectively; t is the current iteration number of the algorithm, and T is the maximum iteration number of the algorithm; α is a sensitivity parameter, which can be set to 5.

[0014] Furthermore, the random perturbation parameters h and g are obtained as follows:

[0015] h=|m2×acos(m3)| (3)

[0016] g=|m1×acos(m3)| (4)

[0017] Where m1 and m2 are set to 1.2 and 0.65 respectively, and m3 is a random number between 0 and 1.

[0018] Furthermore, an adaptive weight w1 is added when the arithmetic optimization algorithm performs division and subtraction operations, and an adaptive weight w2 is added when performing multiplication and addition operations; the number of DNA coding sequences is increased when obtaining the DNA coding set; the DNA sequence position update formula is as follows:

[0019]

[0020]

[0021] Wherein, parameter V j =((UB) j -LB j )×μ+LB j ), UB j and LB j represents the minimum and maximum values ​​at the j-th position, respectively; μ is a parameter that adjusts the search for DNA sequences that meet the criteria, and can be set to 0.5; x i,j (t+1) represents the position j of the i-th DNA sequence in this iteration; best(x j ) represents the j-th position of the optimal DNA sequence in the current iteration; ε is an integer; r2 and r3 are random numbers.

[0022] Furthermore, the adaptive weights w1 and w2 are obtained by the following formula:

[0023]

[0024]

[0025] Here, rand is a random number ranging from 0 to 1; S is the current DNA sequence.

[0026] Furthermore, the dual matching constraint specifically involves: taking two consecutive bases as a combination, and then determining whether there are more than three combinations that are complementary bases or identical bases among all combinations. If so, two different bases from one of the combinations are selected for exchange. If the bases in the combination are complementary, a non-complementary base outside the combination is selected for replacement.

[0027] Furthermore, the formula for exchanging or replacing the bases is as follows:

[0028]

[0029] Where x is a DNA sequence, n represents the total number of bases in this sequence, x' is a subsequence of x, count is the number of sequences in sequence x that are identical or complementary to x', and x' = (x j ,x j+1 ), j∈[1,n-1].

[0030] Furthermore, the mismatch constraint specifically refers to the probability of a mismatch occurring at the 3' end during DNA sequence amplification and the efficiency of mismatch initiation, represented by MPL. The larger the number, the higher the probability of a mismatch and the higher the efficiency of mismatch initiation. In sequence S, when the last base is G / C, the rating is 1; when the last base is T, the rating is 2; and when the last base is A, the rating is 3. The mismatch constraint level is set to 1.

[0031] Furthermore, the mismatch constraint is expressed by the following formula:

[0032] S = S1S2…S i-1 S i

[0033]

[0034] The advantages of the above technical solutions adopted in this invention compared with the prior art are as follows:

[0035] 1. The improved arithmetic optimization algorithm IOA adds random perturbations to important parameters in each iteration and uses a position update formula with added adaptive weights to obtain more candidate DNA sequences, which in turn helps to obtain the globally optimal candidate DNA sequence.

[0036] 2. By leveraging the superiority of the improved arithmetic optimization algorithm IOA and combining it with traditional constraint design of DNA coding sets, the number of DNA coding sets was greatly increased, the coding rate was improved, and the same coding performance was achieved with shorter sequences.

[0037] 3. Double matching constraints and mismatch constraints effectively solve the problems of self-complementary reactions in sequences and unreasonable base distribution at the 3' end of sequences, thus effectively improving the quality of DNA coding sets. Attached Figure Description

[0038] Figure 1 A flowchart illustrating a method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints. Detailed Implementation

[0039] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit the application; that is, the described embodiments are only a part of the embodiments of this application, and not all of them.

[0040] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0041] To demonstrate the effectiveness of the method proposed in this invention, the DNA coding set obtained through IOA was used for verification, employing double matching constraints and mismatch constraints.

[0042] Example 1

[0043] like Figure 1 As shown, this embodiment provides a method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints. The steps are as follows:

[0044] Step 1: Obtain the DNA candidate set that satisfies the Hamming distance constraint;

[0045] Step 2: Add adaptive weights and elementary function perturbation strategies to the DNA candidate set and update the DNA candidate set;

[0046] Step 3: Obtain DNA sequences that meet the traditional constraints (GC content constraint, Hamming distance constraint, and full discontinuity constraint) from the updated DNA candidate set;

[0047] Step 4: By using double matching constraints and mismatch constraints, the DNA sequences are restricted and screened to obtain a DNA coding set with better performance;

[0048] Step 5: Determine if the maximum value of the iteration has been reached. If so, retain and output the DNA coding set that satisfies the combinatorial constraints; otherwise, return to step 2.

[0049] This invention proposes a method to improve the quality of DNA coding sets using double-matching constraints and mismatch constraints. An improved arithmetic optimization algorithm searches the initial DNA candidate set, and DNA sequences are initially obtained through combinations of traditional constraints, resulting in a larger DNA storage set. Based on these DNA storage sets, a selection process is performed using double-matching and mismatch constraints to ultimately obtain a higher-quality DNA coding set. This invention was tested in a Windows 10 environment with an Intel(R) CPU 2.6GHz, 8.0GB of memory, and MATLAB. The experimental results show that the method presented in this example outperforms the experimental results of other algorithms.

[0050] Method comparison:

[0051] To verify the superiority of the algorithm for DNA coding set design, the results obtained by IOA were compared with those obtained by the Altruistic algorithm and the NOL-HHO algorithm. GC,NL (n,d) represents the set of DNA sequences that satisfy the GC content constraint, Hamming distance constraint, and complete discontinuity constraint. Here, n and d represent the sequence length and the Hamming distance, respectively. The comparison results are shown in Table 1. The best results for the lower bound of all coding sets are bolded.

[0052] Table 1. Achievements of different algorithms GC,NL The lower bound of (n,d)

[0053]

[0054]

[0055] Comparative analysis:

[0056] As shown in Table 1, the lower bound of the DNA coding set obtained using IOA all reached the maximum value, which indicates that IOA can obtain a large number of coding sets in DNA coding set design.

[0057] To evaluate the effectiveness of double-matching and mismatching constraints, and to demonstrate that the DNA sequence quality constructed under the new combined constraints is improved, comparison A was conducted. GC,NL (n,d) and A GC,NL,DP,MP The evaluation is based on the unwinding temperature variance and the number of hairpin structures in (n,d). A GC,NL,DP,MP (n,d) represents the DNA coding set that satisfies the GC content constraint, Hamming distance constraint, full discontinuity constraint, double matching constraint, and mismatch constraint. The comparison results are shown in Tables 2, 3, and 4. The best results are all bolded.

[0058] Table 2. Comparison of variances of melting temperatures

[0059]

[0060] Table 3. Comparison of the number of card issuance structures

[0061]

[0062]

[0063] Table 4. Comparison of the Proportion of Card-Issuing Structures

[0064]

[0065] Comparative analysis:

[0066] The evaluation of hairpin structures and melting temperatures clearly shows in the table that DNA storage sets under double-match and mismatch constraints exhibit greater physical and thermodynamic stability. Sequences with fewer hairpin structures can effectively avoid non-specific hybridization, while stable melting temperatures make storage within the DNA strand more stable. This indicates that double-match and mismatch constraints can achieve high-quality coding sets in DNA coding set design.

[0067] In summary, the method proposed in this invention for improving the quality of DNA coding sets by utilizing double matching constraints and mismatch constraints can obtain a larger quantity and better quality of DNA coding sets.

[0068] The foregoing description of specific exemplary embodiments of the invention is for illustrative and explanatory purposes. These descriptions are not intended to limit the invention to the precise forms disclosed, and it will be apparent that many changes and variations can be made in accordance with the foregoing teachings. The exemplary embodiments were chosen and described in order to explain the specific principles of the invention and its practical application, thereby enabling those skilled in the art to implement and utilize various different exemplary embodiments of the invention, as well as various different choices and variations. The scope of the invention is intended to be defined by the claims and their equivalents.

Claims

1. A method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints, characterized in that, include: Randomly construct an initial DNA candidate set; The initial DNA candidate set is iteratively optimized using an improved arithmetic optimization algorithm; Obtain DNA sequences that meet traditional constraints; DNA sequences that meet the constraints are further screened using double matching constraints and mismatch constraints to obtain the DNA coding set; The initial DNA candidate set is iteratively optimized using an improved arithmetic optimization algorithm, specifically: the acceleration function of the arithmetic optimization algorithm. and mathematical optimizer probability Random perturbation parameters generated by elementary functions and To change its value; The specific formula is as follows: (1) (2) in, It is an acceleration function The minimum value, It is an acceleration function The maximum value; It is the current iteration number of the algorithm. It is the maximum number of iterations of the algorithm; Sensitive parameter; The dual matching constraint is as follows: taking two consecutive bases as a combination, and then determining whether there are more than three combinations that are complementary bases or the same bases. If so, the two different bases of one combination are selected for exchange. If the bases of the combination are complementary, the non-complementary bases outside the combination are selected for replacement. The mismatch constraint specifically refers to the probability of a mismatch occurring at the 3' end during DNA sequence amplification and the efficiency of mismatch initiation, expressed as... The higher the number, the greater the probability of a mismatch and the higher the efficiency resulting from the mismatch. In sequence S, when the last base is G / C, the rating number is 1; when the last base is T, the rating number is 2; when the last base is A, the rating number is 3, and the mismatch constraint level is set to 1.

2. The method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints according to claim 1, characterized in that, The random disturbance parameters and The method of obtaining it is: (3) (4) in, and Set them to 1.2 and 0.65 respectively. It is a random number, ranging from 0 to 1.

3. The method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints according to claim 1, characterized in that, Add adaptive weights when the arithmetic optimization algorithm executes division and subtraction operators. Add adaptive weights when performing multiplication and addition operators. The formula for updating the position of the DNA sequence is as follows: (5) (6) Wherein, parameter V j = ((UB j - LB j )×µ +LB j ), Let represent the minimum and maximum values ​​at the j-th position, respectively; These are parameters used to adjust the search for DNA sequences that meet the criteria; This indicates the position j of the i-th DNA sequence in this iteration; It is the j-th position of the optimal DNA sequence in the current iteration; It is an integer; , It is a random number.

4. The method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints according to claim 3, characterized in that, The adaptive weight and Obtained from the following formula: (7) (8) in, It is a random number, ranging from 0 to 1; S is the current DNA sequence.

5. The method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints according to claim 1, characterized in that, The formula for exchanging or replacing the bases is as follows: (9) in, This represents a DNA sequence, where n represents the total number of bases in the sequence. for subsequence of For sequence Zhongyu The number of identical or complementary sequences, .

6. The method for improving the quality of DNA coding sets using double matching constraints and mismatch constraints according to claim 1, characterized in that, The mismatch constraint is expressed by the following formula: (10)。

Citation Information

Patent Citations

  • DNA storage code optimization method employing barnacle algorithm based on weights and mixed mutation strategy

    CN114067916A

  • Method for optimizing coding for deoxyribonucleic acid (DNA) storage based on dual-policy black widow optimization (BWO) algorithm

    GB202211537D0