Primer design method, system and computer storage medium based on minimum degeneracy
By employing a primer design method based on minimum degeneracy, the problem of imbalance between primer coverage and degeneracy in traditional pathogen detection is solved, achieving primer design with high coverage and low degeneracy, thereby improving the accuracy and efficiency of pathogen detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUGOBIOTECH BEIJING CO LTD
- Filing Date
- 2023-02-28
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, traditional pathogen detection methods cannot achieve a balance between primer coverage and degeneracy, resulting in a small detection range and non-specific amplification, which makes it difficult to effectively identify microorganisms, especially in metagenomic sequencing.
A primer design method based on minimum degeneracy is adopted. By dividing the nucleotide sequence into taxonomic units, selecting representative sequences to design a primer pair set, and selecting candidate primer pairs according to coverage, a primer pool is established. Considering the mismatch and coverage between primer pairs, a fault-tolerant design is adopted to balance coverage and degeneracy.
It achieves a balance between high coverage and low degeneracy in pathogen detection, improves the accuracy and efficiency of pathogen detection, reduces the occurrence of nonspecific amplification, and is suitable for the detection of a wide range of pathogens.
Smart Images

Figure CN116030882B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of biomedical technology, and in particular to primer design methods, systems and computer storage media based on minimum degeneracy. Background Technology
[0002] Pathogen infections pose a deadly threat to human health worldwide, and timely and accurate etiological diagnosis is crucial for clinical treatment and antibiotic use. Traditional diagnostic techniques include virus isolation, capsid culture, Gram staining, and antibody or rapid enzyme immunoassay. However, many microorganisms present in clinical samples cannot be cultured using standard techniques, and virus culture is very time-consuming. While antibody or rapid enzyme immunoassays are fast, their high specificity limits their detection range. Therefore, traditional techniques can only detect a small fraction of pathogens.
[0003] The advent of metagenomics or metatranscriptomics has enabled the exploration of the complex genomic composition of clinical samples. Metagenomic sequencing (mNGS), metatranscriptomics sequencing (mtNGS), and targeted sequencing (tNGS) have significantly improved the efficiency of pathogen identification in recent years and are becoming increasingly popular. However, mNGS and mtNGS also face some obstacles in clinical application. First, obtaining positive pathogen reads requires millions of reads; however, in clinical sample mNGS or mtNGS, over 99% of the reads are typically human genomes. Although a range of techniques exist to avoid human genome contamination by removing digested host DNA during DNA extraction, these procedures are relatively complex and can take several days. Second, nucleic acid contamination can be introduced at multiple stages from sample collection to sample processing. Furthermore, the relatively high cost of mNGS or mtNGS is an obstacle to widespread clinical application in the short term. Simultaneously, the vast abundance and diversity of microbial genomes makes pathogen identification difficult with tNGS. Fortunately, tNGS combined with multiplex PCR can rapidly, accurately, and economically detect hundreds of known pathogenic microorganisms and their virulence and resistance genes, demonstrating strong application potential.
[0004] Primer design is crucial in tNGS. Traditional low-degeneracy primer design uses minimum degeneracy primer design with perfect matching (MD-DPD), and most design methods design primers only in one direction and calculate primer coverage, rather than primer pair coverage, which can lead to poor coverage. Given the genomic diversity of microorganisms, primer degeneracy will be very high in order to match as many genomes or genes as possible and obtain satisfactory coverage. However, excessively high degeneracy can lead to non-specific amplification. Therefore, using existing design methods to design primers for a broad spectrum of pathogens is inappropriate. Summary of the Invention
[0005] This application provides a primer design method, system, and computer storage medium based on minimum degeneracy, in order to solve the problem that existing design methods in the prior art cannot achieve a balance between primer coverage and degeneracy.
[0006] On the one hand, embodiments of this application provide a primer design method based on minimum degeneracy, including:
[0007] Obtain the nucleotide sequence;
[0008] The nucleotide sequence is divided into multiple taxonomic units;
[0009] Select a representative sequence from each classification unit;
[0010] Design a set of primer pairs for the representative sequence based on the base frequencies in the representative sequence;
[0011] Determine the coverage of primer pairs relative to other nucleotide sequences in the corresponding taxonomic unit, and select multiple primer pairs as candidate primer pairs based on the coverage of each primer pair;
[0012] Select multiple candidate primer pairs to create a primer pool.
[0013] On the other hand, embodiments of this application also provide a primer design system based on minimum degeneracy, including:
[0014] Sequence acquisition module, used to acquire nucleotide sequences;
[0015] The sequence classification module is used to divide nucleotide sequences into multiple taxonomic units;
[0016] The sequence selection module is used to select a representative sequence from each classification unit;
[0017] The primer design module is used to design a set of primer pairs for a representative sequence based on the base frequencies in the representative sequence.
[0018] The candidate primer selection module is used to determine the coverage of primer pairs relative to other nucleotide sequences in the corresponding taxonomic unit, and selects multiple primer pairs as candidate primer pairs based on the coverage of each primer pair;
[0019] The primer pool creation module is used to select multiple primer pools from candidate primer pairs.
[0020] On the other hand, embodiments of this application also provide a computer storage medium storing a plurality of computer instructions for causing a computer to execute the above-described method.
[0021] The primer design method, system, and computer storage medium based on minimum degeneracy in this application have the following advantages:
[0022] The primer design process considered primer pair mismatches and coverage, allowing for primer pairs with extremely high coverage and low degeneracy. Furthermore, to address the issue that a single primer cannot provide satisfactory coverage for some highly diverse viral genomes, a method was employed that first classifies the sequences and then designs primer pairs for sequences within each taxonomic unit. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 A flowchart illustrating the primer design method based on minimum degeneracy provided in this application embodiment;
[0025] Figure 2 The total sequence captured by the primer pool provided in the embodiments of this application;
[0026] Figure 3 The proportion of sequences captured for each virus provided in the embodiments of this application. Detailed Implementation
[0027] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0028] Figure 1 A flowchart illustrating the primer design method based on minimum degeneracy provided in this application embodiment. This application embodiment provides a primer design method based on minimum degeneracy, including:
[0029] S100, obtain the nucleotide sequence.
[0030] For example, the obtained nucleotide sequence can be stored in a FASTA format file, which can be a CDS (Coding Sequence), gene, genome, or other type of sequence.
[0031] S110 divides the nucleotide sequence into multiple taxonomic units.
[0032] For example, nucleotide sequences can be divided into multiple taxonomic units by using consistent clustering, and nucleotide sequences within the same taxonomic unit have high consistency.
[0033] In embodiments of this application, the nucleotide sequences are preprocessed before segmentation. This preprocessing specifically includes: replacing U with T in the nucleotide sequence file and replacing non-nucleotide characters with -; deleting nucleotide sequences shorter than a length threshold; and performing redundancy removal on the remaining nucleotide sequences. The aforementioned length threshold can be the minimum length of the PCR product.
[0034] Furthermore, after obtaining multiple taxonomic units, taxonomic units with nucleotide sequences below a certain threshold are filtered out. The consistency of the filtered taxonomic units is compared with the consistency of other taxonomic units, and the filtered taxonomic units are merged with other taxonomic units based on the comparison results. For example, if the consistency of a filtered taxonomic unit is not significantly different from that of a certain taxonomic unit, and the difference in their consistency is less than a set difference threshold, the two taxonomic units can be merged together.
[0035] S120, Select a representative sequence from each classification unit.
[0036] For example, each taxonomic unit contains a large number of nucleotide sequences. In this application, N (default 500) nucleotide sequences are randomly selected in each taxonomic unit, and then the Viterbi algorithm is used to select the most representative nucleotide sequence in the current taxonomic unit from these N nucleotide sequences. This nucleotide sequence is the representative sequence.
[0037] S130, based on the base frequencies in the representative sequence, design the corresponding primer pair set for the representative sequence.
[0038] For example, since each classification unit selects a representative sequence, the number of representative sequences is relatively large. For each representative sequence, this application uses the Nearest Neighbor (NN) algorithm to calculate the frequency of each adjacent base in the representative sequence, such as AT, AC, AG, etc. Based on the calculated base frequencies, a corresponding primer pair set is designed for each representative sequence. This primer pair set is a collection containing multiple primer pairs.
[0039] In the embodiments of this application, after designing and obtaining the primer pair set, the primer pairs in the primer pair set are further screened. Specifically, the primer pairs can be screened based on PCR product length, GC content, hairpin, melting temperature, GC clamp, dimer detection, error coverage, etc.
[0040] S140, determine the coverage of the primer pair set relative to all nucleotide sequences, and select multiple primer pairs as candidate primer pair sets based on the coverage of each primer pair set.
[0041] For example, although the primer pair set has undergone the above screening process, the number of primer pairs in the remaining primer pair set is still relatively large. For a large number of primer pairs, this application uses BWT (Burrows-Wheeler transform) alignment software to calculate the coverage of each primer pair in the primer pair set for nucleotide columns in all taxonomic units. Then, the primer pairs in the primer pair set are arranged in descending order of coverage, and the top 5-10 primer pairs with the highest coverage are selected as candidate primer pairs in each primer pair set. At this time, the primer pair set is called the candidate primer pair set.
[0042] After obtaining the candidate primer pair set, a loss function is used to filter candidate primer pairs with special structures. Specifically, the loss function of this application is as follows:
[0043]
[0044] Where L is the loss function, length is the length of the complementary region, GC content is the number of G or C bases in the complementary region, distance1 is the distance from the complementary region of primer 1 to the 3' end, and distance2 is the distance from the complementary region of primer 2 to the 3' end. Candidate primer pairs with a loss function below a certain threshold are retained, while other candidate primer pairs are deleted.
[0045] Furthermore, after filtering candidate primer pairs using a loss function, a specificity test is performed on the filtered candidate primer pairs. In this application, if any two candidate primer pairs in the candidate primer pair set simultaneously match the host, and the matching process involves no more than three mismatches and the product length is within 2000, the candidate primer pair is considered to have non-specific amplification, and the corresponding candidate primer is excluded.
[0046] S150, select multiple candidate primer pairs from the candidate primer pair set to establish a primer pool.
[0047] For example, a greedy algorithm can be used to select candidate primer pairs that have passed the specificity test, and finally establish a primer pool.
[0048] Specifically, an empty primer pool is first established. Then, candidate primer pairs from each candidate primer pair set are added to the primer pool sequentially. Candidate primer pairs from the first set can be added directly. Starting from the second set, interaction checks are performed. Only candidate primer pairs from the set to be added can be added to the primer pool if they do not interact with any candidate primer pairs already in the pool. Furthermore, if a candidate primer pair from a set has already been added to the primer pool, the other candidate primer pairs in that set are not checked, and the process proceeds directly to checking the candidate primer pairs in the next set. Only when a candidate primer pair cannot be added to the primer pool due to interaction issues will the interaction check continue for the next candidate primer pair in the current set. This process continues until a candidate primer pair is added to the primer pool, or all candidate primer pairs cannot be added due to interaction problems.
[0049] Furthermore, if all candidate primer pairs in the current candidate primer pair set cannot be added to the primer pool due to interaction issues, the process returns to the previous candidate primer pair set and re-evaluates whether any candidate primer pairs from that set, excluding those already added to the primer pool (if any), can be added. If such a candidate primer pair exists, one candidate primer pair from the previous set is added to the primer pool, and the process moves to the next candidate primer pair set. This process is repeated until all candidate primer pairs have been evaluated, resulting in the final primer pool.
[0050] Furthermore, if no candidate primer pair is added to the primer pool from a set of candidate primer pairs, that set is filtered out. After establishing the primer pools, the pools are re-established based on the candidate primer pairs from the filtered sets. Using this method, the primer pool established from the unfiltered set of candidate primer pairs is designated as the first primer pool, and the primer pool established from the filtered set is designated as the second primer pool. This filtering and re-establishment process is repeated until every set of candidate primer pairs has at least one candidate primer pair added to its corresponding primer pool.
[0051] Therefore, this application ultimately yields two primer pools, namely the first primer pool and the second primer pool. These two primer pools serve as two optimal solutions to the NP-complete problem, and these optimal solutions can be widely applied to the detection of broad-spectrum pathogens.
[0052] High-coverage primers are required for broad-spectrum pathogen targeting. Introducing degenerate bases during primer design can effectively increase primer coverage. However, high primer degeneracy can introduce nonsense primers (i.e., primers without the target sequence), potentially amplifying irrelevant sequences. Therefore, primer coverage and degeneracy are mutually constraining, a problem proven to be NP-complete in bioinformatics. Currently, broad-spectrum virus identification worldwide relies on high-coverage primers designed using the MD-DPD method. This application proposes a fault-tolerant minimum degeneracy primer pair design (MD-EDPD) method. This method can obtain high-coverage primers under fault-tolerant conditions without excessively high degeneracy, achieving a balance between coverage and degeneracy, and providing a new solution for broad-spectrum pathogen detection.
[0053] This application also provides a primer design system based on minimum degeneracy, the system comprising:
[0054] Sequence acquisition module, used to acquire nucleotide sequences;
[0055] The sequence classification module is used to divide nucleotide sequences into multiple taxonomic units;
[0056] The sequence selection module is used to select a representative sequence from each classification unit;
[0057] The primer design module is used to design a set of primer pairs for a representative sequence based on the base frequencies in the representative sequence.
[0058] The candidate primer selection module is used to determine the coverage of the primer pair set relative to all nucleotide sequences, and selects multiple primer pairs as candidate primer pair sets based on the coverage of each primer pair set;
[0059] The primer pool creation module is used to select multiple candidate primers from the candidate primer set pair to create a primer pool.
[0060] This application also provides a computer storage medium storing a plurality of computer instructions for causing a computer to execute the above-described method.
[0061] Experimental instructions
[0062] To demonstrate the feasibility of the method proposed in this application, eight viruses were selected, and relevant sequences were downloaded from NCBI. Primer pairs were designed using the method described in this application, ultimately resulting in 24 degenerate primer combinations (88-plex). Allowing for one mismatch, 91.6% (39424 / 43016) of the sequences were captured, with only one virus (rhinovirus) showing a coverage rate below 80%. Figure 2 and 3As shown. This application used a set of clinical samples to test the performance of the primer pool, and all viruses were identified. Even when clinical samples were mixed into one tube, the primer pool could still detect the target virus very well.
[0063] In addition, this application also compares the performance of 1000 sequence pairs with degePrime, and the results show that the coverage of the method in this application is much higher than that of degePrime.
[0064] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
[0065] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A primer design method based on minimum degeneracy, characterized in that, include: Obtain the nucleotide sequence; The nucleotide sequence is divided into multiple taxonomic units using a consistent clustering method, and the nucleotide sequences located within the same taxonomic unit are consistent. Select a representative sequence from each of the taxonomic units: randomly select N nucleotide sequences from each taxonomic unit, and use the Viterbi algorithm to select the most representative nucleotide sequence from these N nucleotide sequences. This nucleotide sequence is the representative sequence. Based on the base frequencies in the representative sequence, a corresponding primer pair set is designed for the representative sequence: the frequency of occurrence of each adjacent base in the representative sequence is calculated using the nearest neighbor algorithm, and a corresponding primer pair set is designed for each representative sequence based on the calculated base frequencies; Determine the coverage of the primer pair set relative to all nucleotide sequences, and select multiple primer pairs as candidate primer pair sets based on the coverage of each primer pair set; A primer pool is constructed by selecting multiple candidate primers from the candidate primer set.
2. The primer design method based on minimum degeneracy according to claim 1, characterized in that, After obtaining the nucleotide sequence, the nucleotide sequence is further preprocessed.
3. The primer design method based on minimum degeneracy according to claim 2, characterized in that, Preprocessing of the nucleotide sequence includes: Replace "U" with "T" in the nucleotide sequence file, and replace non-nucleotide characters with "-". Delete nucleotide sequences whose length is less than the length threshold; The remaining nucleotide sequence is then deredundant.
4. The primer design method based on minimum degeneracy according to claim 1, characterized in that, After dividing the nucleotide sequence into multiple taxonomic units, taxonomic units with a number of nucleotide sequences below a certain threshold are filtered out. The consistency of the filtered taxonomic units is compared with the consistency of other taxonomic units, and the filtered taxonomic units are merged with other taxonomic units based on the comparison results.
5. The primer design method based on minimum degeneracy according to claim 1, characterized in that, After designing a corresponding set of primer pairs for the representative sequence, the primer pairs in the designed set are further screened.
6. The primer design method based on minimum degeneracy according to claim 1, characterized in that, After obtaining multiple sets of candidate primer pairs, a loss function is used to filter candidate primer pairs with special structures.
7. The primer design method based on minimum degeneracy according to claim 6, characterized in that, After filtering the candidate primer pairs using the loss function, a specificity test is performed on the filtered candidate primer pairs.
8. The primer design method based on minimum degeneracy according to claim 7, characterized in that, A greedy algorithm is used to select candidate primer pairs that have passed the specificity test, and finally the primer pool is established.
9. A system applying the primer design method based on minimum degeneracy as described in claim 1, characterized in that, include: Sequence acquisition module, used to acquire nucleotide sequences; A sequence classification module is used to divide the nucleotide sequence into multiple classification units; A sequence selection module is used to select a representative sequence from each of the classification units; The primer design module is used to design a set of corresponding primer pairs for the representative sequence based on the base frequencies in the representative sequence. A candidate primer selection module is used to determine the coverage of the primer pair set relative to all nucleotide sequences, and select multiple primer pairs as candidate primer pair sets based on the coverage of each primer pair set; The primer pool creation module is used to select multiple candidate primers from the candidate primer set pair to create a primer pool.
10. A computer storage medium, characterized in that, The computer storage medium stores a plurality of computer instructions, which are used to cause the computer to perform the method described in any one of claims 1-8.