A mutation site screening method based on amplicon deep sequencing and sequence counting
By using amplicon deep sequencing and sequence counting methods, the accuracy and efficiency issues of mutant screening in polyploid species have been solved, enabling high-throughput and low-cost screening of mutation sites.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUBEI FORESTRY SCI INST
- Filing Date
- 2025-06-10
- Publication Date
- 2026-06-30
AI Technical Summary
Existing mutant screening technologies rely on reference genomes, which makes it difficult to accurately compare in polyploid species, resulting in a high false positive rate. Furthermore, traditional threshold setting methods have a false positive rate as high as 15%-20% in polyploids.
By employing amplicon deep sequencing and sequence counting methods, and through a closed-loop process of mixed sample pools, tag amplification, deep sequencing, frequency threshold screening, and targeted monomer verification, experimental steps are simplified, detection throughput and sensitivity are improved, and the risk of cross-contamination is reduced.
It achieves high-throughput and high-efficiency mutation site localization, shortens the screening cycle and reduces experimental costs, and improves the accuracy and sensitivity of mutation detection.
Smart Images

Figure CN120683235B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of mutant screening technology, and in particular to a method for screening mutation sites based on amplicon deep sequencing and sequence counting. Background Technology
[0002] Mutant screening technology is widely used in fields such as genetics, functional genomics, and crop breeding. Generally, mutant screening technology can be divided into forward and reverse techniques. Forward screening, also known as phenotypic screening, involves identifying phenotype-specific mutants through morphological observation and physiological and biochemical index determination. Therefore, it is often used for screening mutants with simple and easy-to-observe phenotypes, such as screening for dwarfing or leaf color variations in plants. Reverse screening, also known as genotyping, involves identifying mutants with specific genotypes by genotyping the genome or candidate genes. Reverse screening is more often used for screening mutants where phenotypic identification is difficult, or where the phenotype is easily affected by the environment or is difficult to detect, such as plant metabolic mutations and animal genetic diseases.
[0003] With the advent of high-throughput sequencing technology, especially the significant reduction in sequencing costs, whole-genome sequencing or target gene amplicon sequencing is increasingly used for mutant screening due to its high throughput, high sensitivity, and high accuracy. A typical experimental procedure is as follows:
[0004] 1) Individual genome resequencing (or target gene amplicon sequencing);
[0005] 2) Sequencing read quality control;
[0006] 3) Quality-controlled reads are then posted back to the genome;
[0007] 4) Mutation detection.
[0008] It is evident that screening for mutants through either genome resequencing or amplicon sequencing involves re-mounting sequencing reads to a reference genome before detecting variations, thus presenting certain limitations and drawbacks. Firstly, it relies on a reference genome, which is unknown for some species. Secondly, existing methods depend on genome re-mounting, but the allele similarity in polyploid species often exceeds 95% (e.g., the FAD2 gene similarity in hexaploid kale exceeds 95%), leading to short-read sequencing data failing to accurately align to the correct allele. Furthermore, traditional threshold setting methods (such as a fixed frequency threshold of 0.01) result in a false positive rate as high as 15%-20% in polyploids. Summary of the Invention
[0009] To overcome the shortcomings of existing technologies, the purpose of this invention is to provide a mutation site screening method based on amplicon deep sequencing and sequence counting, which has the advantages of high detection throughput, simple sequence analysis method, and sensitive and reliable mutation detection, and is suitable for screening gene mutants in different types of species.
[0010] To achieve the above objectives, the present invention provides the following solution:
[0011] A mutation site screening method based on amplicon deep sequencing and sequence counting includes:
[0012] Samples from multiple individuals to be screened were mixed in equal volumes at a ratio of 10–20 to construct a mixed sample pool, and genomic DNA was extracted from each of the mixed sample pools.
[0013] Primers were designed based on the target gene sequence and a tag sequence was added to the 5' end of the primers. Polymerase chain reaction amplification was performed using the genomic DNA as a template to obtain the target amplified fragment. The target amplified fragment was then subjected to deep sequencing using a high-throughput sequencing platform to obtain the raw sequencing data.
[0014] The raw sequencing data is cleaned, redundancy is removed, and screening is performed to obtain candidate sequences;
[0015] The candidate sequences whose expression levels are within a set frequency range are compared with the target gene sequence to identify whether there are mutation sites and obtain the comparison results.
[0016] If the alignment results show the presence of mutation sites, then the individual samples in the corresponding mixed sample pool are amplified and sequenced to determine the specific individual samples containing the mutations.
[0017] Preferably, samples from multiple individuals to be screened are mixed in equal volumes at a ratio of 10–20 to construct a mixed sample pool, and genomic DNA is extracted from each of the mixed sample pools, including:
[0018] Multiple individuals to be screened are randomly selected from the sample population to be screened, and tissue materials of each individual to be screened are obtained.
[0019] The tissue material was lysed and purified to extract genomic DNA from each of the individuals to be screened.
[0020] According to the ratio of 10 to 20 individuals per group, weigh out equal amounts of genomic DNA solutions from each individual and mix them to construct several mixed sample pools;
[0021] The mixed sample cells were numbered and stored.
[0022] Preferably, primers are designed based on the target gene sequence, and a tag sequence is added to the 5' end of the primers. Polymerase chain reaction amplification is performed using the genomic DNA as a template to obtain the target amplified fragment. The target amplified fragment is then subjected to deep sequencing using a high-throughput sequencing platform to obtain raw sequencing data, including:
[0023] A pair of specific primers were designed based on the conserved region sequence of the target gene, and a tag sequence for distinguishing different mixed sample pools was added to the 5' end of each specific primer.
[0024] Using the genomic DNA in the mixed sample pool as a template, polymerase chain reaction (PCR) technology is used to amplify the target gene fragment to obtain the target amplified fragment; the amplification conditions are set according to the characteristics of the primers.
[0025] The target amplified fragment is purified, and the purified fragment is sequenced at one end or two ends using the high-throughput sequencing platform to obtain the raw sequencing data.
[0026] Preferably, the characteristics of the primers include denaturation temperature, annealing temperature, and extension time.
[0027] Preferably, the raw sequencing data is cleaned, redundancy is removed, and screening operations are performed to obtain candidate sequences, including:
[0028] The raw sequencing data is subjected to quality control. Pre-set quality control tools are used to remove sequencing adapters and primer tag sequences, and chimera identification algorithms are used to remove reads containing chimeric structures to obtain effective reads after quality control.
[0029] A redundancy removal operation is performed on the valid read segment to extract a set of non-repeating sequences, resulting in a set of unique sequences. The number of times each unique sequence appears in the sample pool is counted as the depth information of the unique sequence.
[0030] The expression frequency in the mixed sample cell is calculated based on the depth information of each unique sequence and the total depth information of all unique sequences;
[0031] Based on the ploidy characteristics of the individuals to be screened that constitute the mixed sample pool and the ploidy ratio of the individuals used, the lower limit and upper limit of frequency screening are calculated.
[0032] The unique sequence whose expression frequency falls within the frequency selection range is selected as the candidate sequence.
[0033] Preferably, the lower limit value of the frequency screening is a value divided by the product of the ploidy and the individual mixing ploidy.
[0034] Preferably, the upper limit of the frequency filtering is ten times the lower limit.
[0035] Preferably, sequences expressed within a set frequency range from candidate sequences are compared with the target gene sequence to identify the presence of mutation sites, and the comparison results include:
[0036] The set of candidate sequences and the corresponding target gene sequences are imported into a preset sequence alignment tool, and the target gene sequence is designated as the reference sequence.
[0037] Set alignment parameters and perform sequence alignment to generate alignment documents of candidate sequences and reference sequences; the alignment parameters include matching score, mismatch penalty and insertion-missing cost;
[0038] The alignment file is used to retrieve mismatches, insertions, or deletions at the alignment coordinates between the candidate sequence and the reference sequence. The alignment coordinates corresponding to the mismatches, insertions, or deletions are marked as difference sites, and the alignment results are recorded. The alignment results include the sequence position of the difference site, the difference type, and the depth information of the candidate sequence.
[0039] Determine whether the expression frequency corresponding to the differentially expressed site is still within the frequency screening range. If so, mark the differentially expressed site as a mutation site and output the alignment result containing a list of mutation sites.
[0040] Preferably, if the alignment results show the presence of mutation sites, then the individual samples in the corresponding mixed sample pool are amplified and sequenced to determine the specific individual samples containing the mutation, including:
[0041] For each individual to be screened in the mixed sample pool where the comparison results show the presence of mutation sites, samples are taken separately, and the genomic DNA of the individual to be screened is extracted individually.
[0042] Using the same primer system with the aforementioned tag sequence as a template, polymerase chain reaction was performed on the genomic DNA of each individual to be screened to amplify the target gene fragment, and the amplification product was obtained.
[0043] The amplification products were purified and sequenced using first-generation sequencing to obtain the sequencing sequences of each individual to be screened.
[0044] The sequencing sequences of each individual to be screened are compared with the target gene reference sequence to check whether there are base differences consistent with the identified mutation sites;
[0045] Individuals identified as containing the mutation sites in the comparison results are marked as mutated individual samples, and an identification list of mutated individual samples is output.
[0046] According to specific embodiments provided by the present invention, the present invention discloses the following technical effects:
[0047] This invention integrates and simplifies the time-consuming single-plant library preparation and whole-genome alignment steps in traditional mutant screening through a closed-loop process of "mixed pool construction - tag amplification - deep sequencing - frequency threshold screening - targeted monomer verification": ① First, 10–20 times the number of individual samples are mixed in equal quantities and genomic DNA is extracted uniformly, allowing for simultaneous detection of a large number of samples in a single sequencing run, significantly improving screening throughput; ② After embedding tags at the 5′ end of primers, targeted amplification is performed, ensuring amplicon specificity and allowing for intuitive differentiation of different mixed pools during the sequencing stage, effectively reducing the risk of cross-contamination; ③ Using frequency upper and lower limit screening rules, only unique sequences with expression levels within the theoretical range are retained for alignment, significantly reducing redundant data, suppressing sequencing noise, and improving mutation detection sensitivity; ④ Mixed pools with positive alignment results are then subjected to monomer verification, which can identify truly mutant individuals in one go, avoiding the waste of repeated sequencing. Overall, this method achieves high accuracy and high efficiency in mutation site localization with the fewest experimental batches, shortening the screening cycle and reducing experimental costs. Attached Figure Description
[0048] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0049] Figure 1 A flowchart of the method provided in an embodiment of the present invention;
[0050] Figure 2 This is a schematic diagram of the CaFAD2 amplicon reference sequence and primer positions provided in an embodiment of the present invention;
[0051] Figure 3 This is a schematic diagram of mutant sites in sample pool P1 provided in an embodiment of the present invention;
[0052] Figure 4 This is a schematic diagram of mutant sites in sample pool P6 provided in an embodiment of the present invention;
[0053] Figure 5 This is a schematic diagram of the rdl gene mutation sites in sample pools Rdl-P1 and Rdl-P3 provided in this embodiment of the invention. Detailed Implementation
[0054] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0055] The purpose of this invention is to provide a mutation site screening method based on amplicon deep sequencing and sequence counting, which has the advantages of high detection throughput, simple sequence analysis method, and sensitive and reliable mutation detection, and is suitable for screening gene mutants in different types of species.
[0056] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0057] Figure 1 The method flowchart provided in the embodiments of the present invention is as follows: Figure 1 As shown, this invention provides a method for screening mutation sites based on amplicon deep sequencing and sequence counting, comprising:
[0058] Step 100: Mix samples from multiple individuals to be screened in equal volumes at a ratio of 10–20 to construct a mixed sample pool, and extract genomic DNA from each mixed sample pool;
[0059] Step 200: Design primers based on the target gene sequence and add tag sequences to the 5' end of the primers. Use genomic DNA as a template to perform polymerase chain reaction amplification to obtain the target amplified fragment. Then, use a high-throughput sequencing platform to perform deep sequencing on the target amplified fragment to obtain the raw sequencing data.
[0060] Step 300: Clean, remove redundancy, and filter the raw sequencing data to obtain candidate sequences;
[0061] Step 400: Align the candidate sequences whose expression levels are within a set frequency range with the target gene sequence to identify whether there are mutation sites and obtain the alignment results;
[0062] Step 500: If the alignment results show that there is a mutation site, then amplify and sequence the individual samples in the corresponding mixed sample pool to determine the specific individual sample containing the mutation.
[0063] This invention provides a high-throughput screening method for mutants of target genes based on amplicon deep sequencing combined with non-genomic read re-alignment analysis. The method first constructs pools (sample pools) from the sample population to increase throughput. Then, amplicon is obtained from the target gene in each pool and deep sequenced. Redundancy is removed from the reads in each pool to obtain unique reads. The frequency of each unique read in the pool is calculated by determining its repetition count (depth). The theoretical frequency of mutant reads in the pool is used as a threshold for further screening. A small number of unique reads within the threshold range are compared with the target gene sequence. If any unique read contains a mutation site, the pool contains a mutant, and each individual in the pool can be further verified using first-generation sequencing to identify the specific mutant. Otherwise, the pool does not contain any mutant individuals.
[0064] Furthermore, the technical approach of this embodiment is as follows:
[0065] 1. Sample mixing cell
[0066] The sample population to be screened is mixed in a ratio of 10-20 to form a pool. Specifically, 10-20 individuals are mixed in equal amounts to form a series of sample pools. Genomic DNA is then extracted for subsequent mutant screening.
[0067] 2. Amplicon sequencing
[0068] Specific amplification primers were designed based on the target gene sequence. A 6-base index sequence was added to the 5' end of each primer to distinguish different sample pools. PCR amplification was performed using sample pool DNA as a template. PCR amplification conditions were adjusted according to primer characteristics (such as annealing temperature and product length). The obtained amplicon was then subjected to deep sequencing on a next-generation sequencing platform, with a sequencing depth of at least 100 μM for each target gene allele.
[0069] The formula for calculating the number of alleles of the target gene is: Number of alleles = Sample ploidy * 2 * Pooling ratio (Formula 1). For example, after pooling a hexaploid species 10-fold, the number of alleles of the target gene in each pool is: 6 * 2 * 10 = 120. Therefore, the sequencing depth should be no less than 12,000 sequences.
[0070] 3. Amplicon sequence cleaning and redundancy removal
[0071] Sequencing data is segmented into different sample pool amplicon sequences based on tag sequences, and each pooled amplicon is analyzed separately. The specific analysis includes the following steps: 1) Quality control: The raw reads are quality controlled using the FASTP tool, removing sequencing adapters and tag sequences. Then, the uchime2_denovo tool in Vsearch software is used to remove reads containing chimeras, obtaining clean reads; 2) Unique sequence acquisition: The reads are deredundantd using the fastx_uniques tool in Vsearch software to obtain unique reads and the number of each unique read (parameters: --minuniquesize 1 --strand both --sizeout); 3) Unique depth calculation and filtering: A shell script is written to calculate the total depth of unique reads in each sample pool and the frequency of each unique read. The calculation formula is:
[0072] Frequency (P) = Nu / Nt (Formula 2)
[0073] Here, Nu represents the depth of each unique read, and Nt represents the total depth of all unique reads.
[0074] 1. Filter unique reads
[0075] Set a depth threshold; the formula for calculating the threshold is:
[0076] Pmin = 1 / (k × 2 × n) (Formula 3)
[0077] Here, k represents species ploidy and n represents the mixing depth.
[0078] Pmax = Pmin × 10 (Formula 4)
[0079] Unique reads are filtered, retaining only those with a depth within the critical range for subsequent sequence alignment analysis, i.e., Pmin <= P <= Pmax. For example, in a 10-fold pool of hexaploid species, Pmin = 1 / (6 × 2 × 10) = 0.008, Pmax = 0.08, meaning only unique reads with a frequency range of 0.008-0.08 are retained for subsequent analysis.
[0080] 2. Unique read sequence alignment to screen candidate mutation sites
[0081] The unique reads were compared with the target gene reference sequence using MEGA11. If a mutation site was found in the unique read, it indicated that the sample pool contained a mutant with that mutation site. Then, Sanger sequencing was used to sequence the target gene of each individual in the pool to identify the specific mutant. If no mutation site was found in the unique read, it indicated that all samples in the pool were wild-type and no further identification was required.
[0082] As an optional implementation method, the technical route for screening FAD2 gene mutants from the hexaploid species *Kalanchoe blossfeldiana* in this embodiment is as follows:
[0083] 1. Sample mixing cell
[0084] Sixty M2 plants were randomly selected from an EMS-induced mutant population of sea cabbage. After sowing, young leaves were taken, and genomic DNA was extracted. The DNA from each of the 10 plants was mixed in equal molar amounts to construct six 10-fold DNA pools, which were numbered P1-P6.
[0085] 2. Amplicon amplification and sequencing
[0086] Because sea cabbage is a hexaploid species, theoretically, a homozygous genome contains six FAD2 alleles. This invention, after comparing multiple FAD2 gene sequences, designed a pair of primers in the conserved region for PCR amplification. The primer-specific portion is...
[0087] 5'-GGGTCTGCCAAGGCTGTGTCC-3' (SEQ ID NO. 1);
[0088] 5'-CACGGTGCGTCCAAGAGGGT-3' (SEQ ID NO. 2), see Figure 2The yellow portion indicates the primer positions. Each primer has a 6-base index sequence added to its 5' end for subsequent differentiation between different sample pools. Using DNA from the six sample pools as templates, PCR amplification was performed. The 20 μL PCR reaction system included 4 μL Phusion HF buffer (NEB, USA), 0.2 μL dNTPs (10 mM), 0.5 μL each of forward and reverse primers (10 uM), 0.1 μL Phusion DNA polymerase (2 U / μL), and 1 μL template DNA (100 ng). The PCR reaction conditions were: 98°C pre-denaturation for 30 s followed by 30 cycles of 98°C for 10 s, 58°C for 10 s, 72°C for 30 s, and a final extension at 72°C for 5 min. The amplified product, approximately 270 bp in length, was purified on an agarose gel and sequenced (100 bp unidirectional) on an Illumina Hiseq 2000 platform.
[0089] 3. Redundancy removal from amplicon sequences
[0090] Sequencing data was segmented into six different sample pool amplicon sequences based on tag sequences, and each pooled amplicon was analyzed separately. The specific analysis included the following steps: 1) Quality control: The raw reads were quality controlled using the FASTP tool, removing sequencing adapters and tag sequences. Then, the uchime2_denovo tool in the Vsearch software was used to remove reads containing chimeras, obtaining clean reads; 2) Unique sequence acquisition: The reads were deredundantd using the fastx_uniques tool in the Vsearch software to obtain unique reads and the number of each unique read (parameters: --minuniquesize 1 --strand both --sizeout); 3) Unique depth calculation and filtering: The total depth of unique reads in each sample pool and the frequency of each unique read were calculated using a shell script. The frequency of each unique read was then calculated according to Formula 2.
[0091] 4. Algorithm verification for the theoretical minimum frequency Pmin and maximum frequency Pmax
[0092] 4.1 Algorithm justification for the theoretical minimum frequency Pmin
[0093] According to Formula 3, Pmin = 1 / (k × 2 × n) (Formula 3), for a 10-fold mixed pool of hexaploid species, Pmin = 1 / (6 × 2 × 10) = 0.008
[0094] 4.2 Algorithm justification for the theoretical highest frequency Pmax
[0095] We assume that sequencing errors follow a Poisson distribution (common in high-throughput sequencing) and that the true mutation frequency should be significantly higher than the background error rate (typically, the Illumina sequencing error rate is ≤0.5%, and to minimize false positives, we assume a sequencing error rate of 1%).
[0096] Derivation of Pmax:
[0097] If the sequencing error rate is λ = 0.01, and the coverage is D = 100, the expected value of the erroneous reads is λD = 1. According to the Poisson distribution theory, the upper limit of the 99.9% confidence interval is:
[0098] Pnoise=(λ+3×square root(λ)) / D=(0.01+3×1) / 100=0.0301
[0099] For a true mutation to be valid, its frequency must be significantly higher than the noise level, Pmax = C × Pmin ≫ Pnoise. That is, for a hexaploid pool (Pmin = 0.008), if we assume Pmax = C × Pmin ≫ Pnoise and substitute it into Pmin = 0.008, when C = 10: Pmax = 0.08 ≫ 0.031 (i.e., signal-to-noise ratio 0.08 / 0.031 > 20); when C = 5: Pmax = 0.04 (i.e., signal-to-noise ratio 0.04 / 0.031 approximately equal to 1). The standard signal-to-noise ratio requirement in next-generation sequencing is above 10, so C = 10 is optimal, at which point Pmax = 10 × Pmin.
[0100] 5. Filter unique reads
[0101] The critical value for unique read depth is calculated using formulas 3 and 4. Then, unique reads are filtered based on their frequency, retaining only those with depths within the critical value range for subsequent sequence alignment analysis. In this embodiment, the calculated critical value range is [0.008, 0.08]. The specific calculation process is as follows:
[0102] Pmin = 1 / (6 × 2 × 10) = 0.008
[0103] Pmax = 0.008 × 10 = 0.08
[0104] Unique reads with a total depth between 0.008 and 0.08 were retained for sequence comparison analysis. The depth of unique reads retained in each pool is shown in Table 1. Specifically, the minimum depth of unique reads retained after filtering in P1 is 2, and the maximum is 22; in P2, the minimum depth is 20, and the maximum is 208; in P3, the minimum depth is 90, and the maximum is 989; in P4, the minimum depth is 73, and the maximum is 736; in P5, the minimum depth is 21, and the maximum is 211; and in P6, the minimum depth is 62, and the maximum is 622.
[0105] Table 1. Filtering results of sequencing data from 6 sample cells
[0106] Sample cell Unique reads total depth Unique reads filter depth lower bound Unique reads filter depth limit Critical range P1 2332 2 23 0.008-0.08 P2 20874 20 208 0.008-0.08 P3 98998 90 988 0.008-0.08 P4 73624 73 736 0.008-0.08 P5 21115 21 211 0.008-0.08 P6 62287 62 622 0.008-0.08
[0107] 6. Unique reads sequence alignment for screening mutation sites
[0108] The unique reads filtered from each sample pool were compared with the target gene sequences (CraFAD2-A, CraFAD2-B, CraFAD2-C1, CraFAD2-C2, CraFAD2-C3) of sea caltrop. It was found that sample pools P1 and P3 each contained one gene mutation site, both located in gene CraFAD2-C3. The unique read depth of the mutation in P1 was 2 (…). Figure 3 The unique read depth of the mutation in P6 is 118 ( Figure 4 The mutation types are C>T and G>A, which are typical EMS-induced mutation types. Figure 3 , Figure 4 ). Figure 3 The image shows mutant sites in sample pool P1 (mutation type C>T, unique read depth size=2). Figure 4 The image shows mutant sites in sample pool P6 (mutation type G>A, unique read depth size=118).
[0109] As another optional implementation, this embodiment screens for rdl gene mutations in Cape Verde mosquitoes using targeted amplicon sequencing. The raw sequencing data used in this embodiment comes from the literature (Collins EL, Phelan JE, Hubner M, Spadar A, Campos M, Ward D, et al. (2022) A next generation targeted amplicon sequencing method to screen for insecticide resistance mutations in Aedes aegypti populations reveals a rdl mutation in mosquitoes from Cabo Verde. PLoS Negl Trop Dis 16(12): e0010935. https: / / doi.org / 10.1371 / journal.pntd.0010935), and the data can be downloaded from the ENA database (ERS12467832-ERS12467834).
[0110] The specific implementation plan is as follows:
[0111] 1. Experimental Materials
[0112] The original study examined 152 Aedes aegypti mosquito samples from Cape Verde. This protocol selected 60 Aedes aegypti samples from these for rdl gene locus amplicon analysis. Genomic DNA was extracted from each mosquito individually for PCR amplification.
[0113] 2. Amplicon amplification and sequencing
[0114] A pair of primers was designed based on the rdl gene reference sequence for PCR amplification:
[0115] 5'-CCAACCGATGTATCTTCTTC-3' (SEQ ID NO.3);
[0116] 5'-CTGGTTATTTGTACAAGTAGCA-3' (SEQ ID NO.4), product size 498bp, was used as a template for PCR amplification with DNA from each individual sample. The 20μL PCR reaction system included 4 μL Q5 high-fidelity PCR reagent (New England Biolabs, UK), 0.2 μL dNTPs (10 mM), 0.5 μL each of forward and reverse primers (10 uM), and 4 μL template DNA (100 ng). The PCR reaction conditions were: 98°C for 30 s followed by 35 cycles of 98°C for 10 s, 60.4°C for 70 s, 72°C for 90 s, and a final extension at 72°C for 5 min. PCR products were purified using AMPure XP magnetic beads (Beckman Coulter). All PCR products were then diluted and mixed at equal concentrations to prepare 25 μL pools with a concentration of 20 ng / μL. Each pool contained 20 amplicons (1 amplicon × 20 mosquitoes / pool). A second round of PCR was then performed to add Illumina adapters and tags, enabling parallel sequencing of multiple pools within the same Illumina sequencing run.
[0117] 3. Redundancy removal from amplicon sequences
[0118] Sequencing data was split into three different sample pool amplicon sequences based on the tag sequence, and each pool amplicon was analyzed separately. The specific analysis included the following steps: 1) Quality control: The raw reads were quality controlled using the FASTP tool, removing sequencing adapters and tag sequences. Then, the uchime2_denovo tool in the Vsearch software was used to remove reads containing chimeras, obtaining clean reads. The sequencing volume and theoretical depth for each sample pool are shown in Table 1; 2) Unique sequence acquisition: The reads were deredundantd using the fastx_uniques tool in the Vsearch software to obtain unique reads and the number of each unique read (parameters: --minuniquesize 1 --strand both --sizeout); 3) Unique depth calculation and filtering: The total depth of unique reads in each sample pool and the frequency of each unique read were calculated using a shell script. The frequency of each unique read was calculated according to Formula 2.
[0119] 4. Filter unique reads
[0120] The critical value for unique read depth is calculated using formulas 3 and 4. Unique reads are then filtered, retaining only those with depths within the critical value range for subsequent sequence alignment analysis. In this embodiment, the calculated critical value range is [0.025, 0.25]. The specific calculation process is as follows:
[0121] Pmin = 1 / (1 × 2 × 20) = 0.025
[0122] Pmax = 0.025 × 10 = 0.25
[0123] Unique reads with a total depth between 0.025 and 0.25 were retained for sequence comparison analysis. The depth of unique reads retained in each pool is shown in Table 2. Among them, the minimum depth of unique reads retained after filtering in Rdl-P1 is 16 and the maximum is 160; the minimum depth of unique reads retained after filtering in Rdl-P2 is 4 and the maximum is 44; and the minimum depth of unique reads retained after filtering in Rdl-P3 is 1 and the maximum is 4.
[0124] Table 2. Filtering results of sequencing data from the three sample cells.
[0125] Sample cell Unique reads total depth Unique reads filter depth lower bound Unique reads filter depth limit Critical range Rdl-P1 643 16 160 0.025-0.25 Rdl-P2 179 4 44 0.025-0.25 Rdl-P3 13 1 4 0.025-0.25
[0126] 5. Unique reads sequence alignment for screening mutation sites
[0127] The unique reads filtered from each sample pool were compared with the reference sequence (rdl_reference) of the target gene rdl. The same mutation site was found in both sample pools (Rdl-P1 and Rdl-P3), with the mutation type being G>T. This mutation occurs at base 886 of the reference sequence, resulting in the amino acid mutation A296S, and this mutation site has been confirmed in the literature (Collins EL, 2022). Specifically, the depth of the unique reads representing this mutation in Rdl-P1 was 117, while the depths of the unique reads representing this mutation in Rdl-P3 were 4 and 1, respectively. Figure 5 ). Figure 5 The rdl gene mutation sites (G>T) in sample pools Rdl-P1 and Rdl-P3 are shown.
[0128] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0129] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. Furthermore, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A method for screening non-diagnostic mutation sites based on amplicon deep sequencing and sequence counting, characterized in that, include: Multiple individuals to be screened are randomly selected from the sample population to be screened, and tissue materials of each individual to be screened are obtained. The tissue material was lysed and purified to extract genomic DNA from each of the individuals to be screened. Samples from multiple individuals to be screened were mixed in equal volumes at a ratio of 10–20 times the number of individuals. Equal volumes of genomic DNA solutions from each individual were weighed and mixed in groups of ten to twenty individuals to construct several mixed sample pools. Each of the mixed sample cells was numbered and stored. Primers were designed based on the target gene sequence and a tag sequence was added to the 5' end of the primers. Polymerase chain reaction amplification was performed using the genomic DNA as a template to obtain the target amplified fragment. The target amplified fragment was then subjected to deep sequencing using a high-throughput sequencing platform to obtain the raw sequencing data. The raw sequencing data is cleaned, redundancy is removed, and screening is performed to obtain candidate sequences; The candidate sequences whose expression levels are within a set frequency range are compared with the target gene sequence to identify whether there are mutation sites and obtain the comparison results. If the alignment results show that there is a mutation site, then the individual samples in the corresponding mixed sample pool are amplified and sequenced to determine the specific individual samples containing the mutation. The raw sequencing data is cleaned, redundancy is removed, and screening is performed to obtain candidate sequences, including: The raw sequencing data is subjected to quality control. Pre-set quality control tools are used to remove sequencing adapters and primer tag sequences, and chimera identification algorithms are used to remove reads containing chimeric structures to obtain effective reads after quality control. A redundancy removal operation is performed on the valid read segments to extract a set of non-repeating sequences, resulting in a set of unique sequences. The number of times each unique sequence appears in the sample pool is counted as the depth information of the unique sequence. The expression frequency in the mixed sample cell is calculated based on the depth information of each unique sequence and the total depth information of all unique sequences; Based on the ploidy characteristics of the individuals to be screened that constitute the mixed sample pool and the ploidy ratio of the individuals used, the lower limit and upper limit of frequency screening are calculated. The unique sequence whose expression frequency falls within the frequency selection range is selected as the candidate sequence; The lower limit value of the frequency screening is the value obtained by dividing one by the product of the ploidy number of the individual to be screened, two and the mixed ploidy number of the individual; The upper limit of the frequency filtering is ten times the lower limit; The candidate sequences whose expression levels fall within a set frequency range are compared with the target gene sequence to identify any mutation sites. The comparison results include: The set of candidate sequences and the corresponding target gene sequences are imported into a preset sequence alignment tool, and the target gene sequence is designated as the reference sequence. Set alignment parameters and perform sequence alignment to generate alignment documents of candidate sequences and reference sequences; the alignment parameters include matching score, mismatch penalty and insertion-missing cost; The alignment document retrieves mismatches, insertions, or deletions at the alignment coordinates between the candidate sequence and the reference sequence, and marks the alignment coordinates corresponding to the mismatches, insertions, or deletions as difference sites, and records the alignment results; the alignment results include: the sequence position of the difference site, the difference type, and the depth information of the candidate sequence; Determine whether the expression frequency corresponding to the differentially expressed site is still within the frequency screening range. If so, mark the differentially expressed site as a mutation site and output the alignment result containing a list of mutation sites.
2. The method for screening non-diagnostic mutation sites based on amplicon deep sequencing and sequence counting according to claim 1, characterized in that, Primers were designed based on the target gene sequence, and a tag sequence was added to the 5' end of the primers. Polymerase chain reaction amplification was performed using the genomic DNA as a template to obtain the target amplified fragment. The target amplified fragment was then subjected to deep sequencing using a high-throughput sequencing platform to obtain raw sequencing data, including: A pair of specific primers were designed based on the conserved region sequence of the target gene, and a tag sequence for distinguishing different mixed sample pools was added to the 5' end of each specific primer. Using the genomic DNA in the mixed sample pool as a template, polymerase chain reaction (PCR) technology is used to amplify the target gene fragment to obtain the target amplified fragment; the amplification conditions are set according to the characteristics of the primers. The target amplified fragment is purified, and the purified fragment is sequenced at one end or two ends using the high-throughput sequencing platform to obtain the raw sequencing data.
3. The method for screening non-diagnostic mutation sites based on amplicon deep sequencing and sequence counting according to claim 2, characterized in that, The characteristics of the primers include denaturation temperature, annealing temperature, and extension time.
4. The method for screening non-diagnostic mutation sites based on amplicon deep sequencing and sequence counting according to claim 1, characterized in that, If the alignment results show the presence of mutation sites, then the individual samples in the corresponding mixed sample pool are amplified and sequenced to identify the specific individual samples containing the mutation, including: For each individual to be screened in the mixed sample pool where the comparison results show the presence of mutation sites, samples are taken separately, and the genomic DNA of the individual to be screened is extracted individually. Using the same primer system with the aforementioned tag sequence as a template, polymerase chain reaction was performed on the genomic DNA of each individual to be screened to amplify the target gene fragment, and the amplification product was obtained. The amplification products were purified and sequenced using first-generation sequencing to obtain the sequencing sequences of each individual to be screened. The sequencing sequences of each individual to be screened are compared with the target gene reference sequence to check whether there are base differences consistent with the identified mutation sites; Individuals identified as containing the mutation sites in the comparison results are marked as mutated individual samples, and an identification list of mutated individual samples is output.