SNP site detection composition based on MLPA-NGS method for judicial authentication and application thereof
The SNP detection composition designed using the MLPA-NGS method solves the problems of high mutation rate, difficulty in controlling amplicon length, and insufficient detection quantity in existing SNP detection technologies, achieving high-throughput and accurate SNP detection, which is suitable for forensic individual identification and paternity testing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGYIN JIANHUI BIOTECHNOLOGY CO LTD
- Filing Date
- 2022-07-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing SNP detection technologies in forensic identification suffer from problems such as high mutation rate, long PCR amplicon length making amplification difficult, limited number of fluorescent groups, and insufficient detection quantity, making it difficult to meet the demand for high-throughput parallel detection.
A high-polymorphism SNP detection system for the Chinese population was designed using the MLPA-NGS method. SNP probe and primer compositions were used, and probe hybridization, ligation and PCR amplification were performed by MLPA-NGS technology. High-throughput SNP detection was then performed by Illumina next-generation sequencing.
It enables the detection of degraded samples with a length of about 50bp, reduces the risk of non-specific amplification, improves the flexibility of SNP site selection and detection throughput, and is suitable for individual identification and paternity testing, with a detection accuracy of 99.99%.
Smart Images

Figure CN115786455B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of forensic genetics technology, and in particular to a SNP site detection composition based on the MLPA-NGS method for forensic identification and its application. Background Technology
[0002] PCR-STR multiplex amplification based on capillary electrophoresis (CE) is currently the main technique for forensic identification. Short tandem repeats (STRs), also known as microsatellite DNA, are a type of DNA polymorphic locus widely distributed in the human genome. They consist of a core sequence of 2–6 base pairs arranged in tandem repeats. STR loci are generally 100–300 bp in length. They exhibit high polymorphism due to differences in DNA fragment length or sequence between individuals, and are inherited following Mendelian co-dominant patterns during gene transmission. Due to their short fragment length, high amplification efficiency, and generally accurate typing, they are widely used in forensic individual identification and paternity testing. Currently, most forensic DNA databases revolve around STR loci. However, STRs also have certain drawbacks, including: the mutation rate of STR loci is too high, which sometimes causes problems for paternity identification; the PCR amplicon is long, which is not easy to amplify in the detection of degraded samples as templates; the number of STR loci is limited, making it difficult to identify complex kinship; and the number of fluorescent groups available in capillary electrophoresis technology is currently limited, which cannot achieve parallel detection of a large number of STR loci.
[0003] SNP (Single Nucleotide Polymorphism) refers to DNA sequence polymorphism caused by a single nucleotide variation at the genomic level. It is the most common type of heritable variation in humans, accounting for over 90% of all known polymorphisms. SNPs are widespread in the human genome and can be caused by a single base transition or transversion, or by base insertion or deletion. (SNPs have a spontaneous mutation rate of 10^6). -3 ~10 -5 Compared to SNPs, SNPs have a relatively low spontaneous mutation rate (10). -8 If PCR is used, the amplification product of a single SNP site can be controlled below 200bp, which is beneficial for typing of degraded samples; a large number of SNPs are dialleles, which also makes the analysis of typing results simple and easy to automate.
[0004] Before the maturity of NGS (Next Generation Sequencing), also known as high-throughput sequencing technology, the main technical platforms used for SNP genotyping included: Mini sequencing (the core of which is single base extension) and mass spectrometry analysis. However, these platforms could only perform parallel detection of dozens of genetic markers, which was insufficient to meet the quantitative requirements of SNP forensic identification.
[0005] After the high-throughput sequencing technology matured, it was possible to perform parallel detection of up to several thousand SNPs, which has the following advantages: (1) saving the amount of test samples and detection time; (2) parallel detection of multiple samples; (3) designing primers to be as short as possible, improving the success rate of typing of degraded samples. For this reason, Thermo Fisher Scientific in the United States launched two commercial SNP detection kits on the Ion Torrent PGMTM sequencing platform. One is the "Precision ID Identity Panel" for individual identification, which contains 124 SNP loci and is mainly for European populations; the other is the "Precision ID Ancestry Panel" for ancestral information analysis, which is not suitable for individual identification and kinship identification. Illumina in the United States launched the ForenSeq DNASignature kit, which can detect STR and SNP genetic markers at the same time. It contains 95 SNP loci for individual identification, 24 loci for phenotypic identification, and 56 ancestral SNP loci. The "Detection Kit and Method for SNP Loci Based on High-Throughput Sequencing" developed by Li Chengtong, Zhang Suhua, Bian Yingnan, Liu Xiling, and others from the Institute of Forensic Science and Technology, Ministry of Justice, China, includes 273 SNP loci (234 located on autosomes, 9 on the Y chromosome, and 30 on the X chromosome). It can be used for triad paternity testing, duo paternity testing, grandparent-grandchild testing, sibling testing, and individual identification in forensic identification. The fragment library is concentrated below 200bp, suitable for forensic biodegradable samples. The selected loci are targeted at the Chinese population, greatly improving the utilization value of SNP forensic genetic testing in the Chinese population. Wu Liangjun's 2019 master's thesis, "Construction and Preliminary Exploration of a Targeted Capture and High-Throughput Sequencing System for 1245 SNPs and its Forensic Application," is a high-throughput sequencing method for detecting multiple SNPs based on molecular inversion probe (MIP) technology. This technology can detect more than 1,000 SNP sites, but only slightly more than half of the SNPs can be detected in each sample, and the results vary greatly from sample to sample and from test to test, making it unsuitable for forensic identification.
[0006] However, the method of using multiplex PCR to amplify SNP sites and then performing high-throughput sequencing also has some drawbacks: (1) Since the amplification system contains a large number of primers with different sequences, non-specific amplification between primers is difficult to avoid, and the primer design method to reduce non-specific amplification is extremely complicated; the primer design to reduce non-specific amplification will inevitably reduce the flexibility of SNP site selection; (2) For the amplification of SNP sites, although the product length can be lower than 200bp, it cannot be too low, which has limited advantages for the detection of degraded samples.
[0007] Therefore, given the various shortcomings of existing SNP detection methods, it is still necessary to develop more multiple SNP detection methods. Summary of the Invention
[0008] To overcome at least one problem in existing technologies, this invention provides a SNP detection composition based on the MLPA-NGS method for forensic identification and its application. It designs a high-polymorphism SNP detection system based on the MLPA-NGS method, targeting the Chinese population and covering the entire human genome, to meet individual identification and paternity testing needs, providing more technical detection methods for resolving complex and difficult cases. Compared with the library construction method of ultra-multiplex PCR, the probe design of MLPA-NGS technology basically does not need to consider non-specific amplification between primers, and since a pair of probes covers approximately 50 bp of template, it can be used to detect degraded samples of approximately 50 bp in length.
[0009] To achieve the above objectives, the present invention adopts the following technical solution:
[0010] The first aspect of this invention is to provide a SNP site detection composition for forensic identification based on the MLPA-NGS method, comprising an SNP probe composition and a primer composition; the SNP probe composition consists of SNP probe pairs designed for 66 SNP sites on autosomes, 51 SNP sites on the X chromosome, and 25 SNP sites on the Y chromosome; the primer composition consists of universal primers with index sequences for amplifying the ligated probes, each probe pair comprising a left probe and a right probe, the 5' end of the left probe and the 3' end of the right probe each containing a universal sequence, which serves as the binding sequence during amplification by the universal primer; wherein the sequences of the SNP probe pairs are shown in SEQ ID NO.1 to SEQ ID NO.284; the sequences of the universal primers are shown in SEQ ID NO.285 to SEQ ID NO.286, the universal sequence at the 5' end of the left probe is shown in SEQ ID NO.287, and the universal sequence at the 3' end of the right probe is shown in SEQ ID NO.288. Specifically, the overall sequence of the probes involved in the reaction is as follows: Left probe: 5'-universal sequence-template DNA binding sequence-3', Right probe: 5'-template DNA binding sequence-universal sequence-3'.
[0011] Specifically, each SNP locus contains two genotypes: wild type and mutant type, and the SNP genotype can be given after the test is completed; the index sequence is used to distinguish different samples, and the SNP probe composition can give the SNP genotype after detection; in each probe pair, a phosphate group is added to the 5' end of the right probe for probe linkage.
[0012] Furthermore, in the above-mentioned detection composition, three SNP loci are selected on each autosome (based on GRCh38 / hg38), for a total of 66 loci, located at the beginning, end and middle regions respectively. Therefore, any two selected SNPs on the same chromosome are more than 1000K apart, so linkage analysis is not required when used for paternity testing. The SNPs selected on the X chromosome and Y chromosome are distributed in a nearly equidistant manner.
[0013] Specifically, the probe pair sequences designed for 66 SNP sites on autosomes are shown in SEQ ID NO.1 to SEQ ID NO.132, the probe pair sequences designed for 51 SNP sites on the X chromosome are shown in SEQ ID NO.133 to SEQ ID NO.234, and the probe pair sequences designed for 25 SNP sites on the Y chromosome are shown in SEQ ID NO.235 to SEQ ID NO.284.
[0014] It is understood that in the selection and design of the above-mentioned SNP probes, if a new probe is designed by changing the SNP site, and it does not have a fundamental impact on the analysis of forensic identification, it should not be regarded as an essential change to the probes described in this invention.
[0015] Furthermore, in the above detection composition, each SNP probe pair has three characteristic sequences for forensic identification; each probe pair includes a left probe and a right probe, and the characteristic sequences are base sequences of predetermined lengths extracted from the left side of the left probe binding region with template DNA, the connection position between the left probe and the right probe, and the right side of the right probe binding region with template DNA; wherein, the sequence length of the characteristic sequence is 10 bases, and the characteristic sequences are separated by at least one base (e.g., 2 to 35 bases, specifically 3, 4, 5... 30 bases, etc.).
[0016] Specifically, when using the above-described detection composition for paternity testing, in order to analyze the number of reads for each SNP allele in the FastQ file, it is necessary to first design the characteristic sequences of the corresponding probes as described above. The characteristic sequence of each SNP probe pair is 10 bases in length. Each SNP probe pair can detect two alleles of that SNP. Each allele contains three characteristic sequences. The two sets of probe characteristic sequences for the two alleles differ only in the middle specific sequence by one base, which corresponds to the wild type and mutant type, respectively. The above constitutes the characteristic sequences for all detection sites. If a probe containing a degenerate base is considered as two sets of probes, then any set of characteristic sequences corresponds one-to-one with the corresponding probe.
[0017] It is understandable that when designing characteristic sequences, the number, length, position, etc. of the sequences can be appropriately adjusted. Provided that it does not affect the analysis of amplification products, such adjustments should not be considered as essential changes to the characteristic sequences described in this invention.
[0018] A second aspect of the present invention is to provide a SNP site detection kit for forensic identification based on the MLPA-NGS method, comprising the SNP site detection composition as described in any of the first aspects of the present invention.
[0019] Furthermore, in the above-mentioned detection kit, the SNP probe composition is prepared into a probe working solution, wherein the concentration of each probe in the probe working solution is 0.2-20 fmol / μL. Specifically, appropriate amounts of all SNP probes are mixed to prepare a mixed solution with a concentration of 0.2-20 fmol / μL per probe. The concentration is optimized based on the test results to determine the optimal probe working solution, which is 2 fmol / μL.
[0020] Furthermore, in the above-mentioned detection kit, the concentration of each universal primer in the primer composition is 2-200 pMol. Specifically, a concentration of 10-50 pMol is preferred, and a concentration of 20 pMol is more preferred. Each primer is prepared separately without mixing. The upstream and downstream primers are combined differently only when amplifying the probe ligation products of different samples to obtain products with different indices that can distinguish between samples.
[0021] Furthermore, the detection kit also includes at least one of MLPA buffer, ligase, ligase buffer, PCR buffer, dNTP, and PCR enzyme.
[0022] Furthermore, the quantities of each component in the detection kit include: 5 μL of 50-250 ng DNA sample, 1.5 μL of MLPA buffer, 1.5 μL of probe working solution; 6 μL of ligase buffer, 1 μL of ligase; 5 μL of PCR buffer, 4 μL of dNTPs, 0.25 μL of PCR enzyme, and 1 μL each of upstream and downstream universal primers.
[0023] It is understood that appropriate changes may be made to the reagents used, their concentrations, and dosages in the above-mentioned test kits, and these changes are not considered essential alterations to the test kits themselves.
[0024] Based on the probe and primer sequences derived from Illumina next-generation sequencing adapter sequences, amplification itself is also a library preparation process before Illumina sequencing. Understandably, adjustments can be made to use other suitable high-throughput sequencing platforms, such as Roche / 454 sequencing, ABI SOLiD sequencing, Ion Torrent sequencing, and CG sequencing.
[0025] The sequence information involved in the above kit is shown in the table below:
[0026] Table 1 – Information on Probes, Primers, and Characteristic Sequences
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035] A third aspect of the present invention is to provide a method of using an SNP site detection kit as described in any of the second aspects of the present invention, comprising the steps of: denaturing a DNA sample; hybridizing an SNP probe composition with a DNA sample; ligating the hybridization probes using a ligase and a ligase buffer; performing PCR amplification of the probe ligation product with a primer composition; and sequencing the PCR amplification product to obtain sequencing results.
[0036] Furthermore, the specific steps for using the above-mentioned test kit include:
[0037] Step S1, DNA denaturation and probe hybridization: Denature the DNA sample; mix MLPA buffer and probe working solution thoroughly and perform hybridization reaction. The reaction program is as follows: 95℃ for 2 min, 65℃ to 55℃, incubate for 1 hour for each degree drop, and then maintain at 54℃ for 3-10 hours to obtain the hybridization product.
[0038] Step S2: Prepare a ligase main solution containing ligase buffer, add the ligase to the ligase main solution and mix well, heat at 54°C for 1 minute, add to the hybridization product at a constant temperature of 54°C and mix well, continue incubation for 25 minutes, heat at 98°C for 5 minutes, and cool to 20°C and pause to achieve the ligation of the hybridization probe and obtain the ligation product.
[0039] Step S3: Perform PCR amplification reaction with the ligation product and PCR reaction solution. The PCR reaction solution includes PCR reaction buffer, dNTPs, universal upstream and downstream primers with indexes, and Taq enzyme. The reaction conditions are: 95℃ for 30s, 60℃ for 30s, 72℃ for 60s, 35 cycles; incubate at 72℃ for 20min, and finally incubate at 15℃ to obtain the PCR amplification product.
[0040] Step S4: Take an appropriate amount of sample from each PCR amplification product, mix them evenly, and send them to an NGS sequencer for sequencing to obtain sequencing results (specifically, a fastQ file).
[0041] In one specific implementation scheme, the above-mentioned detection kit can be used as follows: (1) On the first day, DNA denaturation and probe hybridization are performed by adding 5 μL of DNA sample (50-250 ng) to a PCR tube, denaturing at 98°C for 5 minutes, and cooling to 25°C. Mix 1.5 μL of MLPA buffer (from MRC-Holland) with 1.5 μL of probe working solution, add to the sample tube, and mix thoroughly. (2) Continue the thermal cycling program: 95°C for 2 minutes, 65°C to 55°C, incubating for 1 hour for each degree of temperature drop, and then maintaining at 54°C for 3-10 hours. (3) On the second day, prepare the ligase-65 main solution: each reaction contains 25 μL. dH2O + 3μL ligase buffer B + 3μL ligase buffer A, then add 1μL ligase-65 enzyme, and gently pipette to mix evenly; buffers A, B and ligase-65 are all from MRC-Holland; place the mixture in a PCR instrument (54℃) and heat for 1 minute, then add to the PCR tube that is incubating at 54℃, mix well, and continue incubation for 25 minutes; (4) heat the above reaction at 98℃ for 5 minutes, and cool to 20℃ to pause, and remove the PCR tube; (5) the PCR enzyme used in PCR amplification is HS Taq enzyme from Takara; 50μL PCR reaction solution includes the following components: 5μL reaction buffer, 4μL 1 μL each of dNTPs, universal upstream and downstream primers with index (20 pMol), 10 μL of ligation product, 0.25 μL of enzyme, and water to make up to 50 μL; the reaction conditions are: 95℃ for 30s, 60℃ for 30s, 72℃ for 60s, 35 cycles; incubate at 72℃ for 20min, and finally incubate at 15℃; (6) Take an appropriate amount of sample from each PCR amplification product, mix them evenly, freeze and store them, and send them to the NGS sequencer produced by Illumina for sequencing to obtain the sequencing fastQ file.
[0042] It is understood that appropriate adjustments can be made to the reagents, temperature, incubation time, sequencing instruments, etc., in the above-described method of use. These adjustments should not be considered as essential changes to the method of use of the above-described test kit, provided that they do not affect the detection and analysis.
[0043] Furthermore, in the above-described method of use, each probe pair has a characteristic sequence for forensic identification, and the design of the relevant characteristic sequences and their sequence information are detailed in the aforementioned table.
[0044] Furthermore, the above-described method of use also includes a step of analyzing the results based on the sequencing results, performing at least one of the following analyses: analyzing the number of probe reads, determining SNP genotyping, judging the sequencing quality of the sample, judging the parentage of the core family lineage, and judging the parental origin of the X chromosome and each segment of the X chromosome.
[0045] Furthermore, in the above method of use, when used to analyze the number of probe reads, it includes the following steps: taking the combination of three characteristic sequences of each pair of probes as the text to be searched, taking each read in the fastQ file as the search object, and using the findall function of regular expressions in Python as the search function to count the number of reads containing each text to be searched; wherein, for each SNP, if the sum of the number of reads of the two alleles is less than 20, it is considered to be of unqualified quality and will not be analyzed; through the quality control SNP, for each sample, the wild-type reads are divided by the total reads of the SNP of that sample to obtain the genotyping value of the SNP of that sample, and the value range of the genotyping value is [0,1].
[0046] Furthermore, in the above-described method of use, when used to determine SNP genotyping, it includes the steps of: testing a population sample of healthy controls using the kit; for each SNP genotyping value, plotting a scatter plot of the genotyping values of each SNP genotype in the population sample; calculating the boundary points for distinguishing each SNP genotype; and genotyping the corresponding SNP of the sample to be tested based on the boundary points.
[0047] Furthermore, in the above-described method of use, when used to determine the sequencing quality of a sample, it includes the following steps: after determining the SNP genotyping, the population genotyping value of the wild-type or homozygous mutant of the same SNP in multiple normal samples can be expressed by the mean of the genotyping value plus or minus the standard deviation. The population genotyping value serves as a measurement characteristic of the SNP and is used to determine the detection quality of the sample; wherein, the larger the mean, the worse the sequencing quality.
[0048] It is understood that this invention is not limited to SNP genotyping of forensic identification test kits or their detection methods designed based on the MLPA-NGS principle. Other kits or self-developed methods designed based on the MLPA-NGS principle, which follow the same methods to establish a judgment framework for SNP analysis and to judge sample sequencing quality in order to solve the SNP genotyping problem, are not considered to be essential changes to the above-mentioned analysis methods.
[0049] Furthermore, in the above-described method of use, when used to determine the parent-child relationship of a core family, the steps include: after obtaining the genotyping results of the autosomal SNPs of the core family, based on the population frequency of each allele of each SNP found, following the analysis rules of the cumulative paternal index, to confirm or deny the parent-child relationship of the core family; wherein, each core family includes a father, mother, and child, and when the SNP loci identified after testing all conform to the family segregation pattern and the cumulative paternal index is greater than or equal to 10,000, the parent-child relationship of the core family is confirmed. The confirmation of the parent-child relationship of the core family is a prerequisite for determining the parental origin of the proband's X chromosome.
[0050] It is understood that this invention is not limited to the confirmation of core family paternity in forensic identification kits designed based on the MLPA-NGS principle. Any other kits or self-developed methods designed based on the MLPA-NGS principle that utilize SNP typing to address the issue of confirming or denying core family paternity are not considered a fundamental change to the aforementioned analytical methods.
[0051] Furthermore, in the above-described method of use, when used to determine the parental origin of the X chromosome and its segments, the steps include: after confirming the parent-child relationship of the core family, for a certain X-SNP of the child, if the parents are hemizygous and homozygous respectively and are different from each other (e.g., the parents are AA and TT respectively), without considering SNP mutations, it can be determined that the child's two alleles can only come from the father and mother respectively, and the father and mother can only provide their own alleles at this SNP position; and after detecting the genotyping value of the child's X-SNP, based on the correspondence between the genotyping value and the dosage value, the relative dosage of the child's two alleles can be calculated, and the proportion of the X chromosome from the father and the mother can be determined (specifically, its approximate proportion).
[0052] It is understood that this invention is not limited to forensic identification test kits designed based on the MLPA-NGS principle. Any other kits or self-developed methods designed based on the MLPA-NGS principle that use the above-described methods of this invention to determine the parental origin of the X chromosome and its segments are not considered to be a substantial change to the above-described analytical methods.
[0053] It is understood that this invention is not limited to kits designed based on the MLPA-NGS principle or self-developed methods for analyzing the parental origin of the X chromosome and its segments. In order to determine the parental origin of other chromosomes, any use of the above-mentioned methods of this invention is not considered as an essential change to the above-mentioned analytical methods.
[0054] Furthermore, in the above-mentioned method of use, the detection of SNP sites on the Y chromosome can not only confirm the sex of the proband, but also assist in the identification of kinship between fathers and sons, grandparents and grandchildren, uncles and nephews, and brothers.
[0055] The fourth aspect of the present invention is to provide an application of the SNP site detection composition as described in any of the first aspects of the present invention, or the SNP site detection kit as described in any of the second aspects of the present invention, specifically: using the kit to test the genomic DNA sample of the subject to be tested for MLPA-NGS detection.
[0056] Furthermore, in the above applications, at least one of the following results is obtained by MLPA-NGS detection: probe read count, SNP genotyping, sequencing quality of the sample, core family paternity, and parental origin of the X chromosome and its segments.
[0057] Furthermore, in the above applications, DNA sample preparation includes: collecting peripheral blood and preparing a DNA sample using a blood DNA extraction kit. It is understood that other forms of samples may also be used.
[0058] Understandably, the analysis of the above results may be for non-diagnostic purposes, used to obtain intermediate results of the correlation analysis.
[0059] To verify the accuracy of MLPA-NGS in SNP detection, this invention used Sanger sequencing and the Infinium Omni ZhongHua-8 SNP chip for verification; the verification results showed that the SNP analysis accuracy was 99.99% compared with other results.
[0060] Compared with the prior art, the present invention, by adopting the above technical solution, has the following beneficial effects:
[0061] This invention presents a SNP detection kit for forensic identification, based on MLPA-NGS technology. It follows the principles of MLPA (Multiples ligation-dependent probe amplification) for library construction and next-generation sequencing (NGS) for sequencing, enabling accurate analysis of SNPs on genes. This is a high-throughput MLPA method. Based on this MLPA-NGS method, SNPs can be detected with ultra-high multiples, thus enabling the development of SNP detection kits for individual identification and forensic identification.
[0062] This invention provides a kit and analytical method based on the MLPA-NGS principle for individual identification and forensic identification. The kit consists of probe working solution, MLPA buffer, high-temperature ligation solution, universal primers, and PCR reaction solution, and its operation is similar to MLPA technology. The fastQ file obtained after sequencing can be automatically analyzed. The analysis includes: SNP genotyping, forensic identification of core families (triads), parental origin of the X chromosome, and presence or absence of the Y chromosome. The kit uses next-generation sequencing for detection, providing a large amount of information, achieving the effect of SNP chips in X chromosome copy number detection, with high sensitivity, low reagent cost, simple operation, and low skill requirements for operators, facilitating large-scale deployment. Compared with SNP detection methods using ultra-multiplex PCR plus next-generation sequencing, it has the advantages of easier probe design and a template length that can be reduced to approximately 50 bp. Attached Figure Description
[0063] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are for illustrative purposes only, and do not constitute an undue limitation of the invention. In the drawings:
[0064] Figure 1 This is a schematic flowchart of the MLPA-NGS method in one embodiment of the present invention; wherein, A: the designed left and right probes and their binding sites with the target DNA; B: after template unwinding, the probes hybridize with the template; C: probe ligation under the action of ligase; D: amplification of the ligation product using adapter sequences from next-generation sequencing as primer pairs; E: amplification of products with the same ends but different middle sections; F: NGS sequencing; G: data analysis and statistics.
[0065] Figure 2 This is a schematic diagram of the probe characteristics of the MLPA-NGS method for SNP detection in one embodiment of the present invention;
[0066] Figure 3 This is a typical scatter plot of SNP genotyping value distribution in one embodiment of the present invention;
[0067] Figure 4 This is a schematic diagram of the Sanger sequencing results of rs4608 in one embodiment of the present invention; wherein, the five samples used are p2866, p2801, p2864, p2845, and p2806 in sequence.
[0068] Figure 5 This is a schematic diagram of the Sanger sequencing results of rs1054480 in one embodiment of the present invention; wherein, the five samples used are p2801, p2864, p2851, p2845, and p2806 in sequence.
[0069] Figure 6 This is a schematic diagram illustrating the results of autosomal SNP analysis of the parental origin of autosomes using the HF01 sample as an example in one embodiment of the present invention;
[0070] Figure 7 This is a schematic diagram illustrating the results of autosomal analysis of the parental origin of autosomes using an HF53 sample as an example in an embodiment of the present invention;
[0071] Figure 8 This is a schematic diagram illustrating the results of X-SNP analysis of the parental origin of the X chromosome using a female sample as an example, based on a child sample of HF01 in one embodiment of the present invention.
[0072] Figure 9 This is a schematic diagram illustrating the results of X-SNP analysis of the parental origin of the X chromosome using a male sample as an example, taking a child sample from HF02 as an example in one embodiment of the present invention;
[0073] Figure 10 This is a schematic diagram of the results of X-SNP analysis of the parental origin of the X chromosome using a T8 sample as an example in one embodiment of the present invention;
[0074] Figure 11 This is a schematic diagram illustrating the results of X-STR analysis of the parental origin of the X chromosome using a T8 sample as an example in one embodiment of the present invention;
[0075] Figure 12 This is a schematic diagram illustrating the results of X-SNP analysis of the parental origin of the X chromosome using a T1 sample as an example in one embodiment of the present invention;
[0076] Figure 13 This is a schematic diagram of the results of X-STR analysis of the parental origin of the X chromosome using a T1 sample as an example in one embodiment of the present invention. Detailed Implementation
[0077] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. Experimental methods in the following embodiments that do not specify specific conditions are generally determined according to national standards. Experimental materials in the following embodiments that do not specify their source are all commercially available raw materials. The equipment used in each step of the following embodiments is conventional equipment. If there is no corresponding national standard, it is carried out according to general international standards, conventional conditions, or conditions recommended by the manufacturer. Unless otherwise stated, all parts are parts by weight, and all percentages are percentages by mass. Unless otherwise defined or stated, all professional and scientific terms used in the present invention have the same meaning as those skilled in the art. In addition, any methods and materials similar or equivalent to those described can be applied to the methods of the present invention.
[0078] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other. The present invention will be further described below with reference to the accompanying drawings and specific embodiments, but this is not intended to limit the scope of the invention.
[0079] The following embodiments illustrate the design concept, reagent kit preparation method, sample collection, testing method, and analysis method. The samples used include over 50 healthy core family samples (each core family includes a father, mother, and child) and 15 core family samples of patients with Turner syndrome (TS). The following embodiments will provide statistical results for some samples, specifically using three patients as examples for a more detailed explanation. The results of these three samples are for illustrative purposes only and should not be considered as limiting the scope of the invention. All reagents used in the following embodiments are commercially available, conventional products.
[0080] The principles of individual identification and forensic identification based on MLPA-NGS are briefly described in the following embodiments:
[0081] (1) Design of detection probes based on MLPA-NGS principle;
[0082] MLPA-NGS technology is a fusion of MLPA and NGS technologies, combining the high precision of MLPA in detecting predictable SNVs with the high throughput of NGS in detecting target fragments. It is a high-throughput MLPA technology. Classical MLPA amplification products are differentiated by capillary electrophoresis based on their length. However, MLPA-NGS introduces universal adapter sequences from NGS into the ligation fragments via PCR after MLPA probe ligation. The fusion products are then directly sequenced in high-throughput, transforming MLPA length detection into MLPA-NGS sequence detection. This overcomes the limitation of MLPA amplification fragment length diversity on the number of detections, significantly increasing throughput. MLPA can detect up to 60 sites at a time, while MLPA-NGS has been tested to detect over 2000 sites simultaneously. For individual identification and forensic identification, SNP probes are designed to detect the genotype of SNPs at the ligation sites where the probes hybridize on DNA. The probes are distributed on all autosomes, the X chromosome, and the Y chromosome, with the highest density on the X chromosome. The selected SNP to be detected is generally a single-base substitution SNP containing only two genotypes (referred to as wild type and mutant type). Two signals of the SNP are detected by designing degenerate bases on the probes that target the two genotypes respectively.
[0083] (2) Determine SNP genotypes using a kit designed based on the MLPA-NGS principle;
[0084] The signals of the two genotypes for each SNP are measured. The raw data is represented by the number of reads. Then, the percentage of reads for one genotype (wild type) relative to the two genotypes is calculated to obtain a genotyping value (the value ranges from [0,1]). The magnitude of the genotyping value is then used to determine whether the genotype is wild-type, heterozygous, or homozygous mutant. Before using the genotyping value to determine the genotype, a genotype boundary value needs to be set. The method is to present the genotyping values of a certain SNP from a series of normal samples as a scatter plot. In most cases, these loci are scattered in three regions: the near-0 region, the near-1 region, and the intermediate region. Two blank areas, large or small, are left between these three regions. The midpoints of these two blank areas are taken as the boundaries of the three distribution regions. In this way, a new genotyping value to be tested can be assigned to one of the three regions. Based on whether the genotyping value of the SNP to be tested falls into the near-0 region, the near-1 region, or the intermediate region, it is determined whether the SNP belongs to the homozygous mutant, wild-type, or heterozygous type. The accuracy of this method in determining genotype was verified through two experimental approaches: one was first-generation sequencing, which involved randomly selecting some SNP sites and samples for first-generation sequencing and comparing the consistency between the results of first-generation sequencing and MLPA-NGS; the other was using SNP chip technology, which involved taking some samples for SNP chip detection and comparing the consistency of the SNPs covered by the two technologies.
[0085] (3) Use a kit designed based on the MLPA-NGS principle to determine the parentage of the core family;
[0086] Samples from both the subject and their parents must be tested simultaneously. Prior to this, it must be ensured that the parents being tested are the subject's biological parents. The subject and parents constitute a core family. Autosomal SNPs within this core family are analyzed to determine if they conform to genetic laws, allowing for paternity testing. Specifically, parentage is determined by calculating the cumulative paternity index (CPI) of the SNPs. For the core family to be tested, if all identified SNP loci conform to pedigree segregation and the CPI is greater than or equal to 10,000, then the parentage within the core family can be confirmed. Using a large number of core families with known relationships as the testing subjects, and MLPA-NGS as the testing method, the confirmation of the tested subjects can, in turn, indicate the reliability of this method for SNP detection.
[0087] (4) Use a kit designed based on the MLPA-NGS principle to determine the parental origin of the proband's X chromosome;
[0088] Once parentage is confirmed, the parental origin of the proband's X chromosome can be further determined. Normally, the father has one X chromosome, and each X-SNP is hemizygous, while the mother has two X chromosomes, and each X-SNP is wild-type, heterozygous, or homozygous mutant. Under normal circumstances, boys have one X chromosome, inherited from the mother, while girls have two X chromosomes, one from the father and one from the mother. However, individuals with certain genetic disorders, such as Turner syndrome, may have one, two, or even more X chromosomes (chimeric). These X chromosomes can only come from the parents. For a given X-SNP, if the mother is not heterozygous and is different from the father (e.g., father AA, mother GG), the parental origin of the child's SNP can be clearly determined: if it is the same as the father, it comes from the father; if it is the same as the mother, it comes from the mother; if there are signals from both parents, it comes from both parents. The ratio of signals from both parents is the ratio of X chromosome origin from each parent. Generally, a significantly smaller signal from one parent than the other suggests the presence of X chromosome mosaicism. By combining the ratio of parental origins of multiple consecutive X-SNPs, the parental origins and proportions of the X chromosome region where the consecutive X-SNPs are located can be confirmed.
[0089] (5) Analyze the quality of test results using a kit designed based on the MLPA-NGS principle;
[0090] Autosomal SNP data can also be used to analyze the quality of test results. Analysis revealed that in high-quality samples, wild-type genotype values are closer to 1, and homozygous mutant genotype values are closer to 0; while in low-quality samples, or samples not tested strictly according to experimental procedures, wild-type genotype values are further from 1, and homozygous mutant genotype values are further from 0. The population genotype values of the same SNP in multiple normal samples (wild-type or homozygous mutant) can be expressed as the mean of the genotype values plus or minus the standard deviation. The population genotype value can be considered a measurement characteristic of the SNP. The absolute value of the difference between the genotype value of each SNP in a new sample and the genotype value of the corresponding allele of the SNP, divided by the standard deviation, is called "genotyping bias," and can be considered as the degree to which the SNP conforms to the measurement characteristics of the SNP. The average genotyping bias of all SNPs in a sample that are not heterozygous can be used to judge the test quality of that sample. Obviously, the larger this average value, the worse the test quality and the less reliable the result.
[0091] The above outlines the general principles of individual identification and forensic identification based on the MLPA-NGS method.
[0092] Example 1 - Design and Use of the Reagent Kit
[0093] This embodiment designs a detection kit for forensic identification based on the MLPA-NGS method and describes the operation method of the detection kit. The above design specifically includes:
[0094] (1) Site selection and probe design;
[0095] SNP probes are designed following the principles of MLPA-NGS technology. The MLPA-NGS process is as follows: Figure 1 (This figure is from a previous study and is for illustrative purposes only; it is not intended to be limiting.) When designing SNP probes, three SNP loci were selected on each autosome (based on GRCh38 / hg38), totaling 66 loci, located at the beginning, end, and middle regions. Therefore, any two selected SNPs on the same chromosome are more than 1000K apart, eliminating the need for linkage analysis in paternity testing. Fifty-one SNPs were selected on the X chromosome, and 25 on the Y chromosome. SNPs on the X and Y chromosomes are distributed in a nearly equidistant manner. SNPs are single-base substitution types, generally containing only two alleles, with a minor allele frequency (MAF) generally greater than 0.2. However, the MAF of SNPs on the X and Y chromosomes was not strictly defined. The frequency of each SNP was obtained from the NCBI 1000 Genomes Database. Each SNP detection probe consists of two seamlessly adjacent probes that bind to the template. The 3' end of the left probe contains degenerate bases, corresponding to the two allelic genotypes of the SNP. The 5' end of the right probe contains a phosphorylated base for probe ligation. The left and right probes constitute a set of probes for detecting one SNP. The wild-type SNP site and a 20 bp fragment from both sides are aligned in the human genome; fragments that are not unique are not selected. The mutant SNP site and a 20 bp fragment from both sides are aligned in the human genome; fragments present are not selected. In some cases, a mismatch is introduced at the third base of the 3' end of the left probe to improve the binding specificity during ligation. The mismatched sequence is also aligned in the genome with a 20 bp fragment; fragments present are not selected. The above SNP probe sequences are shown in SEQ ID NO.1 to SEQ ID NO.284. When using SNP probes to detect SNPs, the pattern is as follows: Figure 2 As shown, for example, for a certain SNP with wild-type and mutant alleles A and G respectively, when the degenerate bases designed on the probe hybridize with the corresponding T and C on the template respectively, they can be successfully ligated; if they do not match on the template, they cannot be successfully ligated or the ligation rate is low. This principle is used to detect SNPs.
[0096] In the SNP probes, a sequence is added to the 5' end of the left probe and the 3' end of the right probe to serve as the binding sequence for universal primer amplification. The corresponding sequences are shown in SEQ ID NO. 287–SEQ ID NO. 288. After SNP probe synthesis and phosphate addition, appropriate amounts of each probe are mixed to prepare a solution with a concentration of 0.2–20 fmol / μL per probe. The concentration is optimized based on test results (the optimal concentration is 2 fmol / μL) and used as the probe working solution. The binding sequence and primers are derived from adapter sequences from Illumina next-generation sequencing, and the amplification itself is also a library preparation process before Illumina sequencing.
[0097] (2) Design the analytical sequence (characteristic sequence) of the probe.
[0098] After adjacent probes complete ligation, amplification, and sequencing, sequencing results are obtained. These results are stored in a FastQ file, containing the sequence and related information for each tested fragment; these sequences are called reads. To analyze the number of reads amplified by the designed probes, the characteristic sequences (i.e., specific sequences) of the probes are searched in the FastQ file. Reads containing these specific sequences are categorized as belonging to that probe and counted. Each pair of SNP probes has three specific sequences: a 10-base segment extracted from the left portion of the left probe-template binding region, the left probe-right probe ligation site, and the right portion of the right probe-template binding region. These three sequences are separated by at least one base. The two groups of specific sequences for two alleles differ only by one base in the middle specific sequence; this base corresponds to the wild type and mutant type, respectively. This constitutes the specific sequences for all detection sites. If a probe containing a single SNP with a degenerate base is considered as two sets of probes, then any set of specific sequences corresponds one-to-one with the corresponding probe. For detailed information on the relevant characteristic sequences, please refer to "Table 1 – Probe, Primer and Characteristic Sequence Information".
[0099] To assign reads from a FastQ file to the designed probe pairs, the conventional analysis method involves aligning reads with a template and then calculating the depth of successfully aligned reads. The analysis method designed in this embodiment differs from this conventional method. The conventional method aligns sequencing reads to a reference sequence and then extracts differentially expressed sites and their frequencies, yielding the same results as the method designed in this embodiment. However, this conventional method requires bioinformatics professionals to use specific software and perform analysis on a server, which is time-consuming and complex. The method designed in this embodiment selects three sequences from each probe pair, searches the FastQ file, and counts their occurrences. This method yields the same results as the conventional method, but with simpler operation and conditions, and does not require specialized bioinformatics personnel or expensive servers. The method designed in this embodiment can automatically exclude low-quality sequencing, as low-quality sequencing often leads to random sequencing errors, and sequences with random errors can be filtered out of the specific sequence search range. The search method utilizes regular expressions.
[0100] (3) Preparation of DNA samples;
[0101] This study was approved by the ethics committee. It used DNA from 53 healthy core families (father, mother, and child), 2 core families from the hematology department whose children had undergone bone marrow transplants, and 15 core families diagnosed with Turner syndrome. Informed consent was obtained from all family members. There were no blood relations between the families. 2 mL of peripheral blood was collected from each member, and DNA was prepared using a blood DNA extraction kit (TIANGEN). DNA concentration was determined by Nanodrop (Thermo Fisher Scientific). It is known that most Turner syndrome patients have problems on the X chromosome. We hope to analyze the impact of Turner syndrome patients or those with similar chromosomal abnormalities on the use of SNPs for individual identification and forensic identification through these samples.
[0102] (4) MLPA-NGS experimental procedure;
[0103] On day one, DNA denaturation and probe hybridization were performed as follows: 5 μL of DNA sample (50-250 ng) was added to a PCR tube, denatured at 98°C for 5 minutes, and then cooled to 25°C. 1.5 μL of MLPA buffer (from MRC-Holland) was mixed with 1.5 μL of probe working solution and added to the sample tube, mixing thoroughly. The thermal cycling program was continued: 95°C for 2 minutes, then 65°C to 55°C, incubating for 1 hour for each degree Celsius decrease, and finally maintaining at 54°C for 3-10 hours. On day two, the ligase-65 main buffer was prepared: each reaction contained 25 μL dH₂O + 3 μL ligase buffer B + 3 μL ligase buffer A, followed by 1 μL of ligase-65 enzyme. The mixture was gently pipetted to mix thoroughly. Buffers A and B, and ligase-65 were all from MRC-Holland. Heat the mixture in a PCR instrument (54℃) for 1 minute, then add it to the PCR tube that is incubating at a constant temperature of 54℃, mix well, and continue incubation for 25 minutes. Heat the above reaction at 98℃ for 5 minutes, then cool to 20℃ and pause. Remove the PCR tube.
[0104] For PCR amplification, the PCR enzyme used was HS Taq enzyme from Takara (catalog number: R007Q). The 50 μL PCR reaction solution contained the following components: 5 μL reaction buffer, 4 μL dNTPs, 1 μL each of forward and reverse primers (20 pMol), 10 μL ligation product, 0.25 μL enzyme, and water to a final volume of 50 μL. The sequences of the forward and reverse primers are shown in SEQ ID NO. 285 and SEQ ID NO. 286, respectively. The poly(N) sequences in both primers are index sequences to distinguish samples. The reaction conditions were: 95℃ for 30 s, 60℃ for 30 s, 72℃ for 60 s, for 35 cycles; incubation at 72℃ for 20 min, and a final incubation at 15℃. Take an appropriate amount of sample from each PCR amplification product, mix them thoroughly, freeze and store them, and send them to Nanjing Novogene Biotechnology Co., Ltd. for sequencing. Use Qubit 2.0 to perform preliminary quantification of library concentration, use Agilent 2100 to detect the integrity of library DNA fragments and the size of inserted fragments, and use an Illumina high-throughput sequencer (such as HiSeq2500 / HiSeq4000 / HiSeqX / MiSeq) to perform paired-end 150bp sequencing to obtain the sequencing fastQ file.
[0105] Example 2 – Analysis of Sequencing Results from the Reagent Kit
[0106] This embodiment performs a result analysis on the fastQ file obtained in Embodiment 1, specifically including:
[0107] (1) Statistical methods and fractal value calculation of Reads
[0108] FastQ files were obtained from a high-throughput sequencer, and a Python program was used for analysis. The combination of three analytical sequences for each site designed in the "(1) Site Selection and Probe Design, (2) Analysis Sequences of Designed Probes" section of Example 1 was used as the text to be searched. Each read in the FastQ file was used as the search object, and the `findall` function in Python was used as the search method to count the number of reads containing each text to be searched. For each SNP, if the sum of the number of reads for the two alleles was less than 20, it was considered unqualified and not analyzed.
[0109] By using the quality control SNP, the wild-type reads of each sample are divided by the total reads of this SNP in that sample to obtain the genotyping value of the SNP. Obviously, the value range is [0,1].
[0110] (2) Construct a judgment framework for SNP typing and perform SNP typing quality control.
[0111] (A) SNP typing;
[0112] In order to accurately determine the genotype of each SNP based on the genotype values, this embodiment uses the genotype values of 73 healthy individuals from the core family of hyperthyroidism to create a scatter plot of the distribution of the genotype values of each SNP, in order to explore the distribution pattern of the SNPs.
[0113] In most cases, the scatter plots of genotype values for each SNP can be divided into three groups: near-1 scatter plots, near-0 scatter plots, and intermediate scatter plots. The intermediate scatter plots are generally clustered together, with clear separation intervals from the near-0 and near-1 scatter plots. Extending the genotype value interval [0,1] by a factor of 1000 and placing it in a coordinate system creates a baseline containing 1000 units, with an X-coordinate interval of [0,1000] and a Y-coordinate of 0. Each genotype value is multiplied by 1000, and the corresponding locus on this baseline is then sampled 10 units before and after it. The Y-coordinate of the corresponding unit on the baseline is then reduced by 1. After this processing, each genotype value of the SNP will generally form three grooves on the baseline, corresponding to the homozygous mutant, heterozygous, and wild-type genotypes, respectively. Between homozygous mutants and heterozygous types, and between heterozygous types and wild types, a flat baseline with a Y-value of 0 is maintained, which may be long or short. The midpoint of this flat baseline is used as the cutoff value for distinguishing between homozygous mutants and heterozygous types, and between heterozygous types and wild types. The length of the flat region represents the reliability of the locus as a cutoff point. For SNP loci on the Y chromosome, there are only two morphologies: wild type and mutant; there is no heterozygous type. The genotyping value of each SNP is divided into three or two groups of scatter plots by the cutoff point, corresponding to three or two genotypes respectively. The cutoff point on the baseline divided by 1000 is the cutoff point for the genotyping value. The genotyping values of the resulting two or three groups of scatter plots are expressed as mean ± standard deviation (x ± s).
[0114] There is a general correspondence between genotyping values and dose values. The scatter plot of genotyping values shows the relationship between genotyping values and dose values for wild-type doses of 0, 0.5, and 1. Beyond these three points, a general judgment is made based on the above correspondence. When the genotyping value is close to 0 or close to 1, the genotyping values corresponding to doses of 0 and 1 are used for correction; when the genotyping value is in the middle range, the genotyping value corresponding to a dose of 0.5 is used for correction. For most SNPs, the cutoff values for genotyping values in different intervals are set to 0.1 and 0.9.
[0115] To establish the genotyping framework for SNPs, 73 samples were used as the analysis subjects. Following the method described above, a scatter plot was created for the genotyping values of each SNP. Based on the scatter plot, the midpoint and length of the two blank regions (boundaries), the mean of the smallest values (genotyping values close to 0), the standard deviation of the smallest values, the mean of the largest values, and the standard deviation of the largest values were calculated. A typical scatter plot of SNP genotyping values is shown below. Figure 3 As shown, the horizontal axis represents the genotyping value interval [0,1] stretched by 1000 times, the vertical axis represents the number of samples, the scatter plots represent the distribution of wild type, heterozygous type, and homozygous mutant type, and the two points on the 0 line of the vertical axis are the boundary points between homozygous mutant and heterozygous type, and between heterozygous type and wild type. The wild type and mutant bases of this SNP (rs12040811) are C and T, respectively.
[0116] The table below contains the boundary midpoints, minimum mean, minimum standard deviation, maximum mean, and maximum standard deviation for each SNP. Due to space limitations and the fact that the mean and method for heterozygous SNPs were not used, they are not included in this table. Figure 3 The data for rs12040811 has been integrated into the table below.
[0117]
[0118]
[0119]
[0120]
[0121] (B) Quality control of SNP typing;
[0122] To analyze the sequencing quality of samples, the concept of mean genotyping bias was defined. For a specific SNP in an individual within the above samples, the genotyping value is determined by the cutoff point to distinguish the genotype. If the genotype is not heterozygous, the mean value is subtracted from the mean value of the genotyping values for the corresponding genotype of that SNP in the sample. The absolute value of this subtraction is compared to the corresponding standard deviation (set to 0.003 if the standard deviation is less than 0.003). This is defined as the genotyping bias for that individual's SNP. The mean genotyping bias is calculated by averaging all calculable genotyping biases for all SNPs measured in that individual. A scatter plot of the genotyping values for all SNPs in each sample was created. It was found that the larger the mean genotyping bias, the more diffuse the distribution of SNP genotyping values in the sample, and the lower the reliability of the genotyping. Therefore, samples with excessively large mean genotyping biases in the healthy sample were removed. The mean and standard deviation of the remaining samples were recalculated, and then the mean genotyping bias of each sample was calculated using the new mean and standard deviation.
[0123] (3) Calculate the cumulative paternity index based on the SNP of the family lineage;
[0124] For each SNP genotyping value in the subject sample, the subject is genotyped based on the aforementioned genotyping cutoff points. When children, fathers, and mothers of a core family undergo the same MLPA-NGS testing and SNP genotyping, the parentage can be determined by analyzing whether they conform to the allele segregation pattern. The likelihood ratio (LR) is used to represent the reliability of paternity testing results. LR is calculated as the ratio of likelihood values (L) based on two hypotheses (H0: the subject is the biological father of a child in a given pedigree; H1: the subject is unrelated). The overall LR is determined by calculating the cumulative paternity index (CPI) using an autosomal SNP panel.
[0125]
[0126] When performing CPI calculations, since the distance between SNPs on each chromosome is much greater than 1000K, there is no need to perform linkage analysis on adjacent SNPs.
[0127] (4) Use X-SNP analysis to determine the parental origin of the X chromosome;
[0128] After determining the biological parents of a child through autosomal SNPs, X-SNPs can be used to further determine the parental origin and proportion of the X chromosome.
[0129] Regarding the two alleles of a child's SNP, there are three scenarios for determining parental origin. The first scenario is indeterminate, such as when the child's parents are (AT, AT, AT). One of the child's alleles could come from either parent, so no determination can be made. The second scenario is barely possible, such as when the child's parents are (AT, AT, TT). Since the child's A can only come from the father, even though the father also has T, it can still be determined that the child's T comes from the mother. The third scenario strongly indicates parental origin, such as when the child's parents are (AT, AA, TT). Without considering SNP mutations, it can be determined that the child's A and T can only come from the father and mother, respectively, and the father and mother can only provide A and T at that SNP location, respectively. In Turner syndrome, because the X chromosome status can be varied—such as one chromosome coming from either the father or mother, both chromosomes from either parent or both from the mother, or a low proportion of mosaicism—the third scenario was chosen as the basis for accurate determination of the X chromosome's parental origin. The main characteristic of the third scenario is that a certain SNP is hemizygous and homozygous in the parents' genes, respectively, and their alleles are different. In this case, after testing the child's SNP genotyping value, the relative dosage of the child's two alleles can be calculated based on the correspondence between the genotyping value and the dosage value, which can determine the approximate proportion of the X chromosome from the father and the mother.
[0130] (5) Controlled experiment;
[0131] To verify the accuracy of MLPA-NGS in SNP detection, two control experiments were used. One experiment randomly selected 10 SNP loci, with 5 samples from each locus. These 5 samples contained three genotypes in MLPA-NGS analysis. Using the DNA from these samples as templates, primers were designed to amplify the SNPs, followed by Sanger sequencing. The consistency between the Sanger sequencing results and the MLPA-NGS analysis results was compared. The second control experiment used Illumina's Infinium Omni ZhongHua-8 SNP chip (service provider: Shanghai Hezhuo Medical Laboratory Co., Ltd.) to detect genome-wide chromosome copy number abnormalities. Three samples were used in this experiment: T11 (proband), B1026 (proband), and H53 (child). The test was performed strictly according to quality control standards, including DNA extraction, enzyme digestion, ligation, PCR, purification, fragmentation, labeling, hybridization, elution, scanning, and analysis. This method detected over 1.17 million SNP loci with an accuracy rate exceeding 99%.
[0132] The accuracy of MLPA-NGS in detecting X chromosome parental origin was verified using the X-STR method. This method is detailed in Ma Hongdu's master's thesis, "Establishment of a Fluorescent Multiplex Amplification System for Nine X-STR Loci and Its Genetic Polymorphism."
[0133] Example 3 – Validation and Application of the Analytical Method for the Reagent Kit
[0134] This embodiment uses certain samples to verify the analysis method in Example 2, and provides relevant specific application examples, which include:
[0135] (1) Statistics on SNP Reads;
[0136] After the analysis method query program in Example 2 is completed, the number of times each specific sequence of each probe pair appears in the fastQ file and the number of times the three specific sequences of each probe pair appear together in a single read in the fastQ file are summarized in an Excel file.
[0137] For example, the following is the data of reads for several SNPs from a test in a patient's T4 family. The three columns on the right, from left to right, are the reads for the child, father, and mother.
[0138]
[0139]
[0140] (2) SNP typing, validation and correlation analysis;
[0141] Based on the genotyping cutoff points of the three genotypes for each SNP, and using the genotyping value calculated for each SNP, each SNP can be genotyped. Substituting the genotyping results and the population frequencies of the alleles, the PI value for each SNP can be calculated. If the total number of reads measured for a certain SNP is too low, it is considered unqualified and will not be included in the calculation.
[0142] (A) SNP genotyping values, genotyping results, and PI values;
[0143] The table below shows the genotyping values, genotyping results, and PI values for several SNPs from a single test in a patient's T4 family. The data is as follows: columns 3-5 and 6-8, from left to right, represent the child, father, and mother, respectively. This table corresponds to the previous table for T4 families. The CPI is obtained by multiplying the PI values of all SNPs within the same core family lineage.
[0144]
[0145] The table below shows the genotyping values, genotyping results, and PI values for several SNPs from a test performed on patient B1100. In the PI column, multiple results are "Err," indicating that the child's genotyping of multiple SNPs with both parents does not conform to the laws of inheritance, suggesting that the tested father is not the child's biological father. This result is consistent with the STR test results. For such patients, subsequent analysis of X-chromosome parental origin is inaccurate. Two cases from the hematology department, where the child had undergone a bone marrow transplant, also showed numerous "Err" values in their PI calculations.
[0146]
[0147]
[0148] (B) The calculation of SNP genotyping deviation values can be used for experimental quality control;
[0149] The three columns on the right of the table below show the genotyping deviation values for several SNPs in the T4 family. The average genotyping deviation value for all autosomal SNPs in each sample is the mean genotyping deviation value, representing the data quality of the sequencing for that sample.
[0150]
[0151] (C) Analysis of the number of SNPs excluded in paternity testing and the average genotyping deviation;
[0152] Analysis was performed on the SNP exclusion count and mean genotyping deviation of 53 healthy families, 2 families that had undergone bone marrow transplantation, and 15 families with TS patients. The results are as follows:
[0153]
[0154]
[0155]
[0156] In these families, HF53 and HF55, where the children underwent bone marrow transplants, had SNP exclusion counts of 6 and 17 respectively, consistent with expectations. In the B1100 family, the SNP exclusion count was 11, indicating the father was not the biological father. The results for B1100 were consistent with the STR testing results. Additionally, three healthy families each had one SNP exclusion count. This example validated these three loci with sequencing. The results showed that in the HF42 family, rs2976399 had another SNP adjacent to its left, affecting the accuracy of rs2976399. The other two families, according to Sanger sequencing, likely had issues identified through next-generation sequencing. Therefore, when using this method for paternity testing, if only one or a very small number of SNP loci deviate from genetic patterns, sequencing errors cannot be completely ruled out. Furthermore, this method has room for improvement.
[0157] (D) Validation experiments: Sanger sequencing and SNP microarray detection;
[0158] Apart from a few problematic loci, this embodiment conducted validation experiments on other loci. Two methods were used for validation: Sanger sequencing and SNP microarray detection.
[0159] Sanger sequencing randomly selects some SNP loci and selectively chooses some samples (to include all three genotypes). The correspondence and results are shown in the table below. Figures 4-5 As shown. See Figure 4 The Sanger sequencing results for the rs4608 locus in five samples were consistent with those obtained using the MLPA-NGS method; see [link to relevant documentation]. Figure 5 The Sanger sequencing results for the rs1054480 locus in five samples were consistent with those obtained using the MLPA-NGS method. These results demonstrate that the Sanger sequencing results are completely consistent with the MLPA-NGS results.
[0160] SNP sample sample sample sample sample rs2976399 p2837 P4002 P4004 P4005 P4252 rs1652727 p2801 p2806 p2845 p2851 p2864 rs2281974 p2801 p2806 p2845 p2851 p2864 rs8451 p2801 p2845 p2851 p2864 p2863 rs2289759 p2801 p2806 p2845 p2851 p2864 rs4608 p2801 p2806 p2845 p2864 p2866 rs11353 p2801 p2806 p2845 p2851 p2864 rs1054480 p2801 p2806 p2845 p2851 p2864 rs2229137 p2801 p2806 p2845 p2851 p2865 rs2270672 p2806 p2845 p2851 p2864 p2863 SNP result result result result result rs2976399 GA AA GG AA GA rs1652727 CT CC TT CT TT rs2281974 GA GG AA GA GA rs8451 GG GG GG GA AA rs2289759 GG AG GG AG AA rs4608 TT CT CT CT CC rs11353 TC TC TT CC TC rs1054480 CC CC TT CT CT rs2229137 AA CC CC AA AC rs2270672 CC CC TT TT CT
[0161] When using the Infinium Omni ZhongHua-8 SNP chip to analyze three samples—the T11 proband, the B1026 proband, and the H53 child—the sequencing results for the T11 proband sample (rs5744944) differed from the actual sequencing results. This result has been mentioned earlier. For the remaining loci, the results were completely consistent when both techniques yielded results.
[0162]
[0163]
[0164]
[0165] (3) SNP-based analysis and verification of chromosomal parentage;
[0166] As mentioned above, there are three scenarios for determining the parental origin of chromosomes based on SNPs in the core family lineage.
[0167] (A) Determining the parental origin of chromosomes using autosomal SNPs, this embodiment utilizes the second and third cases.
[0168] Taking HF01 as an example, SNPs that cannot be distinguished by parental origin are displayed in blue bars. For SNPs that can be distinguished by parental origin, the parental dose is calculated and represented in orange and gray, respectively. After converting to a black and white graph, the interpretation can be made based on the legend colors. Each SNP that conforms to the inheritance pattern is represented by a solid pentagram, while those that do not are represented by a hollow pentagram. If a SNP is not detected, it is represented by an asterisk (※). Figure 6 The study was able to distinguish SNPs from parental sources, with each parent receiving half the dose, which was in line with expectations.
[0169] Taking HF53 as an example, the child in this sample had undergone a bone marrow transplant, and the bone marrow donor was his father. Figure 7 The SNPs show numerous hollow pentagrams, indicating a violation of the law of segregation in inheritance. However, these SNPs, which do not conform to the law of segregation, are all orange, signifying paternal origin, suggesting that this method has some application value even in determining bone marrow transplantation. Therefore, using family SNPs to determine parental origin is effective on autosomal chromosomes.
[0170] (B) This method was used to determine the parental origin of the X chromosome, employing the third of three scenarios (the parents of a given SNP are non-heterozygous and different). The children of HF01 and HF02 are female and male, respectively (see...). Figure 8 , Figure 9 As can be seen, in HF01, for all SNP loci that could distinguish parental origin, the doses from both parents were similar, and this was confirmed at 9 SNP loci, indicating that the subject's X chromosome came from both parents, half from each, consistent with the common knowledge that females have two X chromosomes, one from each parent. In the HF02 sample, for all SNP loci that could distinguish parental origin, this was confirmed at 3 SNP loci, indicating that the subject's X chromosome came from the mother, consistent with the common knowledge that males' X chromosome comes from the mother. All 53 healthy core families tested conformed to the sex characteristics.
[0171] (C) The same method was used to analyze the parental origin of the X chromosome in TS patients;
[0172] Taking the T8 sample as an example (see...) Figure 10The study confirmed at 13 SNP loci that the subject's X chromosome originated from both parents, with a higher proportion coming from the father than the mother. This indicates that the paternal X chromosome is significantly larger than the maternal X chromosome. Calculations showed that the proportion of the paternal X chromosome was 70.2%. These results are consistent with X-STR testing results. Figure 11 The STR test results for the T8 proband and his parents, named GATA172D05, show that the two STR loci of the proband came from the father and mother respectively. The lengths of the STRs carried by the parents are different. The dose of the STR from the mother in the proband is significantly lower than that from the parents. This indicates that the proband's X chromosome comes from the parents, with the X chromosome signal from the mother being weaker. The karyotype analysis results of the T8 proband are: 45,X
[12] / 46,XX
[38] , which are consistent.
[0173] Taking sample T1 as an example (see...) Figure 12 Using the same method, the parental origin of the X chromosome in TS patients was analyzed. Figure 12 As can be seen from the 10 SNP loci, the X chromosome of the tested individual was inherited from both parents, with a significantly higher proportion from the father than the mother. The proportion of the X chromosome from the father was calculated to be 89.7%. These results are consistent with the X-STR test results. Figure 13 The STR test results for the T1 proband and his parents, named DXS10146 (the STR peak indicated by the arrow in the proband, originating from the mother), show that the proband's X chromosome comes from the father, but there is a weak signal from the mother's X chromosome. This weak signal can sometimes be considered contamination or a nonspecific effect. However, in the analysis method of this embodiment, the maternal signal appears repeatedly at multiple SNP loci, while this signal is absent in healthy families, indicating its presence is real. In the karyotype analysis, the T1 proband is 45,X, indicating that the X chromosome from the mother is completely ignored.
[0174] As can be seen from the above embodiments, (1) the SNP set detection based on the MLPA-NGS method can be used for individual identification and forensic identification; (2) the SNP set can determine the parental origin and proportion of the X chromosome; (3) it is difficult to use X-SNP for individual identification and forensic identification, and it is easily affected by Turner syndrome, which has a certain incidence in the population.
[0175] The above description is merely a preferred embodiment of the present invention and does not limit the implementation and protection scope of the present invention. Those skilled in the art should realize that any equivalent substitutions and obvious changes made based on the description and illustrations of the present invention should be included within the protection scope of the present invention.
Claims
1. A SNP site detection composition based on MLPA-NGS method for judicial authentication, characterized in that, The detection composition includes an SNP probe composition and a primer composition. The SNP probe composition consists of SNP probe pairs designed for 66 SNP loci on the autosome, 51 SNP loci on the X chromosome, and 25 SNP loci on the Y chromosome. The primer composition consists of universal primers with index sequences for amplifying the ligated probes. Each probe pair includes a left probe and a right probe. The 5' end of the left probe and the 3' end of the right probe each contain a universal sequence, which serves as the binding sequence during amplification by the universal primer. The sequences of the SNP probe pairs are shown in SEQ ID NO. 1 to SEQ ID NO.
284. The sequences of the universal primers are shown in SEQ ID NO. 285 to SEQ ID NO.
286. The universal sequence at the 5' end of the left probe is shown in SEQ ID NO. 287, and the universal sequence at the 3' end of the right probe is shown in SEQ ID NO.
288.
2. The SNP site detection composition of claim 1, wherein, Each SNP probe pair has three characteristic sequences for forensic identification. Each probe pair includes a left probe and a right probe. The characteristic sequences are base sequences of predetermined lengths extracted from the left side of the region where the left probe binds to the template DNA, the connection position between the left probe and the right probe, and the right side of the region where the right probe binds to the template DNA. The length of each characteristic sequence is 10 bases, and each characteristic sequence is separated from the others by at least one base.
3. A SNP site detection kit for forensic identification based on the MLPA-NGS method, characterized in that, The detection kit contains the SNP site detection composition as described in claim 1 or 2.
4. The SNP site detection kit according to claim 3, characterized in that, The SNP probe composition is prepared into a probe working solution, wherein the concentration of each probe in the probe working solution is 0.2~20 fmol / μL; and / or, in the primer composition, the concentration of each universal primer is 2~200 pMol.
5. The SNP site detection kit according to claim 3, characterized in that, The detection kit also includes at least one of MLPA buffer, ligase, ligase buffer, PCR buffer, dNTP, and PCR enzyme.