A novel polygenic risk score (PRS) approach to predict autism and neurodevelopmental disorders
By combining a multi-gene risk scoring method with common and rare variants, the problem of untapped rare variants in existing technologies has been solved, enabling highly accurate prediction of the risk of autism and neurodevelopmental disorders and early identification of high-risk groups.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WELLMIND BIOMED TECH HLDG LTD
- Filing Date
- 2021-06-23
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, multigene risk scoring methods based on common variants have insufficient predictive accuracy in predicting autism and neurodevelopmental disorders. In particular, due to the failure to effectively utilize rare variants, the variance explained rate is low, making it difficult to identify high-risk individuals at an early stage.
A novel multi-gene risk scoring method for detecting rare variants using gene and DNA sequencing is employed. This method combines common and rare variants, calculates the genetic effect of each gene through burden testing, constructs a predictive model, and improves scoring accuracy.
It significantly improves the accuracy of predicting the risk of autism and neurodevelopmental disorders, enabling the early identification of high-risk individuals, especially potential patients under 18 months of age, which is not possible with traditional methods.
Smart Images

Figure CN114255870B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of gene detection and analysis technology, specifically involving a novel polygenic risk scoring (PRS) method for predicting autism and neurodevelopmental disorders. Background Technology
[0002] Autism Spectrum Disorder (ASD) is a relatively common disorder, with its prevalence rising to approximately 1-2% of the population over the past 20 years. (1) However, most cases are only diagnosed and intervened after the age of 3. In fact, the earlier infants and children at risk of ASD receive diagnosis and intensive treatment, the greater their chances of developing good social and vocational skills. To identify high-risk individuals for autism early, autism specialists use screening tools for further diagnosis. Many tools are commonly used to identify high-risk individuals for ASD, but most screening tools are based on behavioral questionnaires such as the ASQ, CSBS, PEDS, and MCHAT, or eye-tracking devices. (2) .
[0003] Patent applications WO2009150221 and US20110091899A1 disclose a method for detecting whether a subject has autism or is susceptible to autism. This method involves detecting risk alleles in the subject; the more risk alleles detected, the higher the risk of autism. ASD is a complex disease with a high degree of polygenicity. This method relies solely on detecting risk alleles in a pre-defined and limited number of genes (such as pitx, ATP2B2, SLC25A12, and EN2), thus limiting its ability to predict ASD risk. Methods that detect risk alleles in a pre-defined set of genes will only capture a portion of the risk alleles for ASD, reducing predictive accuracy.
[0004] International patent application WO2011076783A discloses a method for detecting the presence or susceptibility to hereditary neurological / psychiatric diseases by simultaneously detecting combinations of risk alleles in multiple genes and estimating a disease risk score for said combinations. This scoring method is essentially similar to the Polygenic Risk Score (PRS); however, the genetic score relies only on common variants (MAF > 0.1%), and its sensitivity of 26% is insufficient for use as a screening tool.
[0005] US Patent Application US20200032337A1 discloses a method and components for determining whether a subject has ASD and whether they are likely to develop ASD, classifying the subject into those with a specific ASD subtype. It detects all single nucleotide polymorphisms (SNPs) in the subject, identifies their presence and / or deletion states, and then compares them with the presence and / or deletion states of SNPs in at least one sample training set. Based on the results of a statistical algorithm, the subject is diagnosed as ASD positive or ASD negative.
[0006] Hong Kong patent application HK1205767A1 discloses a method for detecting ASD-related genes by measuring and analyzing gene expression through a case-control study. Based on the detected genes, an algorithm can infer the risk of autism. The methods listed in the patents above only describe how to assess the odd ratio (OR) of risk alleles detected from SNP microarrays. It is well known that SNP microarrays can only detect a finite subset of all risk alleles (i.e., common SNPs with a MAF > 0.1%), and limit the types and number of risk alleles detected. Furthermore, the above-mentioned risk allele-based detection methods are limited to a range of genes.
[0007] The polygenic risk score (PRS) is a well-known indicator used to estimate an individual's genetic predisposition or relative risk of developing a complex disease. Ideally, the PRS explains a significant percentage (R²) of phenotypic variance. 2 ) should be equal to the "SNP heritability" (h) of complex diseases. snp 2 ) (3) However, the variance (R²) explained by PRS 2 It has only about one-third to two-thirds of heritability. Increasing the sample size of genome-wide association studies (GWAS) may yield higher R values. 2 However, this is difficult to achieve due to high sampling costs and the difficulty in obtaining patient consent. A recent study included all rare variants from individual whole-genome sequencing (WGS) data (N=21,620) and detected over 98% heritability of height (N=21,620). (4) Therefore, PRS-based disease prediction methods are still insufficient for healthcare professionals to use in screening many complex diseases, such as neurodevelopmental disorders like ASD. (5)Such low predictive power is primarily due to the application of only common variants detected by GWAS to PRS calculations. Even though whole-exome sequencing (WES) technology has been widely used for many years, rare variants and a wealth of biological data have not yet been utilized to improve PRS predictive power. In the existing technology, there is an urgent need to develop a PRS scoring method based on rare variants detected by WES to quantify the relative risk of neurodevelopmental disorders such as ASD. This invention provides a PRS scoring method based on rare variants.
[0008] References:
[0009] 1. Baio J, Wiggins L, Christensen DL, et al. Prevalence of Autism SpectrumDisorder Among Children Aged 8 Years—Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2014. MMWR Surveill Summ 2018; 67 (No. SS-6): 1–23.
[0010] 2.Wan G, Kong
[0011] 3. Visscher PM, Hill WG, Wray NR. Heritability in the genomics era—concepts and misconceptions. Nat. Rev. Genet. 2008; 9:255–266
[0012] 4. Wainschtein P, Jain DP, Yengo L, et al. Recovery of trait heritability from whole genome sequence data. bioRxiv; 2019. DOI: 10.1101 / 588020
[0013] 5.Manolio,T.,Collins,F.,Cox,N.et al.Finding the missing heritabilityof complex diseases. Nature 461,747–753(2009)doi:10.1038 / nature08494
[0014] 6.Gaugler,T.,Klei,L.,Sanders,S.et al.Most genetic risk for autismresides with common variation.Nat Genet 46,881–885(2014).
[0015] 7.Guo,M.H.,Plummer,L.,Chan,Y.M.,Hirschhorn,J.N.&Lippincott,M.F.Burdentesting of rare variants identified through exome sequencing via publiclyavailable control data.Am.J.Hum.Genet.103,522–534(2018)
[0016] 8.Goldstein,D.B.,Allen,A.,Keebler,J.,Margulies,E.H.,Petrou,S.,Petrovski,S.,and Sunyaev,S.(2013).Sequencing studies in human genetics:design and interpretation.Nat.Rev.Genet.14,460–470.
[0017] 9.Jack Euesden,Cathryn M.Lewis,Paul F.O’Reilly,PRSice:Polygenic RiskScore software, Bioinformatics,Volume 31,Issue 9,1May 2015,Pages 1466–1468,https: / / doi.org / 10.1093 / bioinformatics / btu848
[0018] 10. Marees AT, de Kluiver H, Stringer S, et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res. 2018;27(2):e1608. doi:10.1002 / mpr.1608
[0019] 11. Li H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics(Oxford, England), 27(21), 2987–2993. https: / / doi.org / 10.1093 / bioinformatics / btr509
[0020] 12. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122. (2016) Summary of the Invention
[0021] This invention discloses a novel gene-based PRS scoring method based on genes and rare variants detected by DNA sequencing (including but not limited to NGS / 3GS), which is used to screen high-risk autism patients and represents a new approach to objectively identify high-risk populations. Since relying solely on common variants from GWAS is one of the biggest limitations of current PRS analysis workflows, this invention, in addition to using common variants from GWAS, also integrates rare variants detected by DNA sequencing (MAF < 0.1%) with human genetic variant information (such as MAF, consequences, pathogenicity, PolyPhen scores, etc.) from different variant databases (such as dbSNP, ClinVar, OMIM, GnomAD, etc.) into the gene-based PRS score. This can significantly increase the explained variance (R²). 2 It addresses the low variance interpretation of PRS estimated solely based on common variants (R). 2 This algorithm improves the accuracy of predicting the relative risk of a subject potentially developing related diseases, while also reducing the predictive power of common and rare variants in the PRS score. It effectively screens high-risk potential patients under 18 months of age, a capability currently unattainable with existing PRS scoring techniques.
[0022] This invention provides a PRS method for predicting autism and neurodevelopmental disorders, which improves the accuracy of predicting the relative risk of potential diseases in subjects by incorporating the genetic effects of rare variants (QV) of loss of function (LoF) with MAF < 0.1% and common variants with MAF > 0.1%.
[0023] ASD, or Autistic Spectrum Disorder, is a collective term for autistic spectrum disorders, along with other neurodevelopmental disorders.
[0024] PRS, Polygenic Risk Score.
[0025] MAF stands for Minor Allele Frequency.
[0026] QV, Qualified Variant, refers to a rare variant with a loss of function (LoF) of less than 0.1% MAF.
[0027] In some embodiments, the PRS method includes the steps of:
[0028] (1) For each gene in the VCF dataset (hereinafter referred to as the discovery dataset) used to construct the prediction model, a burden test was performed; the genetic effect between each gene and the disease was estimated according to the formula listed below, and the gene weight w was used as a reference. j The Log Odd Ratio is used to quantify these genetic effects, serving as the weight of each gene when calculating the gene-based PRS.
[0029] j is a number from 1 to M, where M is the number of genes;
[0030] R0 j R1 j These are the counts of the number of people belonging to the control groups BIN0 and BIN1, respectively.
[0031] S0 j S1 j These are the number of cases belonging to the BIN0 and BIN1 groups, respectively.
[0032] The BIN0 group includes individuals whose gene j detects one or more QVs;
[0033] Group BIN1 consists of individuals in which QV cannot be detected by gene j;
[0034] (2) Calculate the standard error SE and p-value of the weight (Log Odd Ratio) of each gene:
[0035]
[0036]
[0037] (3) Calculate the gene-based PRS scores of all individuals in the target VCF;
[0038] Let i be a number from 1 to N, N be the number of individuals, j be a number from 1 to M, and M be the number of genes. Calculate individual T using the following formula. i The obtained gene-based PRS score PRS Ti :
[0039]
[0040] When no individual T is detected i When gene j contains QV, b ji =0;
[0041] When individual T is detected i When gene j has more than one QV, b ji =1.
[0042] In some embodiments, the gene weight w j The burden test is calculated by detecting the number of people with loss-of-function (LoF) rare variants of each gene in the case group from the VCF dataset (discovery dataset) generated by WES sequencing, and then comparing it with the corresponding number of people in the control group. According to the method described in this invention, the weight of the genetic effect between each gene and the disease is estimated to construct a predictive model for the disease.
[0043] In some embodiments, the burden test procedure of the present invention is as follows:
[0044] (1) By coordinating the sequencing depth of the VCF datasets of the case group and the control group, only the variants that meet the sequencing depth conditions are retained. The variants must have more than 90% of the individuals with a sequencing depth of more than 10X. Based on this condition, the two VCF datasets are filtered using bedtools.
[0045] (2) By coordinating the variant quality values in the VCF datasets of the case group and the control group, and finding the relevant thresholds through QQPlot, the two VCF datasets were filtered.
[0046] (3) Annotate the variants in the VCF dataset using the Variant Effect Predictor (VEP);
[0047] (4) In the two VCF datasets, rare variants (QV) that cause protein truncation were screened out;
[0048] (5) Count the number of QV detected for each gene in each individual in the case group and the control group respectively;
[0049] (6) In the case group and the control group, each individual was assigned to the BIN0 or BIN1 group according to the number of QV detected by the gene, so as to obtain the values of R0, R1, S0, S1.
[0050] (7) Based on the formula of the technical weight of the present invention, the weight of the genetic effect between each gene and the disease is estimated, thereby constructing a predictive model for the disease.
[0051] In some embodiments, the relevant VEP variant consequences that lead to protein truncation include splice acceptors and donors, stop, gain, frameshift, stop loss, start loss, in-frame insertion and deletion, and misspelling.
[0052] In some embodiments, during the burden test, variant quality values in the case and control VCF datasets are reconciled, with a variant quality threshold of approximately 1.0.
[0053] In some embodiments, through supervised learning, gene-based PRS scores estimated based on rare variants and SNP-based PRS scores estimated based on common variants are combined to derive the variance interpretation (pseudo-R). 2 (and the PRS score, which has higher predictive accuracy)
[0054] In some embodiments, rare variants are detected by DNA sequencing, wherein the DNA sequencing includes NGS / 3GS.
[0055] In some embodiments, rare loss-of-function (LoF) variants with MAF < 0.1% are selected, and only variants with sequencing depths of 10X or higher in more than 90% of individuals are retained as target VCF files.
[0056] In some embodiments, a gene-based PRS process is used to calculate all individuals in the target VCF file:
[0057] (1) Filter the variants of the target VCF file and retain only the variants with a sequencing depth of more than 10X in more than 90% of the individuals;
[0058] (2) VEP annotation of the target VCF file to obtain the consequences of each variant;
[0059] (3) Using the variant consequences annotated by VEP, a Python program was used to screen for rare loss-of-function (LoF) variants (i.e. QV) for each individual and each gene.
[0060] (4) Based on whether each gene of each individual contains more than one QV, the relevant b is obtained. ji Value (0 or 1).
[0061] (5) Based on the prediction model and related gene weights w calculated from the VCF dataset (discovery dataset) j and the subject's b derived from the target VCF file ji The numerical value, the gene-based PRS score for an individual, can be calculated using the following formula:
[0062]
[0063] One of the innovations in this invention lies in assessing ASD risk using risk alleles detected by GWAS and DNA sequencing (including but not limited to NGS / 3GS). In addition to the snp-based PRS score assessed using common variants (MAF>1%) detected by GWAS, we also incorporate rare variants (MAF<0.1%) detected by NGS / 3GS using a novel algorithm to assess gene-based PRS scores, rather than relying solely on snp-based PRS scores based on alleles as in existing PRS scoring techniques. Furthermore, unlike similar patented inventions mentioned earlier, this invention does not pre-assume the nature and number of risk alleles. Since a wealth of biological information from public databases can be annotated onto genes carrying rare variants, this high-value information is integrated into the gene-based PRS scoring algorithm, thereby improving the accuracy of predicting potential disease incidence. This invention can use predictive models constructed from different phenotypes of a disease to calculate the PRS score of the subject's relevant phenotype, thereby classifying the subject into different subtypes and providing precise treatment for each individual, thus greatly improving the effectiveness of the intervention.
[0064] Furthermore, this invention introduces a stress testing workflow into the PRS analysis workflow. This workflow detects rare variants in genes through DNA sequencing, rather than common variants detected by GWAS. Moreover, because the PRS score is based on rare gene variants, we can annotate a large amount of different types of biological information onto rare variants and related genes, thereby improving the predictive accuracy of gene-based PRS. The Python program "gbprs.py" is used to calculate the gene-based PRS score for each individual in the target VCF file, based on their rare LoF variants and gene summary statistics; the relevant methods have been described in the relevant embodiments of the invention.
[0065] This invention is innovative:
[0066] 1. The genetic score is based on the polygenic risk score (PRS) of genes, rather than on the odd ratio or single nucleotide polymorphism risk score.
[0067] 2. Genetic risk scoring is performed not only on common variants but also on rare variants.
[0068] 3. Risk scores for related diseases are primarily based on risk alleles detected from SNP arrays and WES / WGS sequencing, rather than gene expression detected from DNA expression chips / RNA-seq in existing technologies.
[0069] 4. In gene-based PRS, the weight (risk burden) of each gene is calculated through burden testing, while in allele-based SNP-based PRS, the weight (odd ratio) of each risk allele is calculated through linear regression. Therefore, the risk score for developing the relevant disease calculated in this invention is a novel PRS that combines gene-based PRS and allele-based SNP-based PRS scores.
[0070] 5. The new combined PRS calculation method will include the genetic effects of all relevant disease risk alleles (common and rare variants) of the subject, rather than just a small subset of risk alleles in the predefined set of disease risk-related genes in existing technologies.
[0071] Advantages of this invention:
[0072] 1. In this invention, screening high-risk ASD patients based on gene-based polygenic risk scores (PRS) is a novel approach that can objectively identify high-risk individuals under 18 months of age when traditional behavioral screening is not feasible.
[0073] 2. A novel algorithm is employed to assess the risk of relevant diseases using the PRS score, which can simultaneously incorporate both common and rare variants of the subjects. The final score comprises two PRS scores: one assessed based on risk alleles identified in GWAS, and the other based on the risk burden of genes identified in burden tests. This combined PRS score significantly improves the variance explained by R0. 2 And prediction accuracy.
[0074] 3. For common variants, genotyping is performed using SNP arrays; for rare variants, genotyping is performed using whole exome sequencing (WES) or whole genome sequencing (WGS). The risk of a subject developing the relevant disease is measured by the PRS score calculated from the detected common and rare variants, rather than by assessing the risk of a subject developing the relevant disease from a limited set of genes or common variants associated with the relevant disease.
[0075] 4. DNA samples can be collected using a swab tool.
[0076] Detailed description
[0077] Unless otherwise stated, the terms used in the specification and claims of this invention have the following definitions.
[0078] Certain embodiments of the invention will now be described in detail, examples of which are illustrated by the accompanying flowcharts. The invention is intended to cover all alternatives, modifications, and equivalents, all of which are included within the scope of the invention as defined in the claims. Those skilled in the art will recognize that many similar or equivalent methods and materials described herein can be used to practice the invention. The invention is by no means limited to the methods and materials described herein. In the event that one or more of the incorporated documents, patents, and similar materials differ from or contradict this application (including, but not limited to, defined terminology, application of terminology, described techniques, etc.), this application shall prevail.
[0079] It should be further appreciated that certain features of the invention, for clarity, have been described in multiple independent embodiments, but may also be provided in combination in a single embodiment. Conversely, various features of the invention, for brevity, have been described in a single embodiment, but may also be provided individually or in any suitable sub-combination.
[0080] Unless otherwise stated, all technical terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art. All patents and publications related to this invention are incorporated herein by reference in their entirety.
[0081] Unless otherwise stated or there is a clear conflict in the context, the articles “a,” “an,” and “described” as used herein are intended to include “at least one” or “one or more.” Therefore, these articles as used herein refer to articles for one or more (i.e., at least one) objects. For example, “a component” refers to one or more components, meaning that more than one component may be considered for use or adoption in the implementation of the described embodiments.
[0082] "Approximately" or "about" is defined as close to as understood by one of ordinary skill in the art, and in a non-limiting embodiment, the term is defined as within 10%, preferably within 5%, more preferably within 1%, and most preferably within 0.5%.
[0083] As used in this specification and claims, the words “comprising,” “having,” “including,” or “containing” are inclusive or open-ended and do not exclude additional, unlisted elements or method steps. Attached Figure Description
[0084] Figure 1 It is a workflow that harmonizes variants in case and control VCF files to eliminate technical biases introduced by multiple sequencing platforms and variant calling pipelines;
[0085] Figure 2 The PC1 and PC2 components are plotted in the PCA analysis of the ASD case dataset;
[0086] Figure 3 It is the workflow for generating gene summary statistical information (establishing a predictive model);
[0087] Figure 4 It is a gene-based PRS calculation workflow;
[0088] Figure 5 It is a coordinated QQPlot of gene load testing results between cases (ASD) and controls (GnomAD) in the VCF dataset (discovery dataset);
[0089] Figure 6 It is the consequence of variants in ASD samples used in the VCF dataset (discovery dataset) and the PolyPhen2 score;
[0090] Figure 7 It is the BETA distribution of all genes with QV>1;
[0091] Figure 8 It is the distribution of the total number of genes (QV>1) in each sample;
[0092] Figure 9 It represents the true probability (Y-axis) of having ASD corresponding to the PRS percentile value (X-axis). Detailed Implementation
[0093] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without inventive effort are within the scope of protection of this invention. Furthermore, the technical features involved in the different embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.
[0094] Example 1: Construction of a gene-based PRS analysis system:
[0095] (1) In this invention, we use autism spectrum disorder (ASD) as a complex disease to validate the gene-based PRS analysis. ASD is a relatively common and highly heritable complex disease with a complex genetic structure involving common and rare variants. Among the affected subjects, 10-20% carry rare loss-of-function (LoF) mutations. (6) ASD is a good disease to demonstrate the workflow of this invention. The workflow uses the burden test method proposed by Guo, Michael H et al. to construct a predictive model.(7) This invention uses their developed TRAPD software package to perform burden testing between ASD and control groups, and uses a gene-based PRS analysis tool to calculate the gene-based PRS of the target data.
[0096] (2) Data quality control is a critical step in any WES data analysis workflow, especially in burden tests, which compare the number of variants between case and control subjects. First, control subjects' WES data come from multiple DNA sequencing platforms. Second, control subjects are not referred to for variants together with case subjects. Because sequencing coverage and quality metrics can vary significantly across different platforms, greatly impacting sensitivity to variant referencing, these issues can cause technical artifacts, leading to systematic bias in burden test results. These issues have been addressed through… Figure 1 The variant coordination workflow shown is resolved as follows (3) and (4):
[0097] (3) To solve the first problem, according to Figure 1 The workflow within the case and control VCF files reconciles the read depth for each variant. For each variant, the proportion of samples with read depths >10x is counted. If more than 90% of samples with read depths >10x are found in both the ASD and GnomADVCF files, the variant is retained. Variants in the case and control VCFs that meet the read depth filtering criteria are output as a BED file and intersected. This BED file is used to filter out non-coordinated variables in the discovery and target datasets for downstream processing.
[0098] (4) To address the second issue, we need to determine the variant quality score thresholds for the ASD and control datasets separately to avoid amplifying or shrinking the burden test results for synonymous rare variants (MAF < 0.1%) between the ASD and control datasets. For the ASD and control GnomAD VCF files, the variant results are annotated by the variant effect predictor (VEP). The burden test results are visualized using QQPlot. If the QQPlot's lambda... 95 A value close to 1.0 indicates that the threshold pair has been correctly identified without causing amplification (>1.0) or shrinkage (<1.0). This threshold pair will be used to filter variants in the dataset. In a preferred embodiment, the VQSRTranche variant quality score is used in both the ASD and the control GnomAD VCF files.
[0099] (5) After providing a high-quality and harmonized set of variants for the ASD and control datasets, we need to define criteria to filter variants of interest. The total number of these variants will be compared between the ASD and control groups using a burden test. Generally, rare variants with loss-of-function (LoF) mutations with MAF <0.1% are more likely to cause monogenic diseases. (8) These variants are referred to as “qualified variants” (QVs). The filtering criteria for becoming a rare PTV variant are listed in Table 1 if a variant labeled as a QV has a MAF below 0.1% and its VEP results belong to the protein truncation variant (PTV). The Python program “make_snp_file.py” reads an ASD or control VCF file as input and generates an output file recording all QVs for each gene. The working principle of this gene-based folding analysis method is illustrated by the examples shown in Tables 1A and 1B.
[0100] (6) Because it is impossible to integrate the genetic effects and biological information of rare variants into existing PRS software. (9) Or in the PRS analysis workflow (10) Therefore, a novel gene-based PRS analysis workflow is urgently needed. To achieve such a workflow, we must first overcome two main problems. First, similar to the conventional practice in PRS analysis based on GWAS results, the effect size of each gene in the discovery dataset must be mapped to the gene profile of individuals in the target dataset in order to calculate the risk score of individuals in the target dataset. Second, we need to estimate the effect size and p-value (i.e., the summation statistic) of each gene in the discovery dataset. How the present invention addresses these two problems is described in paragraphs (7) and (8) below, respectively.
[0101] (7) In SNP-based PRS analysis, genotype values 0, 1, and 2 are key to mapping the effect size (i.e., the risk of increased risk per risk allele) in the discovery dataset to the SNP profile of individuals in the target dataset. In our new workflow, we must find such a key to map effect sizes from the genotype-based discovery dataset to the genotype profile of individuals in the target dataset. A possible key could be BIN groups categorized by a preset threshold. As shown in Tables 1A to 1D, burden testing works by testing for rare LoF variants (i.e., QVs) with a threshold of 1 proposed by Guo, Michael H, et al. Genes with a non-zero number of QVs are classified as BIN1, while genes without QVs are classified as BIN0. QVs are selected according to the criteria listed in Table 5. As shown in Tables 1D and 1B, this key (burden 0 or 1) can be derived for the discovery and target datasets respectively to link effect sizes between the two datasets. The preset threshold "1" is used for dominant burden testing, while the preset threshold "2" is used for recessive burden testing. This means that in the recessive burden test, the burden is equal to 1 when the number of QVs for each gene is greater than or equal to 2.
[0102] Table 1A identifies variants (SNVi, i = 1-5) of individuals D1, D2, D3, ... Dn from WES in the dataset.
[0103] individual <![CDATA[SNV1]]> <![CDATA[SNV2]]> <![CDATA[SNV3]]> <![CDATA[SNV4]]> <![CDATA[SNV5]]> ASD impact <![CDATA[D1]]> A>T A>G C>G T>A G>C yes <![CDATA[D2]]> C>A G>T A>T A>C C>A no <![CDATA[D3]]> T>G A>C C>G A>T T>C no … … … … … … … <![CDATA[D N ]]> C>G G>A G>T A>G T>C yes
[0104] Table 1B, “Make_snp_file.py”, maps variants to genes.
[0105]
[0106]
[0107] Table 1C annotates variants and identifies eligible variants (shaded).
[0108]
[0109] Table 1D calculates the gene burden with a preset threshold of "1" (burden = 1 if QV count >= 1, otherwise burden = 0).
[0110] individual <![CDATA[Gene1]]> <![CDATA[Gene2]]> <![CDATA[Gene3]]> <![CDATA[Gene4]]> ASD impact <![CDATA[D1]]> 1 0 0 1 yes <![CDATA[D2]]> 0 1 0 1 no <![CDATA[D3]]> 0 0 1 0 no … … … … … … <![CDATA[D N ]]> 0 1 0 1 yes
[0111] (8) In our invention, due to the lengthy computation time required to estimate the relative odd ratio (OR) for over ten thousand genes, a simple counting method was used instead of Logit regression to estimate the OR. For each gene, a 2x2 contingency table was used, representing the number of case (S0, S1) and control (R0, R1) subjects in non-QV carrier (BIN0) and QV carrier (BIN1) groups, respectively (Table 3). The OR measures the relative risk increase between BIN0 (non-QV carrier) and BIN1 (QV carrier). As shown in Table 2A, for each gene in the discovery dataset, the effect size, SE, and p-value for each gene can be inferred using the formulas listed in Table 3. Tables 2A-2C illustrate how the gene-based PRS for estimating all individuals in the target dataset works.
[0112] Table 2A estimates the log(OR) and p-values using the formulas listed in Table 3 below.
[0113]
[0114]
[0115] Table 2B repeats the steps in Tables 1A-1D for all individuals (T1, T2, T3, etc.) in the target dataset.
[0116] individual <![CDATA[Gene1]]> <![CDATA[Gene2]]> <![CDATA[Gene3]]> <![CDATA[Gene4]]> <![CDATA[T1]]> 1 0 0 1 <![CDATA[T2]]> 0 1 0 1 <![CDATA[T3]]> 0 0 1 0 … … … … …
[0117] Table 2C calculates the gene-based PRS for all individuals in the target dataset.
[0118] individual <![CDATA[Gene1]]> <![CDATA[Gene2]]> <![CDATA[Gene3]]> <![CDATA[Gene4]]> PRS <![CDATA[T1]]> 1.3 0.0 0.0 0.2 0.5 <![CDATA[T2]]> 0.0 0.7 0.0 0.2 0.9 <![CDATA[T3]]> 0.0 0.0 -0.8 0.0 -0.8 … … … … …
[0119] Table 3 shows the formula derivations for OR, Beta, SE, and p-value.
[0120]
[0121] (9) After performing burden tests on the discovery datasets shown in Tables 1A-1D, we obtained the effect size BETA and p-value (one-end Fisher exact test) of ASD risk for each gene. Based on these burden test results, the gene-based PRS for all subjects in the target dataset can be calculated as described in (12).
[0122] (10) In the preferred embodiment, the workflow of the burden test is as follows: Figure 3 As shown. For all variants in the ASD VCF file, insertions and deletions were left-aligned and normalized, while multi-allelic MNVs were split into multiple independent allele variant SNVs by BCFtools. (11) .
[0123] (11) Data cleaning and variant harmonization were performed following the steps described in Sections (2), (3), and (4). To integrate biological knowledge, such as gene names, outcomes, pathogenicity, and other LoF scores, such as PolyPhen2, SIFT, and CADD, 3,575,127,300 normalized and segmented multiallelic variants were annotated using VEP in the best embodiment. (12) Based on the annotation results, the program "make_snp_file.py" identifies rare protein truncated variants that meet predetermined criteria as QVs. It writes the QVs found in each gene to an output file. The steps just mentioned are repeated in the control VCF file. Another script, "count_qv.py," counts the number of QVs for each gene for each subject in both the ASD and control VCF files. Finally, by running the R script "burden.R" implemented by the inventors, a gene summary statistics (Table 4) is generated according to the formulas listed in Table 3.
[0124] Table 4. Example of gene summary statistics for burden testing
[0125]
[0126] (12) Figure 4 This paper elucidates a novel method for calculating gene-based PRS scores. First, the target VCF file is annotated with VEPs to obtain results, LoF scores, and MAFs for each variant. Then, the Python program "make_snp_file.py" utilizes the annotation information to identify QVs in each gene. Based on the working principles shown in Tables 2A-2C, the inventors developed the script "gbprs.py" to calculate gene-based PRS scores for all subjects in the target VCF file.
[0127] Example 2: Predicting ASD using PRS
[0128] 1. The VCF dataset used in this embodiment comes from 9,630 family-based samples (3,655 ASD cases and 5,975 unaffected siblings and parents), compiled by the Autism Sequencing Consortium (ASC), an international panel of scientists working on ASD. Control data were downloaded from the Genome Aggregation Database (GnomAD), which aggregates 125,748 exomes from human DNA sequencing studies. Using these case and control datasets, a new workflow was employed to construct ASD predictive models, such as... Figure 1 As shown.
[0129] 2. From 3498 unrelated case samples, they were divided into 2800 and 698 subjects using 5-fold Monte Carlo cross-validation (i.e., the cv1-cv5 discovery dataset), serving as the discovery and target datasets, respectively. Samples extracted from GnomAD could only be used as the control discovery dataset in burden testing because the GnomAD VCF file only provides summary statistics for variants without individual genotypes. Paired IBD estimation was performed using PINK, retaining only paired samples with pi-hat < 0.25. Then, control samples from the target dataset were selected from 5975 non-ASD subjects who were unrelated to the corresponding case samples in the target dataset.
[0130] 3. Regarding the issue of bloodline, Figure 2 PCA analysis of the ASD dataset was plotted. No significant outliers were found in the ASD dataset. Furthermore, based on previous findings by Guo, Michael H, et al., the impact on burden test results is relatively small when the majority of samples in both ASD and GnomAD are of European ancestry.
[0131] 4. Following quality control and rare synonym reconciliation between cases and controls in the datasets described in (2), (3), and (4), the corresponding QQPlots are as follows: Figure 5 As shown. The minimum lambda achieved when the true probability of variation in the control dataset (GnomAD) is greater than 0.7 and the VQSRTranche of the case dataset (ASD) is equal to 100%. 95 The value is 3.2. This indicates a relatively high lambda. 95 The effect size (BETA) of ASD-related genes will be exaggerated in the gene pooling statistics. This would be a significant problem if the purpose of this invention is to identify statistically significant ASD-related genes. However, the purpose of this invention is to calculate the PRS scores of subjects in the target dataset based on gene pooling statistics. Therefore, the exaggeration of the effect size will only increase the mean and variance of the PRS scores and will not affect the validity and predictive accuracy of the PRS when used to distinguish between high-risk and low-risk ASD subjects. After reading deep coordination, the output .bed file consists of 2,445,163 sites.
[0132] 5. After coordinating and finding consistency between the variants of ASD in the dataset and those in the control subjects, variants in the case subjects of the dataset were identified through VEP annotation. The variant results and the final results of the PolyPhen scores are composed as follows: Figure 6As shown. In this example, we assume that those rare loss-of-function (LoF) variants (i.e., QV) have a greater adverse effect on the ASD phenotype, and therefore only these eligible variants are retained for burden testing. Table 5 lists the criteria for being a rare LoF variant.
[0133] Table 5 Screening criteria for rare LoF variants (qualified variants QV)
[0134]
[0135] 6. Using the program "make_snp_file.py", the total number of QVs identified in the ASD and control subjects in the dataset was found to be 151,708 in 13,595 genes and 148,264 in 13,994 genes, respectively. As described in Example 1 (11), these two files recording QVs by gene were input into another program "count_qv.py" to output the number of QVs (burden count) for each gene in the ASD and control samples, respectively. The statistical summary of the burden counts for cases and controls is shown in Table 6.
[0136] Table 6 summarizes the total number of eligible variants for each gene in the ASD and GnomAD samples found in the dataset.
[0137] average SD maximum Case (ASD) 3.76 6.23 323 (GnomAD) 123.17 172.25 2,450
[0138] 7. Gene summary statistics were generated by the program "Burden.R". Five-fold cross-validation (CV) revealed the effect size of the BETA for assessing the risk of ASD for all genes in the dataset with QV>1, as shown below. Figure 7 As shown. In the results of 5-fold cross-validation (CV), the mean BETAs were 0.660, 0.658, 0.662, 0.665, and 0.658, with SDs of 1.17, 1.18, 1.17, 1.16, and 1.16, respectively. Based on these data, the distribution of rare LoF variants in the ASD samples is very consistent, therefore the BETA bias estimated by 5-fold CV is small. Thus, the previous assumptions about ancestral composition are correct. These BETAs will be used as gene weights to calculate the gene-based PRS of subjects in the target dataset.
[0139] 8. To verify whether the gene-based PRS analysis method proposed in this invention can identify ASD-related genes with large effects (BETA), 914 genes associated with ASD as declared by the Simons Foundation Autism Research Initiative (SFARI) were used as positive controls.
[0140] 9. Based on the gene summary statistics generated in (7) above, genes with a genome-wide importance threshold (p value < 2.5e-6) were selected and matched with 914 ASD-related genes in SFARI. Table 7 shows the genes that meet the genome-wide importance criteria found in the SFARI gene list (i.e., the genes contained in the upper left group of the 2x2 contingency table).
[0141] Table 7. Genes of genome-wide importance found in the SFARI gene list.
[0142]
[0143]
[0144] 10. On average, 22 genome-wide important genes were found in the SFARI gene list of each CV dataset, while a total of 37 unique genome-wide important genes were found in the SFARI gene list of the 5-fold CV dataset (as shown in Table 7). To check whether this is a random event, we defined the null hypothesis H0 as the discovery of N ASD genes among 914 SFARI genes being a random event. A 2x2 contingency table was constructed for all 5-fold CV datasets, as shown in Table 8.
[0145] Table 8. Summary of PRS analysis results for the five CV findings and the target dataset.
[0146]
[0147] Table 9 2x2 Contingency Table
[0148]
[0149] 11. The hypothesis was verified using Fisher's exact test. Based on the p-values listed in Table 8, rejecting the null hypothesis H0 was statistically significant in all CV datasets (except CV4). In most cases, finding this number of genome-wide important genes in the SFARI gene list is not a random event. We can conclude that the gene-based PRS analysis method of this invention can identify well-known ASD genes.
[0150] 12. Based on the gene summary statistics described in 6 and 7 above, the gene-based PRS scores of the ASD and GnomAD samples in the target dataset were calculated using the user-friendly, self-developed program “gbprs.py”. This program performed all the steps mentioned in (6) to (12) of Example 1.
[0151] 13. In addition to PRS, the program "gbprs.py" can also generate an auxiliary output file – a gene burden map (*.gbm) – which contains a data frame with sample IDs and gene names as rows and columns, respectively. The value of the data frame is the gene burden count for each sample. In another embodiment, this data frame will be used to build a deep learning model to predict the relative risk of ASD.
[0152] 14. The data frame in the 5-fold CV target dataset is obtained through... Figure 8 The violin plot visualization shows the distribution of the number of genes (QV>1) in each sample of the target dataset.
[0153] 15. Sort the PRS of the samples in the target dataset and assign them to 10 bins on average. For each bin, calculate the probability that a sample in that bin is an ASD sample. For ease of explanation, Figure 9 The estimated PRS of the CV1 target dataset is used to create a percentile plot with the probability of ASD samples on the y-axis and the permutation of PRS on the x-axis.
[0154] 16. Through Figure 9 We can estimate the relative increase in ASD risk between target samples with the lowest 10% PRS and those with the highest 10% PRS. This increase in ASD risk from 0.51 to 0.86 represents an approximately 1.7-fold increase in the relative risk of having ASD between the samples in these two bins. This is an encouraging result because we can successfully distinguish samples with relatively high ASD risk from those with relatively low ASD risk based solely on the calculated target sample PRS.
[0155] 17. Repeat similar evaluations for other CV datasets. Table 8 summarizes all PRS analysis results for the five CV datasets, including ROC-AUC, PRS variance interpretation (pseudo-R). 2 The data also showed a relative increase in ASD risk from 10% to 90%. No significant differences were found among the five CV target datasets. PRS's predictive performance was very stable across these datasets.
[0156] 18. In this embodiment, the mean ROC-AUC of the target dataset's PRS was 63.1%. Although this is not very high, its relative ASD risk was increased by approximately 1.6-fold between the lowest 10% and the highest 90% of subjects with gene-based PRS, respectively. This 1.6-fold increase in relative ASD risk is sufficient to screen high-risk ASD subjects for early diagnosis and intervention.
[0157] 19. In a preferred embodiment, when the traditional SNP-based PRS is combined with the gene-based PRS of the present invention to establish a predictive model that merges the two types of PRS using machine learning methods such as SVM or XGBoost, the increase in ROC-AUC and relative ASD risk will be further enhanced.
[0158] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "other embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0159] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention without departing from the principles and spirit of the present invention, the scope of which is defined by the claims and their equivalents.
Claims
1. A PRS method for predicting autism and neurodevelopmental disorders, characterized in that, Improving the accuracy of predicting the relative risk of underlying diseases in subjects by incorporating the genetic effects of rare loss-of-function (LoF) variants (QV) with MAF < 0.1% and common variants with MAF > 0.1%; including the following steps: (1) For each gene in the VCF dataset used to construct the prediction model, a burden test was performed; the genetic effect between each gene and the disease was estimated according to the formulas listed below, and the gene weights were used to determine the genetic effect. w j The Log Odd Ratio is used to quantify these genetic effects, serving as the weight of each gene when calculating the gene-based PRS. j is a number from 1 to M, where M is the number of genes; R0 j R1 j These are the counts of the number of people belonging to the control groups BIN0 and BIN1, respectively. S0 j S1 j These are the number of cases belonging to the BIN0 and BIN1 groups, respectively. The BIN0 group includes individuals whose gene j detects one or more QVs; Group BIN1 consists of individuals in which QV cannot be detected by gene j; (2) Calculate the standard error SE and p-value of each gene weight (Log Odd Ratio): , ; (3) Calculate the gene-based PRS scores of all individuals in the target VCF; Let i be a number from 1 to N, N be the number of individuals, j be a number from 1 to M, and M be the number of genes. Calculate individual T using the following formula. i The obtained gene-based PRS score PRS Ti : ; When no individual T is detected i When gene j contains QV, ; When individual T is detected i When gene j contains more than one QV, ; The gene weight w j The burden test is calculated by detecting the number of people with loss-of-function (LoF) rare variants of each gene in the case group from the VCF dataset generated by WES sequencing, and then comparing it with the relevant number in the control group. The weight of the genetic effect between each gene and the disease is estimated according to the methods described in steps (1) to (3) above, so as to construct a predictive model for the disease. Through supervised learning, the gene-based PRS score estimated based on the rare variant and the SNP-based PRS score estimated based on the common variant are combined to obtain the pseudo-R. 2 The variance represented by the PRS score has higher predictive accuracy.
2. The PRS method according to claim 1, characterized in that, The process of the burden test is as follows: (1) By coordinating the sequencing depth of the VCF datasets of the case group and the control group, only the variants that meet the sequencing depth conditions are retained. The variants must have more than 90% of the individuals with a sequencing depth of more than 10X. Based on this condition, the two VCF datasets are filtered by bedtools. (2) By coordinating the variant quality values in the VCF datasets of the case group and the control group, and finding the relevant thresholds through QQPlot, the two VCF datasets were filtered. (3) Annotate variants in the VCF dataset using the Variant Effect Predictor (VEP); (4) In the two VCF datasets, rare variants (QV) that cause protein truncation were selected; (5) Count the number of QV detected for each gene in each individual in the case group and the control group respectively; (6) In the case group and the control group, each individual was assigned to the BIN0 or BIN1 group according to the number of QV detected by the gene, so as to obtain the values of R0, R1, S0, S1. (7) According to the formula listed in step (1) of claim 1, estimate the weight of the genetic effect between each gene and the disease, thereby constructing a predictive model for the disease.
3. The PRS method according to claim 2, characterized in that, Consequences of the VEP variants that lead to protein truncation include splice acceptors and donors, stop, gain, frameshift, stop loss, start loss, in-frame insertion and deletion, and misspelling.
4. The PRS method according to claim 2, characterized in that, In the burden test, variant quality values in the case and control VCF datasets were reconciled, with a variant quality threshold of 1.
0.
5. The PRS method according to claim 1, characterized in that, The rare variant (QV) was detected by DNA sequencing, which included NGS / 3GS.
6. The PRS method according to claim 1, characterized in that, We selected rare variants with a loss of function (LoF) of MAF < 0.1%, and only retained variants with a sequencing depth of 10X or higher in more than 90% of individuals as target VCF files.
7. The PRS method according to claim 1, characterized in that, The process for calculating gene-based PRS for all individuals in the target VCF file: (1) Filter the variants of the target VCF file and retain only the variants with a sequencing depth of more than 10X in more than 90% of the individuals; (2) VEP annotation of the target VCF file to obtain the consequences of each variant; (3) Using the variant consequences annotated by VEP, a Python program was used to screen for rare loss-of-function (LoF) variants as qualified variants QV for each individual and each gene; (4) Based on whether each gene of each individual contains more than one QV, the relevant results are obtained. The numerical value, the stated The value can be 0 or 1; (5) Based on the prediction model and related gene weights W calculated from the VCF dataset j and the subject's information derived from the target VCF file. The numerical value, the gene-based PRS score for an individual, can be calculated using the following formula: 。