A method and system for assessing the genetic risk of complex diseases using multiple genes.
By employing genome-wide association studies and multi-level site screening strategies, combined with machine learning algorithms, a multi-gene disease risk rating model for complex diseases was constructed. This model addresses the shortcomings of multi-gene genetic risk prediction models for complex diseases in European and American populations, enabling more accurate disease risk stratification and early screening.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XUKANG MEDICAL SCI & TECH (SUZHOU) CO LTD
- Filing Date
- 2022-07-28
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies in Europe and the United States have not adequately considered suitable computational biology algorithms for predicting the risk of polygenic inheritance of complex diseases in populations. This has resulted in poor population screening and stratification, as well as problems of ineffective screening and overdiagnosis.
By employing genome-wide or whole-exome association studies combined with a multi-level locus screening strategy, machine learning algorithms were used to construct a multi-gene disease risk rating model for complex diseases. Through stepwise logistic regression analysis and optimal model selection criteria, highly correlated SNP loci were screened out to establish a PRS model.
It demonstrated good disease risk stratification effects in different populations, improved the accuracy of early screening for complex diseases, reduced ineffective screening and overdiagnosis, and is suitable for risk assessment of multi-gene pattern diseases such as endometriosis and polycystic ovary syndrome.
Smart Images

Figure FT_1 
Figure FT_2 
Figure FT_3
Abstract
Description
Technical Field
[0001] This invention relates to the fields of biotechnology and medicine, specifically to methods and systems for screening gene mutation sites associated with the risk of complex diseases, constructing multi-gene genetic risk rating models for complex diseases, and predicting the risk of disease onset. Background Technology
[0002] Complex diseases (such as coronary heart disease, diabetes, endometriosis, polycystic ovary syndrome, breast cancer, prostate cancer, premature ovarian failure, and premature ovarian insufficiency) are caused by the combined effects of environmental and genetic factors. They exhibit significant genetic heterogeneity and phenotypic complexity, and have a high incidence rate, seriously affecting people's physical and mental health. Increasing research suggests that genetic factors play a crucial role in the pathogenesis of complex diseases. The genetic factors of human complex diseases can generally be divided into the following four categories:
[0003] 1) Monogenic patterns: Such as Mendelian genetic diseases, where a single gene mutation can cause the disease, exhibiting high penetrance. Because these mutations result in severe phenotypes, they are often eliminated during evolutionary history, making them generally rare, with a population frequency of <0.1% (the impact of the disease phenotype on reproduction and lifespan may vary). Currently, whole exome sequencing (WES) or whole genome sequencing (WGS) is often used to study the effects of these genetic factors on the disease.
[0004] 2) Oligogenic pattern: Two or more gene / locus variations work together to cause disease / phenotype. These variations have moderate genetic effects, often exhibiting incomplete penetrance, and their frequency in the population is usually between 5% and 0.1%.
[0005] 3) Polygenic Pattern: Variations in multiple genes / locus sites (each with a minor effect) interact with the environment to lead to the occurrence of diseases / phenotypes. Variations involved in this pattern are often common variations with a frequency greater than 5% in the population. Currently, genotyping arrays (SNPs) are frequently used, along with genome-wide association studies (GWAS), to identify risk genes / locus sites associated with these diseases / phenotypes. Then, a polygenic risk score model (PRS; Genome-wide polygenic score, GPS) is used, combined with environmental factors, to assess the risk of developing the disease / phenotype. This pattern allows for meaningful screening, stratifying the population based on disease risk, identifying high-risk individuals, and enabling early intervention or health management to effectively delay the onset and progression of disease phenotypes. Compared to the monogenic pattern, the polygenic pattern has a higher positive detection rate.
[0006] 4) Other types of variations: such as copy number variation (CNV), chromosomal structural abnormalities, chromosomal aneuploidy and numerical abnormalities, may all be genetic causes.
[0007] A polygenic risk score (PRS) is a score calculated using an individual's genome sequencing data to measure the risk of an individual developing a particular disease. Typically, a PRS is a weighted sum of trait-related alleles across multiple gene loci, usually weighted using the effect values of each relevant allele determined by genome-wide association analysis. Currently available PRS calculation methods include LDpred and PRSice, many of which involve using linkage disequilibrium (LD) rules for SNP screening. For example, the coord method in LDpred software integrates standardized summary statistics with an LD (linkage disequilibrium) reference file to obtain an integrated HDF5 file. LDpred software then uses this integrated file as input to calculate the weight value of each SNP (single nucleotide polymorphism); and then weights the genotype data of the individual whose risk needs to be predicted based on these weight values, thus obtaining the individual's polygenic risk score. The PRSice software involves identifying the most significantly relevant SNPs in each artificially defined LD region and using a series of P-value thresholds for analysis to screen SNP sites for inclusion in the model and construct the PRS model.
[0008] Current research on the polygenic genetic risk of complex diseases largely focuses on populations in Europe and America. PRS risk prediction in these populations has demonstrated the practical significance of PRS models in population disease risk stratification and early health intervention for at-risk individuals. However, the development and innovation of appropriate computational biology algorithms remain crucial in the discovery of polygenic risk loci, directly impacting the effectiveness of the final polygenic risk scoring model in population risk stratification.
[0009] Therefore, new PRS modeling methods are still needed in this field to facilitate population screening and stratification of polygenic genetic risks for complex diseases. Invention Overview
[0011] The inventors have proposed a novel method for constructing polygenic disease risk rating models for complex diseases, based on genome-wide or whole-exome association studies and multi-level locus screening strategies, combined with machine learning algorithms. This method does not rely on conventional artificial linkage disequilibrium block partitioning and SNP screening. As shown in the examples, taking two different complex diseases (endometriosis and polycystic ovary syndrome) as examples, the polygenic disease risk rating models constructed using the method of this invention achieved significant population stratification effects and good predictive performance on the tested samples. Therefore, the models constructed using the method of this invention can generate risk assessment reports, improve early screening for complex diseases, effectively alert at-risk individuals to avoid or reduce the risk of complex diseases, and also help reduce ineffective screening and overdiagnosis and treatment in healthy individuals. Based on this, the inventors completed this invention.
[0012] Therefore, in a first aspect, the present invention provides a method for constructing a polygenic disease risk rating (PRS) model for complex diseases, and apparatus, systems, and computer program products for performing said method. In one embodiment, the method according to the invention includes:
[0013] (1) On the acquired training dataset, perform genome-wide or exome-wide association analysis between SNP sites and disease phenotypes, and select SNP sites with a correlation p value less than the PT threshold.
[0014] (2) For the SNP sites selected in step (1), group them according to autosomes and then use stepwise logistic regression analysis to perform fine screening.
[0015] (3) The 22 groups of SNP sites obtained from the fine screening were integrated and the stepwise logistic regression analysis method was applied to conduct a second screening of SNP sites.
[0016] (4) Based on the risk SNP loci obtained from the secondary screening, a logistic regression model was constructed. The effect value of each risk SNP locus was determined by the corresponding coefficients of each SNP locus in the regression model.
[0017] (5) Establish a polygenic disease risk rating (PRS) model for complex diseases.
[0018] PRS=β1×snp1+β2×snp2+…+β i ×snp i +…+β n ×snp n
[0019] in,
[0020] snp i Let represent the genotype of the i-th SNP locus in the sample, and let 0, 1, and 2 represent the homozygous non-risk locus (containing 0 risk alleles), the heterozygous locus (containing 1 risk allele), and the homozygous risk locus (containing 2 risk alleles), respectively.
[0021] β i The effect value of the i SNP sites determined in step (4);
[0022] n represents the total number of risky SNP sites.
[0023] In a preferred embodiment, steps (2) and (3) employ the optimal model selection criterion as the screening and evaluation standard for SNP sites. In yet another preferred embodiment, in the stepwise logistic regression analysis of steps (2) and (3), the stepwise logistic regression analysis terminates when the removal and addition of variable SNP sites no longer leads to the generation of a better model.
[0024] In some preferred embodiments, the optimal model selection criterion is the AIC criterion. In such embodiments, preferably, the stepwise regression analysis in steps (2) and (3) uses the AIC information statistic as the criterion, and selects the smallest AIC information statistic to eliminate or add variables in order to obtain the optimal set of explanatory variables for SNP.
[0025] In one embodiment, the disease is a complex human disease characterized by a polygenic pattern, such as cancer. In a preferred embodiment, the disease is endometriosis or polycystic ovary syndrome.
[0026] In another aspect, the present invention provides a PRS model for endometriosis that includes the risk SNP sites listed in Table 1, and its use for endometriosis risk stratification in a population or for predicting an individual's susceptibility to endometriosis.
[0027] In another aspect, the present invention provides a polycystic ovary syndrome (PCOS) PRS model comprising the risk SNP loci listed in Table 2, and its use in conducting PCOS in a population or in predicting an individual's susceptibility to PCOS.
[0028] For complex diseases characterized by polygenic patterns, disease susceptibility is influenced by both genetic factors and environmental factors. The disease predictive ability of polygenic risk scores depends on, and reflects, the role of the genetic basis in disease development, i.e., the heritability of the disease; however, the PRS score alone cannot constitute a diagnosis of the disease. For individuals predicted to have high disease susceptibility based on the PRS model, on the one hand, further combining the individual's genetic background and environmental factors can be considered for disease prediction; on the other hand, the risk of disease occurrence can be reduced by providing appropriate medical advice and measures, such as a healthy lifestyle and regular testing.
[0029] Therefore, in another aspect, the present invention provides a disease risk management method, comprising the following steps:
[0030] (1) Using the PRS model established according to the method of the present invention, the individual's PRS score is obtained;
[0031] (2) Based on the baseline of disease incidence risk stratification in the population, predict an individual’s genetic susceptibility to the disease (i.e., the genetic predisposition to the disease).
[0032] (3) Optionally, based on an individual’s family genetic background and / or environmental and lifestyle survey, predict the individual’s risk of developing the disease over a period of time;
[0033] (4) Provide medical advice to high-risk individuals to manage their risk of developing the disease.
[0034] In another aspect, the present invention also provides a multi-gene locus joint disease risk analysis and assessment platform and its use in individual disease risk management. The platform includes: a sample preprocessing module; a multi-gene risk assessment module; a disease risk assessment module; a report presentation and user feedback module; and a data management and processing module.
[0035] The risk assessment method of this invention is not applicable to diagnosis, because the assessment conclusion cannot confirm whether a subject has a disease or indicate whether the patient will definitely develop a disease in the future. However, the risk assessment of this invention can improve the differentiation of at-risk populations, which is beneficial for achieving more accurate population-based disease risk avoidance. Attached Figure Description
[0036] Figure 1 The flowchart of the multi-gene risk assessment method of the present invention is shown.
[0037] Figure 2 This diagram shows the distribution of PRS (periodine renal score) in the Chinese population, including population density maps of PRS scores calculated using a 177-site model in the case and control groups of the Chinese population test set.
[0038] Figure 3 This shows the incidence of endometriosis in the European population and its percentile in the PRS population, including the distribution of endometriosis incidence along the PRS population percentile.
[0039] Figure 4 This shows the incidence rate of polycystic ovary syndrome (PCOS) in the European population and its percentile in the PRS population, including the distribution of PCOS incidence rate along the PRS population percentile. Summary of the Invention
[0040] definition
[0041] When referring to a specific numerical value, the term "approximately" means the typical range of error for that value as known to those skilled in the art. In this document, it should be understood that this expression encompasses references to the specific numerical value itself. Therefore, for example, when referring to "approximately X," it also encompasses a specific reference to the specific numerical value X itself.
[0042] Polygenic risk rating, polygenic risk score, or PRS are used interchangeably herein and refer to a weighted sum of a majority of single nucleotide polymorphisms (SNPs) identified in a sample from an individual (e.g., an individual with disease or at risk of disease) that are associated with disease risk, wherein the corresponding disease effect value of the disease-risk-associated SNP (e.g., determined based on genome-wide or whole-exome association analysis or a constructed PRS model) serves as the weight of that SNP. PRS rating can be performed, for example, on the whole genome or on a predetermined set of loci (e.g., a predetermined set of SNPs). In some embodiments, the predetermined set of loci comprises a majority of loci where one or more alleles are associated with increased disease risk, for example, the loci have one or more disease-risk-associated SNPs. In some embodiments, PRS is calculated on the predetermined set of SNPs. In some embodiments, the predetermined SNP set preferably includes at least 50, 150, 200, 250, 300, 350, or all of the SNPs selected from Table 1 for PRS rating of endometriosis. In other embodiments, the predetermined SNP set preferably includes at least 50, 150, 200, 250, 300, 350, or all of the SNPs selected from Table 2 for PRS rating of polycystic ovary syndrome.
[0043] In this document, the term "reference polygenic risk score" or "reference PRS" refers to a PRS value compared to the PRS of a test sample to, for example, make disease risk predictions and / or treatment decisions for individuals from which the sample originated. For example, the reference PRS may be a PRS measured in a reference sample or reference population, and / or a predetermined value. In some embodiments, the reference PRS is a PRS cutoff value (also referred to herein as the population disease risk stratification baseline) that significantly distinguishes between diseased and unaffected individuals in a reference population based on the population distribution of disease incidence and PRS scores. In some embodiments, individuals with a PRS equal to or greater than the cutoff value have a higher genetic predisposition to disease compared to individuals with a PRS lower than the cutoff value. In other embodiments, individuals with a PRS equal to or less than the cutoff value have a lower genetic predisposition to disease compared to individuals with a PRS higher than the cutoff value. In some embodiments, the reference PRS is determined as, for example, the PRS value of the 1st to 10th percentiles of the reference population, such as the 5th percentile PRS value. In one implementation, the reference PRS is determined, for example, the PRS value of the 95th to 100th percentile of the reference population, such as the 95th percentile PRS value.In other implementations, the reference PRS is determined as, for example, the 25th, 26th, 27th, 28th, 29th, 30th, 31st, 32nd, 33rd, 34th, 35th, 36th, 37th, 38th, 39th, 40th, and 41st percentiles of the reference population. 42nd percentile, 43rd percentile, 44th percentile, 45th percentile, 46th percentile, 47th percentile, 48th percentile, 49th percentile, 50th percentile, 51st percentile, 52nd percentile, 53rd percentile, 54th percentile, 55th percentile, 56th percentile, 57th percentile, 58th percentile, 59th percentile, 60th percentile, ... 61st percentile, 62nd percentile, 63rd percentile, 64th percentile, 65th percentile, 66th percentile, 67th percentile, 68th percentile, 69th percentile, 70th percentile, 71st percentile, 72nd percentile, 73rd percentile, 74th percentile, 75th percentile, 76th percentile, 77th percentile, 78th percentile, 79th percentile, 80th percentile The PRS value is taken at the 81st, 82nd, 83rd, 84th, 85th, 86th, 87th, 88th, 89th, 90th, 91st, 92nd, 93rd, 94th, 95th, 96th, 97th, 98th, or 99th percentile. In some implementations, the reference PRS is determined as the PRS value at the 50th percentile of the reference population, or the median / mean PRS value of the reference population. In some implementations, the reference population is preferably of the same ethnicity as the individual being tested.
[0044] In this paper, the term “penetration” refers to the percentage of individuals with a given genotype that exhibit their expected phenotype.
[0045] In this article, the term "liability" refers to the risk of an individual developing a certain disease in polygenic inheritance, which is determined by the combined effects of genetic basis and environmental factors.
[0046] In this article, the term "susceptibility" or "genetic susceptibility" refers to the genetic factors that contribute to an individual's risk of developing a disease in polygenic inherited diseases, where several pathogenic genes with minor but cumulative effects constitute the genetic factors that cause an individual to develop a certain disease. The risk of an individual developing the disease determined by these genetic factors is called the individual's susceptibility.
[0047] In this paper, the term "SNP (Single Nucleotide Polymorphism)" refers to a polymorphism at a single nucleotide variation at a specific site in a chromosomal DNA sequence. The frequency of SNPs in a population is generally >1%. On average, there is one SNP every 300-1000 bp in the entire human genome. SNP databases are currently available from several publicly available databases, including, for example, http: / / cgap.ncbi.nih.gov / GAI; http: / / www.ncbi.nlm.nih.gov / SNP; and the Human SNP Database at http: / / hgbas.cgr.ki.sei or http: / / hgbase.interactiva.de / .
[0048] In this document, the term "SNP genotype" refers to the genotyping result of an SNP locus extracted from a sample using SNP genotyping technology, including the location of the SNP locus and the alleles located at that location. In one implementation, to construct an RPS model, a genome-wide SNP array is applied to extract genome-wide SNP genotype data from an individual sample. In another implementation, a whole-exome SNP array is applied to extract whole-exome-wide SNP genotype data from an individual sample.
[0049] In this document, the term "array" or "microarray" refers to a hybridizable array element, preferably a polynucleotide probe (e.g., an oligonucleotide), arranged in an ordered manner on a matrix. The matrix can be a solid matrix such as a glass slide or a semi-solid matrix such as a nitrocellulose membrane. As an example of an array, an SNP chip can be mentioned, in which the genotype of the SNP loci on the chip can be determined using the signal (usually a fluorescence signal) obtained after hybridization.
[0050] In this document, the term "clinical disease risk score" or "clinical risk score" refers to a risk assessment result obtained using any suitable clinical risk assessment procedure other than the PRS assessment described herein. Preferably, the clinical risk assessment does not involve genotyping at one or more gene loci. In some embodiments, the clinical risk assessment procedure may include obtaining one or more of the following information from the individual: age, the presence of clinical conditions in the individual that may be related to the occurrence of the disease, such as relevant medical history, family history of the disease or other related diseases (e.g., cancer), including the age of onset of first-degree relatives, previous tissue biopsy results, body mass index, alcohol consumption history, smoking history, exercise history, diet and / or ethnicity. In some embodiments, the clinical risk assessment procedure considers at least age, number of disease-related tissue biopsies, and relevant disease history of first-degree relatives.
[0051] In this document, the term "sample" refers to any sample derived from or originating from a human individual, such as bodily fluids (blood, saliva, urine, etc.), tissue biopsies, or tissues containing nucleic acids, especially DNA. Therefore, samples readily usable for SNP detection, such as tissue biopsies, feces, sputum, saliva, blood, and lymph, are suitable for this invention. Genomic DNA extraction and purification are preferably performed before the sample is used for SNP analysis.
[0052] The term "linkage disequilibrium" (LD) is used to describe the statistical association between two adjacent polymorphic genotypes. Typically, LD refers to the association between alleles at two loci in random gametes, assuming Hardy-Weinberg equilibrium (statistical independence) between the two gametes. LD can be measured using the correlation coefficient r². Two loci with an LD value of 1 are considered to be in perfect LD. Two loci with an LD value of 0 are called in linkage equilibrium. Those skilled in the art can readily identify SNPs in linkage disequilibrium with the SNPs of this invention, for example, by determining the logarithm of the odds (LOD) of the two loci. Therefore, in one embodiment, for the purposes of the PRS model of endometriosis and polycystic ovary syndrome generated by this invention, SNPs in linkage disequilibrium with the specific SNP loci of this invention (e.g., the SNP loci listed in Table 1 or Table 2) are also considered as an alternative. Invention Details
[0053] Research has revealed that common and complex diseases such as endometriosis and polycystic ovary syndrome have a highly polygenic genetic basis, encompassing hundreds or thousands of genetic variations (or polymorphisms), each with a small individual effect on the formation of hereditary traits or diseases. The effects of these minor genes can accumulate to form significant disease phenotypic effects. Therefore, to obtain sufficient information to identify high-risk individuals, it is necessary to consider the genetic load conferred by the combination of disease-risk gene variations in an individual. A suitable approach is to assess an individual's genetic risk of disease using a polygenic risk score (PRS).
[0054] Through dedicated research, the inventors discovered that by applying a multi-level SNP risk site screening strategy, after initial screening of risky SNP sites at the whole-genome or whole-exome level using p-value thresholds, and combining this with specific two-stage screening (i.e., SNP screening at the autosomal monosomal level and SNP screening at the level of integration across all autosomes), based on stepwise logistic regression analysis and optimal model selection criteria, a high-performance set of SNP sites and related risk assessment models for disease risk prediction can be effectively obtained. As shown in the examples, the disease risk prediction model constructed using the method of this invention demonstrates good population disease risk stratification effects across different diseases and is suitable for individual disease risk prediction.
[0055] Therefore, the present invention provides a method for establishing a PRS model, comprising the following steps:
[0056] (1) Obtain training set samples for PRS model construction and extract SNP genotype features from the samples;
[0057] (2) Implement multi-level SNP risk site screening;
[0058] (3) Apply the selected SNP risk sites and construct a PRS model through machine learning.
[0059] In some embodiments, the method of the present invention further includes:
[0060] (4) Validate the constructed PRS model on the test set; and
[0061] (5) Optionally, a validated PRS model may be used to implement population disease risk stratification.
[0062] The diseases applicable to this invention can be complex diseases characterized by a polygenic pattern, such as cancer, or they can be a disease subtype, such as a cancer subtype.
[0063] In another aspect, the present invention also provides a multigene risk assessment model for endometriosis and polycystic ovary syndrome, comprising SNP loci listed in Table 1 or Table 2. In yet another aspect, the present invention also provides an apparatus or system for constructing PRS models, population risk stratification, and disease risk rating.
[0064] The various aspects of the method of the present invention will be described in further detail below.
[0065] I. PRS Model Establishment Method
[0066] Training dataset for model building
[0067] In machine learning modeling, it is typically necessary to acquire input data for model building, as well as feature data extracted from the input data, based on the modeling objective. In this invention, the input data for building the PRS model includes at least: independent variables, such as SNP sites included in the model building and their corresponding feature values; and dependent variables: disease phenotypic classification features, such as whether or not one has the disease or a specific disease subtype.
[0068] In some implementations, in the PRS model establishment method of the present invention, a training set consisting of a plurality of known disease patients and healthy individuals can be selected, and SNP locus genotypic features and disease classification features are extracted from each individual in the training set as input data. The disease patients and healthy controls comprising the training set can be selected by those skilled in the art based on the specific disease type and other factors, according to appropriate inclusion criteria. Generally, it is preferred that the patient group and the healthy control group have a matched age distribution. Furthermore, the healthy controls included in the training set are preferably individuals clinically confirmed to be disease-free, and preferably have no family history of related diseases, especially no history of related diseases in their first-degree relatives. In addition, to maximize the predictive accuracy of PRS, it is also possible to consider that the patients and controls comprising the training set sample have the same characteristics as the target population or target individuals for risk prediction, such as belonging to the same ethnicity, and even having the same geographical origin.
[0069] Therefore, in one embodiment of the method according to the present invention, the training dataset for constructing the PRS model includes: SNP genotype feature data extracted from each individual sample in the training set and corresponding disease classification feature data, wherein the training set consists of a plurality of disease-affected individual samples and a plurality of healthy control individual samples. In some preferred embodiments, the SNP genotype features are SNP genotype features extracted from individual samples. In still some preferred embodiments, the extracted disease classification features are whether an individual has the disease. In yet another preferred embodiment, the disease-affected individuals and healthy control individuals constituting the training set belong to the same ethnicity or the same geographical region.
[0070] In association analysis, a large training set sample size helps reduce sampling error, improve the accuracy of effect size estimation, and identify inefficient genetic variation sites. Therefore, in one embodiment, the training set comprises at least 4,000 samples. Preferably, the training set consists of samples from at least 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 11,000, 12,000, 15,000, 20,000, 25,000, 30,000, or more individuals. In the training set, the ratio of patient individuals to healthy control individuals can be from 1:1 to 1:100, for example, approximately 1:1, approximately 1:5, approximately 1:10, approximately 1:20, approximately 1:30, approximately 1:40, approximately 1:50, approximately 1:60, approximately 1:70, approximately 1:80, approximately 1:90, or approximately 1:100. Alternatively, in the training set, the ratio of the number of individual patients to the number of healthy controls can be adjusted based on the incidence rate of the specific disease in the target population. For example, the ratio of the number of individual patients to the number of healthy controls can be approximately equal to the incidence rate of the disease in the target population.
[0071] SNP genotyping
[0072] To obtain SNP genotypic characteristics, any SNP genotyping method known in the art can be used to genotype samples from individuals. Samples used for this purpose can be, for example, blood, serum, plasma, or tissue biopsies. Furthermore, synthetic nucleic acids, including, but not limited to, various forms of DNA, such as genomic DNA, cDNA, and mitochondrial DNA, can be extracted from the samples for genotyping analysis.
[0073] In some embodiments, the SNP genotyping analysis according to the present invention includes: extracting genomic DNA from samples (e.g., peripheral blood) from selected multiple patients with known diseases and healthy individuals (e.g., training set individuals), sequencing the DNA, and detecting and comparing the genotypic differences of each genotype between cases and healthy controls.
[0074] The SNP genotyping sequencing according to the present invention can be genome-wide. Furthermore, given that research has shown that coding regions often play a key role in the genetic phenotype of many human diseases, in some preferred aspects, the present invention also considers extracting SNP genotypes from the whole exome of training and / or test set samples. Exons are protein-coding regions, and exon SNPs herein also refer to coding variants. Exon SNP genotyping methods are known in the art, including, but not limited to, exon sequencing and exon SNP microarray detection.
[0075] In other respects, for certain specific diseases, disease-related risk loci or regions may already be available. In this case, if necessary, SNP genotyping can be performed on the samples at these available regions and / or loci, and the genotyping results can be used in the PRS model construction and / or PRS rating method of this invention.
[0076] Therefore, in one embodiment, the SNP sequencing according to the present invention can be genome-wide sequencing or whole-exome sequencing; or it can be sequencing performed on one or more specified chromosomes or specified chromosomal regions or SNP risk sites.
[0077] Any sample suitable for SNP genotyping can be used herein, including but not limited to, for example, blood, plasma, serum, tissue biopsy, body fluids, etc., collected from an individual. In one embodiment, the sample is whole blood, throat swab, plasma, serum, tissue biopsy, or a combination thereof. In some embodiments, the sample is a fresh or frozen sample. Nucleic acids, such as gDNA, can be extracted from these samples using any suitable cell genomic extraction technique known in the art. Nucleic acid extraction can also be performed using commercially available nucleic acid extraction kits (such as genomic extraction kits). The extracted nucleic acids (such as gDNA) may be purified, quantified, and fragmented as necessary prior to application to SNP genotyping (e.g., SNP microarray analysis).
[0078] Methods suitable for SNP sequencing are known in the art. For example, both SNP microarrays and NGS allow for the acquisition of SNP loci and their associated allele frequencies on the genome and its segments. Therefore, in principle, both technologies can be used in the method of this invention to provide raw SNP locus data for the genomic region to be analyzed. However, in some embodiments, SNP microarrays are preferred to rapidly determine the SNP genotype of the sample.
[0079] In some respects, peripheral blood genomic DNA can be extracted from the sample using conventional methods and sequenced using whole-genome or whole-exome sequencing, for example, using SNP chips such as the Affymetrix Axiom Array or the Illumina Human Exome Asian BeadChip. Based on the sequencing results, the genotypic differences between cases and healthy controls in the whole genome or whole exome in the test population (e.g., training set or test set) can be compared to identify possible disease-related SNP loci.
[0080] Before applying SNP genotyping results to model construction, the raw SNP genotype data needs to be analyzed and processed to transform it into features usable by the model. In one implementation, based on the determined actual SNP genotype and the risk-related SNP alleles determined by the OR value, the genotype feature value of an individual at that specific SNP locus is extracted, denoted as 0, 1, and 2, to describe whether the individual has a homozygous non-risk locus (containing 0 risk alleles), a heterozygous locus (containing 1 risk allele), and a homozygous risk locus (containing 2 risk alleles), respectively.
[0081] It should be understood that the input data (including SNP genotyping features and disease classification features) applicable to the method of the present invention can be provided in the form of a dataset, for example, a computer-readable dataset. Furthermore, it should also be understood that the acquisition method of these data and their original SNP genotyping data does not constitute any limitation on the present invention. For example, as described above, the input data can be obtained de novo from the target nucleic acid sample using any sequence information detection technology known in the art; alternatively, the user of the method of the present invention can directly provide a computer-readable medium containing the data or a data package generated on a commercial platform as input data for the establishment of the PRS model of the present invention.
[0082] Multi-level SNP risk site screening
[0083] In the method of this invention, the multi-level SNP risk site screening includes at least three steps: initial screening, fine screening, and final screening. During the initial screening, fine screening, and final screening processes, suitable statistical measures known to those skilled in the art can be used as screening criteria for SNP sites, including, but not limited to, chi-square test p-value, Pearson correlation coefficient, and / or information gain (KLIC). Alternatively, the model can be trained iteratively, selecting a subset of features for each iteration and scoring the retention or rejection of features based on the model's prediction performance; or the retention or rejection of features can be determined by ranking the weight coefficients of each SNP site in the obtained model.
[0084] 1) Initial screening
[0085] Association study is a population-level statistical analysis that examines genetic polymorphism to obtain genotypes and then compares genotypes with phenotypes. Association study provides an assessment of the correlation between genetic polymorphism and phenotype, along with its significance (p-value). Therefore, based on the statistics of the association study (e.g., the odds ratio (OR) and / or significance (p-value), SNP loci that may influence the phenotype can be identified.
[0086] Therefore, in one embodiment, the PRS model building method of the present invention includes: performing an association analysis of genetic variation and disease on a training dataset from a large number of known patients (cases) and healthy individuals (controls), and screening disease-related risk SNP loci based on the statistics of the association analysis.
[0087] In one implementation, based on the SNP locus characteristics obtained from genotyping and their corresponding disease classification characteristics (e.g., diseased vs. non-disease-related), disease-related risk SNP loci are initially screened through association studies across the whole genome or whole exome.
[0088] In some implementations, the correlation analysis includes:
[0089] - Input the SNP locus features of the training set samples and their corresponding disease classification features, perform population-level correlation statistical analysis, and determine the correlation between each SNP locus and the onset of the disease;
[0090] - Based on the statistics obtained from correlation analysis (e.g., OR value, significance P value), the SNP loci most likely to affect the phenotype are screened out, thereby obtaining candidate disease-related risk SNP loci.
[0091] Many statistical methods for determining the correlation / association between phenotypes and biological markers such as SNPs are known and can be applied in this invention. Most of these association analyses are computer-executable and include, but are not limited to, PLINK, MAGMA, and GEMMA.
[0092] In one implementation, the population used for association analysis comprises at least 5,000 samples. In some implementations, the population comprises approximately a 1:1 to 1:100 ratio of diseased to healthy individuals; or diseased to healthy individuals matching the population disease incidence rate. In one implementation, the population comprises at least approximately 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, or 9,000 patient samples. In one implementation, the population comprises at least approximately 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, or 9,000 healthy individual samples.
[0093] In one implementation, a significance p-value threshold (PT) is set for screening SNP sites. The PT value can be selected based on the distribution of candidate risky SNP sites obtained from association analysis. In one implementation, the PT threshold is selected from: 1-9 × 10⁻⁶. -3 1-9×10 -4 1-9×10 -5and 1-9×10 -6 In one implementation, genome-wide association analysis is used to identify candidate risky SNP sites, with PT values set at 1–9 × 10⁻⁶. -5 Preferably, the PT value is set to 1×10. -5 In one implementation, candidate risk SNP sites are obtained using whole-exome association analysis, and the PT value is set to be greater than 0.0001, preferably, the PT value is set to 0.001. In a preferred implementation, in the PRS modeling method of the present invention, SNP sites with a disease-related p-value of less than 0.001 obtained from association analysis are selected for subsequent model building steps.
[0094] 2) Fine sieve
[0095] In the method of this invention, the disease-related risk SNP loci initially screened will be divided into 22 groups according to autosomes; each group will be further screened using logistic regression model and stepwise regression analysis.
[0096] In one implementation, the stepwise regression analysis is a backward stepwise regression analysis. In another implementation, the stepwise regression analysis is a forward stepwise regression analysis.
[0097] In one embodiment, the fine sieve includes:
[0098] (i) Group SNP sites by chromosome and fit a logistic regression model that includes all candidate risk SNP sites on a single chromosome.
[0099] (ii) In the stepwise backward regression analysis, SNPs are sequentially eliminated and statistically tested according to their contribution to the dependent variable (i.e., disease occurrence) from smallest to largest, until no SNPs can be eliminated according to the given criteria. Only SNPs are considered for elimination throughout the process; once an SNP is eliminated, it is no longer considered for inclusion in the regression model.
[0100] In a preferred embodiment, in the stepwise regression backtracking analysis, the SNP locus with the smallest contribution to the dependent variable among all candidate SNP loci involved in each step of the regression backtracking analysis is determined according to the optimal model selection criterion; and the optimal model selection criterion is used to determine whether to remove the SNP locus with the smallest contribution to the model.
[0101] If the model after removing the SNP site performs better than the model before the SNP site was removed, then the removal is performed and the stepwise regression back-off analysis is repeated; otherwise, the stepwise regression back-off analysis is terminated and all SNP sites of the model before removal are returned as disease-related risk SNP sites on the i-th chromosome determined by fine screening.
[0102] 3) Secondary screening
[0103] To further optimize the set of SNP sites used for model construction, the 22 sets of disease-related risk SNP sites obtained from the fine screening can be integrated to construct a logistic regression model. Then, stepwise regression analysis can be applied to screen SNP sites again.
[0104] In one implementation, the stepwise regression analysis is a stepwise backward regression analysis. In another implementation, the stepwise regression analysis is a stepwise forward regression analysis.
[0105] In one implementation, the rescreening includes:
[0106] (i) For all candidate risk SNP sites obtained by fine screening, fit the logistic regression model between each SNP site and the dependent variable (i.e., disease occurrence), and select the model with the best performance for subsequent stepwise regression forward analysis.
[0107] (ii) In stepwise regression forward analysis, SNP sites are added to the model in descending order of their contribution to the dependent variable (i.e., disease occurrence) and statistical tests are performed until no SNP sites can be added based on a given criterion.
[0108] In a preferred embodiment, during stepwise regression forward analysis, the SNP site that contributes the most to the dependent variable among all candidate SNP sites involved in each step of the forward regression analysis is determined according to the optimal model selection criterion. That is, the inclusion of this SNP site in the model leads to optimal model performance relative to the inclusion of any of the other candidate SNP sites.
[0109] In another preferred embodiment, during the stepwise regression forward analysis, a decision is made on whether to introduce SNP sites into the model based on the optimal model selection criteria. Preferably, if the model with the added SNP site performs better than the model before the addition, the addition is implemented, and the stepwise regression forward analysis is repeated; otherwise, the stepwise regression forward analysis is terminated.
[0110] In the method of this invention, all SNP sites retained in the model at the termination of stepwise regression analysis are the set of SNP risk sites with optimal performance determined by rescreening.
[0111] 4) AIC evaluation criteria for SNP site screening
[0112] In the 1950s, American mathematicians S. Kullback and R. Leibler proposed using the expected value of the Boltzmann entropy relative to the probability function of the true model to measure the difference / distance between any candidate model and the "true model," also known as the KL distance. Around 1970, Japanese statistician Hirotugu Akaike derived the Akaike Information Criterion (AIC) based on the KL distance. Over time, for statistical models that use the Maximum Likelihood Estimation (MLE) method for model parameter estimation, AIC has been recognized as a criterion for comparing and determining the 'best' model. Although it is impossible to determine the absolute difference between any candidate model and the "true model," when comparing two or more candidate models, AIC can compare the relative differences between each candidate model and the "true model." The smaller the AIC value, the shorter the distance between the candidate model and the "true model" (in other words, the smaller the information loss).
[0113] Therefore, in some preferred embodiments of the present invention, the stepwise regression analysis in the fine screening and re-screening steps, according to the AIC criterion, aims to remove or add variables from the model to obtain the optimal set of SNPs by selecting the minimum AIC value.
[0114] In one embodiment of the method according to the present invention, the optimal model selection criterion is the AIC criterion, and the stepwise logistic regression backward analysis includes the following steps:
[0115] (i) For a set of candidate SNP sites (e.g., a set of candidate SNP sites on a single chromosome during fine screening; or a set of candidate SNP sites obtained by integrating 22 sets of fine screening SNP sites during rescreening), fit a logistic regression model containing all candidate SNP sites in the set as a candidate model for subsequent analysis and determine the AIC value of the model.
[0116] (ii) By deleting one SNP from the candidate model and calculating the AIC value of the model after deletion, the SNP that makes the smallest contribution to the dependent variable (i.e., disease occurrence) among all the SNPs constituting the candidate model is identified. That is, the deletion of this SNP results in the smallest model AIC value relative to the deletion of any of the other SNPs.
[0117] (iii) Compare the minimum model AIC value obtained from step (ii) with the AIC value of the candidate model from step (i), where:
[0118] If the AIC value decreases, the minimum contributing SNP site is removed from the candidate model to obtain a new candidate model, and steps (ii) to (iii) are repeated; or
[0119] If the AIC value no longer decreases, the stepwise regression backward analysis is terminated, and the set of SNP sites constituting the current candidate model is output as the disease-related risk SNP sites obtained through screening.
[0120] In another embodiment of the method according to the present invention, the optimal model selection criterion is the AIC criterion, and the stepwise logistic regression forward analysis includes the following steps:
[0121] (i) For the set of candidate SNP sites (e.g., during fine screening, the set of candidate SNP sites on a single chromosome; or during rescreening, the set of candidate SNP sites obtained by integrating 22 sets of fine screening SNP sites), fit a logistic regression model of each SNP site in the set with the dependent variable (i.e., disease occurrence), calculate and compare the AIC values of each model, and select the SNP site with the smallest AIC value to establish a logistic regression model as a candidate model for subsequent analysis;
[0122] (ii) By introducing a new SNP site into the candidate model and calculating the AIC value of the model after adding the site, the SNP site that makes the greatest contribution to the dependent variable (i.e., disease occurrence) is selected from the remaining SNP sites. That is, the SNP site that leads to the smallest AIC value after adding it to the candidate model compared to adding any of the other SNP sites.
[0123] (iii) Compare this minimum AIC value with the AIC value of the candidate model before adding the new SNP site, where:
[0124] If the AIC value decreases, the SNP locus with the largest contribution is added to the candidate regression model to obtain a new candidate regression model, and steps (ii) to (iii) are repeated; or
[0125] If the AIC value no longer decreases, the stepwise regression forward analysis is terminated, and the current candidate model and the set of SNP sites constituting the model are output as the disease-related risk SNP sites identified through screening.
[0126] The iterative model training and SNP site screening described above can be performed using algorithms in the R language for stepwise regression analysis. However, as those skilled in the art will understand, other stepwise regression algorithms and optimal model selection criteria known in the art can also be used to perform the fine and secondary screening steps of the present invention in the PRS model establishment method.
[0127] As shown in the embodiments, by using two-stage stepwise logistic regression modeling based on the optimal model selection criterion, the method of the present invention effectively reduces the multicollinearity of SNP sites included in the model and obtains the set of explanatory variables (i.e., the optimal SNP set) that has the best explanatory performance for the dependent variable (i.e., the disease phenotype).
[0128] Establishment of a multi-gene genetic risk assessment model for diseases
[0129] After obtaining the optimal SNP set, a PRS model can be constructed using machine learning algorithms. Applicable machine learning algorithms include, but are not limited to, logistic regression, support vector machine, random forest, decision tree, and K-nearest neighbor algorithm.
[0130] In one implementation scheme, PRS model construction includes:
[0131] (c1) A logistic regression model is constructed using the set of risk SNP loci obtained from the secondary screening, and the effect value of each risk SNP locus on the disease is determined by the corresponding coefficients of each SNP locus in the regression model.
[0132] (c2) Establish a polygenic disease risk rating (PRS) model for the disease.
[0133] PRS=β1×snp1+β2×snp2+…+β i ×snp i +…+β n ×snp n
[0134] in,
[0135] n represents the total number of risky SNP sites included in the model;
[0136] snp i Let be the genotype of the i-th SNP locus, and let be integers 0, 1, and 2, representing homozygous non-risk locus (containing 0 risk alleles), heterozygous locus (containing 1 risk allele), and homozygous risk locus (containing 2 risk alleles).
[0137] β i The effect value of the i SNP sites determined by step (c1).
[0138] PRS model validation
[0139] In one embodiment, the PRS model building method of the present invention further includes validating the built model on a test dataset.
[0140] As shown in the embodiments, the PRS model constructed using the PRS modeling method of the present invention also exhibits good disease stratification performance on the test dataset; and the stratification effect of the PRS model remains significant even when the ratio of positive samples (patient samples) to negative samples (healthy control samples) in the test set is reduced to approximately the incidence rate. The PRS models for endometriosis and polycystic ovary syndrome established by the method of the present invention are shown in Table 1 and Table 2, respectively.
[0141] Therefore, in one aspect, the present invention also provides a PRS rating model.
[0142] PRS=β1×snp1+β2×snp2+…+β i ×snp i +…+β n ×snp n
[0143] in,
[0144] n is the total number of risky SNP sites listed in Table 1 or Table 2;
[0145] snp i The genotype of the i-th SNP locus listed in Table 1 or Table 2 is an integer of 0, 1, or 2, representing a homozygous non-risk locus (containing 0 risk alleles), a heterozygous locus (containing 1 risk allele), and a homozygous risk locus (containing 2 risk alleles).
[0146] β i This represents the effect value of the i-th SNP locus. Using the risk SNP locus set in Table 1 or Table 2 of this invention, those skilled in the art can easily establish PRS models for endometriosis and polycystic ovary syndrome applicable to the target population, as well as the disease effect values of each risk SNP locus in the population; and the relevant disease risk stratification baseline for the population. Therefore, the PRS model of this invention is applicable to endometriosis and polycystic ovary syndrome screening and population risk stratification in populations.
[0147] II. Population-based disease risk stratification methods
[0148] In another aspect, the present invention provides a population disease risk stratification method, which includes:
[0149] (i) Construct a training set consisting of disease patients and healthy controls in the target population, and establish a disease risk rating (PRS) model for the target population according to the PRS model construction method of the present invention.
[0150] (ii) On the test set, apply the PRS model from step (i) to calculate the PRS values for all samples in the test set;
[0151] (iii) Determine the PRS percentile distribution of the population and the disease incidence rate at each PRS percentile, thereby obtaining the disease risk stratification baseline for the population.
[0152] III. Disease Risk Prediction Methods
[0153] In another aspect, the present invention provides a method for identifying individuals with an increased risk of disease (or increased genetic susceptibility to disease), the method comprising: applying the PRS risk rating model of the present invention to determine a polygenic risk score (PRS) for a disease on a sample from the individual, wherein when an individual's PRS is higher than a reference PRS, the individual is identified as potentially having an increased risk of disease. In one embodiment, the disease is a complex disease characterized by a polygenic pattern, such as cancer, particularly endometriosis or polycystic ovary syndrome.
[0154] In some implementations, the reference PRS is a pre-specified PRS. In some implementations, the reference PRS is a baseline for disease risk stratification in a reference population. In one implementation, the baseline is determined based on the correspondence between PRS percentiles and disease incidence rates in the reference population. In one implementation, the baseline is determined as the 90th, 95th, or 98th percentile PRS value in the reference population; or as the 10th, 5th, or 2th percentile PRS value in the reference population. In some implementations, the baseline is determined as the median PRS value in the reference population.
[0155] In some implementations, the reference population is a group of individuals with the disease; or a group of individuals with the disease and healthy individuals, the ratio of individuals with the disease to healthy individuals in the group being determined based on the prevalence of the disease in the population of the individuals being tested from the place of origin and / or region. In some implementations, the disease is endometriosis or polycystic ovary syndrome, and the method is a risk prediction method for endometriosis or polycystic ovary syndrome. In some such implementations, the individual is an Asian woman, for example, a woman of Chinese descent. A woman of Chinese descent may, for example, have at least 75% Chinese ancestry, for example, but not limited to, having at least three grandparents and / or maternal grandparents of Chinese descent.
[0156] In one implementation, an individual's risk of developing a disease is predicted based on PRS percentiles. In one implementation, when an individual's PRS value is equal to or less than the 5th percentile PRS value of a reference population, the individual is identified as having a lower genetic susceptibility to the disease or a lower probability of developing the disease.
[0157] In another aspect, the present invention also provides a method for multi-gene risk assessment of endometriosis or polycystic ovary syndrome, comprising the following steps:
[0158] 1) Perform genomic or exome association analysis on multiple known patients (cases) and healthy individuals (controls) with related diseases, and combine multi-level locus screening strategies to discover risk SNP loci associated with the disease;
[0159] 2) Construct a multi-gene risk assessment model for diseases using machine learning algorithms;
[0160] 3) Validate the model on a test set consisting of known disease patient samples and healthy individual samples, and construct a population-stratified baseline for disease incidence risk in the population;
[0161] 4) Perform chip sequencing on target samples from individuals to be tested and preprocess the sequencing data to obtain genotype data of disease-associated risk loci;
[0162] 5) Locate risk sites for the target sample and evaluate them using the multi-gene risk assessment model from step 2) above;
[0163] 6) Based on the basic information filled in by the user, combine the assessment results of step 5) with the population stratification baseline of step 3) to perform individual risk stratification, and optimize the preliminary assessment results based on user feedback.
[0164] In a preferred embodiment, the polygenic risk assessment model for endometriosis or polycystic ovary syndrome includes 177 or 173 SNP risk sites or linkage disequilibrium SNP sites listed in Table 1 or Table 2, respectively.
[0165] IV. Invention Products Used for Risk Model Building and Risk Assessment
[0166] In some aspects, the present invention provides products for implementing any of the foregoing PRS model building and risk assessment methods of the present invention. The products of the present invention include, but are not limited to, devices, apparatuses, and systems. The devices, apparatuses, and / or systems of the present invention may consist of a plurality of modules or components implementing any of the methods of the present invention. In some embodiments, a "module" is a software object or routine (e.g., as an independent thread) that can be executed centrally on a single computing system (e.g., a computer program, a tablet computer (PAD), one or more processors). In other embodiments, a program implementing the methods of the present invention may be stored on a non-transitory computer-readable medium, forming part of the devices, apparatuses, and / or systems of the present invention, wherein the computer-readable medium contains computer program logic or code portions for implementing the methods of the present invention. In one embodiment, a dataset for PRS model building or PRS scoring of the present invention is stored on a non-transitory computer-readable medium, such as a hard disk or CD-ROM, which may exist independently of or form part of the devices, apparatuses, and / or systems of the present invention. In one embodiment, the devices, apparatuses, and / or systems of the present invention may be configured to communicate with at least one remote device or server, for example, to receive information from a server via a communication network and to transmit information to a server via a network.
[0167] While in some implementations the modules and methods described herein are preferably implemented in software, implementation in hardware or a combination of software and hardware is also possible and conceivable to those skilled in the art.
[0168] In addition to the modules / components described above, the apparatus, device, and / or system of the present invention may also include other components, such as components for acquiring nucleic acid samples (e.g., a genomic DNA extraction kit) and components for extracting SNP information (e.g., an SNP chip or sequencing device). In one aspect, the present invention provides an apparatus or system for constructing a polygenic disease risk (PRS) rating model, comprising:
[0169] One or more datasets, wherein a training dataset is stored, the training dataset containing SNP genotypic feature data and corresponding disease classification feature data extracted from each individual sample in the training set, wherein the training set consists of a plurality of diseased individuals and a plurality of healthy control individuals.
[0170] One or more computer-executable processors configured to perform the method steps of the present invention, preferably, the processors configured to perform the following operations:
[0171] (a) Access the dataset;
[0172] (b) On the dataset, perform an association analysis between SNP sites and the disease;
[0173] (c) For SNP sites that meet the PT threshold, group them by autosomal chromosome and apply stepwise logistic regression analysis to screen SNP sites that contribute significantly to the model.
[0174] (d) Integrate the 22 SNP loci obtained from step (c) and apply stepwise logistic regression analysis to determine the set of SNP loci with the best performance;
[0175] (e) Construct a logistic regression model from the optimal set of SNP sites obtained in step (d) and determine the β coefficient of each SNP site in the model;
[0176] (f) Establish a disease PRS rating model according to the following PRS formula.
[0177] PRS=β1×snp1+β2×snp2+…+β i ×snp i +…+β n ×snp n
[0178] in,
[0179] n represents the total number of risky SNP sites included in the model;
[0180] snp i Let be the genotype of the i-th SNP locus, and let be integers 0, 1, and 2, representing homozygous non-risk locus (containing 0 risk alleles), heterozygous locus (containing 1 risk allele), and homozygous risk locus (containing 2 risk alleles).
[0181] β i The β coefficients of the i SNP sites determined in step (e) are given.
[0182] In one implementation, the disease is a complex disease characterized by a polygenic pattern, such as cancer, for example, endometriosis or polycystic ovary syndrome.
[0183] In one aspect, the present invention provides an apparatus or system for predicting an individual's risk of disease, the system comprising:
[0184] - At least one computer-executable processor, said processor being coupled to an electronic storage device containing an electronic representation of a predictor, said predictor being a disease polygenic disease risk rating (PRS) model generated by the PRS construction method according to the present invention or by applying the PRS construction system according to the present invention.
[0185] The processor is configured to receive test data, which includes genotypic characteristics of risk SNP loci extracted from samples of the individual being tested using the PRS model.
[0186] The processor is further configured to evaluate the test data using the electronic representation of the predictor and, based on the evaluation, output a disease PRS risk score for the individual from whom the test data originated.
[0187] Preferably, the processor is further configured to combine the individual PRS score output by the predictor with the individual's clinical disease risk score to determine the individual's disease susceptibility.
[0188] In one aspect, the present invention provides a non-transitory computer-readable medium storing computer program instructions that are executed by a computer or computer system to implement the steps of the model building method, population disease risk stratification method, or disease polygenic risk rating method of the present invention according to any of the foregoing.
[0189] In another aspect, the present invention provides a multi-gene locus joint disease risk analysis and assessment platform, comprising:
[0190] (a) Sample preprocessing module: used to sequence samples from individuals (e.g., SNP microarray sequencing) and preprocess the sequencing data to obtain genotypic data of disease risk SNP loci;
[0191] (b) Polygenic risk assessment module: used to perform the method steps of the present invention to obtain a preliminary disease risk assessment report for an individual and transmit it to the disease risk assessment module;
[0192] (c) Disease Risk Assessment Module: Based on the preliminary individual disease risk assessment report transmitted by the Disease Risk Assessment Module, and combined with the data feedback from the Data Management and Processing Module, the preliminary assessment results are further evaluated through data learning and optimization.
[0193] (c) Report Presentation and User Feedback Module: Used to present evaluation reports, collect user feedback, and transmit it to the data management and processing module;
[0194] (d) Data Management and Processing Module: Used to manage user data and provide feedback to the Disease Risk Assessment Module.
[0195] Preferably, the preprocessing includes quality control of the sequencing data. For example, in the case of microarray sequencing, quality control can be performed according to the default values set by the microarray manufacturer to ensure the production of valid genotyping data. Qualified samples are then further processed to extract disease risk loci. Alternatively, if the data is unqualified, sequencing is repeated.
[0196] Preferably, the preliminary assessment is performed by calculating an individual's PRS score using a polygenic disease risk rating (PRS) model constructed according to the method of the present invention.
[0197] More preferably, the user's basic information includes the user's ethnicity, geographic region, gender, age and lifestyle, and / or the user's clinically relevant disease risk score.
[0198] Preferably, the multi-gene risk assessment module obtains a preliminary assessment report through initial assessment and presents it in the report presentation and user feedback module; the disease risk assessment module obtains a post-optimized assessment report through subsequent optimization assessment and presents it in the report presentation and user feedback module.
[0199] The device or system of this invention can be used for individual disease risk management. Therefore, in one aspect, this invention also provides a disease risk management method, comprising the following steps:
[0200] (1) Using the PRS construction and scoring apparatus or system according to the present invention, an individual's PRS score is obtained;
[0201] (2) Combine the baseline of disease incidence risk stratification of the individual's group to predict the individual's genetic susceptibility to disease (i.e., the genetic tendency to get the disease).
[0202] (3)Optionally, based on an individual’s family genetic background and / or environmental and lifestyle surveys, predict an individual’s risk of developing diseases over a period of time;
[0203] (4)Optionally, provide medical advice to high-risk individuals to manage their risk of developing diseases.
[0204] In another aspect, the present invention also provides compositions suitable for PRS rating of endometriosis or polycystic ovary syndrome, wherein the compositions comprise (or consist of) reagents or combinations thereof for genotyping at least 85%, 90%, 95%, or 100% of the SNP risk sites or their linkage disequilibrium sites listed in Table 1 or Table 2 on samples from an individual. Preferably, the compositions are in the form of a microarray chip. More preferably, the compositions comprise primer and / or probe sets for detecting the genotype of the SNP sites. For example, primer pairs suitable for amplifying the SNP sites can be designed according to methods known in the art, for example, using publicly available primer design software "Oligo". Similarly, probes suitable for detecting nucleic acids containing SNPs (e.g., amplicons containing SNPs) can be designed according to methods known in the art. Primers and / or probes can be labeled as needed, such as radiolabeled, or labeled in other suitable ways (e.g., using fluorescent labels) to allow rapid identification of different amplicones.
[0205] In one embodiment, the present invention also provides the use of the compositions according to the invention in the preparation of products for risk stratification of endometriosis or polycystic ovary syndrome in a population or for predicting an individual's susceptibility to endometriosis or polycystic ovary syndrome. The products may be in the form of kits, which may be used alone; or in combination with the devices or systems or computer program products of the present invention.
[0206] Table 1: Chromosomal locations, ID numbers, reference alleles (REF), and alleles of 177 endometriosis SNP risk loci
[0207]
[0208]
[0209]
[0210]
[0211]
[0212] Table 2: Chromosomal locations, ID numbers, reference alleles (REF), and alleles of 173 SNP risk loci for polycystic ovary syndrome (PCOS).
[0213]
[0214]
[0215]
[0216]
[0217] Example
[0218] Example 1: Multilevel screening of SNP risk sites for endometriosis
[0219] Training set sample selection
[0220] The UK Biobank is currently the world's largest human genetic cohort database, collecting 15 million samples from 500,000 volunteers aged 40-69 in the UK. Its aim is to study the association between genetic factors, environmental factors, lifestyle habits, and major human diseases. This genetic data is available from the UK Biobank for biomedical research worldwide. Hundreds of research projects have been launched around this database, reporting new findings on various diseases, including cancer, heart disease, diabetes, stroke, polycystic ovary syndrome, and endometriosis.
[0221] 8236 endometriosis patients (cases) were selected from the UK Biobank, and 8236 age-matched, clinically diagnosed endometriosis-positive healthy women were selected as controls. Their corresponding whole-genome genotyping data were obtained. All cases and controls were European women. The genetic data were obtained using the Affymetrix Axiom Array microarray on whole-genome DNA samples from the subjects. This microarray platform includes 850,000 SNP loci. For information on the SNP composition of the microarray, see [link to relevant documentation]. http: / / www.affymetrix.com / analysis / downloads / na34 / genotyping / Axiom_UKB_WCSG.na34.annot.csv.zip The microarray sequencing data underwent quality control processing, filtering out all INDELs and SNPs on non-standard staining sites, and excluding the following SNP sites: (1) SNP sites with a genotyping call rate of less than 99%; (2) SNP sites with a minor allele frequency (MAF) of less than 0.05; and (3) SNP sites that significantly deviated from Hardy-Weinberg (HWE) equilibrium in the control population (HWE p-value < 10). -6 ).
[0222] Initial screening of SNP sites
[0223] Using PLINK software, the assoc chi-square test was selected, and default parameters were used to perform a population-level analysis of the association between SNPs and endometriosis on genotype and phenotype data from 16,472 samples in the training set. This analysis yielded the risk alleles for each SNP locus and the corresponding chi-square p-values. Candidate risk SNP loci were selected with a p-value < 0.001 as the cutoff.
[0224] SNP site screening
[0225] The SNP loci identified in the initial screening were divided into 22 groups based on autosomes. Stepwise regression analysis in R was used to perform SNP locus screening for each group.
[0226] In short, stepwise regression backward analysis includes the following steps:
[0227] (i) For the set of candidate SNP sites on a single chromosome c, fit a logistic regression model containing all candidate SNP sites in the set as a candidate model for subsequent analysis, and determine the AIC value of the model.
[0228] (ii) By deleting one SNP from the candidate model and calculating the AIC value of the model after deletion, the SNP that contributes the least to the dependent variable (occurrence of endometriosis) among all the SNPs constituting the candidate model is identified; that is, the deletion of this SNP results in the smallest model AIC value relative to the deletion of any of the other SNPs.
[0229] (iii) Compare the minimum model AIC value obtained from step (ii) with the AIC value of the candidate model from step (i), where:
[0230] If the AIC value decreases, the minimum contributing SNP site is removed from the candidate model to obtain a new candidate model, and steps (ii) to (iii) are repeated; or
[0231] If the AIC value no longer decreases, the stepwise regression backward analysis is terminated, and the set of SNP sites constituting the current candidate model is output as the endometriosis-related risk SNP sites of chromosome c that have been finely screened.
[0232] Integrate endometriosis-related risk SNPs on 22 chromosomes identified by fine screening.
[0233] SNP site rescreening
[0234] A logistic regression model was constructed using the SNP sites obtained through fine screening, and stepwise regression forward analysis in R language was applied to screen SNP sites.
[0235] In short, stepwise regression forward analysis includes the following steps:
[0236] (i) Logistic regression models were established between candidate SNP sites and the dependent variable (occurrence of endometriosis). The AIC values of each model were calculated and compared. The SNP site with the smallest AIC value was selected to establish a logistic regression model as a candidate model for subsequent analysis.
[0237] (ii) By introducing a new SNP site into the candidate model and calculating the AIC value of the model after adding the site, the SNP site that contributes the most to the dependent variable (occurrence of endometriosis) is selected from the remaining SNP sites. That is, the SNP site that leads to the smallest AIC value after adding the candidate model compared to adding any of the other SNP sites.
[0238] (iii) Compare this minimum AIC value with the AIC value of the candidate model before adding the new SNP site, where:
[0239] If the AIC value decreases, the SNP locus with the largest contribution is added to the candidate regression model to obtain a new candidate regression model, and steps (ii) to (iii) are repeated; or
[0240] If the AIC value no longer decreases, the stepwise regression forward analysis is terminated, and the current candidate model and the set of SNP sites constituting the model are output as the endometriosis-related risk SNP sites determined by rescreening.
[0241] After secondary screening, 177 risk SNP loci were identified in this embodiment. Table 1 shows the chromosomal location and risk allele type of the 177 risk SNP loci.
[0242] Example 2: Construction of a polygenic genetic risk assessment model for endometriosis
[0243] Using 177 risk SNP loci obtained from secondary screening, a 177-locus logistic regression model was constructed through machine learning. Based on the constructed logistic regression model, a 177-locus polygenic genetic risk assessment model for endometriosis was established according to the following formula:
[0244] PRS(GPS)=β1×snp1+β2×snp2+...β n ×snp n
[0245] Among them: SNP nLet represent the genotypes of the n SNP loci in the sample, denoted by 0, 1, and 2 to indicate homozygous non-risk loci, heterozygous loci, and homozygous risk loci, respectively. β represents the effect value of the SNP locus on endometriosis, i.e., the β coefficient corresponding to the SNP locus in the 177-locus logistic regression model.
[0246] Example 3: Validation of the 177-site model
[0247] Validate the model in test set 1
[0248] Test set sample selection
[0249] To examine the stratification effect of the model in a population, it was applied to an independent Chinese population test set. For this purpose, an independent Chinese population test set was constructed, which included 18 patients with endometriosis (cases) and 916 clinically confirmed healthy individuals without endometriosis (controls). All participants signed informed consent forms. This study was approved by the institution's ethics committee and conducted in accordance with the guidelines of the Declaration of Helsinki.
[0250] Genomic DNA was extracted from peripheral blood of the subjects using standard procedures. Genotyping was performed on the whole-genome DNA samples using the Affymetrix Axiom Array microarray. This microarray platform includes 850,000 SNP loci. For information on the SNP composition of the microarray, see [link to relevant documentation]. http: / / www.affymetrix.com / analysis / downloads / na34 / genotyping / Axiom_UKB_WCSG.na34.annot.csv.zip .
[0251] Genotyping was performed on the individuals that made up the test set to obtain their genotypes at 177 risk SNP loci in Examples 1 and 2.
[0252] Using the 177-locus polygenic genetic risk assessment model for endometriosis established in Example 2, the polygenic genetic risk (PRS) values for endometriosis were calculated for all test set samples. Based on the test set PRS values, a case-control contrast density plot was plotted. Figure 2 Purple represents 18 samples with endometriosis, and yellow represents 916 healthy samples. The mean hazard value for the endometriosis samples was 5.11, and the mean hazard value for the healthy samples was 4.66. Figure 2 As can be seen, the polygenic inheritance risk values of endometriosis in patients with endometriosis and healthy individuals can be significantly distinguished.
[0253] Validate the model in test set 2
[0254] From the 8,236 endometriosis patients in the 16,472 samples in the training set, 1,441 samples were randomly sampled and mixed with the 8,167 healthy samples in the test set, for a total of 9,608 samples, forming a simulated test set simulating a 15% incidence rate of endometriosis in the population.
[0255] Using a 177-locus model, the PRS score for each individual in the simulated test set was calculated. Individuals in the simulated test set were then ranked from lowest to highest based on their calculated PRS scores, obtaining the PRS percentile ranking for each individual. Based on the PRS percentile, individuals were divided into 100 groups, with each 1% percentile forming a group. The incidence rate for each group was calculated, and a graph showing the relationship between endometriosis incidence and PRS was plotted. Figure 3 ).
[0256] from Figure 3 As can be seen, for individuals in the top 5% PRS percentile, the proportion of endometriosis patients is less than 5%, while for individuals in the 95th PRS percentile, the proportion of endometriosis patients reaches about 40%.
[0257] The foregoing results demonstrate that the model constructed using the PRS modeling method of this invention has resulted in a very significant risk stratification effect for endometriosis in different populations.
[0258] Example 4: Endometriosis Risk Prediction
[0259] To examine the predictive performance of the model, in this embodiment, the 177-site model is applied to calculate the PRS value and PRS percentile of the sample to predict the risk of endometriosis in the sample.
[0260] We collected samples from one known endometriosis patient (D1SZYS01211210FFET01SA01) and one known healthy individual (D1SZYS01211129FFET02SA01) and calculated their PRS values and PRS percentiles. The PRS value of the endometriosis patient sample D1SZYS01211210FFET01SA01 was 7.47, placing it at the 100th percentile according to population risk stratification. The PRS value of the healthy individual sample D1SZYS01211129FFET02SA01 was 3.99, placing it at the 21st percentile. These results demonstrate a highly effective risk stratification effect for endometriosis between the two groups.
[0261] Samples ZYH and HLQ with unknown endometriosis status were collected. Sample ZYH had a calculated polygenic risk assessment score of 6.52, placing her at the 98th percentile of the population's PRS, indicating a high risk of endometriosis. She was subsequently clinically diagnosed with endometriosis. Sample HLQ had a calculated polygenic risk assessment score of 4.28, placing her at the 33rd percentile of the population's PRS, indicating a low risk of endometriosis. She has not yet been diagnosed with endometriosis.
[0262] Example 5: Multilevel SNP risk site screening for polycystic ovary syndrome
[0263] To test the applicability of the PRS model construction method of the present invention, in this embodiment and subsequent embodiments, different complex diseases, such as polycystic ovary syndrome, were used to establish PRS models and the models were tested.
[0264] Training set sample selection
[0265] Eighty-five patients with polycystic ovary syndrome (PCOS) were selected from the UK Biobank (cases), and eighty-five age-matched, clinically confirmed non-PCOS healthy women were selected as controls. Their corresponding whole-genome genotyping data were obtained. All cases and controls were European women. As described in Example 1, the genetic data were obtained using an Affymetrix Axiom Array chip on whole-genome DNA samples from the subjects and underwent the same chip data quality control processing.
[0266] Initial screening of SNP sites
[0267] Using PLINK software, the assoc chi-square test was selected with default parameters to perform a population-level analysis of the association between SNPs and polycystic ovary syndrome (PCOS) on genotype and phenotype data from 16,472 samples in the training set. This analysis yielded the risk alleles for each SNP locus and the corresponding chi-square p-values. Candidate risk SNP loci were selected with a p-value < 0.001 as the cutoff.
[0268] SNP site screening
[0269] The SNP loci identified in the initial screening were divided into 22 groups based on autosomes. Stepwise regression analysis in R was used to perform SNP locus screening for each group.
[0270] In short, stepwise regression backward analysis includes the following steps:
[0271] (i) For the set of candidate SNP sites on a single chromosome c, fit a logistic regression model containing all candidate SNP sites in the set as a candidate model for subsequent analysis, and determine the AIC value of the model.
[0272] (ii) By deleting one SNP from the candidate model and calculating the AIC value of the model after deletion, the SNP that contributes the least to the dependent variable (occurrence of polycystic ovary syndrome) among all the SNPs constituting the candidate model is identified. That is, the deletion of this SNP results in the smallest model AIC value relative to the deletion of any of the other SNPs.
[0273] (iii) Compare the minimum model AIC value obtained from step (ii) with the AIC value of the candidate model from step (i), where:
[0274] If the AIC value decreases, the minimum contributing SNP site is removed from the candidate model to obtain a new candidate model, and steps (ii) to (iii) are repeated; or
[0275] If the AIC value no longer decreases, the stepwise regression backward analysis is terminated, and the set of SNP sites constituting the current candidate model is output as the SNP sites related to polycystic ovary syndrome on chromosome c that have been determined through fine screening.
[0276] Integrate SNP loci associated with polycystic ovary syndrome (PCOS) on 22 chromosomes identified by fine screening.
[0277] SNP site rescreening
[0278] A logistic regression model was constructed using the SNP sites obtained through fine screening, and stepwise regression forward analysis in R language was applied to screen SNP sites.
[0279] In short, stepwise regression forward analysis includes the following steps:
[0280] (i) Logistic regression models were established between candidate SNP sites and the dependent variable (occurrence of polycystic ovary syndrome). The AIC values of each model were calculated and compared. The SNP site with the smallest AIC value was selected to establish a logistic regression model as a candidate model for subsequent analysis.
[0281] (ii) By introducing a new SNP site into the candidate model and calculating the AIC value of the model after adding the site, the SNP site that makes the greatest contribution to the dependent variable (occurrence of polycystic ovary syndrome) is selected from the remaining SNP sites. That is, the SNP site that leads to the smallest AIC value after adding it to the candidate model compared to adding any of the other SNP sites.
[0282] (iii) Compare this minimum AIC value with the AIC value of the candidate model before adding the new SNP site, where:
[0283] If the AIC value decreases, the SNP locus with the largest contribution is added to the candidate regression model to obtain a new candidate regression model, and steps (ii) to (iii) are repeated; or
[0284] If the AIC value no longer decreases, the stepwise regression forward analysis is terminated, and the current candidate model and the set of SNP sites constituting the model are output as the SNP sites related to polycystic ovary syndrome identified through rescreening.
[0285] After secondary screening, 173 risk SNP loci were identified in this embodiment. Table 2 shows the chromosomal location and risk allele type of the 173 risk SNP loci.
[0286] Example 6: Construction of a polygenic genetic risk assessment model for polycystic ovary syndrome
[0287] Using 173 risk SNP loci obtained from secondary screening, a 173-locus logistic regression model was constructed through machine learning. Based on the constructed logistic regression model, a 173-locus polycystic ovary syndrome (PCOS) polygenic genetic risk assessment model was established according to the following formula:
[0288] PRS(GPS)=β1×snp1+β2×snp2+...β n ×snp n
[0289] Among them: SNP n Let represent the genotypes of the n SNP loci in the sample, denoted by 0, 1, and 2 to indicate homozygous non-risk loci, heterozygous loci, and homozygous risk loci, respectively. β represents the effect value of the SNP locus on polycystic ovary syndrome, i.e., the β coefficient corresponding to the SNP locus in the 173-locus logistic regression model.
[0290] Example 7: Validation of the 173-site model
[0291] From the 875 polycystic ovary syndrome (PCOS) patients in the 1750 samples of the training set mentioned above, 91 samples were randomly sampled and mixed with 816 healthy samples from the test set, resulting in a total of 907 samples, forming a simulated test set simulating a 10% incidence rate of PCOS in the general population. A 173-locus model was applied to calculate the PRS score for each individual in the simulated test set; and individuals in the simulated test set were ranked from low to high according to their calculated PRS scores, obtaining the PRS percentile ranking for each individual. Based on the PRS percentile, individuals were divided into 20 groups, with each 5% percentile forming a group. The incidence rate in each group was calculated, and a graph showing the relationship between PCOS incidence rate and PRS was plotted. Figure 4As shown in the figure, for individuals falling within the top 5% PRS percentile, the proportion of polycystic ovary syndrome (PCOS) is less than 1%, while for individuals falling within the 95th percentile PRS, the proportion of PCOS reaches approximately 60%. These results demonstrate that the model constructed using the PRS model of this invention for PCOS can also achieve a very significant population-based risk stratification effect for PCOS.
[0292] Example 8: Risk Prediction for Polycystic Ovary Syndrome
[0293] To examine the predictive performance of the generated model, in this embodiment, a 173-site model is applied to calculate the PRS value and PRS percentile of the sample to predict the risk of polycystic ovary syndrome in the sample.
[0294] We collected a known polycystic ovary syndrome (PCOS) patient sample (XJ) and a known healthy sample, and calculated their PRS values and PRS percentiles. The PRS value of the PCOS patient sample XJ was 12.34, with a PRS percentile of 85%. The PRS value of the healthy sample GUSS was 6.94, with a PRS percentile of 18%. The results indicate that the risk stratification effect of PCOS between the two samples was very significant.
[0295] Samples MR and YYY, with unknown endometriosis status, were collected. Sample MR had a calculated polygenic risk assessment score of 11.89, placing her at the 80th percentile of the population's PRS (Prognostic Reproductive System), indicating a high risk of polycystic ovary syndrome (PCOS). She was subsequently clinically diagnosed with PCOS. Sample YYY had a calculated polygenic risk assessment score of 6.02, placing her at the 10th percentile of the population's PRS, indicating a low risk of PCOS. She was clinically diagnosed but has not yet been diagnosed with PCOS.
[0296] Some embodiments of the present invention:
[0297] 1. A method for constructing a polygenic disease risk rating (PRS) model, the method comprising:
[0298] (a) Obtain a training dataset for PRS model construction, wherein the training dataset contains SNP genotype feature data and corresponding disease classification feature data extracted from each individual sample in the training set, wherein the training set consists of a plurality of diseased individual samples and a plurality of healthy control individual samples.
[0299] Preferably, the SNP genotype features are SNP genotype features extracted from the entire genome or entire exome of an individual sample;
[0300] Preferably, the disease classification feature is whether an individual has a disease.
[0301] (b) Implement multi-level SNP risk site screening;
[0302] (c) Apply the SNP risk site set selected in step (b) to construct a PRS model using machine learning algorithms.
[0303] The multi-level SNP risk site screening in step (b) includes:
[0304] (b1) Perform association analysis between SNP loci and disease on the training dataset to determine the risk alleles of each SNP locus and their association with the disease and the significance p-value.
[0305] (b2) For SNP sites whose p-values meet the PT threshold, group them by autosomal chromosome, and apply logistic regression modeling and stepwise regression analysis respectively to select SNP sites that contribute significantly to the model;
[0306] (b3) Integrate the 22 groups of SNP sites obtained from step (b2), establish a logistic regression model, and apply stepwise regression analysis to determine the set of SNP risk sites with the best performance for step (c).
[0307] 2. According to the method of implementation scheme 1, step (c) includes:
[0308] (c1) Using the set of risk SNP loci obtained in step (b), construct a logistic regression model, and determine the effect value of each SNP locus on the disease using the corresponding coefficients of each SNP locus in the regression model.
[0309] (c2) Establish a polygenic disease risk rating (PRS) model for the disease.
[0310] PRS=β1×snp1+β2×snp2+…+β i ×snp i +…+β n ×snp n
[0311] in,
[0312] n represents the total number of risky SNP sites included in the model;
[0313] snp i Let be the genotype of the i-th SNP locus, and let be integers 0, 1, and 2, representing homozygous non-risk locus (containing 0 risk alleles), heterozygous locus (containing 1 risk allele), and homozygous risk locus (containing 2 risk alleles).
[0314] βi The effect value of the i SNP sites determined by step (c1).
[0315] 3. According to the method in Implementation Scheme 1-2, the PT threshold is selected from: 1-9×10 -3 1-9×10 -4 1-9×10 -5 and 1-9×10 -6 ; Preferably 1×10 -3 .
[0316] 4. According to the method in Implementation Scheme 1-3, step (b1) uses the chi-square test to analyze the association between SNP sites and diseases.
[0317] 5. According to the method of implementation scheme 1-4, wherein the stepwise regression analysis in steps (b2) and (b3) is independently selected from stepwise regression forward analysis and stepwise regression backward analysis.
[0318] 6. According to the method of implementation scheme 1-5, wherein steps (b2) and (b3) adopt the optimal model selection criterion as the screening criterion for SNP sites and / or the termination condition for stepwise regression.
[0319] 7. According to the method in Implementation Scheme 1-6, the stepwise regression analysis of steps (b2) and (b3) is carried out using the AIC optimal model selection criterion.
[0320] 8. According to the method in Implementation Scheme 1-7, step (b2) applies stepwise regression backward analysis.
[0321] 9. According to the method in Implementation Scheme 1-8, step (b3) applies stepwise regression forward analysis.
[0322] 10. According to the methods in Implementation Scheme 1-9, in the training set, the ratio of the number of disease patients to the number of healthy controls is 1:1 to 1:100, or approximately equal to the disease incidence rate of the target population.
[0323] Preferably, the training set consists of samples from at least 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 11,000, 12,000, 15,000, 20,000, 25,000, 30,000 or more individuals; preferably, the samples are blood, serum, plasma or tissue biopsies from the individuals.
[0324] 11. The method according to embodiments 1-10, wherein the method further includes: validating the model established in step (c) on a test set (preferably a test set independent of the training set).
[0325] 12. According to the method in Implementation Scheme 1-11, the model is a logistic regression model containing approximately 150-600 risk SNP loci.
[0326] 13. A population disease risk stratification method, comprising:
[0327] (i) Construct a training set consisting of disease patients and healthy controls according to the methods in Implementation Scheme 1-12, and establish a population disease risk rating (PRS) model;
[0328] (ii) On the test set, apply the PRS model from step (i) to calculate the PRS values for all samples in the test set;
[0329] (iii) Determine the PRS percentile distribution of the population and the disease incidence rate at each PRS percentile, thereby obtaining the disease risk stratification baseline of the population.
[0330] 14. The method according to implementation scheme 13, wherein the population is the Chinese population.
[0331] 15. A method for risk rating of polygenic diseases, comprising:
[0332] (i) Establish a PRS model and a baseline for disease risk stratification of the population using the method of Implementation Scheme 13;
[0333] (ii) Determine the individual’s risk SPN locus genotype and calculate the individual’s PRS value. Compare the individual’s PRS value with the disease risk stratification baseline of the population to predict the individual’s disease susceptibility.
[0334] 16. According to the method in Implementation Plan 15, whereby the individual's disease risk is predicted based on PRS percentiles,
[0335] Preferably, if an individual's PRS is below the 5th percentile of the population's PRS, it indicates a relatively low susceptibility to disease relative to the population's average disease susceptibility; or
[0336] If an individual's PRS is higher than the 95th percentile of the population's PRS, it indicates that the individual has a relatively high susceptibility to disease relative to the average disease susceptibility of the population.
[0337] 17. The method according to embodiments 15-16, wherein the disease is endometriosis and the PRS model contains 177 SNP risk sites listed in Table 1; or the disease is polycystic ovary syndrome and the PRS model contains 173 SNP risk sites listed in Table 2.
[0338] 18. The method according to implementation schemes 15-17, wherein the method further includes: determining an individual's disease susceptibility by combining an individual's clinical disease risk score.
[0339] 19. According to the method of implementation plan 15-18, where the individual is an individual who has not yet shown symptoms of disease.
[0340] 20. In accordance with the methods of implementation plan 15-18, medical advice is given to individuals predicted to be highly susceptible to the disease, such as more frequent disease screening and / or preventive treatment.
[0341] 21. An apparatus or system for constructing a polygenic disease risk (PRS) rating model, comprising:
[0342] One or more datasets, wherein a training dataset is stored, the training dataset containing SNP genotypic feature data and corresponding disease classification feature data extracted from each individual sample in the training set, wherein the training set consists of a plurality of diseased individuals and a plurality of healthy control individuals.
[0343] One or more computer-executable processors configured to perform the method steps of any one of embodiments 1-12, preferably, the processors configured to perform the following operations:
[0344] (a) Access the dataset;
[0345] (b) On the dataset, perform an association analysis between SNP sites and the disease;
[0346] (c) For SNP sites that meet the PT threshold, group them by autosomal chromosome and apply stepwise logistic regression analysis to screen SNP sites that contribute significantly to the model.
[0347] (d) Integrate the 22 SNP loci obtained from step (c) and apply stepwise logistic regression analysis to determine the set of SNP loci with the best performance;
[0348] (e) Construct a logistic regression model from the optimal set of SNP sites obtained in step (d) and determine the β coefficient of each SNP site in the model;
[0349] (f) Establish a disease PRS rating model according to the following PRS formula.
[0350] PRS=β1×snp1+β2×snp2+…+β i ×snp i +…+β n ×snp n
[0351] in,
[0352] n represents the total number of risky SNP sites included in the model;
[0353] snp i Let be the genotype of the i-th SNP locus, and let be integers 0, 1, and 2, representing homozygous non-risk locus (containing 0 risk alleles), heterozygous locus (containing 1 risk allele), and homozygous risk locus (containing 2 risk alleles).
[0354] β i The β coefficients of the i SNP sites determined in step (e) are given.
[0355] 22. The apparatus or system according to embodiment 21, wherein the SNP genotype feature data is the SNP genotype feature of the whole genome or whole exome of the individuals in the training set; and the disease classification feature is the status of whether the individual suffers from the disease.
[0356] 23. An apparatus or system for predicting an individual's risk of disease, the system comprising:
[0357] - At least one computer-executable processor, the processor being coupled to an electronic storage device containing an electronic representation of a predictor, wherein the predictor is a disease polygenic disease risk rating (PRS) model generated by the method according to embodiments 1-10 or the system according to embodiment 21.
[0358] The processor is configured to receive test data, which includes genotypic characteristics of risk SNP loci extracted from samples of the individual being tested using the PRS model.
[0359] The processor is further configured to evaluate the test data using the electronic representation of the predictor and, based on the evaluation, output a disease PRS risk score for the individual from whom the test data originated.
[0360] Preferably, the processor is further configured to combine the individual PRS score output by the predictor with the individual's clinical disease risk score to determine the individual's disease susceptibility.
[0361] 24. The apparatus or system according to embodiment 23, wherein the disease is endometriosis and the PRS model includes 177 SNP risk sites listed in Table 1; or the disease is polycystic ovary syndrome and the PRS model includes 173 SNP risk sites listed in Table 2.
[0362] 25. A non-transitory computer-readable medium storing computer program instructions, which are executed by a computer or computer system to implement the steps of the model building method, population disease risk stratification method, or disease polygenic risk rating method according to any of the foregoing embodiments.
[0363] 26. A multi-gene locus joint disease risk analysis and assessment system, comprising:
[0364] (a) Sample preprocessing module: used to sequence samples from individuals (e.g., SNP microarray sequencing) and preprocess the sequencing data to obtain genotypic data of disease risk SNP loci;
[0365] (b) Polygenic Risk Assessment Module: Used to perform the methodological steps of any one of Implementation Schemes 15-20 to obtain a preliminary disease risk assessment report for an individual and transmit it to the Disease Risk Assessment Module;
[0366] (c) Disease Risk Assessment Module: Based on the preliminary individual disease risk assessment report transmitted by the multi-gene risk assessment module, and combined with the data feedback from the data management and processing module, the preliminary assessment results are subjected to subsequent data learning and optimization assessment.
[0367] (d) Report Presentation and User Feedback Module: Used to present evaluation reports, collect user feedback, and transmit it to the data management and processing module;
[0368] (e) Data Management and Processing Module: Used to manage user data and provide feedback to the Disease Risk Assessment Module.
[0369] 27. The assessment system according to implementation plan 26, wherein the basic information of the user includes the user's ethnicity, geographic region, gender, age and lifestyle, and / or the user's clinical disease risk score.
[0370] 28. A disease risk management method, comprising the following steps:
[0371] (1) Using the device or system according to implementation scheme 23, an individual's PRS score is obtained;
[0372] (2) Combine the baseline of disease incidence risk stratification of the individual's group to predict the individual's genetic susceptibility to disease (i.e., the genetic tendency to get the disease).
[0373] (3)Optionally, based on an individual’s family genetic background and / or environmental and lifestyle surveys, predict an individual’s risk of developing diseases over a period of time;
[0374] (4)Optionally, provide medical advice to high-risk individuals to manage their risk of developing diseases.
[0375] 29. A composition for polygenic risk rating of endometriosis or polycystic ovary syndrome, wherein the composition comprises a reagent or combination of reagents for genotyping 177 SNP risk loci listed in Table 1 on a sample from an individual, or the composition comprises a reagent or combination of reagents for genotyping 173 SNP risk loci listed in Table 2 on a sample from an individual.
[0376] Preferably, the composition exists in the form of a microarray chip.
[0377] 30. Use of the composition according to embodiment 29 in the preparation of a product for risk stratification of endometriosis or polycystic ovary syndrome or for predicting an individual's susceptibility to endometriosis or polycystic ovary syndrome.
Claims
1. A method for constructing a polygenic disease risk rating (PRS) model, the method comprising: (a) Obtain a training dataset for PRS model construction, wherein the training dataset contains SNP genotype feature data and corresponding disease classification feature data extracted from each individual sample in the training set, wherein the training set consists of a plurality of diseased individual samples and a plurality of healthy control individual samples. The SNP genotype features mentioned above are SNP genotype features extracted from the entire genome or entire exome of an individual sample; The disease classification feature is whether an individual has a disease. (b) Implement multi-level SNP risk site screening; (c) Using the SNP risk site set selected in step (b), construct a PRS model through machine learning algorithms. The multi-level SNP risk site screening in step (b) includes: (b1) Initial screening: SNP loci and disease association analysis were performed on the training dataset to determine the risk alleles of each SNP locus, as well as the association and significance p-value with the disease, using a PT threshold of 1-9×10⁻⁶. -3 Screening for SNP sites; (b2) Fine screening: For SNP loci whose p-values meet the PT threshold, group them by autosomal chromosome, and apply logistic regression modeling and stepwise regression backward analysis respectively to select SNP loci that contribute significantly to the model; (b3) Rescreening: Integrate the 22 groups of SNP loci obtained from step (b2), establish a logistic regression model, and apply stepwise regression forward analysis to determine the set of SNP risk loci with the best performance for step (c).
2. The method according to claim 1, wherein, Step (c) includes: (c1) Using the set of risk SNP loci obtained in step (b), construct a logistic regression model, and determine the effect value of each SNP locus on the disease using the corresponding coefficients of each SNP locus in the regression model. (c2) Establish a polygenic disease risk rating (PRS) model for the disease. PRS = β 1 × snp 1 + β 2 × snp 2 +…+ β i × snp i + … + β n × snp n in, n represents the total number of risky SNP sites included in the model; snp i Let be the genotype of the i-th SNP locus, and let be integers 0, 1, and 2, representing a homozygous non-risk locus containing 0 risk alleles, a heterozygous locus containing 1 risk allele, and a homozygous risk locus containing 2 risk alleles. β i The effect value of the i SNP sites determined by step (c1).
3. The method according to claim 1 or 2, wherein, The PT threshold is 1×10 -3 .
4. The method according to claim 1 or 2, wherein, Step (b1) uses the chi-square test to analyze the association between SNP sites and the disease.
5. The method according to claim 1 or 2, wherein, Steps (b2) and (b3) employ the optimal model selection criterion as the screening standard for SNP sites and / or the termination condition for stepwise regression.
6. The method according to claim 1 or 2, wherein, The stepwise regression analysis of steps (b2) and (b3) was carried out using the AIC optimal model selection criterion.
7. The method according to claim 1 or 2, wherein, In the training set, the ratio of the number of disease patients to the number of healthy controls was 1:1 to 1:100, or approximately equal to the disease incidence rate in the target population.
8. The method according to claim 7, wherein, The training set consists of samples from at least 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 11,000, 12,000, 15,000, 20,000, 25,000, 30,000 or more individuals.
9. The method of claim 8, wherein the sample is blood, serum, plasma, or tissue biopsy from an individual.
10. The method according to claim 1 or 2, wherein, The method further includes: validating the model established in step (c) on a test set.
11. The method according to claim 1 or 2, wherein, The model is a logistic regression model containing 150-600 risk SNP loci.
12. A population disease risk stratification method, comprising: (i) Construct a training set consisting of disease patients and healthy controls according to any one of claims 1-11, and establish a population disease risk rating (PRS) model; (ii) On the test set, apply the PRS model from step (i) to calculate the PRS values for all samples in the test set; (iii) Determine the PRS percentile distribution and the disease incidence rate at each PRS percentile in the population, thereby obtaining the disease risk stratification baseline of the population.
13. The method according to claim 12, wherein, The population in question is Chinese.
14. A method for risk rating of polygenic diseases, comprising: (i) Establish a PRS model and a disease risk stratification baseline for the population using the method of claim 12; (ii) Determine the individual’s risk SPN locus genotype and calculate the individual’s PRS value. Compare the individual’s PRS value with the disease risk stratification baseline of the population to predict the individual’s disease susceptibility.
15. The method of claim 14, wherein, Predict an individual's risk of developing the disease based on PRS percentiles.
16. The method of claim 15, wherein, If an individual's PRS is below the 5th percentile of the population, it indicates a relatively low susceptibility to disease relative to the population average; or if an individual's PRS is above the 95th percentile of the population, it indicates a relatively high susceptibility to disease relative to the population average.
17. The method according to claim 14 or 15, wherein, The disease is endometriosis; or the disease is polycystic ovary syndrome.
18. The method according to claim 14 or 15, wherein, The method also includes: determining an individual's disease susceptibility by combining the individual's clinical disease risk score.
19. The method of claim 14 or 15, wherein the individual is an individual who has not yet shown symptoms of disease.
20. The method according to claim 14 or 15, wherein, Medical advice is given to individuals predicted to be highly susceptible to the disease.
21. An apparatus or system for constructing a polygenic disease risk (PRS) rating model, comprising: One or more datasets are provided, wherein a training dataset is stored, the training dataset containing SNP genotypic feature data extracted from each individual sample in the training set and corresponding disease classification feature data, wherein the training set consists of a plurality of diseased individuals and a plurality of healthy control individuals, wherein the SNP genotypic feature data are SNP genotypic features across the entire genome or entire exome of the individuals in the training set; and the disease classification feature is the status of whether an individual suffers from the disease. One or more computer-executable processors, said processors being configured to perform the method steps of any one of claims 1-11, wherein said processors are configured to perform the following operations: (a) Access the dataset; (b) On the dataset, perform association analysis between SNP sites and the disease, using a PT threshold of 1–9 × 10⁻⁶. -3 Screening for SNP sites; (c) For SNP sites that meet the PT threshold, group them by autosomal chromosome and apply stepwise logistic regression analysis to screen SNP sites that contribute significantly to the model. (d) Integrate the 22 SNP loci obtained from step (c) and apply stepwise logistic regression analysis to determine the set of SNP loci with the best performance; (e) Using the optimal set of SNP sites from step (d), construct a logistic regression model and determine the SNP sites in the model. β coefficient; (f) Establish a disease PRS rating model according to the following PRS formula. PRS = β 1 × snp 1 + β 2 × snp 2 +…+ β i × snp i + … + β n × snp n in, n represents the total number of risky SNP sites included in the model; snp i Let be the genotype of the i-th SNP locus, and let be integers 0, 1, and 2, representing a homozygous non-risk locus containing 0 risk alleles, a heterozygous locus containing 1 risk allele, and a homozygous risk locus containing 2 risk alleles. β i For the i SNP sites determined by step (e) β coefficient.
22. A system for predicting an individual's risk of disease, the system comprising: - At least one computer-executable processor, said processor being coupled to an electronic storage device, said storage device containing an electronic representation of a predictor, said predictor being a disease polygenic disease risk rating (PRS) model generated by the method of any one of claims 1-11 or by the apparatus or system of claim 21. The processor is configured to receive test data, which includes genotypic characteristics of risk SNP loci extracted from samples of the individual being tested using the PRS model. The processor is further configured to evaluate the test data using the electronic representation of the predictor and, based on the evaluation, output a disease PRS risk score for the individual from whom the test data originated.
23. The system of claim 22, wherein the processor is further configured to combine the individual PRS score output by the predictor with the individual's clinical disease risk score to determine the individual's disease susceptibility.
24. The system of claim 23, wherein the disease is endometriosis; or the disease is polycystic ovary syndrome.
25. A non-transitory computer-readable medium storing computer program instructions that are executed by a computer or computer system to implement the steps of the method for constructing a polygenic disease risk rating (PRS) model according to any one of claims 1-11, the population disease risk stratification method according to any one of claims 12-13, or the polygenic disease risk rating method according to any one of claims 14-20.
26. A multi-gene locus joint disease risk analysis and assessment system, comprising: (a) Sample preprocessing module: used to sequence samples from individuals and preprocess the sequencing data to obtain genotypic data of disease risk SNP loci; (b) Polygenic risk assessment module: used to perform the method steps of any one of claims 14-20 to obtain a preliminary disease risk assessment report for an individual and transmit it to the disease risk assessment module; (c) Disease Risk Assessment Module: Based on the preliminary individual disease risk assessment report transmitted by the multi-gene risk assessment module, and combined with the data feedback from the data management and processing module, the preliminary assessment results are further evaluated through data learning and optimization. (d) Report presentation and user feedback module: used to present evaluation reports, collect user feedback and transmit it to the data management and processing module; (e) Data Management and Processing Module: Used to manage user data and provide feedback to the Disease Risk Assessment Module.
27. The assessment system of claim 26, wherein the basic information of the user includes the user's ethnicity, geographic region, gender, age and lifestyle, and / or the user's clinical disease risk score.
28. A disease risk management method, comprising the following steps: (1) Using the system according to claim 22, obtain the individual's PRS score; (2) Based on the baseline of disease incidence risk stratification in the individual's population, predict the individual's genetic susceptibility to disease; (3) Optionally, based on an individual’s family genetic background and / or environmental and lifestyle surveys, predict an individual’s risk of developing a disease over a period of time; (4) Optionally, provide medical advice to high-risk individuals to manage their risk of developing diseases.