Computer-based method and apparatus for analyzing genetic data
The method addresses the challenge of subpopulation-specific genetic association prediction by iteratively determining causal variants and effect sizes, enhancing PRS accuracy across diverse ancestries.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- ゲノミクス リミテッド
- Filing Date
- 2021-11-26
- Publication Date
- 2026-06-26
Smart Images

Figure 0007880873000023 
Figure 0007880873000024 
Figure 0007880873000025
Abstract
Description
[Technical Field]
[0001] The present invention relates to obtaining information about an organism by analyzing genetic and phenotypic data, particularly in relation to enabling the acquisition of an improved polygenic risk score (PRS) for a target phenotype. [Background technology]
[0002] PRS is a quantitative summary of the contribution of an organism's genetic DNA to its possible phenotype. In its calculation, PRS may include all DNA variants (directly or indirectly) related to the phenotype in question, or it may use components that are more strongly related to a particular aspect of the organism's ecology (including cells, tissues, or other biological units, mechanisms, or processes). PRS can be used directly to infer past, present, and future ecological aspects of an organism, or as part of a series of measurements or records about the organism.
[0003] PRS (Patient-Risk Spectrometry) is gaining attention as a tool for disease prevention, stratification, and diagnosis. In terms of improving human health and healthcare management, PRS has a variety of practical applications, though not limited to, predicting the risk of developing a disease or phenotype, predicting the age of phenotype onset, predicting disease severity, predicting disease subtypes, predicting response to treatment, selecting appropriate screening strategies for individuals, selecting appropriate pharmacotherapy interventions, and setting prior probabilities for other predictive algorithms.
[0004] PRS can have direct applications as an input source in the application of artificial intelligence and machine learning approaches to make predictions or classifications from other high-dimensional input data (e.g., imaging). For example, it may be used to help train these algorithms to identify predictive measures based on non-genetic data. In addition to its usefulness in making predictive descriptions of individuals, it can also be used to identify cohorts of individuals by calculating PRS for a large number of individuals and then grouping them based on PRS, which is included in but not limited to the above applications.
[0005] PRS can also help select individuals for clinical trials to optimize trial design, for example, by selecting individuals more likely to express an associated disease or phenotype, thereby improving the evaluation of the effectiveness of new therapies. In addition to the individual being calculated, PRS also retains information about its relatives (who share a portion of its genetic DNA). Information about the influence of an individual's DNA on phenotype can be derived from any relevant assessment of the potential impact of retaining any particular combination of DNA variants.
[0006] The following focuses on the analysis of the wealth of recent information derived from gene-associated studies (GAS). These studies systematically evaluate the potential contribution of DNA variants to the genetic basis of phenotype.
[0007] Since the mid-2000s, GAS (typically genome-wide association studies: GWAS, or association studies focusing on a single variant, a variant in a region of the genome, or a GWAS limited to a specific region of the genome) have been conducted on thousands of (primarily human) phenotypes in millions of individuals, yielding billions of potential associations between genotypes and phenotypes. The resulting raw data is then often simplified to generate summary statistics. GAS summary statistics consist of, for each genetic variant (whether attributable or observed), the inferred effect size of the genetic variant to the GAS phenotype, and the standard error of the inferred effect size. In other cases, individual-level data, consisting of information about the complete genetic profile of an individual in a study and its phenotype, may be directly available. However, individual-level data is usually less available due to privacy requirements for individual data.
[0008] PRS consists of a set of effects of numerous genetic variants, each typically having a small individual effect, to construct a collective predictor of the trait in question. PRS can be calculated using the effect sizes of variants determined from a GWAS. Variants included in such a score can be either "causal variants" or "tag variants," meaning they directly (weakly but directly) affect the trait, while tag variants themselves do not have a direct effect on the phenotype, although they are strongly correlated with other unknown causal variants.
[0009] While strategies for constructing PRSs are expanding, a widely accepted general approach to constructing high-accuracy PRSs involves deconvolving signals across all relevant regions by investigating the combination of variants that best capture the underlying biological associations. The number of associations varies; many genomic regions contain a single potential association, while some contain multiple independent associations (rarely up to 10 have been reported).
[0010] The technical challenge in identifying the correct combination of variants responsible for all correlations within a domain is that these variants can correlate with each other. The greater the correlation, the more samples are needed to decompose these correlations.
[0011] Some tools for constructing PRSs are designed to utilize summary statistics data. One approach, generalized by the LDpred software (Vilhjalmsson et al., 2015, https: / / github.com / bvilhjal / ldpred), iterates through multiple random selections of genome-wide valid variants based on a single GWAS and estimates residual signals as variants are selected or excluded.
[0012] The strength of strategies based on summary statistics data lies in the fact that significantly larger sample sizes can be made available to the scientific community due to the absence of limitations in sharing individual-level data. For this reason, the majority of current PRS designs are based on these large summary statistics datasets.
[0013] However, for methods based on all summary statistics data, correlated variants are handled by referring to an external data source that represents what the expected correlations between variants are. The correlation pattern between genetic variants is called linkage disequilibrium (LD). A limitation of relying on external datasets to represent LD patterns is that different subpopulations have distinct LD patterns. For example, individuals with European ancestry may have different LD patterns than individuals with Southeast Asian ancestry. Given that the identification information of the true causal variant is usually never known with certainty, these differences in LD can lead to differences in the predictive accuracy of PRS across different ancestories. In addition, the effect of a particular variant on phenotype may differ among subpopulations. For example, a given causal genetic variant may have a greater effect on a given phenotype in males than in females, or a smaller effect in older individuals than in younger individuals. Therefore, inferences made for one subpopulation, or based on data from individuals in a mixture of multiple subpopulations, are likely to be less accurate for different subpopulations. For example, the datasets supporting the construction of PRS are often based on large cohorts of European ancestors. As a result, these scores often perform poorly for non-European ancestors.
[0014] Existing methods to address this challenge are based on creating PRSs using training datasets from appropriate subpopulations. However, the amount of data available for a particular subpopulation can vary significantly. Therefore, these methods run into problems with sample sizes that are significantly smaller than what would limit their predictive power. Due to the reduced statistical power of smaller studies, attempts to calculate PRSs for specific subpopulations with limited available data may yield less reliable results than simply using results from different subpopulations with more available data. For example, a larger sample size from a cohort of European ancestry can often overcome the bias associated with using mismatched training sets, and a PRS trained on European ancestry may indeed provide the best PRS option for a non-European cohort, although this is not optimal in principle. [Overview of the Initiative]
[0015] The object of the present invention is to improve the analysis of genetic data for organisms and / or to enable the acquisition of more robust and / or highly accurate PRS for individuals belonging to a specific subpopulation.
[0016] According to one aspect of the present invention, a computer-implemented method for analyzing genetic data about an organism is provided. The method includes receiving a plurality of input units, each input unit including information about the relationship between a plurality of genetic variants in a target region of the organism's genome and a target phenotype of the organism; for each of the plurality of genetic variants, determining, based on the plurality of input units, whether the genetic variant is a cause of the target phenotype; and when the genetic variant is determined to be a cause, determining, based on the plurality of input units and information about the correlation between the plurality of genetic variants in the target region, a sampled effect size of the genetic variant on the target phenotype for each of the input units, wherein the sampled effect size of the genetic variant on the target phenotype is non-zero for all of the input units; performing one or more iterations including determining, for each genetic variant, based on at least a subset of the iterations of the sampled effect size of the genetic variant for the input units or on the average of the post hoc effect sizes of the genetic variant for the input units calculated using the sampled effect sizes, a predicted effect size of the genetic variant on the target phenotype for each of the input units.
[0017] By determining which variants are causal using data from a plurality of input units, causal variants can be identified with higher reliability. On the other hand, by determining the predicted effect size separately for each input unit, the method can also consider the possibility of different effect sizes for different subpopulations. Thereby, the power to use a large dataset can be combined with the ability to generate conclusions for each subpopulation. By obtaining a more accurate predicted effect size, a more accurate PRS can be calculated as a result.
[0018] In some embodiments, determining whether a genetic variant is causative involves calculating the probability of information from a plurality of input units assuming the genetic variant is causative and the probability of information from the plurality of input units assuming the genetic variant is not causative, and probabilistically determining that the genetic variant is causative with a probability that depends on the ratio of the probability of the input data assuming the genetic variant is causative to the probability of the input data assuming the genetic variant is not causative. By using probabilistic sampling, the method can consider multiple different combinations of causative variants and identify the overall effect that best explains the observed data.
[0019] In some embodiments, the probability of information from a plurality of input units assuming the genetic variant is causative depends on the proportion of a plurality of genetic variants expected to be causative, the plurality of input units, and the correlation between the effect sizes of the genetic variants on the target phenotype for each of the input units. In some embodiments, the probability of information from a plurality of input units assuming the genetic variant is not causative depends on the proportion of a plurality of genetic variants expected to be causative and the plurality of input units. These terms enable incorporating existing information about the proportion of causative variants into the analysis and varying the predicted effect sizes between input units. If not causative, the effect size is zero, so none of the correlations between effects are appropriate.
[0020] In some embodiments, the proportion of a plurality of genetic variants expected to be causative is predetermined. In some embodiments, the correlation between the effect sizes of the genetic variants on the target phenotype for each of the input units is predetermined. By using predetermined values of the parameters, it is possible to incorporate existing knowledge into the method in a computationally efficient manner.
[0021] In some embodiments, the proportions of multiple genetic variants expected to be causative are updated in each iteration. In some embodiments, the correlations between the effect sizes of genetic variants on the target phenotype for each input unit are updated in each iteration. By learning and updating parameters in each iteration, it is possible to converge the method to true parameter values, which may provide more accurate results but may be more computationally intensive.
[0022] In some embodiments, the input units are determined from their respective populations, and the probability of information from multiple input units, assuming genetic variants are the cause, depends on one or more parameters that quantify the overlap in the populations between pairs of input units. Depending on the data used, some individuals may be present in multiple input units, which can distort the conclusions drawn. Adding parameters to account for this improves the accuracy of the resulting effect size.
[0023] In some embodiments, determining the sampled effect size of a genetic variant involves calculating a probability distribution of the effect size of the genetic variant for a target phenotype with respect to an input unit, and sampling effect size values for the input unit from the probability distribution. Using a probability distribution allows the method to sample multiple different effect sizes while encouraging the selection of values within the range considered most likely to be correct.
[0024] In some embodiments, the probability distribution is a multivariate normal distribution. Using a multivariate normal distribution provides a convenient method that allows for different sampled effect sizes for different input units.
[0025] In some embodiments, the sampling of effect size values in each iteration depends on sampled effect sizes from one or more previous iterations. This type of dependency allows for efficient exploration of the space of values that can be sampled. In some embodiments, the sampling of effect size values is performed using a Monte Carlo-Gibbs sampler. This type of sampling algorithm is particularly suitable for this application.
[0026] In some embodiments, the probability distribution depends on the correlation between the effect sizes of genetic variants for the target phenotype for each of the input units. This allows for control over the likely range of difference in effect sizes between input units, thereby improving accuracy and computational efficiency.
[0027] In some embodiments, the correlation between the effect sizes of genetic variants on the target phenotype for each input unit is predetermined. By using predetermined parameter values, existing knowledge can be incorporated into the method in a computationally efficient manner.
[0028] In some embodiments, the correlation between the effect sizes of genetic variants for the target phenotype for each input unit is updated in each iteration. By learning and updating the parameters in each iteration, the method can be converged to true parameter values, which may provide more accurate results but may be more computationally intensive.
[0029] In some embodiments, each of one or more iterations further comprises subtracting a weighted effect size for each genetic variant determined to be causal from information about the association between each other genetic variant in each input unit and the target phenotype, where the weighted effect size is a sampled effect size of the genetic variant to the target phenotype in the input unit, weighted by the respective correlation coefficients between the genetic variant and each other genetic variant, where the correlation coefficients are determined based on information about the associations between multiple genetic variants in the region of interest. Subtracting the effect of the variant determined to be causal from the associated variants ensures that multiple causal variants are not misidentified based on a single causal relationship. By using correlation coefficients per input unit, the method can account for the variability of genetic correlations among subpopulations.
[0030] In some embodiments, the input units are determined from their respective populations, and the correlation coefficients between each genetic variant depend on the ancestry of the input unit populations. In some embodiments, at least one population of the input units includes individuals with a common ancestor, and the correlation coefficients are determined based on the correlations between genetic variants in the region of interest for individuals with a common ancestor. Using ancestry-based correlation coefficients is particularly useful because individuals with different ancestors often have different patterns of correlation between genetic variants.
[0031] In some embodiments, if at least one of the input units includes individuals with different ancestors, the correlation coefficient is determined based on the mean of the correlations between genetic variants in the region of interest for individuals with each of the different ancestors. Some input units may come from studies that are not stratified by ancestry. By using a mixed set of correlation coefficients, this data can still be incorporated into the method and the results can be improved.
[0032] In some embodiments, at least one population of the input units includes individuals having the same value for a trait. In some embodiments, at least one population of the input units includes individuals having different values for a trait. In some embodiments, the trait is one of sex, age, weight, molecular biomarker, or behavioral trait. Subpopulations may be defined based on traits, and input units based on data from individuals having those traits make it possible to draw conclusions about the difference in effect sizes between different subpopulations.
[0033] In some embodiments, performing one or more iterations includes performing a predetermined number of iterations. Performing a predetermined number of iterations can provide sufficient results for known types of problems while maintaining high computational efficiency.
[0034] In some embodiments, each of one or more iterations further includes a step of evaluating a convergence parameter, and performing one or more iterations includes performing the iterations until a predetermined condition for the convergence parameter is met. Calculating the convergence parameter may be advantageous when the appropriate number of iterations is unknown.
[0035] In some embodiments, information about the association between multiple genetic variants and a target phenotype includes, for each of the multiple genetic variants, an estimate of the strength of the association between the genetic variant and the target phenotype, and the error in the estimate of the strength of the association. As mentioned above, using this type of summary statistical data has the advantage of having a large amount of data available.
[0036] In another embodiment, a method is provided for determining a polygene risk score for a target phenotype for a target individual, comprising: receiving genetic information for a target region of the target individual's genome; receiving predictive effect sizes for a plurality of genetic variants in the target region on the target phenotype, determined using a method for analyzing the genetic data; and determining a polygene risk score based on the genetic information and predictive effect sizes for the target individual. As described above, the calculation of a polygene risk score is a particularly desirable use of the predictive effect sizes determined for genetic variants and can be used in a variety of clinical applications. In some embodiments, the input units received in the method for analyzing the genetic data are determined from each population, and the polygene risk score for the individual is determined using the predictive effect sizes for the input units determined from the population most similar to the target individual. By using the predictive effect sizes for the input units most appropriate for the individual, the accuracy of the polygene risk score can be improved compared to that determined using general effect sizes determined for unstratified data.
[0037] According to another aspect of the present invention, an apparatus for analyzing genetic data of an organism is provided. The apparatus comprises a receiving unit configured to receive a plurality of input units, each input unit containing information about the association between a plurality of genetic variants in a region of interest in the organism's genome and a target phenotype of the organism; and a data processing unit configured to perform one or more iterations, for each of the plurality of genetic variants, determining, based on the plurality of input units, whether the genetic variant is the cause of the target phenotype, and, if determined to be the cause, determining a sampled effect size of the genetic variant for the target phenotype for each input unit, based on information about the correlation between the plurality of input units and the plurality of genetic variants in the region of interest, wherein the sampled effect size of the genetic variant for the target phenotype is non-zero for all input units; and for each genetic variant, determining a predicted effect size of the genetic variant for the target phenotype for each input unit, based on the mean of the posterior effect size of the genetic variant for the input unit calculated using the sampled effect size, or across at least a subset of the iterations of the sampled effect size of the genetic variant for the input unit.
[0038] The present invention may be embodied in a computer program that includes instructions for causing a computer to execute the method, or in a computer-readable medium that includes instructions for causing a computer to execute the method when executed by a computer.
[0039] Embodiments of the present invention will be further described simply as examples with reference to the attached drawings. [Brief explanation of the drawing]
[0040] [Figure 1] This is a flowchart of a method for analyzing genetic data of an organism according to the present invention. [Figure 2]This is a flowchart showing each iteration step in the process of performing the iterations in the method shown in Figure 1. [Figure 3] This is a flowchart of the method for determining a multi-gene risk score according to the present invention. [Figure 4] This graph shows the estimated effect sizes for two different subpopulations using conventional techniques for analyzing genetic data. [Figure 5] This graph shows the effect sizes estimated for two different subpopulations using the method according to the present invention. [Modes for carrying out the invention]
[0041] Figure 1 shows a computer-based method for analyzing genetic data of an organism. Typically, the organism is a human, but the method may be applied to other organisms. While the method refers to an "organism," this term may not refer to a specific individual organism, but rather to an organism or a group of organisms collectively.
[0042] The method includes step S10, which receives a plurality of input units 10. The input units 10 contain information about the association between a plurality of genetic variants in a target region of the organism's genome and the target phenotype of the organism. The target phenotype may include any physical, behavioral, or other phenotype that may be targeted. Genetic variants are typically single nucleotide polymorphisms, but may also include other types of genetic variations, such as insertions or deletions of a portion of the organism's genome.
[0043] Each input unit 10 may be derived from one or more genome-wide association studies (GWAS), and may therefore be referred to as a study or GWAS. Each input unit 10 contains information about the association between multiple genetic variants and target phenotypes for individuals, for example, a group of individuals involved in the corresponding GWAS.
[0044] At least a subset of the input unit 10 is determined from a population of a particular subpopulation. For example, at least one population of the input unit 10 may include individuals that share a common ancestor. Alternatively or additionally, at least one population of the input unit 10 may include individuals that share the same trait value. The trait may be, for example, sex, age, weight, molecular biomarkers, or behavioral traits such as whether or not an individual smokes. In the case of continuous traits such as age or weight, the trait values may be divided into arbitrary bins to form a discrete number of categories and divide individuals for which data is available into corresponding discrete groups to define the input unit 10.
[0045] Since the bin definition is arbitrary and not fixed by the ecosystem, some embodiments of the method may include performing the steps of the method multiple times with different bin definitions (and correspondingly modified input units 10) and comparing the predictive power of the effect sizes produced with the different bin definitions. The effect size with the greatest predictive power may then be returned as the output of the method.
[0046] Not all of the input units 10 have to be determined from populations of a particular subpopulation. For example, at least one population of the input units 10 may include individuals with different ancestry. Alternatively or additionally, at least one population of the input units 10 may include individuals with different value traits. By including one or more additional input units 10 from studies that have not been stratified by subpopulation, the method can be made to utilize additional information from populations that are not separable between subpopulations. This may be because, for example, the underlying data did not contain information about specific traits of individuals in the study, making stratification impossible.
[0047] In the embodiments described herein, information about the association between a plurality of genetic variants and a target phenotype includes, for each of the plurality of genetic variants, an estimate of the strength of the association between the genetic variant and the target phenotype, and the error of the estimate of the strength of the association. Thus, each input unit 10 includes, for each variant i numbered from 1 to n, an estimate of the strength of the association between variant i and the target phenotype.
[0048]
number
[0049]
number
[0050] Estimated strength of association in each input unit 10
[0051]
number
[0052] The unknown true effect size β in each given variant i that is adjusted to have a correlation with nearby variants
[0055] , (or strength of association) is desirable to determine. The problem of genetic prediction is to estimate the set of its true effect sizes β i . All
[0053]
Number
[0054] In this method, the estimation of which variants are causal and their corresponding effect sizes is achieved by exploring the space of possible (X i , β i ) in step S12 of performing one or more iterations. Details of this step are further discussed below. In some embodiments, performing one or more iterations includes performing a predetermined number of iterations. This can be advantageous when it is known approximately how many iterations are needed to obtain a high-precision result. In some embodiments, each of the one or more iterations further includes a step of evaluating a convergence parameter, and performing one or more iterations includes performing iterations until a predetermined condition for the convergence parameter is satisfied. This can be advantageous when it is unclear how many iterations are required to yield a high-precision result.
[0055] As mentioned above, currently available methodologies for analyzing genetic data (such as LDpred) consider one GWAS at a time and perform random sampling to determine which variant is causative, for example, by Monte Carlo sampling. LDpred relies on the fact that it is possible to solve Bayesian computations for one study and one genetic variant. This then extends the methodology from one to multiple correlated variants using Gibbs sampling techniques. Strictly speaking, for a given genetic variant, LDpred makes the following prior assumptions: - There is a probability (1-p) that the effect of the genetic variant on the phenotype is 0 (i.e., the variant is not the cause). - With probability p, the effect on the outcome has mean 0 and variance σ 2 It follows a normal distribution (i.e., the variant is the cause and has an effect size distribution centered at 0).
[0056] Summary statistics in the training GWAS for these assumptions and related phenotypes
[0057]
number
[0058]
number
[0059] However, this approach has limitations, particularly with respect to smaller studies that may yield insufficient results for some subpopulations. For example, studies on individuals with non-European ancestors are less common and typically smaller in scale than those on European ancestors, and therefore yield insufficient predictive results for individuals with non-European ancestors.
[0060] When considering multiple studies on the same target phenotype, currently available methods consist of combining multiple studies into a single meta-analysis and performing further processing on that meta-analysis, such as determining the PRS. One example of a tool that considers the basis for the association between variants and target phenotypes based on multiple studies is polymorphism analysis of GWAS (MTAG, Turley et al., 2018). MTAG combines a set of GWAS and generates a kind of meta-analysis that yields updated summary statistics for each input GWAS. These updated summary statistics can be fed into any standard PRS construction methodology, including LDPred (Craig et al., Nature Genetics, 2020). However, MTAG uses marginal effect sizes and standard errors without simultaneously considering LD information, meaning that this method does not fully utilize the richness of the available input datasets. Another existing approach for combining multiple studies is single-variant Bayesian computation developed in a different context (Trochet et al., Genetic Epidemiology, 2019). In this method, the objective is not to predict effect sizes, but to combine multiple studies to improve the ability to detect genetic associations. Therefore, genetic variants are examined individually, and there is no motivation to control the correlation patterns between them.
[0061] The limitations of existing approaches can also be illustrated by several illustrative use cases.
[0062] In the first scenario, due to historical circumstances, a well-potential GWAS exists for individuals of the first ancestor, typically of European descent. A second, less potent study exists for the same target phenotype in a different ancestor. It is not easy to combine the well-potential study with the second study using existing methods. Firstly, because the correlation patterns between variants differ across ancestors, combining the two studies results in an uncertain study that is difficult to analyze. Secondly, genetic and environmental differences throughout the study can lead to population-specific variants or differences in effect sizes between these populations. Existing methods cannot account for this.
[0063] In the second scenario, predictive algorithms are generated that capture risk factors specific to a subset of the population. Current methods may not be able to make the most of the underlying genetic data. This could mean that a "context-specific" PRS calculated using effect sizes specific to an individual's age, sex, ethnicity, or any other social determinants of health may be more accurate. For example, determinants of cardiovascular disease (CVD) differ by sex due to differences in BMI, blood pressure, alcohol intake, and exercise patterns.
[0064] Existing methods address this challenge by obtaining already stratified samples for studies of subpopulations and then deriving PRSs separately from these. For example, in the CVD example above, current methods analyze GWAS separately for two sex-specific cohorts (male and female) and generate PRSs using each of these cohorts. However, many genetic determinants are shared across sexes. Therefore, a joint analysis of male and female cohorts that takes sex differences into account and generates sex-specific PRSs is more appropriate to maximize predictive power. For example, if interested in the PRS for lung cancer in non-smokers, there are similar options in existing methods: 1) using a large sample that includes smokers, or 2) using a smaller study consisting only of non-smokers.
[0065] However, the predictive power of the PRS also depends on the size of the underlying study. Therefore, restricting the study sample to a subset of the data is generally undesirable. In the example of smoking, the first option is to use a biased study (the PRS suggests a larger effect size for addiction-related variants from the proportion of participants who are smokers), while the second option is likely to be ineffective (since 80% of lung cancer patients are smokers). This leads to a conflicting argument with subpopulation-specific PRSs.
[0066] These use cases are not mutually exclusive. For example, one might want to determine the PRS to predict clinical outcomes in a socially defined subset of a certain sex or ethnic group.
[0067] To overcome these limitations, this method allows for the combination of information from multiple studies when determining causal variants and their effect sizes, but importantly, it allows the determined effect sizes for each genetic variant to differ among the input units 10. This makes it possible to use the greater statistical power of larger studies together with data from smaller studies to improve estimation of which variant is causal in smaller studies, while at the same time determining different effect sizes for different subpopulations.
[0068] This involves extending the Bayesian computation of LDPred (Vilhjalmsson et al., 2015) from a single study to any number of studies in separate subpopulations, but on the same phenotype. Doing so allows for a correlation between the results of Trochet et al.'s single-variant, multiple-study approach and Vilhjalmsson et al.'s multi-variant, single-study approach. Understanding the relationship between both methodological approaches enables the flexible integration of multiple studies and the creation of predictive algorithms based on multiple GWASs rather than a single study.
[0069] As shown in Figure 2, each iteration in step S12 of this method involves determining, for each of several genetic variants, whether the genetic variant is the cause of the target phenotype based on several input units 10. Existing methods consider genetic variants one by one, for example, in physical order or by random sampling, although other options are possible. However, for each variant, this method incorporates multiple studies rather than a single study, evaluating the probability of a model of the variant's causality and effect size for each of the input units 10 (for example, by Bayesian analysis, as will be discussed further below). Thus, this method determines whether each genetic variant is causative by analyzing all of the input units 10 together, rather than considering the input units 10 one by one at a time as in existing methods, or combining the input units 10 into a single meta-analysis.
[0070] If a genetic variant is determined to be the cause, a step is taken to determine the sampled effect size 12 of the genetic variant for the target phenotype for each of the input units 10, based on information about the multiple input units 10 and the correlations between the multiple genetic variants in the region of interest. Thus, in the exploration of the space of causative variants and combined effect sizes, if a variant is selected as the cause, different effect sizes are sampled for each study.
[0071] In the embodiment shown in Figure 1, determining whether or not a genetic variant is the cause includes the steps of: S120, which calculates the probability of information from multiple input units assuming that a genetic variant is the cause, and the probability of information from multiple input units assuming that a genetic variant is not the cause; and S122, which probabilistically determines that a genetic variant is the cause with a probability that depends on the ratio of the probability of information from multiple input units assuming that a genetic variant is the cause, and the probability of information from multiple input units assuming that a genetic variant is not the cause.
[0072] In step S120, the probability of information from multiple input units when it is assumed that a genetic variant is the cause may depend on the correlation between the proportion of multiple genetic variants expected to be the cause, the multiple input units 10, and the effect size of the genetic variant on the target phenotype for each of the input units 10. The probability of information from multiple input units when it is assumed that a genetic variant is not the cause may depend on the proportion of multiple genetic variants expected to be the cause and the multiple input units 10. The probabilities may be calculated using prior values.
[0073] For example, in one embodiment, two prior models are considered for any given variant. The null hypothesis is that the variant has an effect size of 0 for all input units 10, with probability (1-p). The alternative hypothesis is that, with probability p, the effect size of the genetic variant with respect to the input unit 10 follows a multivariate Gaussian distribution.
[0074] The parameter p is the proportion of several genetic variants that are expected to be causative. In some embodiments, the proportions of several genetic variants that are expected to be causative are predetermined. This may be more computationally efficient if estimates are available. In some embodiments, the proportions of several genetic variants that are expected to be causative are updated in each iteration. This allows the method to converge to the true value of p, potentially improving accuracy.
[0075] Under the null hypothesis, the sampled effect size β is equal to 0 for all input units 10. Therefore, the sampled effect size β for variant i is equal to 0. i The covariance matrix for this is the uncertainty in the parameter values (SE for the standard error of the marginal effect size of variant i from input unit j). i,j It is determined solely by (what is called). It itself depends on the sample size of the study and is defined in the summary statistics for input units of 10. More precisely, it is as follows:
[0076]
number
[0077] Under the alternative hypothesis, the sampled effect size β of variant i i It is non-zero, and for each dimension of the multivariate Gaussian, it has a mean of 0 and multiple unknown variances.
[0078]
number
[0079]
number
[0080]
number
[0081] In another embodiment, the correlation between the effect sizes of genetic variants for the target phenotype for each of the input units 10 is updated in each iteration. This allows the method to converge to true parameter values, potentially leading to more accurate results. Alternatively, one can consider a grid of correlation values, and the optimal parameter values for these correlations can be selected by maximizing predictions in a dataset of individual-level data with results. In the example given here, the correlation between effect sizes is a single parameter that is the same for all combinations of input units 10.
[0082] The correlation may be a correlation matrix, which allows the correlation to differ between different combinations of input units 10. For example, for continuous traits such as age, correlation can be used to smooth between bins of the variable. For continuous traits such as age, information can be borrowed from adjacent age bins to improve the effect size and corresponding PRS for any given bin. Since there is an a priori expectation that adjacent or nearby bins should have a higher genetic correlation than more distant bins, this can be considered using different values of correlation between different bins of a continuous variable.
[0083] Once these two prior models are defined, we can calculate the probabilities of information from multiple input units assuming that genetic variants are the cause, and the probabilities of information from multiple input units assuming that genetic variants are not the cause, and combine these with these prior models.
[0084] In one embodiment of step S122, the Bayesian factor can be calculated for each variant i using the probability determined in step S120.
[0085]
number
[0086] We assume that the causative genetic variants are shared among the input units 10 (and their corresponding subpopulations), and that the effect sizes of these variants correlate but vary among the input units 10. In other words, a variant is either causative for all of the input units 10, or not causative for any of them. Therefore, if a genetic variant is determined to be causative, the sampled effect size 12 of the genetic variant for the target phenotype is determined to be non-zero for all of the input units 10.
[0087] If the input units 10 are determined from each population, and depending on the studies used to determine the input units 10, one potential challenge is sample overlap between studies. For example, a “sex-joint” study may be used to derive one input unit 10, which is then analyzed jointly with input units 10 derived from other “male-only” and “female-only” studies. Sex-specific studies may be a subset of a larger set of sex-joint studies, which may include additional samples for which sex information was not provided, or may simply be the union of two sex-specific studies. To account for this, in some embodiments, the probability of information from multiple input units, assuming genetic variants are the cause, depends on one or more parameters that quantify the overlap in populations between each pair of input units 10.
[0088] For example, one way to consider that possibility is the covariance matrix V shown above. i The task is to update it to the following:
[0089]
number
[0090] If it is determined that a genetic variant is the cause, the posterior mean and variance can be calculated for the combined effect size across all input units 10. The step of determining the sampled effect size 12 of the genetic variant includes the step S124 of calculating the probability distribution of the effect size of the genetic variant for the target phenotype for the input units 10, and the step S126 of sampling the effect size values for the input units 10 from the probability distribution.
[0091] In practice, it is impossible to completely explore the space of all possible causal variants and all possible corresponding effect sizes within a reasonable time; therefore, sampled effect sizes 12 are used. Accordingly, sampling techniques, such as Monte Carlo simulation, are used to explore the space of causal variants and their corresponding effect sizes. In some embodiments, sampling of effect size values in each iteration depends on sampled effect sizes 12 from one or more previous iterations. This can be used to guide the sampling technique to sufficiently explore the space of possible values. In some embodiments, sampling of effect size values is performed using a Monte Carlo Gibbs sampler.
[0092] In a preferred embodiment, the probability distribution is a multivariate normal distribution. The probability distribution may depend on the correlation between the effect sizes of the genetic variants for the target phenotype for each of the input units 10. As discussed above regarding probability, the correlation between the effect sizes of the genetic variants for the target phenotype for each of the input units 10 may be predetermined. Alternatively, the correlation between the effect sizes of the genetic variants for the target phenotype for each of the input units 10 may be updated in each iteration, thereby allowing the method to learn a suitable value for the correlation.
[0093] In certain cases, the probability distribution is the posterior mean of the effect size and is distributed as a multivariate normal distribution.
[0094]
number
[0095] A key step in some embodiments of methods for analyzing genetic data for the purpose of calculating PRS is the ability to control the correlation between genetic variants. As mentioned above, the correlation between variants can result in some variants having a large marginal effect size even if they are not responsible for the target phenotype.
[0096] To take this into consideration, in some embodiments, each of one or more iterations further includes step S128, for each genetic variant determined to be causal, subtracting a weighted effect size from information about the association between each other genetic variant in each input unit 10 and the target phenotype. Thus, if genetic variant i is determined to be causal, the sampled effect size β iIf a genetic variant i is determined, the effect of that causal variant is subtracted from the surrounding correlated variants. The weighted effect size is the sampled effect size 12 of the genetic variant for the target phenotype for the input unit 10, weighted by the respective correlation coefficients between the genetic variant and each other genetic variant.
[0097] In certain embodiments, as a result, the following corrections are applied to the marginal effect size of each of the other genetic variant j.
[0098]
number
[0099] In the above formula, β i This is the sampled effect size 12 for each variant currently determined to be the cause. i,j This is the correlation coefficient representing the correlation between each pair of variants i and j. The correlation coefficient is determined based on information about the correlation between multiple genetic variants in the region of interest, which can be estimated from the reference set of the reference sequence. This correction formula is used for each genotyped variant X. i It is normalized to have a variance of 1, and its associated marginal effect size
[0100]
number
[0101] The effect of this correction is that, once it is determined whether a particular variant is the cause, its marginal effect size is corrected using the above formula based on the sampled effect sizes of all variants that have been determined to be the cause in that iteration. Therefore, in such embodiments, the effect size β used in formulas (4) and (6) is corrected.i This is actually the corrected effect size calculated using equation (7). The very subtle point is that this subtraction step for a particular genetic variant depends on which of the other variants was sampled as the cause at the time the subtraction is performed. Therefore, depending on the order in which the genetic variants are sampled, β i Some differences may occur between iterations.
[0102] Importantly, in many cases, the correlation coefficient between genetic variants (the value r in the example above) can be directly derived from the data itself. i,j It is impossible to calculate the correlation coefficients directly, and instead, they must be derived from a reference population, such as data generated by a 1,000-person genome consortium. This set of correlation coefficients may be called a linkage disequilibrium map (or LD map) and reflects the covariance structure between genetic variants. As mentioned above, these correlation coefficients may differ between subpopulations, for example, for different ancestors. In existing methods that analyze only a single study, these correlation coefficients are determined from a reference population LD map that matches the population from which the study originated.
[0103] However, a challenge in this method is handling the effect size subtraction step S128, which considers the correlation between genetic variants in a way that is consistent with the ancestral patterns of variant correlation. To overcome this challenge, this method may, where appropriate, handle multiple reference LD maps in parallel. Once a variant is determined to be causative, the subtraction step S128 is applied ancestrally. Therefore, if the input unit 10 is determined from each population, the correlation coefficient between a genetic variant and each other genetic variant depends on the ancestry of the population of the input unit 10. A one-to-one mapping may be used between the ancestry in which each study was conducted and its corresponding LD map (covariance structure).
[0104] For example, if at least one population among the input units 10 contains individuals with a common ancestor, the correlation coefficient is determined based on the correlation between genetic variants in the region of interest for individuals with a common ancestor.
[0105] In another example, multiple input units 10 are derived from a study containing individuals from a mixture of multiple ancestors. If at least one population of the input units 10 contains individuals with different ancestors, the correlation coefficient is determined based on the average correlation between genetic variants in the region of interest for individuals with each of the different ancestors. The method determines the LD map for the mixed input units 10 as the average of multiple “primary” LD maps, each of which is determined from a well-defined set of reference ancestors with correlations between genetic variants.
[0106] If a population of 10 input units shares a common ancestor but has different values for other characteristics such as sex, a single LD map is sufficient for the common ancestor, and therefore it is not necessary to handle multiple LD maps simultaneously.
[0107] Depending on the input data used, it is possible that all of several genetic variants may not be present at a significant frequency across all ancestors. For example, some genetic variants may be found only in individuals of a specific ancestor. If this is the case, and the causal effect is assigned to one of these low-frequency variants, then this variant, which is not present in a given ancestor, may be considered uncorrelated with other variants of the same ancestor. Therefore, the r of correlation between the low-frequency variant and all other variants can be calculated. i,j The correlation coefficient can be set to zero.
[0108] Once one or more iterations are completed, the method includes step S14 for each genetic variant determining a predicted effect size 14 of the genetic variant for each of the input units 10 for the target phenotype, based on the mean of the sampled effect sizes 12 of the genetic variant for each of the input units 10. The predicted effect size 14 may be based on the mean of the posterior effect sizes of the genetic variant for each of the input units, calculated using the sampled effect sizes 12. In either case, the mean is taken over at least a subset of the iterations. Any preferred method for averaging may be used. Using multiple iterations and averaging helps the results overcome the randomness of effect size sampling. Once the set of causal variants and their effect sizes 14 is determined, it becomes easier to determine the PRS based on the effect sizes 14. In one embodiment, the mean of the sampled effect sizes may be a weighted mean, where the sampled effect size of each variant determined to be causal is weighted by the posterior probability that the variant is causal.
[0109] For example, the average effect size for variant i
[0110]
number
[0111]
number
[0112] Following the example of smoking in lung cancer, this method allows for the joint analysis of input units 10 (not stratified by smoking status) derived from a large lung cancer GWAS with input units 10 derived from a smaller lung cancer GWAS in non-smokers. This effectively yields two sets of predictive effect sizes 14 for lung cancer phenotypes in two subpopulations, namely non-smokers and the general population. For most genetic variants, the predictive effect size 14 is the same for both input units 10 corresponding to the two subpopulations. However, for addiction-related variants, the effect size for input units 10 from the smaller GWAS clearly indicates that these variants are not associated with lung cancer in non-smokers. This effectively achieves the aforementioned goal of obtaining a lung cancer PRS with addiction-related variants subtracted.
[0113] Typically, this method performs best when there is not too much variation in the size of the populations from which the input units 10 are determined. For example, when two input units 10 derived from smaller and larger populations are used, a significant improvement in performance is generally observed when the smaller population is approximately 20% or more the size of the larger population.
[0114] In some embodiments, one or more sampled effect sizes 12 for each genetic variant may be discarded and may not be included in the mean used to obtain the predicted effect size 14. The number of discarded effect sizes may be predetermined or based on the values of the sampled effect sizes 12. The sampled effect sizes 12 to be discarded may be from the first iteration of the method, e.g., the first 10 iterations, the first 20 iterations, or some other predetermined number of iterations. These are often referred to as "burn-in" iterations and are typically discarded because sampling techniques such as the Monte Carlo-Gibbs sampler require several iterations to converge to a useful sampling pattern.
[0115] Considering that it is generally desirable to determine the PRS, the present invention can also be used in a method for determining a polygenic risk score for a target phenotype in a target individual, as shown in Figure 3. The improved estimation of predictive effect sizes obtained using the above method enables more accurate determination of the PRS.
[0116] A method for determining the PRS includes step S20, which involves receiving genetic information 16 from around a target region of the genome of a target individual. This may include information about the genetic variants expressed by the individual in the target region (such as single nucleotide polymorphisms, deletions, or insertions).
[0117] The method further includes step S22 of receiving predictive effect sizes 14 for the target phenotype of multiple genetic variants in the region of interest, which are determined using the method for analyzing the genetic data described above.
[0118] The method further includes step S24, which determines a polygene risk score 20 based on genetic information and effect size 14 for a target individual 16.
[0119] In one embodiment, the input units 10 received in a method for analyzing genetic data are determined from their respective populations, and the polygene risk score 20 for an individual is determined using a predictive effect size 14 for the input unit 10 determined from the population most similar to the target individual. For example, if the effect size 14 is determined for two input units 10, one from populations with European ancestry and the other from East Asian ancestry, the individual is of East Asian ancestry, and the predictive effect size 14 for the East Asian input unit 10 is used to determine the PRS 20 for the individual.
[0120] In one embodiment, PRS20 is calculated as follows:
[0121]
number
[0122]
number
[0123] A method for analyzing genetic data may be performed by an apparatus for analyzing genetic data about an organism, which is also shown in Figure 1. The apparatus comprises a receiving unit 200 configured to receive a plurality of input units 10, each input unit containing information about the association between a plurality of genetic variants in a region of interest in the organism's genome and the organism's target phenotype. The apparatus further comprises a data processing unit 210 configured to perform one or more iterations, each of the plurality of genetic variants, which includes determining whether the genetic variant is causative of the target phenotype based on the plurality of input units, and, if it is determined that the genetic variant is causative, determining a sampled effect size 12 of the genetic variant for the target phenotype for each of the input units 10, based on information about the correlation between the plurality of input units 10 and the plurality of genetic variants in the region of interest. The sampled effect size 12 of the genetic variant for the target phenotype is non-zero for all of the input units 10. The data processing unit 210 is further configured to determine, for each genetic variant, the predicted effect size 14 of the genetic variant for each target phenotype for each input unit 10, based on the mean of the posterior effect size of the genetic variant for each input unit 10 calculated using the sampled effect size 12, or in at least a subset of iterations of the sampled effect size 12 of the genetic variant for each input unit 10.
[0124] The present invention may be embodied in a computer program that includes instructions causing a computer to perform a method for analyzing genetic data when the program is executed by a computer. The present invention may also be embodied in a computer-readable medium that includes instructions causing a computer to perform a method for analyzing genetic data when executed by a computer.
[0125] result Ancestral crossing To demonstrate the effectiveness of this method in determining effect sizes for subpopulations of different ancestors, Figure 4 shows an example of an effect size determined using a conventional method, and Figure 5 shows an effect size determined using this method.
[0126] As can be seen from the difference in the number of cases (Table 1), summary statistics data for breast cancer with good efficacy are available for individuals of European ancestry, and a significantly smaller cohort exists for East Asian women. In addition, two cohorts with good efficacy, namely the UK Biobank for individuals of European ancestry (Bycroft et al.) and the multi-ethnic cohort (MEC) for individuals of East Asian ancestry, are available to evaluate effect sizes for various phenotypes.
[0127] Figures 4 and 5 show the estimated effect sizes of genetic variants on chromosome 19 for two input units determined from two breast cancer studies in individuals of East Asian ancestry (red) and individuals of European ancestry (black). Figure 4 shows the effect size when determined separately for the two input units using conventional methods. Figure 5 shows the effect size when determined jointly for the two input units using the present method.
[0128] When effect sizes are determined by analyzing each input unit separately (Figure 4), the genetic variant at the confirmed oncolost ELL (enlarged inset in the lower panel of Figures 4 and 5) has a large weight for Europeans. However, the smaller sample size of the study in individuals of East Asian ancestry is not sufficient to detect this signal. When effect sizes are determined by analyzing the input units together (Figure 5), the combination of both studies provides sufficient statistical power for East Asians and also has a large effect size at the confirmed oncolost ELL.
[0129] Genome-wide collaborative analysis using this method improves predictive performance for both ancestors. In addition, collaborative analysis significantly alters the accuracy of causal variant identification. This can be seen in Figures 4 and 5. Large non-zero effect sizes for breast cancer in both European and East Asian ancestors span significantly shorter spatial distances in the upper panel of Figure 5 (collaborative analysis) than in the upper panel of Figure 4 (individual analysis). This reflects a better understanding of causal variant locating obtained by combining data from multiple ancestors.
[0130] Table 1 shows the training population used to determine breast cancer PRS in women of European and East Asian ancestry.
[0131] [Table 1]
[0132] Using these cohorts, we evaluated the predictive ability of different methods used to determine the predictive effect size in PRS calculation—namely LDPred, MTAG, and our method—to calculate PRS. The results are shown in Table 2, with bold indicating the best performance for each ancestor. Since breast cancer is a binary trait, the area under the curve (AUC) is used as a measure of predictive accuracy to quantify the separation of PRS between breast cancer cases and controls. The best-performing method was our method, which combines input units from studies of multiple ancestors and generates ancestor-specific versions of PRS based on the effect size for each input unit.
[0133] [Table 2]
[0134] Contextual As discussed above, this method can also be used to determine predictive effect sizes specific to subpopulations determined based on other characteristics of individuals. Similar to different ancestors, different strata of the population can be treated, and PRS specific to these different strata can also be calculated. In the example below, it is assumed that the studies used to determine the input units originate from a single population. Therefore, it is not necessary to consider different sets of correlation coefficients (i.e., LD maps representing the correlation structure between genetic variants) for each input unit. However, as mentioned above, there may be overlap in the sample of individuals across studies.
[0135] In this example, the predictive effect size of a genetic variant on BMI is determined against the input units determined using the training dataset from the GIANT Consortium GWAS (152,893 men, 171,977 women, or 332,154 combined). The PRS obtained from the effect size is then applied to the evaluation dataset. Since BMI is a quantitative trait, the explanatory power of variance (r 2 ) is used as a measure of prediction accuracy. Two approaches, namely, - An approach using existing methods that combine both sexes into a single meta-analysis to generate a single PRS that is evaluated in both men and women, and - This method jointly analyzes BMI studies in men and separate BMI studies in women to generate different effect sizes and two distinct PRSs (one for each sex). A comparison is made between the PRS (Percentage-Ratio Scale) and the effect sizes generated using the method described above.
[0136] The results of this comparison are shown in Table 3. Bold text indicates the best-performing methodology for each of the two genders.
[0137] [Table 3]
[0138] The explained BMI variance is higher for men when using the male effect size from this sex-stratified approach. Similarly, the explained BMI variance is higher for women when using the female weight effect size from this sex-stratified approach. In both cases, meta-analyses of men and women using existing methods do not perform very well. In addition, using either the male or female effect size from this method explains a higher proportion of BMI variance in the sex-joint assessment set than existing meta-analysis-based methods.
[0139] References Bayesian meta-analysis across genome-wide association studies of diverse phenotypes, Trochet H, Pirinen M, Band G, Jostins L, McVean G, Spencer C, Genetic Epidemiology 2019 Multi-trait analysis of genome-wide association summary statistics using MTAG, P Turley et al. Nature Genetics 2018 Vilhjalmsson BJ, Yang J, Finucane HK, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet 2015. Variable prediction accuracy of polygenic scores within an ancestry group, Hakhamanesh Mostafavi, Arbel Harpak Ipsita Agarwal, Dalton Conley, Jonathan K Pritchard, Molly Przeworski, eLife, 2020 Bycroft et al, The UK Biobank resource with deep phenotyping and genomic data, Nature 2018 A correction for sample overlap in genome-wide association studies in a polygenic pleiotropy-informed framework, Marissa LeBlanc, Verena Zuber, Wesley K. Thompson, Ole A. Andreassen, Schizophrenia and Bipolar Disorder Working Groups of the Psychiatric Genomics Consortium, Arnoldo Frigessi, and Bettina Kulle Andreassen, 2018 Multitrait analysis of glaucoma identifies new risk loci and enables polygenic prediction of disease susceptibility and progression, Jamie E. Craig et al, Nature Genetics 2020 Furthermore, it should be noted that the following embodiments are disclosed in this specification. [Aspect 1] A computer-based method for analyzing genetic data about an organism, Receiving multiple input units, each input unit containing information about the relationship between multiple genetic variants in a target region of the organism's genome and the organism's target phenotype, For each of the aforementioned multiple genetic variants, Based on the plurality of input units, determine whether the genetic variant is the cause of the target phenotype, and If it is determined that the genetic variant is the cause, the sampled effect size of the genetic variant for the target phenotype for each of the input units is determined based on information about the correlation between the plurality of input units and the plurality of genetic variants in the region of interest, wherein the sampled effect size of the genetic variant for the target phenotype is non-zero for all of the input units. Performing one or more iterations including, For each genetic variant, the predicted effect size of the genetic variant for the target phenotype for each of the input units is determined based on the average of the posterior effect size of the genetic variant for the input unit calculated using the sampled effect size, or over at least a subset of the iterations of the sampled effect size of the genetic variant for the input unit. A method that includes this. [Aspect 2] The method according to embodiment 1, wherein determining whether or not the genetic variant is the cause includes calculating the probability of the information from the plurality of input units when it is assumed that the genetic variant is the cause, and the probability of the information from the plurality of input units when it is assumed that the genetic variant is not the cause, and probabilistically determining that the genetic variant is the cause with a probability that depends on the ratio of the probability of the input data when it is assumed that the genetic variant is the cause, and the probability of the input data when it is assumed that the genetic variant is not the cause. [Aspect 3] Assuming that the aforementioned genetic variant is the cause, the probability of the information from the plurality of input units is, The proportion of the multiple genetic variants that are expected to be the cause, The plurality of input units, and Correlation between the effect size of the genetic variant for the target phenotype for each of the input units The method according to embodiment 2, which depends on the method. [Aspect 4] Assuming that the genetic variant is not the cause, the probability of the information from the plurality of input units is: The proportion of the multiple genetic variants that are expected to be the cause, and The aforementioned multiple input units The method according to embodiment 2 or 3, which depends on the method. [Aspect 5] The proportion of the multiple genetic variants that are expected to be the cause is predetermined, according to the method of embodiment 3 or 4. [Aspect 6] The correlation between the effect size of the genetic variant for the target phenotype for each of the input units is predetermined, according to the method of any one of embodiments 3 to 5. [Aspect 7] The method according to embodiment 3, 4, or 6, wherein the proportion of the multiple genetic variants expected to be causative is updated in each iteration. [Aspect 8] The method according to any one of embodiments 3 to 5 or 7, wherein the correlation between the effect size of the genetic variant for the target phenotype for each of the input units is updated in each iteration. [Aspect 9] The method according to any one of embodiments 2 to 8, wherein the input units are determined from their respective populations, and the probability of the information from the plurality of input units, assuming that the genetic variant is the cause, depends on one or more parameters that quantify the overlap in the populations between each pair of input units. [Aspect 10] The method according to any one of embodiments 1 to 9, wherein determining the sampled effect size of the genetic variant includes calculating a probability distribution of the effect size of the genetic variant for the target phenotype for the input unit, and sampling the value of the effect size for the input unit from the probability distribution. [Aspect 11] The method according to embodiment 10, wherein the probability distribution is a multivariate normal distribution. [Aspect 12] The method according to embodiment 10 or 11, wherein the sampling of the effect size value in each iteration depends on the sampled effect size from one or more previous iterations. [Aspect 13] The method according to any one of embodiments 10 to 12, wherein the sampling of the effect size value is performed using a Monte Carlo Gibbs sampler. [Aspect 14] The method according to any one of embodiments 10 to 13, wherein the probability distribution depends on the correlation between the effect sizes of the genetic variants for the target phenotype for each of the input units. [Aspect 15] The correlation between the effect size of the genetic variant for the target phenotype for each of the input units is predetermined, according to the method of embodiment 14. [Aspect 16] The method according to aspect 14, wherein the correlation between the effect size of the genetic variant for the target phenotype for each of the input units is updated in each iteration. [Aspect 17] Each of the one or more iterations further includes subtracting a weighted effect size for each genetic variant determined to be causative from information about the association between each other genetic variant in each input unit and the target phenotype, The weighted effect size is the sampled effect size of the genetic variant for the target phenotype with respect to the input unit, weighted by the respective correlation coefficients between the genetic variant and each other genetic variant. The correlation coefficient is determined based on the information regarding the correlation between the plurality of genetic variants in the target region. The method according to any one of embodiments 1 to 16. [Aspect 18] The method according to embodiment 17, wherein the input unit is determined from each population, and the correlation coefficient between the genetic variant and each other genetic variant depends on the ancestor of the population of the input unit. [Aspect 19] The method according to embodiment 18, wherein at least one of the input units comprises individuals having a common ancestor, and the correlation coefficient is determined based on the correlation between genetic variants in the region of interest for the individuals having a common ancestor. [Aspect 20] The method according to embodiment 18 or 19, wherein at least one of the populations among the input units includes individuals having different ancestors, and the correlation coefficient is determined based on the mean of the correlations between genetic variants in the region of interest for each of the individuals having different ancestors. [Aspect 21] The method according to any one of embodiments 1 to 20, wherein at least one of the input units includes individuals having the same value characteristic. [Aspect 22] The method according to any one of embodiments 1 to 21, wherein at least one of the input units includes individuals having different value characteristics. [Aspect 23] The method according to embodiment 21 or 22, wherein the characteristic is one of sex, age, weight, molecular biomarker, or behavioral characteristic. [Aspect 24] The method according to any one of embodiments 1 to 23, wherein performing one or more iterations includes performing a predetermined number of iterations. [Aspect 25] The method according to any one of embodiments 1 to 24, wherein each of the one or more iterations further includes the step of evaluating a convergence parameter, and performing one or more iterations includes performing the iterations until a predetermined condition for the convergence parameter is met. [Aspect 26] The method according to any one of embodiments 1 to 25, wherein the information regarding the association between the plurality of genetic variants and the target phenotype includes, for each of the plurality of genetic variants, an estimate of the strength of the association between the genetic variant and the target phenotype, and the error of the estimate of the strength of the association. [Aspect 27] A method for determining a polygene risk score for a target phenotype for a target individual, comprising: receiving genetic information for a target region of the genome of the target individual; receiving predicted effect sizes for a plurality of genetic variants in the target region on the target phenotype, determined using a method for analyzing genetic data as described in any one of embodiments 1 to 26; and determining the polygene risk score based on the genetic information and the predicted effect sizes for the target individual. [Aspect 28] The method according to embodiment 27, wherein the input units received in the method for analyzing genetic data are determined from each population, and the polygene risk score for the individual is determined using the predictive effect size for the input units determined from the population most similar to the target individual. [Aspect 29] A device for analyzing genetic data about organisms, A receiving unit configured to receive multiple input units, each input unit containing information about the relationship between multiple genetic variants in a target region of the organism's genome and the organism's target phenotype, For each of the aforementioned multiple genetic variants, Based on the plurality of input units, determine whether the genetic variant is the cause of the target phenotype, and If it is determined that the genetic variant is the cause, the sampled effect size of the genetic variant for the target phenotype for each of the input units is determined based on information about the correlation between the plurality of input units and the plurality of genetic variants in the region of interest, wherein the sampled effect size of the genetic variant for the target phenotype is non-zero for all of the input units. Perform one or more iterations including For each genetic variant, the predicted effect size of the genetic variant for the target phenotype for each of the input units is determined based on the average of the posterior effect size of the genetic variant for the input unit calculated using the sampled effect size, or on the average of the posterior effect size of the genetic variant for the input unit calculated using the sampled effect size. A data processing unit configured as follows: A device equipped with the following features. [Aspect 30] A computer program that, when executed by a computer, includes an instruction causing the computer to perform the method described in any one of the embodiments 1 to 28. [Aspect 31] A computer-readable medium containing instructions that, when executed by a computer, cause the computer to perform the method described in any one of the embodiments 1 to 28. [Explanation of symbols]
[0140] 10 input units 12 Sampled effect sizes 14. Effect Size 16 Individual Genetic Information 20 PRS
Claims
1. A computer-based method for analyzing genetic data about an organism, Receiving multiple input units, each input unit containing information about the relationship between multiple genetic variants in a target region of the organism's genome and the organism's target phenotype, For each of the aforementioned multiple genetic variants, Based on the plurality of input units, determine whether the genetic variant is the cause of the target phenotype, and If it is determined that the genetic variant is the cause, the sampled effect size of the genetic variant for the target phenotype for each of the input units is determined based on the correlation of linkage disequilibrium (LD) between the plurality of input units and the plurality of genetic variants in the region of interest, wherein the sampled effect size of the genetic variant for the target phenotype is non-zero for all of the input units. Performing one or more iterations including, For each genetic variant, the predicted effect size of the genetic variant for the target phenotype for each of the input units is determined based on the average of the posterior effect size of the genetic variant for the input unit calculated using the sampled effect size, or over at least a subset of the iterations of the sampled effect size of the genetic variant for the input unit. Includes, Determining the sampled effect size of the genetic variant includes calculating the probability distribution of the effect size of the genetic variant for the target phenotype for the input unit, and sampling the value of the effect size for the input unit from the probability distribution. The probability distribution depends on the correlation between the effect sizes of the genetic variants for the target phenotype for each of the input units. The correlation between the effect size of the genetic variant for the target phenotype for each of the input units is updated in each iteration. method.
2. Determining whether or not the aforementioned genetic variant is the cause is, i) the probability of the information from the multiple input units assuming that the genetic variant is the cause, and ii) the probability of the information from the multiple input units assuming that the genetic variant is not the cause, The method according to claim 1, comprising probabilistically determining that the genetic variant is the cause, with a probability that depends on the ratio of the probability of the information from the plurality of input units when it is assumed that the genetic variant is the cause, and the probability of the information from the plurality of input units when it is assumed that the genetic variant is not the cause.
3. (a) The probability of the information from the plurality of input units, assuming that the genetic variant is the cause, The proportion of the multiple genetic variants that are expected to be the cause, The plurality of input units, and Correlation between the effect size of the genetic variant for the target phenotype for each of the input units Depends on, and (b) The probability of the information from the plurality of input units, assuming that the genetic variant is not the cause, The proportion of the multiple genetic variants that are expected to be the cause, and The multiple input units The method according to claim 2, which depends on either or both of the following.
4. (a) The proportions of the multiple genetic variants that are expected to be the cause are predetermined, or (b) The proportion of the plurality of genetic variants that are expected to be causative is updated in each iteration, the method according to claim 3.
5. (a) The correlation between the effect size of the genetic variant for the target phenotype for each of the input units is predetermined, or (b) The method according to claim 3 or 4, wherein the correlation between the effect size of the genetic variant for the target phenotype for each of the input units is updated in each iteration.
6. The method according to any one of claims 2 to 5, wherein the input units are determined from each population, and the probability of the information from the plurality of input units, assuming that the genetic variant is the cause, depends on one or more parameters that quantify the overlap in the populations between each pair of input units.
7. (a) The probability distribution is a multivariate normal distribution. (b) The sampling of the effect size value in each iteration depends on the sampled effect size from one or more previous iterations, and (c) The sampling of the effect size values is performed using a Monte Carlo Gibbs sampler. The method according to any one of claims 1 to 6, wherein one or more of the above.
8. Each of the one or more iterations further includes subtracting a weighted effect size for each genetic variant determined to be causative from information about the association between each other genetic variant in each input unit and the target phenotype, The weighted effect size is the sampled effect size of the genetic variant for the target phenotype with respect to the input unit, weighted by the respective correlation coefficients between the genetic variant and each other genetic variant. The correlation coefficient is determined based on the information regarding the correlation between the plurality of genetic variants in the target region. The method according to any one of claims 1 to 7.
9. The method according to claim 8, wherein the input unit is determined from each population, and the correlation coefficient between each genetic variant and each other depends on the ancestor of the population of the input unit.
10. (a) At least one of the input units includes individuals having a common ancestor, and the correlation coefficient is determined based on the correlation between genetic variants in the region of interest for the individuals having a common ancestor, and The method according to claim 9, wherein at least one of the populations of the input units includes individuals having different ancestors, and the correlation coefficient is determined based on the mean of the correlations between genetic variants in the region of interest for each of the individuals having different ancestors.
11. (a) At least one of the input units includes individuals having the same value of characteristic, (b) The method according to any one of claims 6, 9, or 10, wherein at least one of the populations of the input units includes individuals having different value characteristics.
12. The method according to claim 11, wherein the characteristic is one of sex, age, weight, molecular biomarker, or behavioral characteristic.
13. (a) Performing one or more iterations includes performing a predetermined number of iterations. (b) Each of the one or more iterations further includes the step of evaluating the convergence parameter, and performing one or more iterations includes performing the iterations until a predetermined condition for the convergence parameter is met. (c) The method according to any one of claims 1 to 12, wherein the information regarding the association between the plurality of genetic variants and the target phenotype includes, for each of the plurality of genetic variants, one or more of the following: an estimate of the strength of the association between the genetic variant and the target phenotype, and an error in the estimate of the strength of the association.
14. A method for determining a polygenic risk score for a target phenotype in a target individual, Receiving genetic information about the target region of the genome of the aforementioned target individual, To receive a predictive effect size for the target phenotype of a plurality of genetic variants in the region of interest, determined using a method for analyzing genetic data according to any one of claims 1 to 13, A method comprising determining the polygene risk score based on the genetic information and the predicted effect size for the target individual.
15. The method according to claim 14, wherein the input units received in the method for analyzing genetic data are determined from each population, and the polygene risk score for the target individual is determined using the predicted effect size for the input units determined from the population most similar to the target individual.
16. A device for analyzing genetic data about organisms, A receiving unit configured to receive multiple input units, each input unit containing information about the relationship between multiple genetic variants in a target region of the organism's genome and the organism's target phenotype, For each of the aforementioned multiple genetic variants, Based on the plurality of input units, determine whether the genetic variant is the cause of the target phenotype, and If it is determined that the genetic variant is the cause, determine the sampled effect size of the genetic variant for the target phenotype for each of the input units, based on the linkage disequilibrium (LD) correlations between the plurality of input units and the plurality of genetic variants in the region of interest, wherein the sampled effect size of the genetic variant for the target phenotype is non-zero for all of the input units. Perform one or more iterations including, For each genetic variant, the predicted effect size of the genetic variant for the target phenotype for each of the input units is determined based on the average of the posterior effect size of the genetic variant for the input unit calculated using the sampled effect size, or on the average of the posterior effect size of the genetic variant for the input unit calculated using the sampled effect size. A data processing unit configured as follows: Equipped with, Determining the sampled effect size of the genetic variant includes calculating the probability distribution of the effect size of the genetic variant for the target phenotype for the input unit, and sampling the value of the effect size for the input unit from the probability distribution. The probability distribution depends on the correlation between the effect sizes of the genetic variants for the target phenotype for each of the input units. The correlation between the effect size of the genetic variant for the target phenotype for each of the input units is updated in each iteration. Device.
17. A computer program or computer-readable medium that includes instructions causing a computer to perform the method according to any one of claims 1 to 15 when the program is executed by the computer.