A fast and accurate genomic prediction method and device based on genetic force model

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using the MAPS method based on a heritability model, K-means clustering and a multi-random mixture model are employed to optimize the heritability of SNPs, overcoming the shortcomings of the GBLUP and Bayesian methods and achieving rapid and accurate genome prediction.

CN120452535BActive Publication Date: 2026-06-26HUAZHONG AGRI UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUAZHONG AGRI UNIV
Filing Date: 2025-04-28
Publication Date: 2026-06-26

Application Information

Patent Timeline

28 Apr 2025

Application

26 Jun 2026

Publication

CN120452535B

IPC: G16B20/20; G16B20/40; G16B40/00; G16B5/00

AI Tagging

Technology Topics

Data set Algorithm

Technical Efficacy Phrases

Optimize model representationReduce computational complexity

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A four-dimensional attention recognition method based on electroencephalogram feature fusion selection
CN119184694BImprove recognition accuracy Improve computing efficiency Psychotechnic devices SensorsElectroencephalogram featureFeature selection
A face detection method without interruption, an electronic device and a storage medium
CN122116437AAvoid image processing conversion operationssave computing resources Character and pattern recognition Face detection Data set
Efficient Simulation Method for Non-Gaussian Processes Based on HPM and JTM Mixture Models
CN120706247Breduce computing timeAvoid loop nested calculationsStructural dynamicsAlgorithm
A whole-body-local multi-scale biomechanical model coupling construction method for human body impact simulation
CN122113352Areflect impactReduce problem sizeDesign optimisation/simulation Constraint-based CAD Human body Computation complexity
An experimental flume device for exploring the change of sand ripple on high-concentration sediment bed surface near the bottom
CN224341204UImprove Acquisition Accuracyavoid redundancyHydrodynamic testing Data terminal Water storage tank

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, the GBLUP method suffers from low prediction accuracy due to the limitation of the labeling equal contribution model assumption, while the Bayesian method suffers from low computational efficiency due to the complex parameter solution process.

Method used

The MAPS method based on heritability models is adopted to construct trait-specific heritability models, optimize the genetic structure of complex traits, divide SNPs into multiple layers using the K-means clustering algorithm, and combine them with a multi-random mixture model for genome prediction.

Benefits of technology

It achieves fast and accurate genome prediction with low computational complexity, improving prediction performance and significantly enhancing prediction accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120452535B_ABST

Patent Text Reader

Abstract

The application belongs to the field of animal and plant breeding prediction, and discloses a fast and accurate genomic prediction method based on a genetic force model, which calculates the minor allele frequency and linkage disequilibrium score of a marker; obtains an optimal SNP genetic force model through a SNP genetic force model; selects an optimal number of layers; obtains layer genotype data formed by the genotype data of all individuals in a corresponding layer of a to-be-predicted data set; calculates the estimated genetic force of each layer of SNPs based on the optimal SNP genetic force model, and assigns the estimated genetic force to a corresponding diagonal weight matrix of the layer; calculates the kinship matrix between individuals in the corresponding layer genotype data of each layer; fits a multi-random mixed model to obtain the genomic estimated breeding value of each phenotype of all individuals in the to-be-tested data set. The application is based on a linear model framework, has low calculation complexity, and the key steps can be processed in parallel, so the calculation speed is fast; the application constructs a trait-specific SNP genetic force model, optimizes the model representation of the genetic structure of a complex trait, and thus improves the prediction performance.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of plant and animal breeding prediction, specifically relating to a rapid and accurate genome prediction method and device based on a heritability model. Background Technology

[0002] With the widespread application of high-density single nucleotide polymorphism (SNP) genotyping technology covering the entire genome, whole-genome prediction technology has gradually become the mainstream method for evaluating and predicting the genetic value of individuals in the field of plant and animal breeding. Currently, the models used in genome-wide selection are mainly divided into two categories: direct methods and indirect methods. The representative direct method is the GBLUP method, which constructs an individual relationship matrix through genotype information within a mixed linear model framework. After estimating the variance components using the maximum likelihood method, the model is solved to obtain the individual breeding value. Since the model solution efficiency is only affected by the number of individuals, GBLUP has low computational complexity and high computational efficiency. However, its model assumption of equal contribution of markers is relatively simple, resulting in poor prediction accuracy. The representative indirect method is the Bayesian method, which first constructs a model in the training cluster. Based on Bayes' prior information theory, it uses a hidden Markov iteration process to obtain the marker effect value. Then, in the validation cluster, the marker effect is accumulated according to the individual genotype to obtain the individual breeding value. The indirect method uses different prior assumptions to optimize the model representation of the trait genetic structure. Compared with GBLUP, it has higher prediction accuracy. However, due to the introduction of many unknown parameters by its assumptions and the inability to parallelize the iterative process of parameter solution, its computational efficiency is poor. Summary of the Invention

[0003] The technical problem this invention aims to solve is to address the low prediction accuracy of the traditional GBLUP method due to the limitations of the marker co-contribution model assumption, and the low computational efficiency of the Bayesian method due to the complex parameter solution process. This invention proposes a fast and accurate genome prediction method based on a heritability model, which optimizes the model representation of complex trait genetic structures using a trait-specific heritability model, termed MAPS (Marker Adjusted Prioritization and Stratification).

[0004] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:

[0005] A rapid and accurate genome prediction method based on a heritability model includes the following steps:

[0006] Step S1: Obtain phenotype files, genotype files, and labeled map files; construct a training set, which includes genotype and phenotype data for each individual; and calculate the minor allele frequencies and linkage disequilibrium scores of the labeled individuals.

[0007] Step S2: Construct an SNP heritability model and train the SNP heritability model to obtain the optimal SNP heritability model;

[0008] Step S3: Calculate the estimated heritability of all single nucleotide polymorphism (SNP) SNPs based on the optimal SNP heritability model and divide all SNP SNPs into K layers. Perform the likelihood ratio test under K layers to obtain the corresponding significance P value. Traverse the K values within the set range and select the K value corresponding to the smallest significance P value as the optimal number of layers.

[0009] Step S4: Based on the optimal number of strata and the SNPs corresponding to each stratum determined in Step S3, obtain the stratified genotype data of all individuals in the dataset to be predicted, which consists of the genotype data of the corresponding stratum. Based on the optimal SNP heritability model determined in Step S2, calculate the estimated heritability of each SNP and assign it to the diagonal weight matrix of the corresponding stratum. Calculate the kinship matrix between individuals in the stratified genotype data of each stratum. Fit a multi-random mixture model to obtain the estimated breeding values of the genomes of each phenotype of all individuals in the dataset to be tested.

[0010] As described above, in step S1, the genotype file uses 0, 1, and 2 to represent the three allele types aa, Aa, and AA, respectively. The map file contains the SNP name, the chromosome number where the SNP is located, and the physical location of the SNP on the chromosome.

[0011] The minor allele frequencies in step S1, as described above, are calculated based on the following steps:

[0012] Use R software to save the genotype file as a genotype matrix. In the genotype matrix, rows represent individuals, columns represent markers, and elements are genotypes. The elements of each row of the genotype matrix are the genotype data of the corresponding individual. Calculate the mean of each column of the genotype matrix, and take half of the column mean as the minor allele frequency (MAF) of the marker. If the minor allele frequency (MAF) exceeds 0.5, replace the corresponding minor allele frequency (MAF) with 1-MAF.

[0013] The chain imbalance fraction in step 1 is calculated based on the following formula:

[0014]

[0015] Where, ξ j For the j-th single nucleotide polymorphism (SNP) j The corresponding chain imbalance fraction, SNP j The corresponding columns of the genotype matrix and SNPs f The squared Pearson correlation coefficients between the columns of the corresponding genotype matrix, SNP fFor the j-th single nucleotide polymorphism (SNP) j Other single nucleotide polymorphisms within 1 Mb upstream and downstream of the map file.

[0016] As described above, the SNP heritability model in step S2 is based on the following formula:

[0017]

[0018] in, For the j-th single nucleotide polymorphism (SNP) j Estimated heritability, For the j-th single nucleotide polymorphism (SNP) j The marginal effect estimate, ξ j For the j-th single nucleotide polymorphism (SNP) j The chain imbalance fraction, p j For the j-th single nucleotide polymorphism (SNP) j The minor allele frequencies, where α is a model hyperparameter.

[0019] The marginal effect estimate in step S2, as described above, is obtained based on the following steps:

[0020] The kinship matrix G among all individuals was calculated using genotype files. Genome-wide association analysis was performed on the training set using a single random mixed linear model to obtain the log-likelihood value l0 and the marginal effect estimate of single nucleotide polymorphisms (SNPs).

[0021] As described above, training the SNP heritability model in step S2 includes:

[0022] Set the range of the model hyperparameter α, and set the model hyperparameter α to the initial value α0.

[0023] The estimated heritability of all single nucleotide polymorphisms (SNPs) was calculated, and the estimated heritability of all SNPs was assigned to a diagonal weight matrix W. A new kinship matrix M among all individuals was calculated using the genotype file. A single random mixed linear model was fitted to obtain the log-likelihood value l1. A chi-square statistic 2(l1-l0) was constructed, and a likelihood ratio test was performed to obtain the significance p-value.

[0024] The model hyperparameter α is updated according to the Brent method. The significance P-value is calculated repeatedly for each update until the significance P-value reaches a minimum value. The corresponding model hyperparameter α is the optimal model hyperparameter. The optimal SNP heritability model is constructed based on the optimal model hyperparameter.

[0025] The optimal number of layers in step S3, as described above, is determined based on the following steps:

[0026] The K-means clustering algorithm was used to cluster all single nucleotide polymorphism (SNP) SNPs using their estimated heritability as features, dividing all SNPs into K layers.

[0027] Assign the estimated heritability of each layer to the diagonal weight matrix of the corresponding layer, and calculate the new kinship matrix between individuals in the genotype data of each layer.

[0028] Using the new kinship matrix, a multi-random mixture model is fitted to obtain the log-likelihood value l. K Construct the chi-square statistic 2(l) K -l0), perform a likelihood ratio test to obtain the corresponding significance P-value.

[0029] K is iterated over within its range of values. The significance P-value is calculated for each iteration. The significance P-value of K for all values is obtained. K with the smallest significance P-value is selected as the optimal number of strata.

[0030] A computer device includes a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of the prediction method described above.

[0031] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the prediction method described above.

[0032] A computer program product includes a computer program that, when executed by a processor, implements the steps of the prediction method described above.

[0033] Compared with the prior art, the present invention has the following advantages and beneficial effects:

[0034] 1. This invention is based on a linear model framework, which has low computational complexity and allows key steps to be parallelized, resulting in fast computation speed; 2. This invention improves prediction performance by constructing a trait-specific SNP heritability model and optimizing the model representation of the genetic structure of complex traits. Attached Figure Description

[0035] Figure 1 A comparison of the prediction accuracy of GBLUP (Best Linear Unbiased Prediction for Genomes), BayesR (Bayes Method), and the method (MAPS) described in Example 1 of this invention. Detailed Implementation

[0036] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0037] Example 1:

[0038] A rapid and accurate genome prediction method based on a heritability model includes the following steps:

[0039] Step S1: Obtain the phenotypic file, genotype file, and labeled map file, marking them as single nucleotide polymorphisms (SNPs). Obtain the phenotypic data for each individual based on the phenotypic file, and the genotype data for each individual based on the genotype file. Identify individuals possessing both phenotypic and genotype data as the total population. Construct training and validation sets. Randomly divide the total population into 5 groups. Select the phenotypic and genotype data from 4 groups as the training set, and the genotype data from the remaining 1 group as the validation set. The dataset to be predicted and the validation set only include the genotype data of the individuals; the phenotypic data of the individuals in the prediction dataset is unknown. Calculate the labeled minor allele frequency (MAF) and linkage disequilibrium score (LDsc) based on the genotype matrix generated from the genotype file and the map file.

[0040] Step S1-1: Use R software to read the phenotype file, genotype file, and map file. The genotype file must be in numeric format, using 0, 1, and 2 to represent the three alleles: aa, Aa, and AA, respectively. Check and adjust the individual order in the phenotype and genotype files to ensure consistency. The map file contains three columns of information: SNP name, chromosome number of the SNP, and physical location of the SNP on the chromosome.

[0041] Step S1-2: Use R software to save the genotype file read in step S1-1 as a genotype matrix in matrix format, with rows representing individuals, columns representing markers, and elements representing genotypes. Each row of the genotype matrix contains the genotype data of the corresponding individual. Calculate the mean of each column in the genotype matrix and use half of the column mean as the MAF (minor allele frequency) of the marker. If the MAF exceeds 0.5, replace the corresponding MAF with 1-MAF to ensure that the MAF ranges from 0 to 0.5. The phenotypic file includes the phenotypic data of each individual. Construct a training set, which includes the phenotypic data and corresponding genotype data of the individuals. Construct a validation set, which only includes the genotype data of the individuals.

[0042] Step S1-3: Use R software to save the map file read in step S1-1 as a data.frame file, and calculate the LDsc (linkage disequilibrium score) corresponding to each SNP in combination with the genotype matrix, using the j-th single nucleotide polymorphism SNP as an example. j For example, the j-th single nucleotide polymorphism (SNP) j The corresponding chain imbalance fraction ξ j Calculated as: the j-th single nucleotide polymorphism (SNP) j The corresponding columns of the genotype matrix and SNPs j Other single nucleotide polymorphisms (SNPs) within 1 Mb upstream and downstream of the map file f The corrected sum of squares of the Pearson correlation coefficients between the columns of the corresponding genotype matrix is denoted by formula (1), where SNP j The corresponding columns of the genotype matrix and SNPs f The squared Pearson correlation coefficients between the columns of the corresponding genotype matrix, where N is the number of individuals.

[0043]

[0044] In this embodiment, the practical application of the genome prediction method described in this invention is illustrated using Duroc pig sample data as an example, and the classic GBLUP direct method and BayesR indirect method are used as benchmarks for comparison.

[0045] In step S1, genotype data (Table 1 shows the genotype data of Duroc pigs), phenotypic data (Table 2 shows the phenotypic data of Duroc pigs), and map files (Table 3 shows the map files of Duroc pigs) are obtained from a public dataset (http: / / gigadb.org / dataset / 100894). Individuals with both phenotypic and genotype data are considered as the total population. Training and validation sets are constructed. Table 2 shows the seven phenotypes for Duroc pigs: backfat thickness (BF) at 100kg, eye muscle depth (LMD) at 100kg, lean meat percentage (LMP) at 100kg, total number of teats (TTN), number of left teats (LTN), number of right teats (RTN), and time to feed per day (TPD).

[0046] Table 1

[0047] ID SNP1 SNP2 SNP3 … SNP258662 0001 0 0 1 … 0 0002 0 0 2 … 2 0003 2 1 0 … 1 … … … … … … 2770 2 1 0 … 2

[0048] Table 2

[0049] ID BF LMD LMP TTN LTN RTN TPD 0001 12.16 47.5 52.44 11 6 5 84.9648 0002 9.63 44.2 54.61 15 8 7 63.8484 0003 11.69 46.2 53.95 12 6 6 86.4532 … … … … … … … … 2770 9.34 46.4 57.98 12 6 6 57.6142

[0050] Table 3

[0051] SNP CHR POS SNP1 1 29039 SNP2 1 41138 SNP3 1 49386 … … … SNP258662 18 55971779

[0052] Step S2: Construct an SNP heritability model and train the SNP heritability model to obtain the optimal SNP heritability model;

[0053] Step S2-1: Based on the phenotypic and genotypic data of the training set read in step S1, calculate the kinship matrix G among all individuals using the genotypic file based on the VanRaden method. Perform genome-wide association analysis (GWAS) on the training set using a single random mixed linear model (MLM) to obtain the log-likelihood value l0 of the single random mixed model and the marginal effect estimates β1, β2, ..., β of each SNP. m , where m is the total number of SNPs;

[0054] Step S2-2: Combining the estimated marginal effect of SNP calculated in step S2-1 and the MAF and LDsc calculated in step S1, construct the SNP heritability model, denoted as formula (2), with SNP... j For example, among which, For the j-th single nucleotide polymorphism (SNP) j Estimated heritability, For the j-th single nucleotide polymorphism (SNP) j The marginal effect estimate, ξ j For the j-th single nucleotide polymorphism (SNP) j The chain imbalance fraction LDsc, p j For the j-th single nucleotide polymorphism (SNP) j The minor allele frequency MAF, where α is the model hyperparameter;

[0055]

[0056] Step S2-3: Set the range of α to [-3, 1], and give an initial value α0.

[0057] The Brent method was used to optimize the hyperparameter α in the SNP heritability model, specifically including:

[0058] Step S2-4: Calculate the estimated heritability of all single nucleotide polymorphisms (SNPs) based on formula (2). Based on the characteristics of the VanRaden algorithm, the estimated heritability of all single nucleotide polymorphisms (SNPs) is assigned to a diagonal weight matrix W. A new kinship matrix M among all individuals is calculated using the genotype file. A single random mixture linear model is fitted to obtain the log-likelihood value l1. Based on the log-likelihood value l0 of the single random mixture model obtained in step S2-1, a chi-square statistic 2(l1-l0) is constructed, and a likelihood ratio test is performed to obtain the significance P-value P0.

[0059] Step S2-5: Update the model hyperparameter α according to the Brent method. Repeat the above calculation in step S2-4 for each update until the significance P value reaches a minimum. Take the model hyperparameter α at this time as the optimal model hyperparameter, and construct the optimal SNP heritability model based on the optimal model hyperparameter.

[0060] Step S3: Based on the calculation results of the heritability model established in S2, the K-means clustering algorithm is used, combined with the likelihood ratio test to optimize the label clustering, so as to sort and stratify the SNPs.

[0061] Step S3-1: Based on the estimated heritability of all single nucleotide polymorphisms (SNPs) obtained from the optimal SNP heritability model in step S2, use the K-means clustering algorithm to cluster all SNPs using the estimated heritability as features, and divide all SNPs into K layers. In this embodiment, K is 20. The genotype data of all individuals in the corresponding layer constitute the layer genotype data.

[0062] Step S3-2: For each single nucleotide polymorphism (SNP) contained in each layer, according to the VanRaden algorithm, assign the estimated heritability corresponding to layers 1, 2, ..., K to the diagonal weight matrix W1, W2, ..., Wk of the corresponding layer. K Calculate the new kinship matrices M1, M2, ..., Mk between individuals in the genotype data corresponding to layers 1, 2, ..., K, respectively. K ;

[0063] Step S3-3: Obtain a new kinship matrix M1, M2, ..., M using the information from step S3-2. K The log-likelihood value l was obtained by fitting a multi-random mixture model. K Based on l0 obtained in step S2-1, construct the chi-square statistic 2(l) K -l0), perform the likelihood ratio test to obtain the corresponding significance P value P. K ;

[0064] Step S3-4: K is traversed within the range of 1 to b. Steps S3-1 to S3-3 are repeated simultaneously and in parallel using multiple threads to obtain the P values P1, P2, ..., P for all values of K. bThe optimal number of strata is selected when the P-value is minimized.

[0065] Step S4: Based on the optimal number of strata and the SNPs corresponding to each stratum determined in Step S3, obtain the stratified genotype data of all individuals in the dataset to be predicted, which consists of the genotype data of the corresponding stratum. Based on the optimal SNP heritability model determined in Step S2, calculate the estimated heritability of each SNP and assign it to the diagonal weight matrix of the corresponding stratum. Calculate the kinship matrix between individuals in the stratified genotype data of each stratum. Fit a multi-random mixture model to obtain the genomic estimated breeding value (GEBV) of each phenotype of all individuals in the dataset to be tested. The log-likelihood value and the genomic estimated breeding value are both outputs of the multi-random mixture model.

[0066] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above methods.

[0067] To verify the effectiveness of the present invention, Duroc pig individuals with both phenotypic and genotypic data were used as the total population. The total population was randomly divided into 5 groups. The genotypic data of the individuals in the first group were selected as the validation set, and the phenotypic and genotypic data of the remaining 4 groups were used as the training set. The training set and validation set corresponding to one training and validation cycle were obtained. Similarly, the second, third, fourth, and fifth validation sets and the corresponding second, third, fourth, and fifth training sets were selected in sequence. The above steps were repeated a total of five times, resulting in a total of 25 training and validation sets corresponding to training and validation cycles.

[0068] For each training and validation run, steps S2 and S3 are performed. Based on the optimal number of strata and the corresponding SNPs in each stratum determined in step S3, and the genotype data of the corresponding layer in the validation set, the estimated heritability of each layer's SNPs is calculated and assigned to the diagonal weight matrix of the corresponding layer based on the optimal SNP heritability model determined in step S2. The kinship matrix between individuals in the corresponding layer genotype data is also calculated. A multi-random mixture model is fitted to obtain the Genomic Estimated Breeding Value (GEBV) for each phenotype of all individuals in the validation set. Both the log-likelihood value and the GEBV are outputs of the multi-random mixture model. The Pearson correlation coefficient between the GEBV and the actual phenotype data of each individual in the validation set is calculated as the prediction accuracy. The prediction accuracy of the same phenotype for all individuals in the validation set after 25 training and validation runs is averaged.

[0069] The prediction accuracy of the GBLUP and BayesR methods was tested on individuals in the validation set after 25 training and validation runs. For both GBLUP and BayesR methods, the prediction accuracy for the same phenotype across all individuals was averaged. The results are as follows: Figure 1 As shown, compared to the prediction accuracy of the best linear unbiased prediction method GBLUP and the BayesR method, the method (MAPS) described in this invention achieved good prediction results for all seven traits of Duroc pigs. Among them, the prediction accuracy of MAPS for the traits of BF, LMP, LTN, RTN, TTN and TPD in Duroc pigs is significantly higher than that of the GBLUP and BayesR methods, which shows that the method (MAPS) described in this paper has good performance in the field of genome prediction.

[0070] Example 2:

[0071] In this embodiment, a computer device is also provided, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described method embodiments.

[0072] Example 3:

[0073] In this embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps in the above-described method embodiments.

[0074] Example 4:

[0075] In this embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above-described method embodiments.

[0076] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.

Claims

1. A rapid and accurate genome prediction method based on a heritability model, characterized in that, Includes the following steps: Step S1: Obtain phenotype files, genotype files, and labeled map files; construct a training set, which includes genotype and phenotype data for each individual; and calculate the minor allele frequencies and linkage disequilibrium scores of the labeled individuals. Step S2: Construct an SNP heritability model and train the SNP heritability model to obtain the optimal SNP heritability model; Step S3: Calculate the estimated heritability of all single nucleotide polymorphism (SNP) SNPs based on the optimal SNP heritability model and divide all SNP SNPs into K layers. Perform the likelihood ratio test under K layers to obtain the corresponding significance P value. Traverse the K values within the set range and select the K value corresponding to the smallest significance P value as the optimal number of layers. Step S4: Based on the optimal number of strata and the SNPs corresponding to each stratum determined in Step S3, obtain the strat genotype data of all individuals in the dataset to be predicted, which is composed of the genotype data of the corresponding stratum. Based on the optimal SNP heritability model determined in Step S2, calculate the estimated heritability of each stratum SNP and assign it to the diagonal weight matrix of the corresponding stratum. Calculate the kinship matrix between individuals in the strat genotype data of each stratum. By fitting a multi-random mixture model, the estimated genomic breeding values for each phenotype of all individuals in the test dataset are obtained. The SNP heritability model in step S2 is based on the following formula: in, For the first Single nucleotide polymorphism Estimated heritability, For the first Single nucleotide polymorphism The marginal effect estimate, For the first Single nucleotide polymorphism Chained unbalanced fractions, For the first Single nucleotide polymorphism minor allele frequencies, For model hyperparameters, The marginal effect estimate in step S2 is obtained based on the following steps: The kinship matrix G among all individuals was calculated using genotype files. Genome-wide association analysis was performed on the training set using a single randomized mixed linear model to obtain the log-likelihood value of the single randomized mixed model. Marginal effect estimates of single nucleotide polymorphisms (SNPs) The training of the SNP heritability model in step S2 includes: Set model hyperparameters Range, setting model hyperparameters Initial value , Calculate the estimated heritability of all single nucleotide polymorphisms (SNPs), assign the estimated heritability of all SNPs to a diagonal weight matrix W, calculate the new kinship matrix M among all individuals using the genotype file, and obtain the log-likelihood value by fitting a single random mixture linear model. Constructing chi-square statistics The significance p-value was obtained by performing a likelihood ratio test. Update model hyperparameters using the Brent method. Each update involves repeated calculations of the significance p-value until the significance p-value reaches its minimum, corresponding to the model hyperparameters. The optimal model hyperparameters are used to construct the optimal SNP heritability model.

2. The rapid and accurate genome prediction method based on a heritability model according to claim 1, characterized in that, In step S1, the genotype file uses 0, 1, and 2 to represent the three allele types aa, Aa, and AA, respectively. The map file contains the SNP name, the chromosome number where the SNP is located, and the physical location of the SNP on the chromosome.

3. The rapid and accurate genome prediction method based on a heritability model according to claim 1, characterized in that, The minor allele frequencies in step S1 are calculated based on the following steps: Use R software to save the genotype file as a genotype matrix. In the genotype matrix, rows represent individuals, columns represent markers, and elements are genotypes. The elements of each row of the genotype matrix are the genotype data of the corresponding individual. Calculate the mean of each column of the genotype matrix, and take half of the column mean as the minor allele frequency (MAF) of the marker. If the minor allele frequency (MAF) exceeds 0.5, replace the corresponding minor allele frequency (MAF) with 1 - MAF. The chain imbalance fraction in step 1 is calculated based on the following formula: in, For the first Single nucleotide polymorphism The corresponding chain imbalance fraction, for The corresponding columns of the genotype matrix and The squared Pearson correlation coefficients between the columns of the corresponding genotype matrix. For the first Single nucleotide polymorphism Other single nucleotide polymorphisms within 1 Mb upstream and downstream of the map file.

4. The rapid and accurate genome prediction method based on a heritability model according to claim 1, characterized in that, The optimal number of layers in step S3 is determined based on the following steps: The K-means clustering algorithm was used to cluster all single nucleotide polymorphism (SNP) SNPs using their estimated heritability as features, dividing all SNPs into K layers. Assign the estimated heritability of each layer to the diagonal weight matrix of the corresponding layer, and calculate the new kinship matrix between individuals in the genotype data of each layer. Using the new kinship matrix, a multi-random mixture model is fitted to obtain the log-likelihood value. Constructing chi-square statistics Perform a likelihood ratio test to obtain the corresponding significance P-value. K is iterated over within its range of values. The significance P-value is calculated for each iteration. The significance P-value of K for all values is obtained. K with the smallest significance P-value is selected as the optimal number of strata.

5. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the prediction method according to any one of claims 1 to 4.

6. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the prediction method according to any one of claims 1 to 4.

7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the prediction method according to any one of claims 1 to 4.

Citation Information

Patent Citations

CN115995262A
CN116377081A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

CN115995262A

CN116377081A