Methods for preferential functional site based sequencing data and applications

By screening functional loci using biological prior information and employing linkage disequilibrium filtering, the problem of random distribution of genetic markers in genome-wide prediction was solved, improving breeding accuracy, reducing computational resource consumption, and enhancing breeding efficiency.

CN117334250BActive Publication Date: 2026-06-23CHINA AGRI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA AGRI UNIV
Filing Date
2022-06-24
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies, when using whole-genome predictive breeding, cannot effectively improve the accuracy of genome selection due to the random distribution of genetic markers, and they also consume a lot of computational resources and time.

Method used

Functional loci are screened using biological prior information, and the linkage disequilibrium method is used for filtering to selectively retain markers associated with the target trait, increase the marker density in important regions, and reduce the influence of noisy loci.

Benefits of technology

It improves the accuracy of genome selection, reduces computational resources and time consumption, and enhances breeding efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117334250B_ABST
    Figure CN117334250B_ABST
Patent Text Reader

Abstract

The application relates to a method for screening functional sites based on sequencing data, and the specific steps are as follows: 1. obtaining SNP sites of all samples through genome sequencing, randomly selecting a plurality of samples as a discovery population, and extracting biological prior information; 2. demarcating important regions of the genome according to the biological prior information obtained in step 1; 3. performing frame linkage disequilibrium filtering on the whole genome, reducing the threshold of LD filtering in the important regions or not performing filtering, and improving the marker density of the important regions, and finally screening a trait-specific marker set. Before performing breeding value estimation, the selective linkage disequilibrium site screening can increase the signal-to-noise ratio of the site, improve the prediction accuracy, and further reduce the consumption of computing resources and time, and improve the breeding efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of molecular genetics and agricultural animal breeding, specifically to a method for selecting functional sites based on sequencing data, applicable to whole-genome selection in various species. Background Technology

[0002] Single nucleotide polymorphisms (SNPs) are currently the most important molecular genetic markers due to their abundant distribution throughout the genome, extremely high detection rate, and relatively high genetic stability, thus occupying a crucial position in molecular genetics. In the past decade, genome-wide prediction (GP) has utilized vast amounts of SNP data and has been widely applied to assess the genetic value of complex traits in plants and animals, demonstrating a significant advantage in accuracy over traditional pedigree-based breeding methods.

[0003] The accuracy of genomic selection (GP) is influenced by various factors, such as reference population size, effective population size, marker density and quality, and the statistical model applied to genome prediction. With the continuous development of sequencing technology and improvements in bioinformatics algorithms, the number of genetic markers available for genome prediction is increasing. However, many studies have confirmed that simply adding genetic markers does not increase the accuracy of genome selection. This is mainly because the genetic markers added by sequencing technology are randomly distributed across the genome, and not all genetic markers contribute to the target phenotype. From another perspective, the direct genetic structure of important economic traits of interest in breeding varies, and the number, effect, and distribution of QTLs are not uniform. Therefore, considering both aspects, how to screen loci in the more information-rich and diverse sequencing data to obtain phenotype-specific locus combinations, minimize the influence of noisy loci on specific phenotypes, and enhance the weight of QTL regions is an important strategy to further improve the accuracy of genome selection and leverage the advantages of sequencing data. Summary of the Invention

[0004] In response to the needs of breeding practices and the application of sequencing data in genome selection, the present invention aims to provide a method for selectively filtering functional sites based on sequencing data and utilizing biological prior information, and to apply it to genome prediction and genome selection breeding applications.

[0005] A method for optimizing functional sites based on sequencing data, comprising the following steps:

[0006] 1. Obtain the SNP sites of all samples through genome sequencing, and randomly select a number of samples from all samples as the discovery population to extract biological prior information;

[0007] Extracting prior biological information: Perform genome-wide association analysis (GWAS) on the discovered population, and extract prior biological information based on the GWAS results or database functional annotation results. Specifically, this involves identifying which loci are associated with the target trait or influence trait variation. Alternatively, prior biological information can be extracted through other methods such as genome functional annotation and multi-omics data analysis.

[0008] 2. Delineate important regions of the genome based on the biological prior information obtained in step 1;

[0009] 3. Perform linkage disequilibrium filtering across the entire genome, reduce the threshold for LD filtering or eliminate filtering in important regions, increase the marker density in important regions, and finally screen out a set of trait-specific markers.

[0010] In step 1, the SNP sites are obtained using methods based on low-depth sequencing, high-depth sequencing, or targeted genome sequencing.

[0011] In step 1, the optimal candidate site screening criteria are determined using different P-value gradients to screen out significant sites.

[0012] The significance of the P-value is fitted to the effect of a specific SNP site using the mixed linear model formula (3):

[0013] y = Xb + Za i +Gu+e (3)

[0014] Where y is the target phenotypic value vector, b is the fixed effects vector, and a i The effect value for the detection site is denoted as u, where u is the random polygenic effect, e is the random residual, X and G are the correlation matrices for the fixed and random effects, respectively, and Z is the genotype value, coded as 0, 1, and 2.

[0015] In step 3, chain imbalance filtering is performed based on the bounding box method. First, r is set. 2 Threshold, remove r 2 Sites exceeding the threshold are included, while significant sites and sites linked to significant sites are also retained (r). 2 >0.9), using different gradients of r 2 Filtering is used to determine the optimal filtering criteria, r 2 This indicates the degree of linkage disequilibrium between two sites.

[0016] Use r 2 The value (0-1) represents the linkage disequilibrium between two loci, where 0 represents that the two loci are freely assorting, and 1 represents that the two loci are in a fully linked state. The calculation formula is shown in formula (2):

[0017] D=P(AB)-P(A)*P(B) (1)

[0018]

[0019] Where P(AB) is the frequency of the AB genotype at a certain locus, and P(A) and P(B) are the frequencies of alleles A and B, respectively.

[0020] The above methods are applied in genome prediction.

[0021] The application of the above methods in genomic selection breeding.

[0022] The beneficial effects of this invention are:

[0023] The application of this invention is well-suited to practical breeding scenarios: In actual genomic selection applications, if sequencing data is used, tens of millions of SNP marker loci will be generated. Therefore, before estimating breeding values, selective linkage disequilibrium site screening can increase the signal-to-noise ratio of the loci and improve prediction accuracy. Furthermore, reducing the number of loci can significantly reduce computational resources and time consumption, thereby improving breeding efficiency. Attached Figure Description

[0024] The present invention includes the following figures:

[0025] Figure 1 A schematic diagram of a method for selectively filtering functional sites based on biological prior information and linkage disequilibrium.

[0026] Figure 2 A diagram illustrating the accuracy of different parameter combinations. Detailed Implementation

[0027] The present invention will now be described in detail with reference to embodiments and appendices. It should be understood that the embodiments given below are for illustrative purposes only, and the specification and appendices are only intended to clearly describe one embodiment and are not intended to limit the scope of the invention. The features, operations, or characteristics described in the specification can be combined in various suitable ways to form various implementations. Those skilled in the art can make various modifications and substitutions to the present invention without departing from its spirit and essence. Specific descriptions include, but are not limited to:

[0028] The present invention uses SNP sites obtained based on low-depth sequencing methods, but other SNP detection methods such as high-depth sequencing, targeted genome sequencing, microarray-based and reference haplotype library filling results can also be used.

[0029] A schematic diagram of site selection is shown below. Figure 1As shown, the density of sites located in major QTL regions after selective linkage disequilibrium weighted filtering is significantly higher than that in other genomic regions, while the distribution of sites obtained by traditional SNP chip sites, sequencing sites, or simplified genome sequencing is relatively uniform.

[0030] The following embodiments of the present invention use pig data as an example, which are only examples of use in specific situations and do not indicate that the present invention is species-limited. Theoretically, this method is applicable to all agricultural plant and animal species.

[0031] Implementation Examples: The implementation examples are used to specifically illustrate the methods described in this invention.

[0032] 1. Experimental Materials

[0033] The study used 11 million SNP loci obtained through genome sequencing of 3,549 Duroc pigs. All the study samples were male pigs, and the phenotypic record was the age at which they reached 100 kg body weight (AGE).

[0034] 2. Experimental Methods

[0035] 2.1 Genome-wide association analysis to screen functional loci

[0036] 1000 pigs were randomly selected from all samples as the locus discovery population. Genome-wide association analysis was performed on the AGE phenotype based on the discovery population. The optimal candidate locus selection criteria were determined by exploring different P-value gradients (see step 2.3 for details) and significant loci were selected.

[0037] 2.2 Selective Chaining Imbalance Filtering

[0038] Chain imbalance filtering based on bounding box method, first set r 2 Threshold, remove r 2 Sites exceeding the threshold are included, while retaining significant sites selected in step 2.1 and sites linked to significant sites (r). 2 >0.9), using different gradients of r 2 Filtering is used to determine the optimal filtration criteria.

[0039] 2.3 Different P-values ​​and r 2 Accuracy assessment

[0040] A method for screening functional loci based on selective linkage disequilibrium filtering using biological prior information, defining two main parameters (P-value and r). 2 The results showed that accuracy was affected by both significance (P-value) and linkage disequilibrium (r). 2 Due to the influence of this invention, the embodiment example uses 171 parameter combinations (9 P-values ​​× 19 LDRs). 2 The result is as follows Figure 2 As shown, the results are displayed at P-value (vertical axis) 0.0001 and r 2 The highest accuracy is achieved at 0.35 (horizontal axis).

[0041] 2.4 Comparison of screening sites with traditional microarray results

[0042] The optimal results obtained by the method of selective linkage disequilibrium filtering for functional sites based on biological prior information were compared with the results of two traditional microarrays. The comparison was based on the correlation coefficient between the predicted breeding values ​​and the actual phenotypic records; a higher correlation indicated higher prediction accuracy. Each result was evaluated using a 5-fold cross-validation method to reduce accuracy assessment errors caused by sampling. The results showed that the new method used in this invention has significantly higher accuracy than traditional microarray technology when using the preferably obtained whole-genome markers for genome prediction.

[0043] Table 1. Accuracy assessment of the method for screening functional sites based on selective linkage disequilibrium filtering using biological prior information versus traditional microarrays.

[0044] site set SMIC No. 1 chip Neogene pig 80k chip SLDP Optimal Parameter Combination Prediction accuracy 0.318±0.04 0.331±0.04 0.389±0.04

[0045] The basic principle of this invention is to increase the marker density of specific genomic regions through selective linkage disequilibrium filtering, thereby increasing their weight in the prediction model. When using this method, two factors need to be considered: one is the linkage disequilibrium site filtering threshold, typically expressed as r. 2 (0~1) represents the linkage disequilibrium between two loci. 0 means that the two loci are free to combine, while 1 means that the two loci are in a state of complete linkage. The calculation formula is shown in formula (2). Another factor is the GWAS significance P-value, which is obtained by fitting the effect of a specific SNP locus through a mixed linear model (formula (3)). The effect value is then tested for significance. The smaller the P-value, the higher the degree of association between the locus and the specific phenotype, and the greater the possibility of its influence on the phenotype.

[0046] D=P(AB)-P(A)*P(B) (1)

[0047]

[0048] y = Xb + Za i +Gu+e (3)

[0049] D is obtained through formula (1), where P(AB) is the frequency of the AB genotype at a certain locus, and P(A) and P(B) are the frequencies of alleles A and B, respectively. In formula (3), y is the target phenotypic value vector, b is the fixed effect vector, and a iThe effect value of the detection site is denoted as u, the random polygenic effect is denoted as e, the random residual is denoted as X, G is the correlation matrix of the fixed effect and the random effect, respectively, and Z is the genotype value, coded as 0, 1, 2.

[0050] The above embodiments are only used to illustrate the present invention and are not intended to limit the present invention. Those skilled in the art can make various changes and modifications without departing from the essence and scope of the present invention. Therefore, all equivalent technical solutions also fall within the protection scope of the present invention.

[0051] The contents not described in detail in this specification are existing technologies known to those skilled in the art.

Claims

1. A method for optimizing functional sites based on sequencing data, characterized in that, Comprising the following steps: 1) Obtain SNP sites of all samples by genome sequencing, randomly select several samples as discovery population from all samples, and extract biological prior information; 2) Delimit important regions of genome according to biological prior information obtained in step 1); 3) Perform frame linkage disequilibrium filtering on whole genome, reduce filtering threshold in important regions or do not perform filtering, and improve marker density of important regions, and finally screen out trait-specific marker set; The biological prior information is extracted by whole genome association analysis, database functional annotation, genome functional annotation or multi-omics data analysis mode; Different P-value gradients are used to determine the best candidate site screening standard, and significant sites are screened out; the significance P-value is fitted to the effect of a specific SNP site by a mixed linear model formula (3): ; where y is the target phenotype value vector, b is the fixed effect vector, a i where y is the target phenotype value vector, b is the fixed effect vector, a i where y is the target phenotype value vector, b is the fixed effect vector, a 2. The method of sequencing data based prioritization of functional sites of claim 1, wherein: In step 1), the method for obtaining SNP sites is based on low-depth sequencing method, high-depth sequencing or targeted genome sequencing.

3. The method of claim 1, wherein: the sequencing data is obtained from a sample comprising a plurality of cells; and the functional site is a site of interest in the sample. In step 3), linkage disequilibrium filtering is performed based on the sliding window method. First, set r 2 threshold, remove sites with r 2 greater than the threshold, while retaining significant sites and sites linked to significant sites, use different gradients of r 2 filtering to determine the optimal filtering standard, r 2 represents the degree of linkage disequilibrium between two sites.

4. The method of sequencing data based prioritization of functional sites of claim 3, wherein: r 2 The value is 0-1, 0 represents free combination between two loci, and 1 represents that two loci are in complete linkage, and the calculation formula is shown as formula (2): ; ; Wherein P(AB) is the genotype frequency of a certain site AB, and P(A) and P(B) are the frequencies of alleles A and B respectively.

5. The application of the method for selecting functional sites based on sequencing data according to any one of claims 1-4 in genome prediction.

6. The application of the method for selecting functional sites based on sequencing data according to any one of claims 1-4 in genome selection breeding.