Peanut quality trait selection breeding method based on whole genome snp and application
By using whole-genome resequencing and a hybrid deep learning model, high-quality SNP loci were screened, and a hybrid deep learning model was constructed. This solved the problems of long cycle and low prediction accuracy in traditional peanut breeding methods, achieving efficient breeding and cultivating high-quality new peanut varieties.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CROP RES INST GUANGDONG ACAD OF AGRI SCI
- Filing Date
- 2025-09-22
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional flowering breeding methods suffer from long cycles, low prediction accuracy, and significant susceptibility to environmental errors. Furthermore, existing whole-genome selection technologies have low coefficients of determination when predicting across populations, making it difficult to meet the needs of commercial breeding.
By integrating whole-genome resequencing, SNP optimized marker screening, and hybrid deep learning model construction, high-quality SNP loci are screened out through whole-genome resequencing and quality control processes. A hybrid deep learning model is then constructed to optimize hyperparameters, shorten the breeding cycle, and improve prediction accuracy.
It has enabled accurate prediction of peanut quality traits and evaluation of breeding values, shortened the breeding cycle, improved prediction accuracy, increased breeding efficiency, and cultivated new high-quality peanut varieties with high oleic acid and high protein.
Smart Images

Figure CN121171335B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of crop breeding technology, specifically to a method and application of peanut quality trait selection breeding based on whole-genome SNPs. Background Technology
[0002] Peanuts, as a globally important oilseed and economic crop, have quality traits (such as oil content, protein content, and fatty acid composition) that directly impact industrial economic benefits and consumer health needs. With the development of molecular breeding technology, genome-wide selection (GS) has become a key method to overcome the bottlenecks of traditional phenotypic selection due to its ability to simultaneously elucidate the synergistic regulatory mechanisms of multiple genes. However, the high complexity of the peanut genome and the combined influence of minor gene interactions with the environment lead to problems such as long cycles and low predictive accuracy in traditional breeding methods. Therefore, it is necessary to develop a genome-wide selection breeding technology that integrates high-precision genetic marker screening and model optimization strategies to achieve precise improvement of peanut quality traits.
[0003] However, traditional flowering breeding techniques mainly rely on direct phenotypic selection or marker-assisted selection, which has some limitations. First, phenotypic selection is significantly affected by environmental errors. For example, the determination of oleic acid content requires gas chromatography, which is costly and has poor repeatability, leading to significant deviations in breeding value evaluation. Second, marker-assisted selection can only utilize a few known QTL loci, ignoring the cumulative effect of minor genes across the entire genome. For example, the quality trait of oleic acid / linoleic acid ratio is regulated by at least 12 gene loci, limiting the interpretation of phenotypic variation by traditional methods. Third, the breeding cycle is lengthy, requiring multiple generations of self-pollination from hybridization to stable line selection. Furthermore, the segregation of traits in offspring leads to low selection efficiency. Although existing genome-wide selection technologies attempt to incorporate deep learning models, the lack of integration of population structure correction and association contribution weighting mechanisms results in generally low coefficients of determination in cross-population predictions, making it difficult to meet the needs of commercial breeding.
[0004] Therefore, it is necessary to develop and apply peanut quality trait selection breeding methods based on whole-genome SNPs. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of existing technologies and provide a method and application for peanut quality trait selection breeding based on whole-genome SNPs. This invention integrates whole-genome resequencing, SNP optimization marker screening, and hybrid deep learning model construction to achieve prediction of peanut quality traits and evaluation of breeding values. Through whole-genome resequencing and quality control processes, an SNP variation map is obtained, covering ten key quality traits such as protein, oil content, and oleic acid, providing comprehensive genetic information for subsequent analysis. The SNP optimization and marker screening process avoids interference from low-quality loci by screening out loci to form a feature marker set. By adjusting and training the hybrid deep learning model and optimizing hyperparameters through ten-fold cross-validation, a whole-genome selection model is obtained, shortening the breeding cycle and reducing the prediction error rate, thus providing technical support for the improvement of peanut quality traits.
[0006] To solve the above-mentioned technical problems, this invention provides the following technical solution: On the one hand, a peanut quality trait selection breeding method based on whole-genome SNPs, the specific steps of which are as follows:
[0007] Data acquisition: Multiple peanut germplasm samples were collected and planted. Three healthy plants were taken at the germplasm maturity period. Ten quality traits were measured and the average values were used to construct a phenotypic matrix. At the same time, DNA was extracted from the young leaves of the plants and whole-genome resequencing was performed. After processing the sequencing data, it was compared with the peanut T2T reference genome to screen SNP sites and obtain SNP variation maps.
[0008] SNP optimization and marker screening: Based on the SNP variation map, calculate the SNP site deletion rate and minimum allele frequency, screen high-quality SNP sites and fill in missing data, analyze population structure, calculate linkage disequilibrium, and conduct genome-wide association analysis to screen key SNP sites to form a feature marker set;
[0009] Model construction: Using the feature tag set as input and the corresponding germplasm quality phenotypic trait as output, a hybrid deep learning model is constructed. During training, correlation contribution-related indicators are introduced to adjust the model, hyperparameters are optimized, and cross-validation is used to screen and obtain a genome-wide selection model.
[0010] New line breeding: Based on the combination of phenotypic matrix, SQI and AC indexes and germplasm genetic distance data, the backbone parents are selected to construct the breeding population, the genotypes of offspring individuals are obtained, the original breeding values are obtained through the whole genome selection model, the corrected offspring individual breeding values are calculated, and superior single plants are screened. Through multi-generation aggregation selection, high-quality new peanut lines are obtained.
[0011] Furthermore, in the data acquisition process, ten quality traits were measured, including protein, oil content, oleic acid, linoleic acid, oleic acid / linoleic acid ratio, stearic acid, palmitic acid, lysine, phenylalanine, and proline. Whole-genome resequencing was performed using a sequencing platform to obtain short-fragment base sequence data. The short-fragment base sequence data underwent quality control, removing short-fragment base sequence data containing adapter sequences, filtering out low-quality bases with a percentage greater than 5%, and eliminating repetitive short-fragment base sequence data. The short-fragment base sequence data was then compared with the peanut T2T reference genome, and single nucleotide polymorphisms at SNP sites were detected using GATK software, filtering out low-quality sites with a quality value <30 and a coverage depth <5 to obtain an SNP variation map.
[0012] Furthermore, in the SNP optimization and marker screening process, based on the SNP variation map, Plink software is used to calculate the deletion rate and minimum allele frequency of each SNP locus, and the comprehensive quality index of SNP loci is calculated using the SNP locus comprehensive quality screening index formula to screen out high-quality SNP loci with an SQI ≥ 0.6. For loci with a small amount of missing data after screening, Beagle software is used to complete the missing data, resulting in a complete SNP genotype data matrix. Subsequently, Admixture software is used for population structure analysis, and the optimal population structure grouping is determined through cross-validation error. The population structure matrix is output, and Plink software is used to calculate the linkage disequilibrium coefficient. Then, TASSEL software is used to conduct genome-wide association analysis, calculate the p-value of each SNP locus associated with ten quality traits, and calculate the association contribution of SNP loci to the target quality trait using the SNP-trait association contribution formula. Loci with an association contribution of ≥ 5 to the target quality trait are selected to form a feature marker set for use.
[0013] Furthermore, in the SNP optimization and marker screening, the SNP comprehensive quality index is calculated using the SNP comprehensive quality screening index formula: SQI = 0.6 × (1 - MR) + 0.4 × MAF, where SQI is the SNP comprehensive quality index, with a value range of 0-1, and a higher value indicates better SNP quality, MR is the SNP deletion rate, and MAF is the minimum allele frequency of the SNP.
[0014] Furthermore, in the SNP optimization and labeling screening, the association contribution of SNP loci to the target quality trait is calculated using the SNP-trait association contribution formula, which is as follows: Where AC represents the association contribution of the SNP locus to the target quality trait, and P represents the P-value of the association between the SNP locus and the ten quality traits. The linkage disequilibrium coefficient is used to measure the degree of genotypic association between two SNP loci. The closer to 1, the stronger the correlation between the two sites and the greater the interference; the closer to 0, the weaker the correlation and the smaller the interference.
[0015] Furthermore, in the model construction, a hybrid deep learning model is constructed using a feature marker set as input and the corresponding ten quality traits of germplasm as output, including a VAE layer, a CNN layer, and a Transformer layer. In the VAE layer training stage, the average association contribution of the feature marker set, i.e., the average AC value of all key SNP sites, is introduced, and the noise compression loss value of the VAE layer is calculated using the noise compression loss formula. After the VAE layer output is processed by the CNN layer to extract local linkage features and the Transformer layer to capture the whole genome epistatic effect, the VAE latent dimension, CNN convolution kernel parameters, and Transformer attention head number are iteratively optimized and adjusted. At the same time, the hybrid deep learning model is evaluated by ten-fold cross-validation to determine the coefficient of determination for each round of validation, and the average coefficient of determination and the standard deviation of the coefficient of determination are analyzed. Then, the prediction accuracy correction coefficient is calculated using the prediction accuracy correction coefficient formula. When the prediction accuracy correction coefficient is ≥0.8, the whole genome selection model is obtained.
[0016] Furthermore, in the model construction, the noise compression loss value of the VAE layer is calculated using the noise compression loss formula, which is: Among them, L VAE This represents the noise compression loss value. Reconstructing data from genotype raw data X and VAE The mean squared error X represents the original data of the genotypes of the key SNP loci obtained through screening. To reconstruct the output data, AC avg The average association contribution of the feature label set;
[0017] The prediction accuracy correction factor is calculated using the following formula: Among them, C acc This is a correction factor for prediction accuracy. The average coefficient of determination, CV std For R 2 The standard deviation of the value.
[0018] Furthermore, in the breeding of the new line, genetically complementary backbone parents are selected by combining phenotypic matrix, SQI and AC indices, and germplasm genetic distance data. The selection criteria are that the genetic distance between parents is greater than the population average genetic distance, and the parents exhibit excellent performance in different quality traits. A breeding population is constructed using the selected backbone parents. After the offspring individuals have grown, DNA is extracted from young leaves to obtain the genotypes of the offspring individuals. The genotypes of the offspring individuals are matched with a set of feature markers to determine the key SNP loci carried by each individual, and the A value of the locus is obtained using the SNP-trait association contribution formula. The C-value was averaged to obtain the average association contribution of key SNP loci carried by offspring individuals; simultaneously, the corresponding quality phenotypic data of the backbone parents were extracted from the phenotypic matrix, and the average phenotypic value of the parents was obtained by averaging; the genotypes of offspring individuals were input into the whole genome selection model to obtain the original predicted breeding value, and then the corrected breeding value of offspring individuals was calculated using the formula for corrected breeding value of offspring individuals. Superior individual plants were selected according to the high breeding value standard of ≥8 after correction of offspring individual breeding value, and the superior individual plants were self-crossed or backcrossed to stabilize the superior traits; the process of obtaining genotypes of offspring individuals and GBV was repeated. cal Through calculation, screening of superior single plants, and self-crossing / backcrossing operations, and after 3-5 generations of polymeric selection, high-quality new peanut lines with increased oil content, protein, and oleic acid were obtained.
[0019] Furthermore, in the breeding of the new strain, the corrected individual breeding value of the offspring is calculated using the formula for corrected breeding value of offspring individuals. The formula for corrected breeding value of offspring individuals is: GBV cal =GBV pred ×AC ind +0.1×P par Among them, GBV cal GBV represents the corrected breeding values for individual quality traits in offspring. pred AC represents the original breeding values obtained by predicting the quality traits of offspring individuals using a genome-wide selection model. ind P represents the average association contribution of offspring individuals carrying key SNP loci. par This represents the average phenotypic value of the core parent lines in the corresponding quality traits.
[0020] On the other hand, the application of genome-wide selection breeding for peanut quality traits involves the following specific steps:
[0021] Data acquisition module: Collect multiple peanut germplasm samples for planting. At the germplasm maturity period, take 3 healthy plants, measure 10 quality traits and take the average value to construct a phenotypic matrix. At the same time, extract DNA from the young leaves of the plants and perform whole genome resequencing. After processing the sequencing data, compare it with the peanut T2T reference genome, screen SNP sites, and obtain SNP variation map.
[0022] SNP optimization and marker screening module: Based on the SNP variation map, calculate the SNP site deletion rate and minimum allele frequency, screen high-quality SNP sites and fill in missing data, analyze population structure, calculate linkage disequilibrium, and conduct genome-wide association analysis to screen key SNP sites to form a feature marker set;
[0023] Model building module: Using the feature tag set as input and the corresponding germplasm quality phenotypic trait as output, a hybrid deep learning model is constructed. During training, correlation contribution-related indicators are introduced to adjust the model, hyperparameters are optimized, and cross-validation is used to screen and obtain a genome-wide selection model.
[0024] The platform development module uses the Vue.js+Django framework to develop a peanut smart breeding platform, embedding a whole genome selection model and integrating data retrieval, downloading, visualization, breeding value prediction, parent pairing, and custom training functions.
[0025] New line breeding module: Based on the combination of phenotypic matrix, SQI and AC indexes and germplasm genetic distance data, the core parents are selected to construct the breeding population, the genotypes of offspring individuals are obtained and uploaded to the peanut smart breeding platform to obtain the predicted breeding value, and the corrected offspring individual breeding value is calculated and the superior single plants are screened. Through multi-generation aggregation selection, high-quality new peanut lines are obtained.
[0026] Compared with existing technologies, this peanut quality trait selection breeding method and its application based on whole-genome SNPs have the following beneficial effects:
[0027] I. This invention integrates whole-genome resequencing, SNP optimization marker screening, and hybrid deep learning model construction to achieve prediction of peanut quality traits and evaluation of breeding values. Through whole-genome resequencing and quality control processes, an SNP variation map is obtained, covering ten key quality traits such as protein, oil content, and oleic acid, providing comprehensive genetic information for subsequent analysis. The SNP optimization and marker screening process selects loci to form a feature marker set, avoiding interference from low-quality loci. By adjusting and training the hybrid deep learning model and optimizing hyperparameters through 10-fold cross-validation, a whole-genome selection model is obtained, shortening the breeding cycle and reducing the prediction error rate, thus providing technical support for the improvement of peanut quality traits.
[0028] II. This invention constructs a core parent selection strategy by combining phenotypic matrices, SQI / AC indices, and genetic distance data. Through genetic complementarity screening, it ensures the initial genetic diversity of the breeding population, laying the foundation for multi-generational aggregate selection. In the offspring evaluation stage, it integrates the predicted values of the whole-genome selection model, the average association contribution of key SNP loci, and the average phenotypic value of the parents to achieve dynamic correction of the breeding value, significantly improving the accuracy of screening individuals with high breeding values. In addition, the platform development module integrates data retrieval, visualization, and breeding value prediction functions through the Vue.js + Django framework, and can obtain parent pairing suggestions in real time, solving the problems of data silos and decision lag in traditional breeding. After 3-5 generations of aggregate selection, the oil content, protein content, and oleic acid / linoleic acid ratio of the new lines are significantly improved, achieving synergistic improvement of quality traits and providing a breeding paradigm for the peanut industry to upgrade towards high oleic acid and high protein.
[0029] Other advantages, objectives and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination or study, or may be learned from the practice of the invention. Attached Figure Description
[0030] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0031] Figure 1 A flowchart of a peanut quality trait selection breeding method based on genome-wide SNPs;
[0032] Figure 2 A framework diagram for the application of genome-wide selection breeding for peanut quality traits. Detailed Implementation
[0033] To further illustrate the technical means and effects of the present invention in achieving its intended purpose, the following detailed description of the specific implementation methods, structures, features, and effects of the present invention, in conjunction with the accompanying drawings and preferred embodiments, is provided below.
[0034] Example 1:
[0035] Data Acquisition: In the breeding of new high-oleic peanut lines, 150 peanut germplasm resources from different regions were collected and planted in experimental fields according to unified planting standards. Once the germplasm reached maturity (45-50 days after flowering), three healthy plants were randomly selected. Ten quality traits were measured, specifically protein content, oil content, oleic acid content, linoleic acid content, oleic acid / linoleic acid ratio, stearic acid content, palmitic acid content, lysine content, phenylalanine content, and proline content. Measurements were performed 3-4 times per plant, and the average values were used to construct a phenotypic matrix. Simultaneously, data from these three plants were collected. Young leaves were harvested, and DNA was extracted. Whole-genome resequencing was performed using a sequencing platform to obtain short-fragment base sequence data. The short-fragment base sequence data underwent quality control, removing short fragments containing adapter sequences, filtering short fragments with a low-quality base ratio >5%, and eliminating repetitive short-fragment base sequence data. The quality-controlled short-fragment base sequence data was then compared with the peanut T2T reference genome, and single nucleotide polymorphisms (SNPs) at SNP sites were detected using GATK software. Low-quality sites with a quality value <30 and a coverage depth <5 were filtered out, and finally, an SNP variation map was obtained.
[0036] SNP optimization and marker screening: Based on the obtained SNP variant map, the deletion rate and minimum allele frequency of each SNP locus were calculated using Plink software. The overall quality index of each SNP locus was calculated using the SNP overall quality screening index formula: SQI = 0.6 × (1 - MR) + 0.4 × MAF, where SQI is the overall quality index of the SNP locus, ranging from 0 to 1, with higher values indicating better SNP quality; MR is the deletion rate of the SNP locus; and MAF is the minimum allele frequency of the SNP locus. High-quality SNP loci with an SQI ≥ 0.6 were screened out. SNP loci that still exist after screening were further investigated. For loci with limited missing data, Beagle software was used to fill in the gaps, resulting in a complete SNP genotype matrix. Admixture software was used for population structure analysis, and the optimal population structure grouping was determined through cross-validation error. The population structure matrix was then output. Plink software was used to calculate the linkage disequilibrium coefficients among the selected high-quality SNP loci. Finally, TASSEL software was used to conduct genome-wide association analysis, calculating the p-value of each SNP locus's association with ten quality traits. The SNP-trait association contribution formula was used to calculate the association contribution of each SNP locus to the target quality trait (with a focus on oleic acid content). The SNP-trait association contribution formula is as follows: Where AC represents the association contribution of the SNP locus to the target quality trait, and P represents the P-value of the association between the SNP locus and the ten quality traits. The linkage disequilibrium coefficient is used to measure the degree of genotypic association between two SNP loci. The closer the value is to 1, the stronger the association between the two loci and the greater the interference; the closer the value is to 0, the weaker the association and the smaller the interference. Loci with AC ≥ 5 are selected to form a feature marker set for high-oleic flower fertility varieties, such as... Figure 1 As shown.
[0037] Model Construction: Using the selected feature marker set as input and the phenotypes of ten quality traits corresponding to 150 germplasm accessions (with a focus on oleic acid content phenotype data) as output, a hybrid deep learning model containing VAE layers, CNN layers, and Transformer layers is constructed. In the VAE layer training phase, the average association contribution of the feature marker set is first calculated, i.e., the average AC value of key SNP sites. The noise compression loss value of the VAE layer is then calculated using the noise compression loss formula to optimize the VAE layer parameters, achieving noise reduction and feature extraction of genotype data. The noise compression loss formula is as follows: Among them, L VAE This represents the noise compression loss value. Reconstructing data from genotype raw data X and VAE The mean squared error X represents the original data of the genotypes of the key SNP loci obtained through screening. To reconstruct the output data, AC avg The average association contribution of the feature label set;
[0038] The output data from the VAE layer is fed into the CNN layer. The CNN layer convolution kernel size is set to 3×3, the number of kernels is 64, and the stride is 1. Local linkage features between SNP sites are extracted through convolution operations. The output of the CNN layer is then fed into the Transformer layer. The Transformer layer has 8 attention heads and 256 hidden layer dimensions to capture the epistatic effect of SNP sites across the entire genome. The VAE latent dimension, CNN convolution kernel parameters, and the number of Transformer attention heads are iteratively optimized and adjusted. After each adjustment, the hybrid deep learning model is evaluated using 10-fold cross-validation to determine the coefficient of determination for each round of validation. The average coefficient of determination and the standard deviation of the coefficient of determination are analyzed. Finally, the prediction accuracy correction coefficient is calculated using the following formula: Among them, C acc This is a correction factor for prediction accuracy. The average coefficient of determination, CV std For R 2 The standard deviation of the value; when iterative optimization reaches C acc When the value is ≥0.8, training is stopped, and a genome-wide selection model suitable for high-oleic flower fertile species is obtained.
[0039] New line breeding: Based on the phenotypic matrix of 150 germplasms (with a focus on oleic acid content phenotypic data), SQI and AC values of each SNP locus (with a focus on AC values of oleic acid-related SNP loci), and genetic distance data between germplasms, genetically complementary backbone parents were selected. The selection criteria were: the genetic distance between parents was greater than the average genetic distance of 150 germplasms, and the paternal parent had an oleic acid content ≥78%, the maternal parent had a protein content ≥28%, and an oil content ≥52%.
[0040] The selected male and female parents were used to crossbreed to construct the F1 generation breeding population. When the F1 generation plants grew to maturity, the tender leaves of each plant were collected to extract DNA and obtain the genotype data of the F1 generation individuals.
[0041] The F1 generation genotype data were matched with a feature marker set composed of key SNP loci selected through screening to determine the key SNP loci carried by each F1 generation individual. The AC value of each carried locus was calculated using the SNP-trait association contribution formula, and the average value was taken to obtain the average association contribution of each F1 generation individual carrying the key SNP locus. Then, ten quality trait phenotypic data from the paternal and maternal parents were extracted from the phenotypic matrix, and the average value was calculated to obtain the average phenotypic value of the parents. The F1 generation genotype data were input into the constructed genome-wide selection model to obtain the original predicted breeding value for each F1 generation individual. The corrected breeding value for each F1 generation individual was then calculated using the offspring individual corrected breeding value formula: GBV. cal =GBV pred ×AC ind +0.1×P par Among them, GBV cal GBV represents the corrected breeding values for individual quality traits in offspring. pred AC represents the original breeding values obtained by predicting the quality traits of offspring individuals using a genome-wide selection model. ind P represents the average association contribution of offspring individuals carrying key SNP loci. par The phenotypic average of the core parents in the corresponding quality traits; superior individual plants were selected according to the high breeding value standard of ≥8 after correction of the F1 generation individuals, and the selected superior F1 generation individuals were self-crossed to obtain the F2 generation population; the F1 generation individual genotype acquisition and GBV were repeated. cal Through calculation, screening of superior single plants, and self-pollination, and after three generations of polymerase selection, a new high-oleic peanut variety with oleic acid content ≥80%, protein content ≥25%, and oil content ≥50% was obtained.
[0042] In summary, in the breeding scenario of high-oleic peanut varieties, peanut germplasm from different regions was first collected. After standardized planting, ten quality traits were measured to construct a phenotypic matrix. At the same time, DNA was extracted, resequencing was performed, and the data was processed to obtain SNP variation maps. Then, SNP sites were optimized and screened to obtain a feature marker set. A hybrid deep learning model was then constructed based on this set. After iterative optimization and cross-validation, a genome-wide selection model that meets the requirements was obtained. Finally, based on multi-dimensional data, backbone parents were selected to construct a breeding population. The genotypes of the offspring were obtained, and the corrected breeding values were calculated. After three generations of aggregate selection, a new high-oleic peanut variety was successfully bred.
[0043] Example 2:
[0044] Data Acquisition: In the breeding of new high-protein, high-oil peanut lines, 180 peanut germplasm resources from different regions were collected. These were planted in experimental fields according to standardized planting procedures. At the germplasm maturity period (48-53 days after flowering), three healthy plants were randomly selected to measure ten quality traits (protein, oil content, oleic acid, linoleic acid, oleic acid / linoleic acid ratio, stearic acid, palmitic acid, lysine, phenylalanine, and proline content). Measurements were taken 3-4 times per plant, and the average value was used to construct a phenotypic matrix. DN was extracted from the tender leaves of the three plants. A. Whole-genome resequencing was performed using a sequencing platform to obtain short-fragment base sequence data. The short-fragment base sequence data was then quality-controlled to remove short-fragment base sequence data containing adapter sequences, low-quality bases accounting for >5%, and duplicate short-fragment base sequence data. The quality-controlled short-fragment base sequence data was then compared with the peanut T2T reference genome. Single nucleotide polymorphisms at SNP sites were detected using GATK software, and low-quality sites with quality values <30 and coverage depths <5 were filtered out to finally obtain the SNP variation map.
[0045] SNP optimization and marker screening: Based on the SNP variant map, the deletion rate and minimum allele frequency of each SNP locus were calculated using Plink software. The comprehensive quality index of each SNP locus was determined by the SNP comprehensive quality screening index formula: SQI = 0.6 × (1-MR) + 0.4 × MAF. High-quality SNP loci with SQI ≥ 0.6 were screened out.
[0046] For loci with missing data after screening, Beagle software was used to impute them, resulting in a complete SNP genotype data matrix. Admixture software was then used to analyze the population structure, and the optimal grouping was determined through cross-validation error. The population structure matrix was output, and Plink software was used to calculate the linkage disequilibrium coefficients between high-quality SNP loci. TASSEL software was used to perform genome-wide association analysis, calculating the p-value of each SNP locus associated with ten quality traits (with a focus on protein and oil content). The SNP-trait association contribution formula was used to calculate the association contribution of SNP loci to protein and oil content. The SNP-trait association contribution formula is as follows: Loci with AC ≥ 5 were selected to form a set of feature markers for the breeding of high-protein and high-oil flowers.
[0047] Model construction: Using the feature marker set of key SNP sites as input and the phenotypes of ten quality traits corresponding to 180 germplasms (with a focus on protein and oil content) as output, a hybrid deep learning model containing VAE layer, CNN layer and Transformer layer is constructed.
[0048] During the VAE layer training phase, the average association contribution of the feature label set, i.e., the average AC value of the key SNP sites, is calculated. The noise compression loss value of the VAE layer is then calculated using the noise compression loss formula to optimize the VAE layer parameters. The noise compression loss formula is as follows:
[0049] The output data from the VAE layer is fed into a CNN layer (3×3 kernels, 80 kernels, stride 1) to extract local linkage features, and then fed into a Transformer layer (10 attention heads, 320 hidden layer dimensions) to capture genome-wide epistatic effects. The VAE latent dimension, the number of CNN kernels, and the number of Transformer attention heads are iteratively optimized. After each optimization, the model is evaluated using 10-fold cross-validation. The coefficient of determination for each round of validation is calculated, and the average coefficient of determination and the standard deviation of the coefficients are analyzed. Finally, the prediction accuracy correction coefficient is calculated using the following formula: When C acc When the value is ≥0.8, a genome-wide selection model suitable for high-protein, high-oil-content flowering plants is obtained, such as... Figure 2 As shown.
[0050] New line breeding: Based on the phenotypic matrix of 180 germplasms (focusing on protein and oil content), SQI and AC values of SNP loci, and genetic distance, key parents were selected: the genetic distance between parents was greater than the average genetic distance, the male parent had a protein content ≥30%, and the female parent had an oil content ≥55%. The male and female parents were crossed to construct the F1 generation. DNA was extracted from young leaves of the F1 generation at maturity to obtain genotype data. The genotype data of the F1 individuals were matched with a feature marker set composed of the selected key SNP loci to determine the key SNPs carried by each F1 individual. The genotype data of the F1 generation individuals were analyzed, and the AC value of each carrier locus was calculated using the SNP-trait association contribution formula. The average value was then taken to obtain the average association contribution of each F1 generation individual carrying the key SNP locus. Ten quality trait phenotypic data from the paternal and maternal parents were extracted from the phenotypic matrix, and the average value of the parents was obtained. The genotype data of the F1 generation individuals were then input into the constructed genome-wide selection model to obtain the original predicted breeding value for each F1 generation individual. The corrected breeding value for each F1 generation individual was then calculated using the offspring individual corrected breeding value formula, which is: GBV. cal =GBV pred ×AC ind +0.1×P par According to GBV cal ≥8 individuals were selected to identify superior F1 generation plants. These superior F1 generation plants were then self-crossed to obtain the F2 generation population. The F1 generation individual genotypes and GBV were then replicated. cal Through calculation, screening of superior single plants, and self-pollination, and after four generations of polymerase selection, a new high-quality peanut line with a protein content ≥28% and an oil content ≥54% was obtained.
[0051] In summary, in the scenario of cultivating new peanut varieties with high protein and high oil content, 180 peanut germplasm accessions from different regions were collected and planted according to regulations. Phenotypic matrices were constructed by measuring quality traits, and DNA was extracted and resequencing to obtain SNP variation maps. Subsequently, SNP sites were optimized and screened to form a feature marker set, and a hybrid deep learning model was constructed. After optimization and validation, a qualified whole-genome selection model was obtained. Then, suitable backbone parents were selected to construct a hybrid population, the genotypes of offspring were obtained, and the corrected breeding values were calculated. Superior single plants were screened and selected through four generations of aggregate selection to cultivate new peanut varieties with high protein and high oil content.
[0052] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.
Claims
1. A method for breeding of peanut quality traits based on whole genome SNP, characterized in that, The specific steps of this breeding method are as follows: Data acquisition: Multiple peanut germplasm samples were collected and planted. Three healthy plants were taken at the germplasm maturity period. Ten quality traits were measured and the average values were used to construct a phenotypic matrix. At the same time, DNA was extracted from the young leaves of the plants and whole-genome resequencing was performed. After processing the sequencing data, it was compared with the peanut T2T reference genome to screen SNP sites and obtain SNP variation maps. SNP optimization and marker screening: Based on the SNP variant map, the deletion rate and minimum allele frequency of each SNP locus were calculated using Plink software. The overall quality index of the SNP locus was then calculated using the formula: ,in, This refers to the overall quality index of SNP loci. The SNP site deletion rate. To determine the minimum allele frequency of SNP loci, high-quality SNP loci with an SQI ≥ 0.6 were screened. For loci with a small amount of missing data after screening, Beagle software was used to fill in the gaps, resulting in a complete SNP genotype data matrix. Subsequently, Admixture software was used for population structure analysis, and the optimal population structure grouping was determined through cross-validation error. The population structure matrix was output, and the linkage disequilibrium coefficient was calculated using Plink software. Finally, TASSEL software was used to conduct genome-wide association analysis, calculating the p-value of each SNP locus associated with ten quality traits. The SNP-trait association contribution formula was used to calculate the association contribution of SNP loci to the target quality traits. The SNP-trait association contribution formula is as follows: ,in, The contribution of SNP loci to the target quality trait. The p-value represents the association between SNP loci and ten quality traits. The linkage disequilibrium coefficient of the SNP locus is used to select SNP loci with an association contribution of ≥5 to the target quality trait to form a feature marker set. Model Construction: Using the feature tag set as input and the corresponding ten quality traits of germplasm as output, a hybrid deep learning model is constructed, including VAE layers, CNN layers, and Transformer layers. During the VAE layer training phase, the average association contribution of the feature tag set, i.e., all key SNP sites, is introduced. The average value is used to calculate the noise compression loss of the VAE layer using the noise compression loss formula. After the VAE layer output is processed by the CNN layer to extract local linkage features and the Transformer layer to capture genome-wide epistatic effects, the VAE latent dimension, CNN convolution kernel parameters, and Transformer attention head number are iteratively optimized and adjusted. Simultaneously, the hybrid deep learning model is evaluated using ten-fold cross-validation to determine the determination coefficients for each round of validation. The average determination coefficient and the standard deviation of the determination coefficients are analyzed, and the prediction accuracy correction coefficient is calculated using the prediction accuracy correction coefficient formula. When the prediction accuracy correction coefficient is ≥0.8, the genome-wide selection model is obtained. The noise compression loss of the VAE layer is calculated using the noise compression loss formula, which is: ,in, This represents the noise compression loss value. Raw genotype data Reconstructing data with VAE The mean square error, The original data of genotypes of the key SNP loci obtained through screening, To reconstruct the output data, The average association contribution of the feature label set is used; the prediction accuracy correction coefficient is calculated using the prediction accuracy correction coefficient formula, which is as follows: ,in, This is a correction factor for prediction accuracy. The average coefficient of determination. for Standard deviation of the value; New line development: Based on phenotypic matrix, SQI and AC indices, and germplasm genetic distance data, core parents were selected to construct a breeding population. The genotypes of offspring individuals were obtained, and the original breeding values were obtained using a genome-wide selection model. Corrected offspring individual breeding values were then calculated, and superior individual plants were screened. Through multi-generational aggregation selection, high-quality new peanut lines were obtained. The corrected offspring individual breeding value was calculated using the formula: ,in, This refers to the corrected breeding values for individual quality traits in offspring. The original breeding values are obtained by using a genome-wide selection model to predict the quality traits of offspring individuals. The average association contribution of key SNP loci carried by offspring individuals. This represents the average phenotypic value of the core parent lines in the corresponding quality traits.
2. The peanut quality trait selection breeding method based on whole-genome SNPs according to claim 1, characterized in that, In the data acquisition process, ten quality traits were measured, including protein, oil content, oleic acid, linoleic acid, oleic acid / linoleic acid ratio, stearic acid, palmitic acid, lysine, phenylalanine, and proline. Whole-genome resequencing was performed using a sequencing platform to obtain short-fragment base sequence data. The short-fragment base sequence data underwent quality control, removing short-fragment base sequence data containing adapter sequences, filtering out low-quality bases with a percentage greater than 5%, and eliminating repetitive short-fragment base sequence data. The short-fragment base sequence data was then compared with the peanut T2T reference genome. Single nucleotide polymorphisms (SNPs) at SNP sites were detected using GATK software, and low-quality sites with a quality value <30 and a coverage depth <5 were filtered out to obtain an SNP variation map.
3. The peanut quality trait selection breeding method based on whole-genome SNPs according to claim 1, characterized in that, In the breeding of the new strain, phenotypic matrix, and Based on the genetic distance data of indicators and germplasm, genetically complementary backbone parents were selected. The selection criteria were that the genetic distance between parents was greater than the population average genetic distance, and each parent showed superior performance in different quality traits. A breeding population was constructed using the selected backbone parents. After the offspring individuals grew, young leaves were collected to extract DNA and obtain the genotypes of the offspring individuals. The genotypes of the offspring individuals were matched with a set of characteristic markers to determine the key SNP loci carried by each individual, and the loci were obtained using the SNP-trait association contribution formula. The values were averaged to obtain the average association contribution of key SNP loci carried by offspring individuals; simultaneously, the corresponding quality phenotypic data of the backbone parents were extracted from the phenotypic matrix, and the average phenotypic value of the parents was obtained by averaging; the genotypes of offspring individuals were input into the whole genome selection model to obtain the original predicted breeding value, and then the corrected breeding value of offspring individuals was calculated using the formula for correcting breeding values of offspring individuals. Superior individual plants were selected according to the high breeding value standard of ≥8 after correction of breeding value of offspring individuals, and the superior individual plants were self-crossed or backcrossed to stabilize the superior traits; the process of obtaining genotypes of offspring individuals was repeated. Through calculation, screening of superior single plants, and self-crossing / backcrossing operations, and after 3-5 generations of polymeric selection, high-quality new peanut lines with increased oil content, protein, and oleic acid were obtained.
4. An application of the peanut quality trait selection breeding method based on whole-genome SNPs as described in any one of claims 1-3, characterized in that, The application includes: Data acquisition module: Collect multiple peanut germplasm samples for planting. At the germplasm maturity period, take 3 healthy plants, measure 10 quality traits and take the average value to construct a phenotypic matrix. At the same time, extract DNA from the young leaves of the plants and perform whole genome resequencing. After processing the sequencing data, compare it with the peanut T2T reference genome, screen SNP sites, and obtain SNP variation map. SNP optimization and marker screening module: Based on the SNP variation map, calculate the SNP site deletion rate and minimum allele frequency, screen high-quality SNP sites and fill in missing data, analyze population structure, calculate linkage disequilibrium, and conduct genome-wide association analysis to screen key SNP sites to form a feature marker set; Model building module: Using the feature tag set as input and the corresponding germplasm quality phenotypic trait as output, a hybrid deep learning model is constructed. During training, correlation contribution-related indicators are introduced to adjust the model, hyperparameters are optimized, and cross-validation is used to screen and obtain a genome-wide selection model. The platform development module uses the Vue.js+Django framework to develop a peanut smart breeding platform, embedding a whole genome selection model and integrating data retrieval, downloading, visualization, breeding value prediction, parent pairing, and custom training functions. New strain development module: based on the combination of phenotypic matrix, and Using indicators and germplasm genetic distance data, core parents were selected to construct a breeding population. The genotypes of offspring individuals were obtained and uploaded to the peanut smart breeding platform to obtain predicted breeding values. The corrected breeding values of offspring individuals were calculated and superior single plants were selected. Through multi-generation aggregation selection, high-quality new peanut lines were obtained.