A method, apparatus and medium for maize whole genome selection

By constructing an additive relation matrix A and a mixed linear model, combined with pedigree information and the rrBLUP ridge regression model, the problem of insufficient phenotypic data quality and evaluation indicators in existing whole-genome selection methods is solved, thereby improving the screening efficiency and accuracy of superior combinations in maize breeding.

CN122201422APending Publication Date: 2026-06-12MAIZE RES INST HEILONGJIANG ACAD OFAGRI SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
MAIZE RES INST HEILONGJIANG ACAD OFAGRI SCI
Filing Date
2026-03-06
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing whole-genome selection methods in maize breeding suffer from insufficient phenotypic data quality and inaccurate model evaluation indicators, resulting in a low ability to identify top-performing combinations.

Method used

By constructing an additive relationship matrix A, combining pedigree information and a mixed linear model, phenotypic correction is performed. Then, using the rrBLUP ridge regression model and paired repeated K-fold cross-validation, the number of Top-N matches is dynamically determined, and the Top-N match number and RankSum index are calculated to optimize model evaluation.

🎯Benefits of technology

It improves the efficiency and effectiveness of screening superior combinations in maize hybrid breeding, accurately identifies leading combinations, and enhances the predictive accuracy of the model and the relevance of evaluation indicators.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201422A_ABST
    Figure CN122201422A_ABST
Patent Text Reader

Abstract

The application discloses a corn whole genome selection method, device and medium, and belongs to the field of agricultural molecular breeding and bioinformatics technology. The method obtains SNP genotypes, pedigrees and field phenotype data of hybrid combinations, constructs an additive relationship matrix A by using the pedigrees, and establishes a mixed linear equation of an animal model. After stripping the genetic redundancy effect, the sum of the additive genetic effect BLUP and the intercept is extracted as the corrected phenotype. The corrected phenotype is used as the response variable, and the whole genome SNP is used as the independent variable. The rrBLUP is used to establish a prediction model, and the test set prediction value is obtained under the paired repeated K-fold cross-validation. The Top-N matching number and the RankSum index are proposed for the breeding head optimization target, the hit number and the absolute ranking performance of the top N combinations in the real field are quantitatively evaluated, and the optimal model training strategy is selected and the hybrid combination optimization ranking is output. In this way, the selection efficiency and effect of excellent combinations in corn hybrid breeding can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of experimental equipment technology for environmental science and ecology, and in particular to a method, equipment and medium for whole genome selection of maize. Background Technology

[0002] Genomic selection (GS) utilizes high-density molecular markers covering the entire genome to simultaneously estimate the effects of all markers, thereby predicting the breeding value of candidate individuals. It has been widely applied in breeding practices for major crops such as maize. However, existing GS methods still have the following three shortcomings in practical breeding applications:

[0003] First, evaluation metrics are disconnected from breeding decision-making objectives. Existing GS studies generally use Pearson correlation coefficient as an evaluation metric for prediction accuracy. This metric reflects the linear fit of the model to the phenotypic values ​​of the entire population. However, the core issue that breeders are truly concerned with in actual decision-making is whether the few elite combinations predicted by the model as optimal (i.e., the "top combinations") still perform well in real field trials. Two prediction models may have similar correlation coefficients across the entire population, but their accuracy in predicting top combinations differs significantly. Existing evaluation systems lack quantitative measures for the accuracy of the actual ranking of top samples, making it difficult to directly guide breeders' decisions on selecting elite materials. Although there are ranking evaluation metrics such as Precision@K in the field of information retrieval, their design goal is to judge the relevance of document retrieval, without considering the absolute deviation of the actual ranking position in the breeding scenario. That is, it is not only necessary to know "whether the predicted elite is indeed excellent," but also "where it ranks in the actual list."

[0004] Second, environmental noise in the phenotypic data was not adequately removed. In field trials, systematic spatial heterogeneity exists between plots due to factors such as soil fertility gradients, irrigation water distribution, and micro-topographical differences. Directly using the original field phenotypic mean as the response variable of the GS model means that the phenotypic signal is mixed with minor environmental variations, which weakens the accurate capture of genetic signals and reduces the model's ability to identify true genetic potential.

[0005] Third, the known pedigree information of breeding populations is not fully utilized. In hybrid breeding, the pedigree data of parents is known and easily accessible, and the pedigree data completely records the genetic transmission pathways between individuals. By constructing an additive relationship matrix A using pedigree information, the additive genetic effects of individuals can be estimated more accurately within a mixed linear model framework, effectively separating genetic signals from environmental noise. Especially for cases with insufficient marker density or incomplete genotyping of some samples, pedigree information can provide additional genetic structure information, improving the accuracy of phenotypic correction.

[0006] In summary, there is an urgent need for a genome-wide selection method, equipment, and medium that can integrate pedigree information for phenotypic correction and use ranking indicators oriented towards head selection targets for model evaluation, so as to improve the screening efficiency and effectiveness of superior combinations in maize hybrid breeding. Summary of the Invention

[0007] This invention provides a method, device, and medium for whole-genome selection of maize, which addresses the shortcomings of existing whole-genome selection methods in terms of phenotypic data quality and model evaluation indicators, resulting in a low ability to identify the most critical superior combinations in breeding practice, and improves the screening efficiency and effectiveness of superior combinations in maize hybrid breeding.

[0008] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0009] This invention provides a method for whole-genome selection in maize, comprising:

[0010] S1: Obtain the whole genome SNP genotype matrix, field raw phenotypic data, and pedigree information including individual number, maternal parent number, and paternal parent number of the maize hybrid combination population;

[0011] S2: Construct an additive relationship matrix A based on pedigree information, establish a mixed linear model, estimate the variance components using the restricted maximum likelihood method, extract the additive genetic effect BLUP value and add it to the intercept estimate to obtain the corrected phenotype;

[0012] S3: Center the SNP genotype matrix with -1 / 0 / 1, use the corrected phenotype as the response variable and the SNP marker matrix as the independent variable, and use the rrBLUP ridge regression model to solve for the marker effect vector and the intercept, and calculate the predicted value of the test set.

[0013] S4: Paired repeated K-fold cross-validation is used to train the model with the original phenotype and the corrected phenotype under the same data split and obtain the predicted values ​​of the test set; the number of Top-N matches is dynamically determined and the Top-N match number and RankSum index are calculated.

[0014] S5: Perform paired statistical tests on the index results to verify the improvement effect of the corrected phenotype. Then, train the final rrBLUP model using the corrected phenotype of the whole sample and output the recommended list of maize hybrid combinations by sorting by GEBV.

[0015] Furthermore, in S2, the process of constructing the additive relationship matrix A includes: extracting all unique individuals in the pedigree and assigning them continuous integer codes, setting the parent numbers of the founding parents to 0, and recursively calculating the elements aij of matrix A using the pedigree matrix after integer coding, where aij represents the degree of additive genetic correlation between individual i and individual j.

[0016] Furthermore, the hybrid linear model adopts an animal model form, wherein the model expression is: The u mentioned is an additive genetic effect, which follows the genetic law. The genetic parameter of the model is specified as add_animal, an additive animal model. Pedigree information is provided as an encoded three-column matrix, and the additive genetic variance is estimated simultaneously using the restricted maximum likelihood method. and residual variance .

[0017] Furthermore, S2 also includes a correction step for spatial environmental effects, adding a spatial effect term s to the hybrid linear model, and fitting a two-dimensional autoregressive AR(1)×AR(1) correlation structure based on field row and column coordinates, thereby expanding the model to... The corrected phenotype extracts only the sum of the BLUP value and the intercept, and the spatial effect term s is absorbed by the model and not included in the corrected phenotype; in S2, the corrected phenotype of individuals with a BLUP value of zero is set to the missing value NA to avoid the introduction of bias caused by the model convergence boundary or missing data.

[0018] Furthermore, in S3, the SNP genotype matrix is ​​centered by: shifting the original allele dose codes by site to convert them into a centered numerical matrix with a desired mean of 0, so as to meet the linear assumptions and data centering requirements of the subsequent ridge regression model.

[0019] Furthermore, in S4, the paired repeated K-fold cross-validation consists of 50 random cycles, each with 5 folds; in each cycle, fold labels are first randomly generated, and within each fold, the original phenotype and the corrected phenotype share the same identical training and test set partitions; in S4, when the trait is of the "bigger is better" type, the phenotypic values ​​and predicted values ​​are ranked in descending order; when the trait is of the "smaller is better" type, they are ranked in ascending order.

[0020] Furthermore, in S5, the paired statistical test is a paired t-test, which examines whether the improvement of the corrected phenotype relative to the original phenotype reaches a statistically significant level for the three indicators Accuracy, TopHits, and RankSum.

[0021] Furthermore, this method can also be applied to other traits besides yield, including disease resistance indicators and growth period, wherein the growth period is ranked in ascending order, and the disease resistance indicator is ranked according to the scoring direction.

[0022] The present invention also provides a whole-genome selection device for maize, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the method described in any of the above-mentioned embodiments.

[0023] The present invention also provides a computer-readable storage medium, characterized in that it stores a computer program thereon, which, when executed by a processor, implements the method described in any of the preceding claims.

[0024] Compared with the prior art, the technical solution disclosed in this invention has the following beneficial effects:

[0025] (1) By using the pedigree additive relationship matrix A, the genetic effects and environmental residuals in the mixed linear model are effectively separated. The corrected phenotype more realistically reflects the individual's additive genetic potential, providing higher quality response variables for the GS model and improving prediction accuracy.

[0026] (2) Two evaluation indicators, Top-N coincidence number (TopHits) and RankSum, are proposed for the decision-making goal of top breeding. TopHits directly measures the number of hits of top combinations, while RankSum further measures the absolute deviation of the predicted elites from the actual ranking. The two indicators complement each other to reflect the actual effectiveness of the model in screening elite materials, making up for the limitation of the traditional correlation coefficient, which only reflects the goodness of fit of the whole population.

[0027] (3) Paired repeated K-fold cross-validation design was adopted to compare different phenotypic strategies under strictly identical data partitioning, eliminating the interference of random partitioning differences on the results and providing a reliable statistical inference basis for method comparison.

[0028] (4) The number of Top-N is dynamically determined according to the size of the test set, which makes the evaluation index comparable between test sets of different sizes and enhances the universality of the method.

[0029] In summary, this invention addresses the shortcomings of existing whole-genome selection methods in terms of both phenotypic data quality and model evaluation indicators, which leads to a low ability to identify the most critical superior combinations in breeding practice. This invention improves the screening efficiency and effectiveness of superior combinations in maize hybrid breeding. Attached Figure Description

[0030] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0031] Figure 1 This is a schematic diagram of the whole-genome selection method for maize provided in an embodiment of the present invention;

[0032] Figure 2 This is a schematic diagram illustrating the principle of the whole-genome selection method for maize provided in an embodiment of the present invention. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0034] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0035] This invention provides a method, device, and medium for whole-genome selection of maize, which can solve the problem that existing whole-genome selection methods are insufficient in terms of phenotypic data quality and model evaluation indicators, resulting in a low ability to identify the most critical superior combinations in breeding practice, and improve the screening efficiency and effect of superior combinations in maize hybrid breeding.

[0036] As shown in Figures 1 and 2, embodiments of the present invention provide a method for whole-genome selection of maize, comprising:

[0037] S1: Obtain the whole genome SNP genotype matrix, field raw phenotypic data, and pedigree information including individual ID, maternal parent ID, and paternal parent ID for the maize hybrid population; summarized as data acquisition and preprocessing, the specific steps are as follows:

[0038] Construct a maize hybrid combination experimental population. Obtain the following data from this population: (a) whole genome SNP genotype matrix, with rows representing samples and columns representing marker loci; (b) field raw phenotypic data, including observed values ​​of target traits such as yield and disease resistance; (c) pedigree information, organized in a three-column format, namely individual number (trt), maternal parent number (mum), and paternal parent number (dad), with the parent numbers of the founding parents set to 0.

[0039] Perform data standardization on phenotypic data: force trait fields, except for the ID column, to be converted to numeric type; for potentially duplicate sample IDs, only retain the first occurrence record. Take the intersection of genotype and phenotype using the sample ID as the key and arrange them in a consistent order to avoid sample misalignment.

[0040] The genotype matrix is ​​checked for encoding errors: when the SNP encoding is 0 / 1 / 2, it is converted to -1 / 0 / 1 encoding by subtracting 1 from the entire matrix. Specifically, the rule is to check if the maximum value of the genotype matrix is ​​greater than 1; if so, then proceed with the following steps. The conversion is as follows: -1 / 0 / 1 encoding satisfies the requirement of subsequent rrBLUP ridge regression models for label matrix centering, thus improving numerical stability.

[0041] S2: Construct an additive relation matrix A based on pedigree information, establish a mixed linear model, estimate the variance components using the restricted maximum likelihood method, extract the additive genetic effect BLUP value, and add it to the intercept estimate to obtain the corrected phenotype; in summary: phenotypic correction based on the pedigree additive relation matrix, the specific steps are as follows:

[0042] (2.1) Genealogy encoding and A matrix construction

[0043] The specific steps for encoding individual IDs in the pedigree as integers are as follows:

[0044] (a) Extract all unique individual IDs that have appeared in the three columns of the pedigree (trt, dad, mum) (excluding missing parents with a code of 0) and merge them into a complete set of individuals;

[0045] (b) Assign consecutive integer numbers to each unique individual in sorted order, and establish a bidirectional mapping table recode from the original ID to the integer number;

[0046] (c) Replace the individual ID column in the three columns of pedigree data and the phenotypic data with the corresponding integer code, and uniformly code unknown parents as 0.

[0047] The encoded genealogy matrix is ​​used to construct an additive relation matrix A, where the elements The A matrix represents the degree of additive genetic correlation between individual i and individual j. The A matrix is ​​calculated using a pedigree recursive algorithm.

[0048] (2.2) Construction and solution of mixed linear models

[0049] For each target trait, a mixed linear model in the form of an animal model is constructed as follows:

[0050]

[0051] Where: y is an n×1 phenotypic observation vector; β is a fixed effect vector, which in this model only includes the population mean (intercept term), i.e., the formula object is constructed as y ~ 1; u is an additive genetic effect vector, which follows a multivariate normal distribution. Its covariance matrix is ​​composed of the pedigree additive relation matrix A and the additive genetic variance. Jointly determined; ε is the residual vector, which follows... X is the fixed effects correlation matrix, and Z is the random effects correlation matrix.

[0052] The model specifies the genetic parameter as an additive animal model (add_animal), and pedigree information is provided by an encoded three-column matrix pedn, with the individual ID column named trt. Variance components are solved iteratively using the restricted maximum likelihood (REML) method. and This allows us to obtain both the fixed effects estimate and the best linear unbiased prediction (BLUP) of the random effects.

[0053] (2.3) Extraction of corrected phenotypes

[0054] Extract from model output:

[0055] (a) Estimated fixed effects intercept: ;

[0056] (b) Additive genetic effects BLUP: through Obtain the predicted genetic effects for all individuals (including the original parents and offspring).

[0057] Since the model output uses internal integer encoding, it needs to be restored to the original sample ID through the following steps:

[0058] (a) Identify all founding individuals in the genealogy using the founders.orig code;

[0059] (b) Extract the internal encoding map (map.codes) from the model object;

[0060] (c) Extract the BLUP value founders.PBVs corresponding to the founding parents through the mapping relationship;

[0061] (d) Use the recode mapping table to restore the integer encoding to the original sample ID.

[0062] For each sample, the corrected phenotype is calculated as follows:

[0063]

[0064] Where μ̂ is the intercept estimate and û is the additive genetic effect BLUP value for that individual. For individuals with a BLUP value exactly 0 (usually due to missing data or model convergence at the parameter space boundary), their corrected phenotype is set to NA (missing value) to avoid introducing systematic bias.

[0065] (2.4) Variance components and genetic parameter output

[0066] The model also outputs the following genetic parameters for quality assessment:

[0067] Additive genetic variance and its standard error;

[0068] residual variance and its standard error;

[0069] Heredity ;

[0070] gamma ratio (the ratio of each variance component to the residual variance);

[0071] z.ratio (the ratio of the variance component to its standard error);

[0072] Constraint state judgment (when the absolute value of gamma is less than 10— 6 If it is positive, mark it as Boundary; otherwise, mark it as Positive.

[0073] (2.5) Exception handling

[0074] Possible anomalies and handling strategies in actual calculations: When the model fails to converge, record the warning message and skip the trait, proceeding to the next trait; when the variance estimate is NA, it indicates that the model cannot identify the random effect, so skip the trait; all warnings and error messages are recorded in a unified text log file for subsequent diagnosis.

[0075] (2.6) Batch processing

[0076] The above process is executed sequentially on all target trait columns in the phenotypic file, merging the corrected BLUP value of each trait into the sample ID table as a new column. The final output is a data file containing the original IDs of all samples and the corrected phenotypic BLUP values ​​for each trait.

[0077] (2.7) Incorporating spatial effects

[0078] Building upon pedigree correction, field spatial effects can be further incorporated into the model. When significant environmental gradients exist along the row and column directions in field experiments, a spatial effect term 's' is added to the mixed linear model, expanding the model to:

[0079]

[0080] In this model, s is fitted using a two-dimensional autoregressive AR(1)×AR(1) correlation structure constructed based on field row and column coordinates (row and col). The spatial parameter is specified as the AR model, and the coordinates are provided by the row and col columns in the phenotypic data. At this time, the model simultaneously estimates pedigree genetic effects and spatial environmental effects. The corrected phenotype still only takes the sum of the intercept and the additive genetic effect BLUP. The spatial effect is absorbed by the model and is not included in the corrected phenotype, thus achieving dual separation of genetic signal and spatial noise.

[0081] In short, the process of constructing the additive relationship matrix A includes: extracting all unique individuals in the pedigree and assigning them continuous integer codes, setting the parent numbers of the founding parents to 0, and recursively calculating the elements aij of matrix A using the pedigree matrix after integer coding, where aij represents the degree of additive genetic correlation between individual i and individual j.

[0082] S2 further includes a correction step for spatial environmental effects, adding a spatial effect term s to the hybrid linear model, and fitting a two-dimensional autoregressive AR(1)×AR(1) correlation structure based on field row and column coordinates. The model is then extended to... The corrected phenotype extracts only the sum of the BLUP value and the intercept, and the spatial effect term s is absorbed by the model and not included in the corrected phenotype; in S2, the corrected phenotype of individuals with a BLUP value of zero is set to the missing value NA to avoid the introduction of bias caused by the model convergence boundary or missing data.

[0083] S3: Center the SNP genotype matrix using -1 / 0 / 1 encoding. With the corrected phenotype as the response variable and the SNP marker matrix as the independent variable, use the rrBLUP ridge regression model to solve for the marker effect vector and the intercept, and calculate the predicted values ​​for the test set. In summary, the rrBLUP genome-wide prediction model is constructed, and the specific steps are as follows:

[0084] The corrected phenotype obtained by S2 Using the genome-wide SNP genotype matrix Z as the independent variable, a ridge regression BLUP (rrBLUP) predictive model was established:

[0085]

[0086] Where g is the p×1 SNP label effect vector, and e is the residual.

[0087] The mixed.solve function is used on the training set to solve the problem: Inputting the training set phenotypic vector y_train and the training set genotype matrix Z_train, the output is the label effect estimation vector. (Stored in the u-component of the result object) and intercept estimation (Stored in the beta component).

[0088] Calculate the predicted values ​​on the test set: That is, the matrix product of the test set genotype matrix and the marker effect vector plus the intercept.

[0089] In S3, the SNP genotype matrix is ​​centered by means of: shifting the original allele dose code by site to convert it into a centered numerical matrix with a desired mean of 0, so as to meet the linear assumptions and data centering requirements of the subsequent ridge regression model.

[0090] S4: Paired repeated K-fold cross-validation is used. The model is trained with both the original and corrected phenotypes under the same data split, and the predicted values ​​on the test set are obtained. The number of Top-N matches is dynamically determined, and the Top-N match count and RankSum metric are calculated. In summary: Paired cross-validation and the calculation of Top-N and RankSum metrics are as follows:

[0091] (4.1) Paired cross-validation design

[0092] To rigorously and fairly verify the effectiveness of the phenotypic correction strategy, paired repeated K-fold cross-validation was employed. In this embodiment, 50 random cycles (cycles = 50) were set, with each cycle consisting of 5 folds (n_folds = 5), resulting in a total of 250 training-test splits.

[0093] Key design points: In each iteration, fold labels are first generated by random sampling, and these labels remain fixed within the same iteration; in each fold, the original phenotype and the corrected phenotype share the exact same training set index and test set index, thereby achieving strict paired comparison—the two methods make predictions under exactly the same data partitioning conditions, and the observed differences only come from the phenotype correction itself, rather than the random fluctuations in data partitioning.

[0094] (4.2) Dynamic determination of the number of Top-N

[0095] In each fold, the number of Top-N samples is dynamically determined based on the actual number of samples in the current test set:

[0096]

[0097] Where n_test is the number of samples in the current test set, and top_pct is the preset head proportion parameter (in this embodiment, the default is 0.10, i.e., 10%). The maximum value of rounding up and combining with 1 is taken to ensure that at least one head sample is selected. This dynamic determination strategy is more reasonable than a fixed K value because the test set size may differ for different folds.

[0098] (4.3) Calculation of evaluation indicators

[0099] In each fold, the following three indicators are calculated for both the original phenotypic model and the corrected phenotypic model:

[0100] Metric 1: Prediction Accuracy

[0101] Accuracy = cor(y_test, y_pred)

[0102] The Pearson correlation coefficient between the true phenotypic values ​​and predicted values ​​in the test set is used as a routine evaluation benchmark at the whole population level.

[0103] Indicator 2: Top-N match count (TopHits)

[0104] (a) Rank all samples in the test set by true phenotype and predicted value, respectively. When the trait is of the "greater is better" type (e.g., yield, descending = TRUE), rank in descending order (the larger the value, the lower the ranking); when the trait is of the "smaller is better" type (e.g., growth period or disease index, descending = FALSE), rank in ascending order.

[0105] (b) Extract the sample set P_predtopK of the top K predicted values ​​and the sample set P_realtopK of the top K true phenotypes.

[0106] (c) Calculate the size of the intersection of the two sets:

[0107]

[0108] TopHits indicates how many of the K combinations predicted by the model to be optimal are actually among the K combinations that perform best in the actual field. The higher the TopHits, the stronger the model's ability to identify the top-performing combinations.

[0109] Indicator 3: Sum of True Ranks (RankSum) and MeanRank (MeanRank)

[0110] For each sample i in P_predtopK, find its rank R_i in the true phenotypic ranking, and sum the true ranks of all predicted head samples:

[0111]

[0112] The core significance of RankSum lies in the fact that TopHits only answers "how many were hit," while RankSum further answers "how far behind those predicted elites who weren't precisely hit actually ranked." An ideal model should have its predicted top K players concentrated at the very top of the actual rankings, with RankSum approaching the minimum theoretical value K(K+1) / 2. The smaller the RankSum, the better the model's predicted top combinations perform in the actual rankings.

[0113] Compared to Precision@K in the field of information retrieval, the core difference between TopHits and RankSum is that RankSum introduces an absolute measure of the true ranking position, which is a unique requirement of breeding decision-making. Breeders are not only concerned with "whether the selected combination is excellent" (whether it hits the target), but also with "how excellent it is" (what its ranking is), because in multi-year, multi-location validation trials with limited resources, the combination ranked 2nd and the combination ranked 50th have a fundamental difference in resource allocation priority.

[0114] (4.4) Results Collection

[0115] Each method for each fold generates one result record, containing the method label (Raw or Corrected), cycle number, fold number, accuracy, tophits, and rankSum. All records are summarized into a complete result data frame.

[0116] In step S4, the paired repeated K-fold cross-validation consists of 50 random cycles, each with 5 folds. In each cycle, fold labels are randomly generated, and within each fold, the original phenotype and the corrected phenotype share the same training and test sets. In step S4, when the trait is of the "bigger is better" type, the phenotypic values ​​and predicted values ​​are ranked in descending order; when the trait is of the "smaller is better" type, they are ranked in ascending order.

[0117] S5: After performing paired statistical tests on the indicator results to verify the improvement effect of the corrected phenotype, the final rrBLUP model is trained using the full sample corrected phenotype. The optimal maize hybrid combination recommendation list is output by sorting by GEBV. In summary, the process involves statistical testing, model selection, and optimal combination recommendation. The specific steps are as follows:

[0118] (5.1) Paired statistical test

[0119] After summarizing the results of all cycles and folds, paired t-tests were performed on the original phenotype and the corrected phenotype on the three indicators respectively:

[0120] (a) Accuracy paired test: t.test(Accuracy_Corrected, Accuracy_Raw, paired =TRUE)

[0121] (b) TopHits pairing test: t.test(Hits_Corrected, Hits_Raw, paired = TRUE)

[0122] (c) RankSum pairing test: t.test(RankSum_Corrected, RankSum_Raw, paired =TRUE)

[0123] The validity of the pairing test stems from the pairing design in step 4.1: the two records (Raw and Corrected) in each fold share the same data partition, thus forming a natural pair.

[0124] (5.2) Model selection strategy

[0125] If statistical tests show that the corrected phenotype is significantly higher than the original phenotype in TopHits (p < 0.05) and significantly lower than the original phenotype in RankSum (p < 0.05), then the pedigree correction strategy is confirmed to be effective, and the corrected phenotype is used as the response variable in the final model training. If correction does not bring significant improvement, the original phenotype is used as a fallback. TopHits and RankSum are used as the primary decision indicators, and Accuracy is used as a secondary reference.

[0126] (5.3) Final model training and combination recommendation

[0127] After determining the optimal phenotypic data source, the final rrBLUP model is trained using the corresponding phenotypic and genotypic data of all samples (without further dividing into training / test sets). The genomic estimated breeding value (GEBV) of all candidate hybrid combinations is calculated, and they are sorted according to the breeding target direction (yield in descending order, disease index in ascending order). A recommended ranking list of preferred hybrid combinations is then output.

[0128] In S5, the paired statistical test is a paired t-test, which examines whether the improvement of the corrected phenotype relative to the original phenotype reaches a statistically significant level for the three indicators Accuracy, TopHits, and RankSum.

[0129] The above method can also be applied to other traits besides yield, including disease resistance indicators and growth period, wherein the growth period is ranked in ascending order and the disease resistance indicator is ranked according to the scoring direction.

[0130] Based on the same idea, this invention also provides a whole-genome selection device for maize, including a memory and a processor. The memory stores a computer program, and when the processor executes the computer program, it implements the method described in any of the above-mentioned embodiments.

[0131] Based on the same idea, embodiments of the present invention also provide a computer-readable storage medium, characterized in that it stores a computer program thereon, which, when executed by a processor, implements the method described in any of the above-mentioned embodiments.

[0132] Specifically, Example 1: Genome-wide selection of maize hybrids for yield based on pedigree correction

[0133] (a) Data input and preprocessing

[0134] The experimental population in this embodiment was constructed by crossing 240 maize parent lines with 5 paize parent lines, resulting in a total of 848 maize hybrid combinations, which were planted in Heilongjiang Province.

[0135] The input data consists of three parts:

[0136] (a) Genotype matrix file (cx_M_full.rds), stored in R language RDS format, with row names as sample IDs and columns as SNP sites. In this embodiment, it contains tens of thousands of SNP markers;

[0137] (b) Field raw phenotypic file (BLUE.xlsx), containing multiple trait columns such as sample ID, maternal parent number (mum), paternal parent number (dad), and yield (yield);

[0138] (c) Genealogical information is extracted from the phenotypic file and organized into three columns: trt, dad, and mum.

[0139] Data preprocessing workflow: First, the trait columns in the phenotype file are forcibly converted to numeric type column by column (as.numeric(as.character(x))), and outliers that cannot be converted are automatically set to NA; for duplicate sample IDs, only the first record is kept by using !duplicated(); the intersection of genotype and phenotype is obtained by using the sample ID as the key (intersect(rownames(Z_full), pheno_df$ID)), and arranged in a consistent order (arrange(match(ID, common_ids))).

[0140] Genotype coding check: Detect the maximum value of the genotype matrix. If max(Z_full) > 1, then execute Z_full = Z_full - 1 to complete the conversion from 0 / 1 / 2 to -1 / 0 / 1.

[0141] (II) Specific Implementation of Phenotypic Correction

[0142] 2.1 Model Fitting and Solution

[0143] For each target trait, construct a mixed linear model equation that includes only the intercept as a fixed effect:

[0144] At the algorithm implementation level, the inverse matrix of the individual additive relationship matrix A is used as prior information, and iterative calculations are performed using the restricted maximum likelihood method (REML) or an equivalent variance component estimation method. The algorithm module will automatically iteratively solve for the estimated values ​​of additive genetic variance and residual variance. During the model solution process, an anomaly detection mechanism is set up: if the variance component fails to converge or the estimated value is missing, the system logs and automatically skips the abnormal trait, ensuring the continuity of the batch processing flow.

[0145] 2.2 BLUP Extraction and Corrected Phenotypic Calculation

[0146] After the model converges, the intercept estimates of the fixed effects and the additive genetic effect (BLUP) values ​​of the random effects are extracted from the solution vector of the mixed linear equation system.

[0147] For the integer encoding used in the system's internal calculations, the encoding is reversed to restore the original sample ID through a pre-stored hash mapping table. To eliminate systematic errors, for individuals whose BLUP value in the calculation result is exactly 0 (usually due to severe data loss or convergence boundaries), their corrected phenotype is set to missing values ​​(NA). Finally, the intercept estimate is added to the individual's additive genetic effect BLUP value to calculate the purified corrected phenotype.

[0148] (III) rrBLUP whole-genome prediction and paired cross-validation

[0149] 3.1 Paired Cross-Validation Logic Design

[0150] To scientifically verify the effectiveness of the phenotypic correction strategy, a paired repeated K-fold cross-validation mechanism was adopted. In this embodiment, the number of cross-validation cycles was set to 50, with each cycle consisting of 5 folds (K=5).

[0151] The core partitioning logic is as follows: In each fold cross-validation, 20% of the total data is allocated to the training set, and the remaining 80% is allocated to the test set. Specifically, the outer loop randomly generates a folded label sequence containing numbers from 1 to 5; in the inner loop (fold k), samples with label k are assigned to the training set, and samples with label other than k are assigned to the test set. This partitioning mechanism ensures the model receives sufficient training samples, conforming to best practices in machine learning.

[0152] Under the same data segmentation configuration, the original phenotypic vector and the corrected phenotypic vector were used as response variables and input into the genome-wide ridge regression prediction module (rrBLUP).

[0153] 3.2 Model Training and Dynamic Calculation of Evaluation Indicators

[0154] On the training set, a hybrid model algorithm is used to estimate the marker effect vector and intercept of whole-genome SNPs. Then, the effect vector is applied to the genotype matrix of the test set, and the phenotypic prediction value of the test set is obtained by matrix multiplication and addition of the intercept.

[0155] The head proportion is dynamically determined based on the actual number of samples in the test set. The predicted values ​​and the actual phenotypic values ​​are sorted separately, and the top K samples of each are extracted. The intersection of the two sets is calculated to obtain the Top-N match number (TopHits). At the same time, the absolute rank of the predicted head sample in the actual ranking is queried and summed to calculate RankSum.

[0156] (iv) Statistical testing and visualization

[0157] Descriptive statistics: For all 250-fold (50 cycles × 5 fold) results, grouped by Method, calculate the mean and standard deviation of Accuracy, TopHits, and RankSum.

[0158] Paired t-test: Perform paired t-tests on the three indicators to test whether the improvement of the corrected phenotype relative to the original phenotype is statistically significant.

[0159] Visualization: Box plots are used to show the distribution differences of the original phenotypic data and the corrected phenotypic data on the three indicators of RankSum (lower is better), Accuracy (higher is better), and TopHits (higher is better), with p-values ​​marked in the plots.

[0160] Results output: Detailed results (index values ​​for each method and each division) and summary statistics will be output as Excel files, and visualizations will be saved as PNG format.

[0161] (v) Recommendation of the optimal combination

[0162] When the statistical test results confirm that the corrected phenotype is significantly better than the original phenotype in TopHits and RankSum, the final rrBLUP model is trained using the corrected phenotype and genotype data of all 848 samples, the GEBV of all hybrid combinations is calculated, and the recommended list of preferred combinations for yield traits is output in descending order.

[0163] Example 2: Phenotypic Correction Incorporating Both Pedigree and Spatial Effects

[0164] Based on Example 1, when the row and column coordinates (row, col) of each plot have been recorded in the field experiment and there is a significant spatial environmental gradient, pedigree genetic effects and spatial effects can be simultaneously incorporated into the mixed linear model:

[0165] mod <- remlf90(formuli,genetic = list(model = 'add_animal',pedigree =pedn,id = 'trt'),spatial = list(model = 'AR',coord = datn[ c('row','col')]),data = datn)

[0166] The `spatial` parameter specifies the AR model, with coordinates provided by the `row` and `col` columns. In this case, the model simultaneously estimates the additive genetic variance. Spatial variance and residual variance The BLUP extraction and phenotypic correction calculation process is completely consistent with that of Example 1. Spatial effects are absorbed by the model but are not included in the corrected phenotype.

[0167] This extended scheme can further improve the separation accuracy of genetic effects and environmental noise in experimental environments with significant spatial heterogeneity in the field.

[0168] Example 3: Comparison of other baseline correction methods (M0 model)

[0169] To verify the gain effect of pedigree information in phenotypic correction, this invention also includes a baseline model M0 without pedigree and space as a control:

[0170] mod <- remlf90(formuli, random = ~ trt, data = dat)

[0171] This model treats the individual effect trt as an independent and identically distributed random effect (without utilizing pedigree structure), and its covariance matrix is: Instead By comparing the TopHits and RankSum performances of the M0 model and the pedigree model (Example 1) under the same cross-validation framework, the contribution of pedigree information to the predictive power of head combinations can be quantitatively evaluated.

[0172] The basic principles of the present invention have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in the present invention are merely examples and not limitations, and should not be considered as essential features of each embodiment of the present invention. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the present invention to the necessity of employing the aforementioned specific details.

[0173] The block diagrams of devices, apparatuses, devices, and systems involved in this invention are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context clearly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.

[0174] It should also be noted that in the apparatus, device, and method of the present invention, the components or steps can be disassembled and / or recombined. These disassemblies and / or recombinations should be considered as equivalent solutions of the present invention.

[0175] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use the invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects without departing from the scope of the invention. Therefore, the invention is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features disclosed herein.

[0176] It should be understood that the qualifying terms "first", "second", "third", "fourth", "fifth" and "sixth" used in the description of the embodiments of the present invention are only used to more clearly illustrate the technical solutions and are not intended to limit the scope of protection of the present invention.

[0177] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the invention to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations therein.

Claims

1. A method for whole-genome selection in maize, characterized in that, include: S1: Obtain the whole genome SNP genotype matrix, field raw phenotypic data, and pedigree information including individual number, maternal parent number, and paternal parent number of the maize hybrid combination population; S2: Construct an additive relationship matrix A based on pedigree information, establish a mixed linear model, estimate the variance components using the restricted maximum likelihood method, extract the additive genetic effect BLUP value and add it to the intercept estimate to obtain the corrected phenotype; S3: Center the SNP genotype matrix with -1 / 0 / 1, use the corrected phenotype as the response variable and the SNP marker matrix as the independent variable, and use the rrBLUP ridge regression model to solve for the marker effect vector and the intercept, and calculate the predicted value of the test set. S4: Paired repeated K-fold cross-validation is used to train the model with the original phenotype and the corrected phenotype under the same data split and obtain the predicted values ​​of the test set; the number of Top-N matches is dynamically determined and the Top-N match number and RankSum index are calculated. S5: Perform paired statistical tests on the index results to verify the improvement effect of the corrected phenotype. Then, train the final rrBLUP model using the corrected phenotype of the whole sample and output the recommended list of maize hybrid combinations by sorting by GEBV.

2. The method for whole-genome selection of maize according to claim 1, characterized in that, In S2, the process of constructing the additive relationship matrix A includes: extracting all unique individuals in the pedigree and assigning them continuous integer codes, setting the parent numbers of the founding parents to 0, and recursively calculating the elements aij of matrix A using the pedigree matrix after integer coding, where aij represents the degree of additive genetic correlation between individual i and individual j.

3. The method for whole-genome selection of maize according to claim 1, characterized in that, The mixed linear model adopts an animal model approach, where the model expression is y = Xβ + Zu + ε, where y is the observation vector, X is the fixed-effects correlation matrix, β is the fixed-effects vector, Z is the additive genetic effects correlation matrix, u is the additive genetic effects vector, and ε is the residual effects vector, wherein u follows a certain order. Distribution, where A is the additive genetic correlation matrix. The model uses additive genetic variance; the genetic parameter is specified as the additive animal model `add_animal`, pedigree information is provided as an encoded three-column matrix, and the additive genetic variance is simultaneously estimated using restricted maximum likelihood estimation. and residual variance .

4. The method for whole-genome selection of maize according to claim 1, characterized in that, S2 further includes a correction step for spatial environmental effects, adding a spatial effect term s to the hybrid linear model, and fitting a two-dimensional autoregressive AR(1)×AR(1) correlation structure based on field row and column coordinates. The model is then extended to... The corrected phenotype extracts only the sum of the BLUP value and the intercept, and the spatial effect term s is absorbed by the model and not included in the corrected phenotype; in S2, the corrected phenotype of individuals with a BLUP value of zero is set to the missing value NA to avoid the introduction of bias caused by the model convergence boundary or missing data.

5. The method for whole-genome selection of maize according to claim 1, characterized in that, In S3, the SNP genotype matrix is ​​centered by means of: shifting the original allele dose code by site to convert it into a centered numerical matrix with a desired mean of 0, so as to meet the linear assumptions and data centering requirements of the subsequent ridge regression model.

6. The method for whole-genome selection of maize according to claim 1, characterized in that, In step S4, the paired repeated K-fold cross-validation consists of 50 random cycles, each with 5 folds. In each cycle, fold labels are randomly generated, and within each fold, the original phenotype and the corrected phenotype share the same training and test sets. In step S4, when the trait is of the "bigger is better" type, the phenotypic values ​​and predicted values ​​are ranked in descending order; when the trait is of the "smaller is better" type, they are ranked in ascending order.

7. The method for whole-genome selection of maize according to claim 1, characterized in that, In S5, the paired statistical test is a paired t-test, which examines whether the improvement of the corrected phenotype relative to the original phenotype reaches a statistically significant level for the three indicators Accuracy, TopHits, and RankSum.

8. The method for whole-genome selection of maize according to any one of claims 1 to 7, characterized in that, It can also be applied to traits other than yield, including disease resistance indicators and growth period, wherein the growth period is ranked in ascending order and the disease resistance indicator is ranked according to the scoring direction.

9. A maize whole-genome selection device, comprising a memory and a processor, characterized in that, The memory stores a computer program, and when the processor executes the computer program, it implements the method described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, It stores a computer program thereon, which, when executed by a processor, implements the method described in any one of claims 1 to 7.