Method for predicting heterosis in rice
By integrating genomic, transcriptomic, phenomic, and environmental omics data, and combining multi-model training and ensemble learning, the problems of time consumption and insufficient accuracy in predicting heterosis in rice have been solved, enabling a rapid and accurate breeding process.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIANGSU LIXIAHE REGION AGRI RES INST
- Filing Date
- 2026-01-19
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies for predicting heterosis in rice suffer from problems such as time consumption, blindness, and insufficient prediction accuracy. In particular, the genomic selection method fails to effectively utilize multi-omics information and is prone to model overfitting or underfitting, affecting the breeding process and accuracy.
By employing a multi-omics data integration and ensemble learning strategy, multiple base learner models are trained by constructing a multi-omics feature matrix and using a stacking method for ensemble learning. Combined with cross-validation and hyperparameter optimization, the heterosis of rice can be predicted.
It significantly improves prediction accuracy and breeding cycle, reduces breeding costs, and enables rapid and accurate prediction of heterosis, making it suitable for large-scale breeding.
Smart Images

Figure CN122201406A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of gene breeding, and in particular to a method for predicting heterosis in rice. Background Technology
[0002] Rice is a staple food worldwide, and the effective utilization of heterosis is crucial for increasing yield. Traditionally, superior hybrid combinations have been discovered through breeders' experience and extensive field cross-breeding and screening, which is time-consuming and often inaccurate. While predicting heterosis through linkage markers based on molecular markers (AFLP, SSR, SNP) can provide genetic information, the accuracy of predictions is often insufficient to meet the needs of rapid breeding due to limitations in marker density and trait correlation.
[0003] Genomic selection (GS) transforms the breeding problem into a prediction problem based on whole-genome markers, enabling the estimation of population breeding values without large-scale phenotypic testing, significantly accelerating the breeding process. However, directly applying existing GS methods to heterosis prediction still has shortcomings: First, most methods only utilize genotype information, making it difficult to cover the interactions between multiple omics information such as transcriptome and phenotypic information; second, GWAS results or WGS pre-screening sites are often used in preprocessing, and this kind of "prior screening" may miss important minor or interacting sites, affecting downstream predictions; third, when using machine learning, default parameters are often used, making the model prone to overfitting or underfitting, limiting generalization ability.
[0004] With the advent of the "smart breeding" era, integrating multi-omics data and developing more accurate prediction algorithms has become an inevitable trend. Currently, there is no high-precision heterosis prediction method in this field that comprehensively utilizes multi-omics information and employs advanced machine learning and ensemble learning strategies to provide effective technical support for rice breeding practices. Summary of the Invention
[0005] Purpose of the invention: The purpose of this invention is to provide a method for accurate prediction of heterosis in rice based on multi-omics data and machine learning.
[0006] Technical solution: The method for predicting heterosis in rice according to the present invention includes the following steps: S1: Obtain multi-omics data of parental rice and hybrid F1 generation rice, and construct a multi-omics feature matrix; S2, define heterosis labels based on the actual phenotypic values of parental rice and hybrid F1 generation rice; S3, using the multi-omics feature matrix obtained in step S1 as input and the heterosis label obtained in step S2 as the prediction target, trains multiple different base learner models; S4. Using the multi-omics feature matrix obtained in step S1 as input, and based on the base learner model obtained in step S3, the predicted values of the parental rice hybrid offspring are obtained. After fusion using an ensemble learning strategy, the heterosis prediction results are obtained. In the hybrid F1 generation rice, each parent rice appears at least 6 times, and the omics data includes a variety of data such as genomics data, transcriptomics data, environmental omics data, and phenomics data.
[0007] Preferably, the genomic data includes one or more of single nucleotide polymorphism (SNP) markers, insertion / deletion markers, and structural variation information; the transcriptomics data includes gene expression abundance information; the environmental omics data includes measurements of one or more of the following during the entire rice growth period: daily maximum temperature, daily minimum temperature, effective accumulated temperature, precipitation, soil moisture content, sunshine duration, photosynthetically active radiation, soil pH, soil salinity, soil organic matter content, and soil nitrogen, phosphorus, and potassium content; and the phenomics data includes measurements of one or more of the following: plant height, heading date, number of effective panicles, number of grains per panicle, thousand-grain weight, seed setting rate, and disease resistance.
[0008] Preferably, the step of constructing the multi-omics feature matrix in step S1 includes: preprocessing the multi-omics data of parental rice and hybrid F1 generation rice to extract features, calculating the hybrid combination omics features, and summarizing to construct the multi-omics feature matrix; The preprocessing methods include one or more of missing value imputation, outlier correction, and data standardization; the feature extraction methods include one or more of variance filtering, correlation coefficient-based filtering, mutual information method, and model-based feature importance ranking; and the methods for calculating hybrid combination omics features include one or more of calculating the mean, calculating the difference, and estimation based on a genetic model.
[0009] Preferably, the step of ranking the feature importance based on the model includes: using parental rice phenotypic data as the response variable and genomic and transcriptomic data as features, training a random forest model, and screening and extracting the variant sites and genes with the top 5-15% of feature importance.
[0010] Preferably, the genetic model in the genetic model-based estimation method is an additive model or a dominant model.
[0011] Preferably, the actual phenotypic measurement values in step S2 include one or more of the following: plant height, heading date, number of effective panicles, number of grains per panicle, thousand-grain weight, seed setting rate, and disease resistance; the heterosis label is mid-parent heterosis or super-parent heterosis.
[0012] Preferably, the multiple different base learner models mentioned in step S3 include at least two of the following: linear regression model, ridge regression model, support vector machine regression model, decision tree regression model, random forest regression model, gradient boosting decision tree regression model, and neural network model.
[0013] Preferably, when training the base learner model in step S3, the method further includes: using a cross-validation strategy to optimize the hyperparameters of the model.
[0014] Preferably, for linear regression models, ridge regression models, support vector machine regression models, decision tree regression models, random forest regression models, and gradient boosting decision tree regression models, the hyperparameter optimization method is any one of grid search, random search, Bayesian optimization, and evolutionary algorithm; for neural network models, the hyperparameter optimization method is gradient-based optimization.
[0015] Preferably, the method for fusion of integrated learning strategies in step S4 includes any one of simple averaging, weighted averaging, and stacking.
[0016] Preferably, when using the stacking method as an ensemble learning strategy, the steps include: taking the multi-omics feature matrix obtained in step S1 as input, and based on the base learner model obtained in step S3, obtaining the predicted value of hybrid F1 generation rice and establishing the predicted value feature matrix; taking the predicted value feature matrix as input and the heterosis label obtained in step S2 as the prediction target, training to obtain a meta-learner model; taking the multi-omics feature matrix obtained in step S1 as input, and based on the meta-learner model, obtaining the heterosis prediction result; The meta-learner model is either a linear model or a nonlinear model.
[0017] Preferably, the nonlinear model includes a decision tree regression model, a K-nearest neighbor regression model, and a single hidden layer neural network.
[0018] Beneficial effects: Compared with existing technologies, this invention has the following significant advantages: 1. By integrating multi-omics data, it fully utilizes genomic, transcriptomic, phenomic, and environmental omics data, improving the comprehensiveness of feature representation; 2. By adopting a multi-model training and ensemble learning strategy, it avoids the limitations of a single model and enhances the model's generalization ability and prediction accuracy; 3. Through cross-validation and hyperparameter optimization, it ensures the robustness and reliability of the model, making it suitable for large-scale breeding data; 4. This method can quickly and accurately predict heterosis in rice, significantly shortening the breeding cycle and reducing breeding costs, and has excellent industrial application value. Attached Figure Description
[0019] Figure 1 Flowchart of a method for predicting heterosis in rice; Figure 2 and Figure 3This is a graph showing the correlation statistics between the predicted values and the actual values of the test set. Detailed Implementation
[0020] The technical solution of the present invention will be further described below.
[0021] Example 1:
[0022] S1. Acquisition, preprocessing, and construction of multi-omics feature matrices for multi-omics data. In this embodiment, 300 rice backbone parent materials (including indica rice and japonica rice) and 1200 hybrid combinations prepared therefrom were selected as the research population. The 1200 combinations covered all 300 parents, and each parent appeared at least 6 times in the combination.
[0023] 1. Data Acquisition: Genomic data: Based on third-generation resequencing data of rice backbone parent materials, 15,000 high-quality structural variation markers were obtained by aligning the sequences to the Nipponbare reference genome (IRGSP-1.0) and performing SV calling using the Minigraph-Cactus structural variation detection pipeline. These structural variation (SV) data are stored in the form of a genotype matrix.
[0024] Transcriptomics data: Leaf tissues of rice scaffold parents were collected during the peak tillering stage, and transcriptome sequencing was performed using RNA-seq technology. Hisat2 software was used to align the sequencing reads to a reference genome, and gene expression levels were calculated using StringTie software, expressed as FPKM values. Expression abundance data for 40,000 genes were ultimately obtained.
[0025] Environmental omics data: Temperature data for the 15 days prior to the heading stage of rice. Specifically, the daily maximum temperature and nighttime maximum temperature of the planting area were recorded each day.
[0026] Phenomic data: 1200 hybrid combinations of rice backbone parent materials were used in F1 generation. A randomized block design with three replicates was adopted for field trials to measure yield-related traits of the parents and F1 offspring, including yield per plant, number of effective panicles, number of grains per panicle, and thousand-grain weight. The phenotypic values were the average of the three replicates.
[0027] 2. Data preprocessing includes: Missing value imputation: For genomic structural variation data, sites with a missing rate higher than 20% are removed, and the remaining missing sites are imputed based on the linkage disequilibrium principle; for transcriptomics data, missing values are imputed using the k-nearest neighbor algorithm (k=10); for environmental omics data, missing values are checked and imputed to ensure that a complete time series is formed, and the imputation method is time series interpolation.
[0028] Outlier correction: For phenomics data, the isolated forest algorithm is used to identify outliers and replace them with the mean value of the trait; for environmental omics data, thresholding or statistical methods are used to identify outlier records and interpolation is used to replace them.
[0029] Data standardization: For genomic data, different genotypes are replaced with different numbers and min-max normalization is performed to scale the values to the [0,1] range; for transcriptomics data, log2(FPKM+1) transformation is performed to make its distribution closer to normal, and then Z-score standardization is performed; for phenomics data, Z-score standardization is performed directly.
[0030] 3. Feature extraction and multi-omics feature matrix construction: Feature selection: To reduce dimensionality and noise, a model-based feature selection method was adopted. Using phenotypic data of rice backbone parent materials as the response variable and genomic and transcriptomic data as features, a random forest model was trained, and the top 10% of SV loci and genes (1500 SVs and 4000 genes) in terms of feature importance were selected for subsequent analysis.
[0031] For environmental omics data, statistical features, temporal features, and threshold features, including mean, accumulated temperature, and number of high-temperature days, were calculated from 15-day daily and nighttime maximum temperature sequences. Finally, a "sample × feature" matrix was constructed, with each row representing a rice sample and each column corresponding to an extracted temperature feature.
[0032] Hybrid combination characteristic calculation: For any hybrid combination F1 generation based on rice backbone parent materials, its multi-omics characteristics are calculated from the corresponding characteristics of its parents.
[0033] This embodiment employs an additive model, where a specific trait value of the F1 generation of a hybrid combination is the average of the trait values of its two parents. For genomic data, if the genotype code for parent P1 is 'a' and the genotype code for parent P2 is 'b', then the trait value of the hybrid combination at that locus is (a+b) / 2. Similarly, gene expression levels and parental phenotypic values are calculated in the same way. Ultimately, each F1 generation of a hybrid combination is represented by a multidimensional vector containing genomic, transcriptomic, and phenotypic features, which is then merged with the aforementioned environmental omics matrix to construct a multi-omics feature matrix. The rows of the matrix represent hybrid combinations, and the columns represent features.
[0034] S2, Steps for defining heterosis phenotypic labels In this embodiment, "yield per plant" is used as the phenotypic indicator of heterosis, specifically defined as the midparental heterosis (MPH) of the F1 generation rice yield per plant compared to the average yield of the parents. The calculation formula is as follows: MPH=(F1-(P1+P2) / 2) / ((P1+P2) / 2)×100%; Among them, F1 is the measured yield per plant of the hybrid combination, P1 and P2 are the measured yield per plant of the maternal and paternal parents, respectively. The MPH values of 1200 hybrid combinations were calculated as continuous phenotypic labels to be predicted.
[0035] S3, Multi-model Training The 1200 hybridization combinations were randomly divided into a training set (1000 combinations, covering all 300 parents, with each parent appearing at least 6 times in the combination) and a test set (200 combinations). Using the multi-omics feature matrix of the training set as input, the following four base learner models were trained in parallel: (1) Random forest regression model: implemented using the Scikit-learn library in Python. Hyperparameters include the number of decision trees (n_estimators), maximum depth (max_depth), and minimum number of samples per leaf node.
[0036] (2) Gradient boosting decision tree regression model: implemented using the XGBoost library. Hyperparameters include learning rate, maximum depth, and subsample ratio.
[0037] (3) Support Vector Machine Regression Model: Implemented using the Scikit-learn library, employing a radial basis function (RBF) kernel. Hyperparameters include the penalty coefficient C and the kernel function coefficient gamma.
[0038] (4) Neural Network Model: A multilayer perceptron (MLP) was built using the Keras library. The network structure included: an input layer with the number of neurons equal to the feature dimension; two hidden layers, each with 128 neurons, using the ReLU activation function; and an output layer using the linear activation function. Hyperparameters included the learning rate, batch size, and number of training epochs.
[0039] During the training of the base learner models, hyperparameter optimization is performed: for random forest regression models, gradient boosting decision tree regression models, and support vector machine regression models, 5-fold cross-validation combined with Bayesian optimization is used; for neural network models, the Adam optimizer is used for gradient-based optimization to search for the optimal combination of hyperparameters. The optimization objective is to minimize the mean squared error (MSE) of cross-validation. Each base learner is retrained on the entire training set using the optimized hyperparameters to obtain the final model.
[0040] S4, Integrated Prediction Ensemble learning using a stacking method yields predicted values: (1) Use the four base learners that have been trained to perform 5-fold cross-validation prediction on the training set to obtain 4 predicted values for each sample in the training set, and build a prediction feature matrix (meta-features) from these predicted values.
[0041] (2) Using the predicted feature matrix as input and the heterosis label obtained in step S2 as the prediction target, train a linear regression model as a meta-learner model; (3) Based on the aforementioned step S1, using the aforementioned 300 rice backbone materials as parents, 200 new hybrid combinations were set up and a multi-omics characteristic matrix was further established; (4) Input the multi-omics feature matrix obtained in step 3 into four base learners to obtain four predicted values for each sample in the hybrid combination. Establish the test set predicted value feature matrix and input it into the trained meta-learner model. The output is the final predicted value of heterosis of the hybrid combination.
[0042] The 200 combinations were sorted in descending order based on the predicted values. The top 10% of the combinations were selected as the superior combinations for key pairing and field verification, which greatly reduced the workload of blind pairing and shortened the breeding cycle.
[0043] The process of the above method for predicting heterosis in rice is as follows: Figure 1 As shown.
[0044] Experimental Example 1: Method Accuracy Evaluation 1. Use a stacking method for ensemble learning to obtain the final predicted values for the test set: (1) Use the four base learners that have been trained to perform 5-fold cross-validation prediction on the training set to obtain 4 predicted values for each sample in the training set, and build a prediction feature matrix (meta-features) from these predicted values.
[0045] (2) Using the predicted feature matrix as input and the heterosis label obtained in step S2 as the prediction target, train a linear regression model as a meta-learner model; (3) Input the test set data into four base learners to obtain four predicted values for each sample in the test set. Establish the feature matrix of the predicted values of the test set and input it into the trained meta-learner model. The output is the final predicted value of heterosis of the test set.
[0046] 2. Obtain the final predicted values for the test set using the simple averaging method: The test set data is input into four base learners to obtain four predicted values for each sample in the test set. The arithmetic mean of the prediction results of the four base learners is taken as the final predicted value of heterosis in the test set.
[0047] Calculate the correlation coefficient (r) and root mean square error (RMSE) between the predicted values and the true MPH values in the test set.
[0048] The results are as follows Figure 2 and Figure 3 As shown, the stacking method for ensemble learning achieves a prediction correlation coefficient of 0.59 and an RMSE of 0.38 for the final predicted values on the test set, which is significantly better than any single base learner. The best single model has a correlation coefficient of 0.52 and a root mean square error of 0.4. It is also better than the simple averaging method (correlation coefficient of 0.48 and root mean square error of 0.46).
[0049] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this invention, and these modifications and substitutions should all be covered within the scope of protection of this invention. Therefore, the scope of protection of this invention should be determined by the scope of the claims.
Claims
1. A method for predicting heterosis in rice, characterized in that, Includes the following steps: S1: Obtain multi-omics data of parental rice and hybrid F1 generation rice, and construct a multi-omics feature matrix; S2, define heterosis labels based on the actual phenotypic values of parental rice and hybrid F1 generation rice; S3, using the multi-omics feature matrix obtained in step S1 as input and the heterosis label obtained in step S2 as the prediction target, trains multiple different base learner models; S4. Using the multi-omics feature matrix obtained in step S1 as input, and based on the base learner model obtained in step S3, the predicted values of the parental rice hybrid offspring are obtained. After fusion using an ensemble learning strategy, the heterosis prediction results are obtained. In the hybrid F1 generation rice, each parent rice appears at least 6 times, and the omics data includes a variety of data such as genomics data, transcriptomics data, environmental omics data, and phenomics data.
2. The method for predicting heterosis in rice according to claim 1, characterized in that, The genomic data includes one or more of single nucleotide polymorphism (SNP) variant markers, insertion / deletion variant markers, and structural variation information; the transcriptomics data includes gene expression abundance information; the environmental omics data includes measurements of one or more of the following during the entire rice growth period: maximum daily temperature, minimum daily temperature, effective accumulated temperature, precipitation, soil moisture content, sunshine duration, photosynthetically active radiation, soil pH, soil salinity, soil organic matter content, and soil nitrogen, phosphorus, and potassium content; the phenomics data includes measurements of one or more of the following: plant height, heading date, number of effective panicles, number of grains per panicle, thousand-grain weight, seed setting rate, and disease resistance.
3. The method for predicting heterosis in rice according to claim 1, characterized in that, Step S1, which involves constructing a multi-omics feature matrix, includes: preprocessing the multi-omics data of parental rice and hybrid F1 generation rice to extract features, calculating the hybrid combination omics features, and summarizing them to construct a multi-omics feature matrix. The preprocessing methods include one or more of missing value imputation, outlier correction, and data standardization; the feature extraction methods include one or more of variance filtering, correlation coefficient-based filtering, mutual information method, and model-based feature importance ranking; and the methods for calculating hybrid combination omics features include one or more of calculating the mean, calculating the difference, and estimation based on a genetic model.
4. The method for predicting heterosis in rice according to claim 2, characterized in that, The steps of the model-based feature importance ranking include: using parental rice phenotypic data as the response variable and genomic and transcriptomic data as features, training a random forest model, and screening and extracting the variant sites and genes with the top 5-15% feature importance.
5. The method for predicting heterosis in rice according to claim 1, characterized in that, The actual phenotypic values measured in step S2 include one or more of the following: plant height, heading date, number of effective panicles, number of grains per panicle, thousand-grain weight, seed setting rate, and disease resistance; the heterosis label is mid-parent heterosis or super-parent heterosis.
6. The method for predicting heterosis in rice according to claim 1, characterized in that, The multiple different base learner models mentioned in step S3 include at least two of the following: linear regression model, ridge regression model, support vector machine regression model, decision tree regression model, random forest regression model, gradient boosting decision tree regression model, and neural network model.
7. The method for predicting heterosis in rice according to claim 1, characterized in that, Step S3, when training the base learner model, also includes: using a cross-validation strategy to optimize the model's hyperparameters.
8. The method for predicting heterosis in rice according to claim 7, characterized in that, For linear regression models, ridge regression models, support vector machine regression models, decision tree regression models, random forest regression models, and gradient boosting decision tree regression models, the hyperparameter optimization method is any one of grid search, random search, Bayesian optimization, and evolutionary algorithm; for neural network models, the hyperparameter optimization method is gradient-based optimization.
9. The method for predicting heterosis in rice according to claim 1, characterized in that, The fusion method for the integrated learning strategy described in step S4 includes any one of simple averaging, weighted averaging, and stacking.
10. The method for predicting heterosis in rice according to claim 9, characterized in that, When using the stacking method as an ensemble learning strategy, the steps include: taking the multi-omics feature matrix obtained in step S1 as input, and based on the base learner model obtained in step S3, obtaining the predicted values of hybrid F1 generation rice and establishing the predicted value feature matrix; taking the predicted value feature matrix as input and the heterosis label obtained in step S2 as the prediction target, training to obtain the meta-learner model; taking the multi-omics feature matrix obtained in step S1 as input, and based on the meta-learner model, obtaining the heterosis prediction result; The meta-learner model is either a linear model or a nonlinear model.