A buckwheat key lipid screening method in storage process based on lipidomics and machine learning
By combining LASSO regression and random forest regression models, and using GC-MS and LC-MS technologies to screen key lipid biomarkers during buckwheat storage, the problem of insufficient screening capability in existing technologies was solved, and the accurate identification and stable screening of key lipids during buckwheat storage was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANXI UNIV
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies struggle to accurately screen key lipid biomarkers related to quality changes during storage from high-dimensional, highly correlated buckwheat lipidomics data. Traditional methods cannot handle multicollinearity and nonlinear relationships between variables, and the screening results are unstable.
By combining LASSO regression and random forest regression models, and using GC-MS and LC-MS analysis techniques, the OPLS-DA algorithm was used to screen key volatile compounds and lipid molecules. A LASSO regression model was constructed and 10-fold cross-validation was performed. Finally, a random forest regression model was constructed to calculate the importance score of lipid molecules.
This study enabled the precise identification of key lipid biomarkers during buckwheat storage, improved the stability and accuracy of screening results, and provided technical support for the buckwheat storage process.
Smart Images

Figure CN122282984A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of food quality testing and metabolic biomarker screening technology, specifically to a method for screening key lipids during buckwheat storage based on lipidomics analysis and machine learning modeling. Background Technology
[0002] Buckwheat, a traditional grain crop with both medicinal and edible uses, is highly favored for its rich content of protein, dietary fiber, flavonoids, and unique minerals. In recent years, with consumers' increasing demands for food flavor and quality, the quality evolution of buckwheat and its products during storage has become a research hotspot in the grain processing field.
[0003] Lipids are an important component of buckwheat, serving not only as energy storage substances but also participating in biological processes such as cell membrane structure formation and signal transduction. More importantly, lipid metabolism is closely related to quality changes during buckwheat storage. Lipids undergo oxidative degradation under enzymatic or non-enzymatic conditions, such as those involving lipases (LIP), lipoxygenases (LOX), and hydroperoxidases (POD), generating small-molecule volatile compounds such as aldehydes, ketones, and alcohols. These substances directly contribute to the characteristic aroma of buckwheat but can also lead to quality deterioration. Therefore, identifying key lipid molecules closely related to buckwheat storage quality is crucial for revealing the molecular mechanisms of buckwheat quality changes, establishing quality prediction models, and guiding the scientific storage of buckwheat.
[0004] However, lipidomics data are characterized by high dimensionality, high correlation, and nonlinearity. A typical buckwheat lipidomics analysis can detect hundreds of lipid molecules, including phospholipids, triglycerides, diglycerides, free fatty acids, glycolipids, and other subclasses. Accurately screening key lipid biomarkers truly relevant to quality changes during storage from this massive dataset is a major technical challenge in this field.
[0005] Traditional methods for screening key substances mainly rely on univariate statistical analysis (such as t-tests and ANOVA) or simple correlation analysis. These methods have the following drawbacks: (1) they cannot handle the problem of multicollinearity between variables, as lipid molecules are often highly correlated; (2) they ignore the interaction and nonlinear relationship between variables; (3) the screening results are unstable and easily affected by sample size and outliers; and (4) they are difficult to effectively reduce dimensionality and select features for high-dimensional data.
[0006] Variable selection is a crucial step in building predictive models and identifying key biomarkers. Existing research indicates that variable selection methods can be broadly categorized into two types: regression-based and machine learning-based. LASSO regression (minimum absolute shrinkage and selection operator), a regularization-based regression analysis method, can achieve variable compression and selection by constructing a penalty function, exhibiting unique advantages in handling high-dimensional collinear data. Random forest regression, an ensemble learning method based on decision trees, can assess the importance ranking of variables and effectively handle nonlinear relationships and variable interactions. Combining these two methods organically holds promise for overcoming the limitations of single methods and achieving precise screening of key lipid molecules.
[0007] To date, there have been no reports on the systematic screening of key lipid molecules during buckwheat storage by combining lipidomics analysis with LASSO regression and random forest regression. Summary of the Invention
[0008] The technical problem to be solved by this invention is to overcome the problems of insufficient screening ability for key lipid molecules in buckwheat storage and the difficulty of traditional statistical methods in processing high-dimensional lipidomics data in the existing technology, and to provide a key lipid screening method based on lipidomics and machine learning in buckwheat storage, so as to achieve accurate identification of key lipid biomarkers.
[0009] To achieve the above objectives, a method for screening key lipids during buckwheat storage based on lipidomics and machine learning is characterized by the following steps:
[0010] (1) Sample collection and preparation: Untreated and superheated steam treated buckwheat flour were subjected to accelerated oxidation storage in a stability test chamber. Constant temperature and relative humidity were set, and the samples were stored for a certain period of time.
[0011] (2) Acquisition of volatile compound data: The volatile compound composition of the sample in step (1) was analyzed by GC-MS analysis technology, the types and contents of volatile compounds in the buckwheat sample were identified and quantified, and the original dataset of volatile compounds was constructed.
[0012] (3) Screening of key volatile compounds: The OPLS-DA algorithm is used to process the original dataset of volatile compounds obtained in step (2), screen out key volatile compounds, and construct the original dataset of key volatile compounds.
[0013] (4) Lipidomics data acquisition: The samples collected in step (1) were analyzed by LC-MS to identify and quantify the types and contents of lipid molecules in the buckwheat samples and construct the original lipidomics dataset.
[0014] (5) Screening of key lipid molecules: Differential analysis was performed on the original lipidomics dataset from step (4) to obtain differentially expressed lipids. The OPLS-DA algorithm was used to process the differentially expressed lipid data and screen out key lipid molecules to construct the original dataset of key lipid molecules.
[0015] (6) Construction and application of LASSO regression model: Select the key volatile compounds with the first VIP in step (3) as the response variable and the key lipid molecule dataset with the first VIP in step (5) as the feature variable, construct the LASSO regression model, and extract the lipid molecules with non-zero coefficients in the LASSO regression model as the key lipid dataset related to the formation of key volatile compounds.
[0016] (7) Construction and application of random forest regression model: Select the same dataset as in step (6) as the response variable, and the key lipid dataset related to the formation of key volatile compounds in step (6) as the feature variable. Construct a random forest regression model, calculate the importance score of each lipid molecule, and sort them according to the importance score.
[0017] (8) Validation and evaluation: The random forest regression model constructed in step (7) was validated using the "retention method" to evaluate the correlation between key lipids and key volatile compounds and the stability of the screening results.
[0018] Preferably, the buckwheat sample in step (1) is whole grain flour made by milling after dehulling, and includes two varieties: sweet buckwheat and bitter buckwheat.
[0019] Preferably, the buckwheat samples subjected to superheated steam treatment in step (1) are sweet buckwheat and bitter buckwheat that have been subjected to superheated steam treatment respectively.
[0020] Preferably, the superheated steam treatment in step (1) is set to temperatures of 150, 170 and 190 °C, and the treatment time is 10 s.
[0021] Preferably, the volatile compounds in step (2) are separated by GCMS-TQ8040 NX and SH-PolarD chromatographic columns. The volatile compounds are identified by comparing them with the NIST 20 spectral library, and volatile compounds with a similarity of ≥ 80% are retained as the original dataset of volatile compounds.
[0022] Preferably, the OPLS-DA algorithm in step (3) uses VIP > 1 as the screening criterion to establish the original dataset of key volatile compounds.
[0023] Preferably, the lipidomics data in step (4) are acquired by UPLC-Q-Exactive-Orbitrap MS, and the lipid molecules are separated by Syncronis C18 column.
[0024] Preferably, the identification and quantification results of lipid molecules in step (4) are obtained by using LipidSearch V5.0 software, and finally a raw lipidomics dataset is established.
[0025] Preferably, the differentially expressed lipids in step (5) are obtained by screening using the T-test corrected p-value (FDR) and fold change value (FC). After OPLS-DA analysis, lipid molecules with VIP > 1 are used as the original dataset of key lipid molecules.
[0026] Preferably, the response variable selection in step (6) is the quantitative result of the first two representative volatile compounds in the original dataset of key volatile compounds constructed in step (3).
[0027] Preferably, the feature variables in step (6) are selected from the original dataset of all key lipid molecules in step (5).
[0028] Preferably, the LASSO regression model established in step (6) uses 10-fold cross-validation to select the optimal penalty parameter λ, with the selection criterion being minimizing the mean squared error. Finally, a key lipid dataset related to the formation of key volatile compounds is established.
[0029] Preferably, the key parameter settings for the random forest regression model in step (7) include: the number of decision trees (ntree) is 500–2000, and the number of variables (mtry) randomly selected when splitting each node is selected through hyperparameter tuning. The importance of key lipids is ranked based on the degree of increase in model prediction error (%IncMSE).
[0030] Preferably, the "hold-out method" in step (8) involves dividing the dataset into a training set (70%) and a test set (30%), and evaluating the model prediction performance on the training set and the test set respectively.
[0031] The present invention has at least the following beneficial effects:
[0032] This invention, using a small-scale sample, combines LASSO regression with random forest regression for the first time to screen key lipids during buckwheat storage, based on volatile compound analysis and lipidomics analysis. This is a feasible and reliable method. The established machine learning regression model provides insights into the formation process of volatiles in buckwheat due to lipid oxidation during storage. It provides technical support for lipid analysis during buckwheat storage and has strong practical and promotional value. Attached Figure Description
[0033] Figure 1 Plot of the LASSO regression model λ and Mean-Squared Error for 2-Hydroxy-benzaldehyde in sweet buckwheat
[0034] Figure 2 Plot of the relationship between the LASSO regression model λ and Mean-Squared Error of Hexanal in sweet buckwheat
[0035] Figure 3 LASSO regression constants of lipid molecules related to 2-Hydroxy-benzaldehyde in sweet buckwheat
[0036] Figure 4 LASSO regression constants of Hexanal-related lipid molecules in sweet buckwheat
[0037] Figure 5 Random forest regression importance ranking of lipid molecules associated with 2-Hydroxy-benzaldehyde in sweet buckwheat
[0038] Figure 6 Random forest regression importance ranking of Hexanal-related lipid molecules in sweet buckwheat
[0039] Figure 7 Predictive performance of key lipids for 2-Hydroxy-benzaldehyde content in a random forest regression model of sweet buckwheat
[0040] Figure 8 Predictive performance of key lipids for Hexanal content in a random forest regression model of sweet buckwheat
[0041] Figure 9 Plot of the LASSO regression model λ and Mean-Squared Error for 2-Hydroxy-benzaldehyde in buckwheat
[0042] Figure 10 Relationship between λ and Mean-Squared Error in the LASSO regression model of Hexanal in tartary buckwheat
[0043] Figure 11 LASSO regression constants of lipid molecules related to 2-Hydroxy-benzaldehyde in buckwheat
[0044] Figure 12 LASSO regression constants of Hexanal-related lipid molecules in buckwheat
[0045] Figure 13 Random forest regression importance ranking of lipid molecules associated with 2-Hydroxy-benzaldehyde in buckwheat
[0046] Figure 14 Random forest regression importance ranking of Hexanal-related lipid molecules in buckwheat
[0047] Figure 15 Predictive performance of key lipids for 2-Hydroxy-benzaldehyde content in a random forest regression model of tartary buckwheat
[0048] Figure 16 Predictive performance of key lipids for Hexanal content in a random forest regression model of tartary buckwheat Detailed Implementation
[0049] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. This embodiment particularly highlights the specific application of LASSO regression and random forest regression models in key lipid screening.
[0050] Example 1
[0051] This embodiment provides a method for screening key lipids during buckwheat storage based on lipidomics and machine learning. Taking sweet buckwheat as an example, the method specifically includes the following steps:
[0052] (1) Sample collection and preparation: Dehulled sweet buckwheat flour, and samples treated with hot steam at 150, 170 and 190℃ for 10s respectively, were stored in an environment of 50℃ and 60% relative humidity. Samples stored for 16 weeks were collected (named C–16, C150–16, C170–16, C190–16). Samples without storage were used as controls (named C–0, C150–0, C170–0, C190–0).
[0053] (2) Acquisition of volatile compound data: The volatile compound composition of the sample in step (1) was analyzed using a GCMS-TQ8040 NX analyzer equipped with an SH-PolarD column. The types and contents of volatile compounds in the buckwheat sample were identified and quantified, and the original dataset of volatile compounds was constructed.
[0054] (3) Screening of key volatile compounds: The OPLS-DA algorithm is used to process the original dataset of volatile compounds obtained in step (2) to screen out volatile compounds with VIP > 1 and construct the original dataset of key volatile compounds, as shown in Table 1–4.
[0055] (4) The lipid molecules in the samples collected in step (1) were separated and identified using a UPLC-Q-Exactive-Orbitrap MS system equipped with a Syncronis C18 column. The lipid molecules were annotated using LipidSearch V5.0 software to construct the original lipidomics dataset.
[0056] (5) Screening of key lipid molecules: Differential analysis was performed on the original lipidomics dataset from step (4). Differentially expressed lipids were obtained using FDR < 0.05, FC > 1.5 and < 0.67 as screening criteria. The OPLS-DA algorithm was then used to screen lipids with VIP > 1, and the original dataset of key lipid molecules was constructed, as shown in Tables 5–8.
[0057] Table 1 Key Volatile Compounds of C–16 vs C–0 Volatile compounds VIP 2-Hydroxy-benzaldehyde 3.36334 Hexanal 2.52331 Hexanoic acid 2.3209 1-Pentanol 2.01699 Benzaldehyde 1.75511 1-Octanol 1.49957 1-Hexanol 1.45802 1-Butanol 1.44574 Dihydro-5-pentyl-2(3H)-furanone 1.38969 Octanoic acid 1.38778 6,10-Dimethyl-5,9-undecadien-2-one 1.36035 1-Heptanol 1.34211 Acetophenone 1.19002 Pentanal 1.15764 1,3,5,7-Cyclooctatetraene 1.11583 Tetradecane 1.08308 Heptanoic acid 1.07212 Benzyl alcohol 1.00155 Table 2 Key Volatile Compounds of C150–16 vs C150–0 Volatile compounds VIP 2-Hydroxy-benzaldehyde 5.06098 Dibutyl phthalate 2.15092 1-Octen-3-ol 1.69803 Nonanal 1.68989 Benzaldehyde 1.53631 2,3-Dimethyl-nonane 1.41035 1-Butanol 1.39514 Styrene 1.33844 (E)-2-Heptenal 1.25448 1-Pentanol 1.17317 Hexanoic acid 1.12091 Undecane 1.0662 (E)-2-Octen-1-ol 1.02971 Trimethyl-pyrazine 1.02208 Table 3 Key Volatile Compounds of C170–16 vs C170–0 Volatile compounds VIP 2-Hydroxy-benzaldehyde 4.77139 Hexanal 3.3844 Nonanal 2.76421 1-Octen-3-ol 1.77667 2-Heptanone 1.483 Benzaldehyde 1.47071 2,3-Dimethyl-nonane 1.4285 Dodecane 1.29497 Dibutyl phthalate 1.26492 1-Butanol 1.25878 Heptanal 1.2099 1,3,5,7-Cyclooctatetraene 1.16975 6,10-Dimethyl-5,9-undecadien-2-one 1.16023 1,2-Benzenedicarboxylic acid, butyl octyl ester 1.0836 Octanal 1.05061 Hexanoic acid 1.02067 2-Propenoic acid, butyl ester 1.01924 Hexanedioic acid, bis(2-ethylhexyl) ester 1.00118 Table 4 Key Volatile Compounds of C190–16 vs C190–0 Volatile compounds VIP 2-Hydroxy-benzaldehyde 4.25351 Hexanal 3.99944 Nonanal 2.74251 1-Octen-3-ol 1.5565 Styrene 1.44583 2,3-Dimethyl-nonane 1.41191 Dodecane 1.3651 2-Heptanone 1.31634 Heptanal 1.24344 Benzaldehyde 1.2414 Undecane 1.16266 Octanal 1.16068 1-Butanol 1.14202 (E)-2-Octen-1-ol 1.13661 Hexanedioic acid, bis(2-ethylhexyl) ester 1.11723 3-Methyl-decane 1.08322 Table 5 Key Lipids of C-16 vs C-0 Lipid VIP FA(18:1) 7.01559 FA(18:2) 6.57951 FA(16:0) 4.14847 TG(16:0_18:2_18:2) 3.9466 TG(16:0_18:1_18:2) 3.84688 TG(18:1_18:1_18:1) 3.66962 TG(18:1_18:2_18:2) 3.60665 TG(18:1_18:1_18:2) 3.55764 TG(16:0_18:1_18:1) 3.38297 TG(18:1_18:2_18:3) 2.5799 PC(18:1_18:2) 2.54041 TG(18:1_18:1_20:1) 2.44774 FA(20:1) 2.31077 PC(18:1_18:1) 2.30978 TG(18:1_20:1_18:2) 2.29213 TG(16:0_20:0_18:2) 2.07027 TG(16:0_16:0_18:2) 2.06314 FA(22:0) 1.80012 PC(18:2_18:2) 1.7882 TG(18:1_18:1_18:3) 1.76262 TG(16:0_18:2_18:3) 1.7317 FA(20:0) 1.68903 TG(22:0_18:1_18:2) 1.66756 TG(16:0_16:0_18:1) 1.66348 TG(22:0_18:2_18:2) 1.65643 TG(20:0_18:1_18:1) 1.63422 TG(20:1_18:2_18:2) 1.57779 TG(22:0_18:1_18:1) 1.39396 FA(24:0) 1.35302 DG(18:1_18:1) 1.29467 FA(18:3) 1.25746 PC(16:0_18:1) 1.24583 DG(18:1_18:2) 1.19843 TG( 1.19808 1.19745 1.19063 1.17864 1.13808 1.11992 1.07157 1.06988 1.05974 Table 6 Key Lipids of C150–16 vs C150–0 3.85346 3.58342 3.20101 3.05543 2.73271 2.55854 2.54638 2.14377 1.96704 1.95287 1.8862 1.8653 1.78729 1.72565 1.45727 1.40437 1.3562 1.20528 1.15073 1.12992 PE(18:2_18:2) 1.06804 Table 7 Key Lipids of C170–16 vs C170–0 lipids VIP PC (18:1_18:2) 4.44696 PC (18:1_18:1) 3.63698 LPC(18:1) 3.37492 DG(18:1_18:1) 3.13521 DG(16:0_18:2) 3.03504 PC(18:2_18:2) 2.79435 TG(16:0_18:2_18:3) 2.75056 LPC(18:2) 2.72863 DG(18:2_18:2) 2.46008 PC (16:0_18:1) 2.29755 TG(O-15:2_3:0_18:1) 2.25845 DG(18:1_18:2) 2.24262 PC (16:0_18:2) 2.05439 TG(O-15:1_3:0_18:1) 1.87596 TG(O-15:2_3:0_18:2) 1.75806 PE(18:1_18:2) 1.49339 DG(20:1_18:2) 1.40018 LPC (16:0) 1.35201 DG(16:0_18:1) 1.19652 PE(18:2_18:2) 1.17505 Table 8 Key Lipids of C190–16 vs C190–0 lipids VIP PC (18:1_18:2) 3.94669 PC (18:1_18:1) 3.9388 LPC(18:1) 3.40529 DG(18:1_18:1) 3.28699 PC(18:2_18:2) 3.03294 DG(16:0_18:2) 2.85261 LPC(18:2) 2.50736 TG(O-15:2_3:0_18:1) 2.36332 DG(18:1_18:2) 2.35208 DG(18:2_18:2) 2.28298 PC (16:0_18:1) 2.16609 TG(O-15:1_3:0_18:1) 1.91805 PC (16:0_18:2) 1.82727 TG(O-15:2_3:0_18:2) 1.61001 DG(20:1_18:2) 1.51873 PE(18:1_18:2) 1.3675 LPC (16:0) 1.28369 PE(18:2_18:2) 1.2122 DG(18:2_18:3) 1.12628 DG(16:0_18:1) 1.09151
[0058] (6) Construction and application of the LASSO regression model: 2-Hydroxy-benzaldehyde and Hexanal from step (3) were selected as response variables, and the key lipid molecule dataset from step (5) was selected as feature variables to construct the LASSO regression model. The model establishment process used the ten-fold cross-validation method to determine the optimal λ value of the model, such as... Figure 1 and Figure 2 The optimized parameters were used to run the model, and lipid molecules with non-zero coefficients in the LASSO regression model were extracted, such as... Figure 3 and Figure 4 .
[0059] (7) Construction and application of random forest regression model: The same dataset as in step (6) is selected as the response variable, and the key lipid dataset related to the formation of key volatile compounds obtained in step (6) is used as the feature variable to construct a random forest regression model. The model parameter values after hyperparameter tuning are shown in Table 9. Finally, the importance score of each lipid molecule is calculated and sorted according to the importance score, as shown in Table 9. Figure 5 and Figure 6 As shown.
[0060] Table 9. Random Forest Regression Hyperparameter Tuning Values Volatile substances ntree mtry 2-Hydroxy-benzaldehyde 1000 3 Hexanal 1000 1
[0061] (8) Validation and Evaluation: Using the "hold-out method," the dataset samples were divided into a 70% training set and a 30% test set to validate the random forest regression model constructed in step (7), evaluate the correlation between key lipids and key volatile compounds, and assess the stability of the screening results. Figure 7 and Figure 8 As shown.
[0062] The results showed that, in buckwheat samples, combining LASSO regression with random forest regression based on lipidomics data had good predictive ability for two marker substances, 2-Hydroxy-benzaldehyde and Hexanal, formed during storage. Specifically, the random forest model constructed using key lipids screened by the LASSO regression model achieved a predictive performance of 0.9715 (R0.05) for 2-Hydroxy-benzaldehyde content. 2 The predictive performance for Hexanal content was 0.6530 (R0). 2 ).
[0063] Example 2
[0064] This embodiment provides a method for screening key lipids during buckwheat storage based on lipidomics and machine learning. Taking bitter buckwheat as an example, the method specifically includes the following steps:
[0065] (1) Sample collection and preparation: hulled buckwheat whole powder, and samples treated with hot steam at 150, 170 and 190℃ for 10s respectively, were stored in an environment of 50℃ and 60% relative humidity. Samples stored for 16 weeks were collected (named T–16, T150–16, T170–16, T190–16). Samples without storage were used as controls (named T–0, T150–0, T170–0, T190–0).
[0066] (2) Acquisition of volatile compound data: The volatile compound composition of the sample in step (1) was analyzed using a GCMS-TQ8040 NX analyzer equipped with an SH-PolarD column. The types and contents of volatile compounds in the buckwheat sample were identified and quantified, and the original dataset of volatile compounds was constructed.
[0067] (3) Screening of key volatile compounds: The OPLS-DA algorithm is used to process the original dataset of volatile compounds obtained in step (2) to screen out volatile compounds with VIP > 1 and construct the original dataset of key volatile compounds, as shown in Table 9–12.
[0068] (4) The lipid molecules in the samples collected in step (1) were separated and identified using a UPLC-Q-Exactive-Orbitrap MS system equipped with a Syncronis C18 column. The lipid molecules were annotated using LipidSearch V5.0 software to construct the original lipidomics dataset.
[0069] (5) Screening of key lipid molecules: Differential analysis was performed on the original lipidomics dataset from step (4). Differentially expressed lipids were obtained using FDR < 0.05, FC > 1.5 and < 0.67 as screening criteria. The OPLS-DA algorithm was then used to screen lipids with VIP > 1, and the original dataset of key lipid molecules was constructed, as shown in Tables 13–16.
[0070] (6) Construction and application of the LASSO regression model: 2-Hydroxy-benzaldehyde and Hexanal from step (3) were selected as response variables, and the key lipid molecule dataset from step (5) was selected as feature variables to construct the LASSO regression model. The model establishment process used the ten-fold cross-validation method to determine the optimal λ value of the model, such as... Figure 9 and Figure 10The optimized parameters were used to run the model, and lipid molecules with non-zero coefficients in the LASSO regression model were extracted, such as... Figure 11 and Figure 12 .
[0071] (7) Construction and application of random forest regression model: The same dataset as in step (6) is selected as the response variable, and the key lipid dataset related to the formation of key volatile compounds obtained in step (6) is used as the feature variable to construct a random forest regression model. The model parameter values after hyperparameter tuning are shown in Table 17. Finally, the importance score of each lipid molecule is calculated and sorted according to the importance score, as shown in Table 17. Figure 13 and Figure 14 As shown.
[0072] (8) Validation and Evaluation: Using the "hold-out method," the dataset samples were divided into a 70% training set and a 30% test set to validate the random forest regression model constructed in step (7), evaluate the correlation between key lipids and key volatile compounds, and assess the stability of the screening results. Figure 15 and Figure 16 As shown.
[0073] Table 9 Key Volatile Compounds of T-16 vs T-0 Volatile compounds VIP Hexanoic acid 2.46002 2-Hydroxy-benzaldehyde 2.08599 1-Butanol 1.65882 Hexanal 1.64485 Styrene 1.37118 Octanoic acid 1.3606 2-Heptanone 1.3349 1-Pentanol 1.32515 Benzaldehyde 1.3165 2-Ethyl-1-hexanol 1.22137 Acetophenone 1.1891 1-Hexanol 1.15922 6,10-Dimethyl-5,9-undecadien-2-one 1.1329 Dodecane 1.07965 Heptanoic acid 1.07458 Pentadecane 1.03047 1-Heptanol 1.02575 Table 10 Key Volatile Compounds of T150–16 vs T150–0 Volatile compounds VIP Hexanal 3.50195 Nonanal 2.27132 2-Hydroxy-benzaldehyde 2.0736 Dibutyl phthalate 1.95435 2-Heptanone 1.67868 1-Octen-3-ol 1.42617 Benzaldehyde 1.4123 Styrene 1.38274 2-Ethyl-1-hexanol 1.36626 1-Butanol 1.18727 Pentadecane 1.11758 3-Methyl-decane 1.01541 2-Propenoic acid, butyl ester 1.01255 Table 11 Key Volatile Compounds of T170–16 vs T170–0 Volatile compounds VIP 2-Hydroxy-benzaldehyde 2.3413 Dibutyl phthalate 1.84433 Styrene 1.63018 6,10-Dimethyl-5,9-undecadien-2-one 1.59968 2-Heptanone 1.57207 Hexanal 1.50121 Hexanedioic acid, bis(2-ethylhexyl) ester 1.44238 Benzaldehyde 1.41995 Tridecane 1.2647 3-Methyl-decane 1.25167 1-Butanol 1.25015 2,4-Di-tert-butylphenol 1.23583 Tetradecane 1.11869 Acetophenone 1.07627 6-Methyl-5-hepten-2-one 1.0711 2-Propenoic acid, butyl ester 1.05498 2-Pentyl-furan 1.03207 Acetic acid, heptyl ester 1.02858 Table 12 Key Volatile Compounds of T190–16 vs T190–0 Volatile compounds VIP 2-Hydroxy-benzaldehyde 2.49354 Hexanal 2.34657 Nonanal 1.82243 Dibutyl phthalate 1.73289 2-Heptanone 1.61251 Benzaldehyde 1.59964 1,3,5,7-Cyclooctatetraene 1.46854 Acetic acid, heptyl ester 1.35575 1-Butanol 1.31598 6-Methyl-5-hepten-2-one 1.20955 2-Ethyl-1-hexanol 1.18558 Acetophenone 1.10691 2-Propenoic acid, butyl ester 1.08817 6,10-Dimethyl-5,9-undecadien-2-one 1.06679 1-Heptanol 1.0559 2-Undecenal 1.04747 Table 13 Original datasets of key lipids in T–16 vs T–0 Lipids VIP FA(18:2) 7.313 FA(18:1) 7.14104 FA(16:0) 4.00153 TG(18:1_18:1_18:2) 3.91532 TG(16:0_18:1_18:2) 3.83993 TG(18:1_18:2_18:2) 3.81905 [[ID= 49]]TG(16:0_18:2_18:2) 3.59711 TG(18:1_18:1_18:1) 3.52625 TG(18:1_18:2_18:3) 3.14727 TG(16:0_18:1_18:1) 3.12188 PC(18:1_18:2) 2.50549 FA(20:1) 2.21111 TG(18:1_18:1_20:1) 2.18432 TG(18:1_20:1_18:2) 2.10558 [[ID= 57]]TG(16:0_16:0_18:2) 2.0798 PC(18:1_18:1) 2.00888 FA(18:0) 1.93222 TG(16:0_20:0_18:2) 1.88843 PC(18:2_18:2) 1.84625 TG(18:1_18:1_18:3) 1.75874 TG(22:0_18:2_18:2) 1.65738 TG(16:0_18:2_18:3) 1.62474 FA(22:0) 1.59672 FA(20:0) 1.51142 TG(20:1_18:2_18:2) 1.46771 TG(16:0_16:0_18:1) 1.41056 TG(20:0_18:1_18:1) 1.34425 TG(22:0_18:1_18:2) 1.33776 TG(18:2_16:0_18:3) 1.25491 FA(18:3) 1.21556 DG(18:2_18:2) 1.19844 FA(24:0) 1.16819 PC(16:0_18:2) 1.14565 DG(18:1_18:1) 1.13302 TG(O-15:2_3:0_18:1) 1.09927 DG(18:1_18:2) 1.09444 PC(16:0_18:1) 1.08917 TG(22:0_18:1_18:1) 1.065 TG(22:1_18:2_18:2) 1.03749 Table 14 Original datasets of key lipids from T150–16 vs T150–0 Lipid VIP DG(18:1_18:1) 4.2378 PC(18:1_18:2) 4.06088 DG(18:2_18:2) 3.54922 LPC(18:1) 3.24035 DG(16:0_18:2) 3.13388 PC(18:1_18:1) 3.02088 PC(18:2_18:2) 2.93452 TG(O-15:2_3:0_18:1) 2.84966 DG(18:1_18:2) 2.76257 LPC(18:2) 2.75058 TG(O-15:2_3:0_18:2) 2.35698 TG(O-15:1_3:0_18:1) 2.0506 PC(16:0_18:2) 1.86553 DG(20:1_18:2) 1.79058 PE(18:1_18:2) 1.55733 PC(16:0_18:1) 1.52162 FA(20:1) 1.4703 PE(18:2_18:2) 1.36926 TG(18:1_18:1CHO_18:2) 1.22557 LPC(16:0) 1.19552 DG(16:0_18:1) 1.19299 TG(21:4_18:1_17:2CHO) 1.14522 DG(18:2_18:3) 1.08631 DG(24:0_18:2) 1.00406 Table 15 Original datasets of key lipids from T170–16 vs T170–0 Lipid VIP PC(18:1_18:2) 4.01556 DG(18:1_18:1) 3.55322 DG(18:2_18:2) 3.3407 LPC(18:1) 3.03078 DG(16:0_18:2) 2.92245 PC(18:1_18:1) 2.85144 PC(18:2_18:2) 2.72737 LPC(18:2) 2.62014 DG(18:1_18:2) 2.41255 TG(O-15:2_3:0_18:1) 2.39061 TG(O-15:2_3:0_18:2) 2.12276 TG(O-15:1_3:0_18:1) 1.86038 PC(16:0_18:2) 1.82321 DG(20:1_18:2) 1.5171 PC(16:0_18:1) 1.48814 PE(18:1_18:2) 1.46958 PE(18:2_18:2) 1.25052 LPC(16:0) 1.13661 DG(16:0_18:1) 1.06228 DG(18:2_18:3) 1.03895 Table 16 Original datasets of key lipids from T190–16 vs T190–0 Lipid VIP TG(18:1_18:2_18:3) 4.77099 PC(18:1_18:2) 4.42799 DG(18:1_18:1) 3.60536 PC(18:1_18:1) 3.12497 PC(18:2_18:2) 2.96329 DG(16:0_18:2) 2.96019 LPC(18:1) 2.9318 DG(18:2_18:2) 2.75421 LPC(18:2) 2.55924 DG(18:1_18:2) 2.51243 TG(O-15:2_3:0_18:1) 2.47433 PC (16:0_18:2) 1.97345 TG(O-15:1_3:0_18:1) 1.87377 TG(O-15:2_3:0_18:2) 1.76486 PC (16:0_18:1) 1.66473 PE(18:1_18:2) 1.55161 DG(20:1_18:2) 1.4004 PE(18:2_18:2) 1.28218 DG(16:0_18:1) 1.18271 Table 17 Optimization Values for Hyperparameters in Random Forest Regression Volatile substances ntree mtry 2-Hydroxy-benzaldehyde 500 2 Hexanal 1000 3
[0074] The results showed that in buckwheat samples, combining LASSO regression with random forest regression based on lipidomics data had good predictive ability for two marker substances, 2-Hydroxy-benzaldehyde and Hexanal, formed during storage in sweet buckwheat. Specifically, the random forest model constructed using key lipids screened by the LASSO regression model achieved a predictive performance of 0.9362 (R²) for 2-Hydroxy-benzaldehyde content. 2 The predictive performance for Hexanal content was 0.6023 (R0). 2 ).
[0075] In summary, a key lipid screening method based on lipidomics and machine learning can be applied to the screening of key lipids in buckwheat samples during storage, and provides some insights into the formation of volatiles during storage.
Claims
1. A buckwheat storage process key lipid screening method based on lipidomics and machine learning, characterized by, Includes the following steps: (1) Sample collection and preparation: Untreated and superheated steam treated buckwheat flour were subjected to accelerated oxidation storage in a stability test chamber. Constant temperature and relative humidity were set, and the samples were stored for a certain period of time. (2) Acquisition of volatile compound data: The volatile compound composition of the sample in step (1) was analyzed by GC-MS analysis technology, the types and contents of volatile compounds in the buckwheat sample were identified and quantified, and the original dataset of volatile compounds was constructed. (3) Screening of key volatile compounds: The OPLS-DA algorithm is used to process the original dataset of volatile compounds obtained in step (2), screen out key volatile compounds, and construct the original dataset of key volatile compounds. (4) Lipidomics data acquisition: The samples collected in step (1) were analyzed by LC-MS to identify and quantify the types and contents of lipid molecules in the buckwheat samples and construct the original lipidomics dataset. (5) Screening of key lipid molecules: Differential analysis was performed on the original lipidomics dataset from step (4) to obtain differentially expressed lipids. The OPLS-DA algorithm was used to process the differentially expressed lipid data and screen out key lipid molecules to construct the original dataset of key lipid molecules. (6) Construction and application of LASSO regression model: Select the key volatile compounds with the first VIP in step (3) as the response variable and the key lipid molecule dataset with the first VIP in step (5) as the feature variable, construct the LASSO regression model, and extract the lipid molecules with non-zero coefficients in the LASSO regression model as the key lipid dataset related to the formation of key volatile compounds. (7) Construction and application of random forest regression model: Select the same dataset as in step (6) as the response variable, and the key lipid dataset related to the formation of key volatile compounds in step (6) as the feature variable. Construct a random forest regression model, calculate the importance score of each lipid molecule, and sort them according to the importance score. (8) Validation and evaluation: The random forest regression model constructed in step (7) was validated using the "retention method" to evaluate the correlation between key lipids and key volatile compounds and the stability of the screening results.
2. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The buckwheat sample mentioned in step (1) is whole grain flour made by milling after dehulling, and includes two varieties: sweet buckwheat and bitter buckwheat.
3. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The buckwheat samples subjected to superheated steam treatment in step (1) are sweet buckwheat and bitter buckwheat that have been subjected to superheated steam treatment respectively.
4. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The superheated steam treatment in step (1) is set to temperatures of 150, 170 and 190 °C, and the treatment time is 10 s for each.
5. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The volatile compounds described in step (2) were separated by GCMS-TQ8040 NX and SH-PolarD columns. The volatile compounds were identified by comparing them with the NIST 20 spectral library. Volatile compounds with a similarity of ≥ 80% were retained as the original dataset of volatile compounds.
6. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The OPLS-DA algorithm described in step (3) uses VIP > 1 as the screening criterion to establish the original dataset of key volatile compounds.
7. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The lipidomics data described in step (4) were acquired by UPLC-Q-Exactive-Orbitrap MS, and the lipid molecules were separated by Syncronis C18 column.
8. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The identification and quantification results of lipid molecules in step (4) were obtained using LipidSearch V5.0 software, and finally, the original lipidomics dataset was established.
9. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The differentially expressed lipids in step (5) were obtained by screening using the T-test corrected p-value (FDR) and fold change value (FC). After OPLS-DA analysis, lipid molecules with VIP > 1 were used as the original dataset of key lipid molecules.
10. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The response variable selection in step (6) refers to the quantitative results of the first two representative volatile compounds in the original dataset of key volatile compounds constructed in step (3).
11. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The feature variables selected in step (6) are all the original datasets of key lipid molecules in step (5).
12. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The LASSO regression model described in step (6) was established using 10-fold cross-validation to select the optimal penalty parameter λ, with the selection criterion being minimizing the mean squared error. Finally, a key lipid dataset related to the formation of key volatile compounds was established.
13. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The key parameter settings for the random forest regression model described in step (7) include: the number of decision trees (ntree) is 500–2000, and the number of variables randomly selected (mtry) when splitting at each node is selected through hyperparameter tuning. The importance of key lipids is ranked based on the degree of increase in model prediction error (%IncMSE).
14. The method for screening key lipids in buckwheat storage based on lipidomics and machine learning as described in claim 1, characterized in that, The "hold-out method" mentioned in step (8) involves dividing the dataset into a training set (70%) and a test set (30%), and evaluating the model prediction performance on the training set and the test set respectively.