A hydrogen storage alloy performance prediction method based on feature engineering and multi-model screening
By employing a multi-dimensional feature construction and multi-model fusion approach, the problem of unstable feature selection in the performance prediction of solid-state hydrogen storage alloys in existing technologies was solved, achieving high-precision and globally optimal prediction of hydrogen storage alloy performance and improving the stability and prediction accuracy of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAN TECH UNIV
- Filing Date
- 2026-05-12
- Publication Date
- 2026-06-26
AI Technical Summary
Existing machine learning methods for predicting the performance of solid hydrogen storage alloys suffer from incomplete database dimensions and a single feature selection model, leading to unstable feature selection, inability to guarantee global optimality, and susceptibility of prediction results to randomness in data partitioning and overfitting.
By constructing a multi-dimensional feature set, we use various methods such as random forest Gini importance, XGBoost gain importance, SHAP value and Boruta algorithm to evaluate feature importance. We combine Pearson correlation coefficient to remove redundant features and use Gaussian process regression, gradient boosting regression and random forest for multi-model cross-validation to identify the globally optimal feature subset.
It significantly improves the stability and prediction accuracy of feature selection, with an average determination coefficient R² of 0.725, which is 90.8% and 55.9% higher than existing technologies, respectively, ensuring the reliability and accuracy of prediction results.
Smart Images

Figure CN122290835A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of solid-state hydrogen storage technology, specifically relating to a method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening. Background Technology Currently, the design of solid-state hydrogen storage alloys (such as rare-earth AB5 type, titanium-based AB type, AB2 type, vanadium-based solid solution type, magnesium-based hydrogen storage alloys, etc.) mainly relies on traditional "trial and error" methods or empirical criteria, such as screening materials by adjusting element ratios and observing changes in hydrogen storage capacity. This method has a long development cycle, high cost, and makes it difficult to explore the impact of high-order interactions between multiple elements on performance.
[0002] With the development of machine learning technology, researchers have begun to explore its application in the prediction and design of the performance of solid-state hydrogen storage alloys. For example, patent document CN118888056A discloses a method for constructing the structure-property relationship of solid-state hydrogen storage materials based on machine learning. By extracting structural property data of solid-state hydrogen storage materials from the Material Project database, the contribution of each feature to the target variable is calculated using the SHAP function image to screen key features. After performing mathematical transformations such as squaring, cubicing, and exponential transformations on the features, models such as linear regression and random forest are established. Finally, the structure-property relationship expression of hydrogen bond length, radius, and electronegativity of solid-state hydrogen storage materials is constructed. However, this method suffers from several drawbacks. First, the database information is incomplete. The structural features extracted from the Material Project database only include 16 features across 5 categories: chemical formula, bonding elements, shortest hydrogen bond length, space group, density, formation energy, volume, total energy, atomic energy, number of atomic sites, relative convex hull energy, total magnetization, atomic experimental radius, atomic covalent radius, atomic machine learning radius, and electronegativity. Second, the method relies solely on the SHAP model for feature selection, lacking cross-validation across multiple methods. This poses a risk of unstable feature selection due to bias in a single model. Furthermore, the model evaluation relies on a single training-test set partitioning, making the evaluation results susceptible to the randomness of data partitioning. Finally, the method does not perform predictive validation of target performance related to actual hydrogen storage capacity. For example, patent document CN116092606B discloses a high-capacity hydrogen storage alloy design method based on machine learning. It calculates the weighted average and variance of three descriptors (atomic radius, electronegativity, and valence electron number) and the weighted average of bulk modulus with respect to atomic and mass percentages, respectively. A prediction model is established using the Xgboost algorithm, and the alloy composition is optimized using a genetic algorithm. This achieves a high-precision prediction with a relative error of only 0.54% between the actual and predicted hydrogen storage capacity. However, this method still suffers from the problem of feature selection relying on a single model, which can easily lead to unstable feature selection. The model's goodness of fit R² is only 0.28.
[0003] Furthermore, the method for removing low-frequency elements reported in the literature (Zhou P, Shen H, Xu N, et al. Ambient-condition hydrogen storage: accelerating the development of high-capacityhydrides via a quantitative interpretable machine learning framework[J]. Energy Storage Materials, 2025: 104789.) simply removes low-frequency elements, leading to the loss of key information. This results in the same problem mentioned above, ultimately causing overfitting (R²=0.953 on the training set, R²=0.767 on the test set). Another method, recursive feature elimination and direct modeling of explicit features (Zhou P, Xiao X, Zhu X, et al. Machine learning enabled customization of performance-oriented hydrogen storage materials for fuel cell systems[J]. Energy Storage Materials, 2023, 63: 102964.), relies on a greedy algorithm for feature selection, failing to escape local optima to guarantee global optima. Moreover, explicit features are directly involved in modeling without any selection, introducing redundant information.
[0004] In summary, existing machine learning methods for developing high-performance solid-state hydrogen storage materials generally suffer from incomplete database dimensions, single feature selection models, and unstable feature selection, failing to guarantee globally optimal prediction results. There is an urgent need to develop an optimal feature set selection method based on multi-model fusion. Through multi-model cross-validation, this method can ensure the finding of the globally optimal feature-model combination, comprehensively improving the stability, reliability, and prediction accuracy of feature selection. Summary of the Invention
[0005] This invention provides a feature screening and performance prediction method for hydrogen storage alloys based on multi-model fusion to overcome the problems in existing solid hydrogen storage alloy machine learning prediction methods, such as limited feature construction dimensions, reliance on a single model for feature screening, and neglect of multicollinearity among features, which prevent the guarantee of global optimization.
[0006] To achieve the above objectives, the technical solution provided by this invention is: a method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening, comprising the following steps: Step 1: Constructing the physicochemical feature set of the alloy: Collect compositional data of solid hydrogen storage alloys, including the target properties of the alloys and the corresponding alloy feature vectors, to form the original dataset; The physicochemical properties of each element are retrieved from the element feature library, including four newly introduced categories of element features: electronic structure, thermodynamics, physical and mechanical properties, and periodic law. For each alloy, the weighted average of each property is calculated based on the atomic percentage, which serves as the alloy's feature vector, forming an alloy-level feature matrix. The calculation formula is as follows:
[0007] in, This represents the number of element types in the alloy. For the first atomic percentage of each element This is a value representing a specific physicochemical property of the element; The composition of all alloys, the calculated eigenvectors, and the set target performance are integrated into a single feature database file and then output. Step 2: Multi-method feature importance analysis: First, load the feature database file generated in step one, using all feature columns from the feature database file as the input feature matrix. Using target performance as the target variable ; Then, for the characteristic matrix Missing values in the variable are filled with 0, and the target variable is deleted. Missing samples; Finally, multiple feature importance assessment methods were used to evaluate the features respectively. and Training is performed, and the importance score of each feature to the prediction result is calculated under different methods; Step 3: Feature Union Extraction: Based on the importance scores of each feature under different methods in Step 2, independently select the top K features in terms of importance in each method; take the union of the features selected by the four methods as the candidate feature set I; Step 4: Feature selection based on importance-relevance joint criteria: First: Extract the data columns corresponding to the features in the candidate feature set I from the feature database file in step one, fill missing values with 0, and calculate the Pearson correlation coefficient matrix R; Then, by setting a correlation threshold Iterate through all feature pairs in the candidate feature set I and check the absolute value of their correlation coefficients in the correlation matrix. : like If so, then the feature pair is directly used as a candidate feature; like The importance of the feature pairs is further ranked using the random forest method, retaining those with higher scores as candidate features and eliminating those with lower scores; finally, after traversing all feature pairs, a new feature set II is obtained. Step 5: Feature Subset Search and Multi-Model Cross-Validation Step 6: Determine the global optimal model and the optimal feature subset.
[0008] Furthermore, in step one above, the physicochemical property values include, but are not limited to: atomic number, relative atomic mass, atomic radius, covalent radius, electronegativity, total number of valence electrons, first ionization energy, melting point, density, and bulk modulus.
[0009] Furthermore, in step two above, multiple feature importance evaluation methods are used, including Gini importance of random forest, gain importance of XGBoost, SHAP value, and Boruta algorithm.
[0010] Furthermore, the specific steps of step two above are as follows: First, sort the features in the new feature set II obtained in step four in descending order of random forest importance score, and take the top M features as the feature pool III to be searched. Then, iterate through all possible non-empty subsets from Feature Pool III. For each feature subset, perform 5-fold cross-validation using three regression models and calculate the average coefficient of determination for each model on that feature subset. The scores and root mean square error; the three regression models are Gaussian process regression, gradient boosting regression, and random forest.
[0011] Furthermore, step six above specifically involves: based on the cross-validation results from step five, if there exists... The model-feature subset combination with the highest score is selected as the optimal solution for predicting the target performance; if multiple combinations exist... If the scores are similar, the option with fewer features or lower model complexity should be selected first.
[0012] Furthermore, in step three above, the top K features in terms of importance are preferably K, which is between 15 and 25.
[0013] Furthermore, in step four above, a correlation threshold is set. Preferred It ranges from 0.7 to 0.9.
[0014] Furthermore, in step five above, the first M features are selected as the feature pool to be searched, preferably M is 8 to 12.
[0015] Compared with the prior art, the beneficial effects of the present invention are: (1) This invention introduces four dimensions of solid hydrogen storage alloy characteristics, namely electronic structure, thermodynamics, physical mechanics and periodicity, to clarify the key role of electronic structure (such as the proportion of orbital valence electrons), thermodynamics (such as melting enthalpy, atomization enthalpy, etc.), and physical mechanics (such as bulk modulus, hardness, etc.) in predicting hydrogen storage capacity, and provides a more comprehensive idea for feature construction and feature selection basis for subsequent machine learning research and development of hydrogen storage alloys.
[0016] (2) This invention independently evaluates feature importance by integrating four methods: random forest Gini importance, XGBoost gain importance, SHAP value, and Boruta algorithm. The union of the features with the highest importance is taken, and after a second ranking by random forest importance, the features with lower importance in highly correlated feature pairs are removed by using the Pearson correlation coefficient threshold. Then, the entire subset is traversed. (1) and cross-validated using three regression models. Compared with existing technologies, this invention effectively identifies features that are important in different methods through cross-validation of multiple feature importance assessment methods, making the feature selection results more objective and reliable. At the same time, it eliminates multicollinearity among features by removing redundancy through correlation, and avoids the risk of local optima by replacing the greedy algorithm with a full subset search, significantly improving the stability, reliability and global optimality of feature selection.
[0017] (3) Using the optimal model and optimal feature subset obtained by multi-dimensional feature construction and multi-model joint screening in this invention, the hydrogen release capacity of solid hydrogen storage alloy at room temperature is predicted, and the average coefficient of determination R 2 It reached 0.725. Compared to the 8 features (R) in the patent with publication number CN116092606B. 2 =0.380) and 78 unfiltered full features (R 2 Compared to (=0.465), the prediction accuracy of the present invention is improved by 90.8% and 55.9% respectively, which fully verifies the significant advantages of the present invention in high-precision prediction. Attached Figure Description
[0018] Figure 1 This is an overall flowchart of the method of the present invention; Figure 2 A comparison chart showing the ranking of feature importance across multiple methods; among which, Figure 2 (a) Feature importance map of the XGBoost model; Figure 2 (b) is a feature importance graph of the Random Forest model; Figure 2 (c) is a feature importance graph of the SHAP method; Figure 2 (d) is the feature importance graph of the Boruta method; Figure 3 Prediction performance graphs for different models and feature subsets; Figure 4 The prediction performance graphs are shown for different feature subsets. Detailed Implementation
[0019] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of the embodiments of this invention will be described in more detail below with reference to the accompanying drawings. The described embodiments are only some, not all, of the embodiments of this invention. Any other embodiments obtained by those skilled in the art through similar improvements and adjustments based on the content of this invention without inventive effort are considered to be within the scope of protection of this invention.
[0020] This embodiment uses V-Ti-Fe hydrogen storage alloys as the object of analysis to describe in detail the method provided by the present invention.
[0021] Step 1: Constructing the physicochemical feature set of the alloy: The original dataset was constructed by collecting the target performance (solid-state hydrogen storage and dehydrogenation capacity) of 85 V-Ti-Fe hydrogen storage alloys at 20-30℃ from publicly available literature and experimental data.
[0022] The physicochemical properties of each element are read from the element feature library, including 78 physicochemical properties newly added in this invention, such as electronic structure, thermodynamics, physical machinery, and periodic law. The weighted average of each property is calculated based on atomic percentage, serving as the eigenvector of the alloy and forming the alloy-level feature matrix. The calculation formula is as follows:
[0023] in, This represents the number of element types in the alloy. For the first in the alloy atomic percentage of each element This represents the physicochemical characteristic value corresponding to the element.
[0024] After calculation, a total of 85 samples and 78 features were obtained. The 85 alloy compositions, hydrogen release capacities and their respective 78 feature vectors were integrated into a feature database file and output.
[0025] Step 2: Multi-method feature importance analysis: First, load the feature data file generated in step one, using all feature columns from the feature database file as the input feature matrix. (Some features are shown in Table 1), with target performance as the target variable. In this embodiment, the target performance is the amount of hydrogen released at room temperature.
[0026] Table 1. Partial Characteristic Matrix of V-Ti-Fe Hydrogen Storage Alloys
[0027] Then, missing values in the feature matrix are filled with 0, and samples with missing target values are deleted.
[0028] Finally, four feature importance evaluation methods were used to evaluate the feature matrix. and target variable Training is performed, and the contribution of each feature to the prediction result under different methods is calculated, i.e., the importance score of each of the 78 features is calculated. The four feature importance evaluation methods are: Gini importance of random forest, gain importance of XGBoost, SHAP value, and Boruta algorithm. The importance of each feature under the above four evaluation methods is ranked as follows: Figure 2 As shown, Figure 2 The top 20 key features identified using four different methods on a Ti-V-based solid-state hydrogen storage alloy dataset are presented. Among them, from... Figure 2 (a) It can be seen that the top 5 important features identified by XGBoost are density, atomic radius, s orbital valence electron ratio, first ionization energy, and covalent radius. Figure 2 (b) It can be seen that the top 5 important features identified by Random Forest are atomic radius, first ionization energy, enthalpy of fusion, melting point, and first electron affinity. Figure 2 (c) It can be seen that the first 5 important features identified by SHAP are atomic radius, first ionization energy, enthalpy of fusion, melting point, first electron affinity, and nuclear radius. Figure 2 (d) It can be seen that the top 5 important features identified by Boruta are atomic radius, enthalpy of fusion, first electron affinity, first ionization energy, and melting point. The newly introduced electronic structure, thermodynamics, and physical-mechanical characteristics of solid-state hydrogen storage alloys make significant contributions to their room-temperature hydrogen desorption performance.
[0029] The importance scores of some features are shown in Table 2. The importance of each feature is output as a file.
[0030] Table 2 Summary of Importance Scores for Some Features
[0031] Step 3: Feature Union Extraction and Correlation Analysis: Based on the importance scores of each feature under the four methods in step two, the top K features in terms of importance are independently selected for each method. In this embodiment, K=20. The union of the features selected by the four methods is taken as the candidate feature set I, resulting in a total of 25 candidate features.
[0032] Step 4: Feature selection based on importance-relevance joint criteria: First, extract the data columns corresponding to the features in the candidate feature set I from the feature database file in step one, fill missing values with 0, and calculate the Pearson correlation coefficient matrix R. Then, set the correlation threshold. Iterate through all feature pairs consisting of the 25 candidate features in the candidate feature set I, call the df.corr() method to calculate the correlation coefficient between the features, and check the absolute value of the correlation coefficient in the correlation matrix. : like If so, then the feature pair is directly used as a candidate feature; like This indicates the features in the feature pair. With features If the correlation is high, the importance of the feature pair needs to be further ranked using the random forest method, retaining the higher-scoring ones as candidate features and eliminating the lower-scoring ones.
[0033] After traversing all feature pairs, 10 redundant features are removed from the initial 25 candidate features, resulting in a new feature set II containing 15 features.
[0034] Step 5: Feature Subset Search and Multi-Model Cross-Validation The features in the new feature set II obtained in step four are sorted in descending order of random forest importance score, and the top M features are taken as the feature pool III to be searched. In this embodiment, M=10.
[0035] Iterate through all possible non-empty subsets in Feature Pool III (total... There are 1023 feature subsets, representing all combinations from 1 to 10 features. For each feature subset, 5-fold cross-validation is performed using three regression models: Random Forest, Gaussian Process Regression, and Gradient Boosting. The average coefficient of determination for each model on that feature subset is calculated. Fractional and root mean square error.
[0036] Step Six: Determining the Global Optimal Model and Optimal Feature Subset: Based on the cross-validation results from step five, select The model with the highest score and the feature subset combination is considered the optimal solution for predicting the target performance. In this embodiment, when using a Gaussian process to predict the hydrogen release at 20-30℃ using the feature subset {atomic radius, density, nuclear radius}, the average... The value reached 0.725, which is globally optimal. This subset contains only 3 features, resulting in low model complexity and good generalization performance.Figure 3 As shown, Figure 3 The paper presents the optimal R² variation trends for predicting the performance of solid-state hydrogen storage using three models: Gaussian Process Regression, Gradient Boosting, and Random Forest, when using different numbers of features.
[0037] To verify the effectiveness of the feature selection and subset search method in this invention, a Gaussian process regression model was used to compare and evaluate the prediction performance of three different feature sets on the same dataset. First, the feature database file generated in step 1) was loaded, containing 85 samples. The following three feature sets were then constructed: Feature set A: The eight features disclosed in the patent document with authorization number CN116092606B; Feature set B: All 78 physicochemical features constructed in step 1), without any screening; Feature set C: The optimal feature subset determined in step 6), namely {atomic radius, density, atomic nucleus radius}.
[0038] A Gaussian process regression model was used for performance evaluation, with 5-fold cross-validation repeated 10 times. The average R-value for each feature set was calculated. 2 Score and its standard deviation. Evaluation results are as follows: Figure 4 As shown in the figure, the R² prediction performance of three feature subsets under the same Gaussian process regression model for the room temperature hydrogen desorption capacity of solid-state hydrogen storage after 10 repeated 5-fold cross-validations is compared. The three feature subsets are: eight physicochemical features (combined features calculated based on atomic radius, electronegativity, total valence electrons, and bulk modulus) used in the patent document with authorization number CN116092606B; all physicochemical features covered in the feature library constructed in this invention; and... Figure 3 The feature selection process shown yields the optimal combination of predictive features.
[0039] The final results show that the average R of feature set A (the 8 features in the patent with publication number CN116092606B) is: 2 The average R-value for feature set B (78 unfiltered features) is 0.380. 2 Increased to 0.465, the average R-value of feature set C (best 3 features) 2 It further improved to 0.725, the highest among the three, and also had the best stability.
[0040] The above comparison results confirm that the feature engineering and screening method proposed in this invention can effectively identify key physicochemical features, and can significantly improve prediction accuracy and model stability while significantly reducing feature dimensionality.
[0041] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.
Claims
1. A method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening, characterized in that: Includes the following steps: Step 1: Constructing the physicochemical feature set of the alloy: Collect compositional data of solid hydrogen storage alloys, including the target properties of the alloys and the corresponding alloy feature vectors, to form the original dataset; The physicochemical properties of each element are retrieved from the element feature library, including four newly introduced categories of element features: electronic structure, thermodynamics, physical and mechanical properties, and periodic law. For each alloy, the weighted average of each property is calculated based on the atomic percentage, which serves as the alloy's feature vector, forming an alloy-level feature matrix. The calculation formula is as follows: in, This represents the number of element types in the alloy. For the first atomic percentage of each element This is a value representing a specific physicochemical property of the element; The composition of all alloys, the calculated eigenvectors, and the set target performance are integrated into a single feature database file and then output. Step 2: Multi-method feature importance analysis: First, load the feature database file generated in step one, using all feature columns from the feature database file as the input feature matrix. Using target performance as the target variable ; Then, for the characteristic matrix Missing values in the variable are filled with 0, and the target variable is deleted. Missing samples; Finally, multiple feature importance assessment methods were used to evaluate the features respectively. and Training is performed, and the importance score of each feature to the prediction result is calculated under different methods; Step 3: Feature Union Extraction: Based on the importance scores of each feature under different methods in Step 2, independently select the top K features in terms of importance in each method; take the union of the features selected by the four methods as the candidate feature set I; Step 4: Feature selection based on importance-relevance joint criteria: First: Extract the data columns corresponding to the features in the candidate feature set I from the feature database file in step one, fill missing values with 0, and calculate the Pearson correlation coefficient matrix R; Then, by setting a correlation threshold Iterate through all feature pairs in the candidate feature set I and check the absolute value of their correlation coefficients in the correlation matrix. : like If so, then the feature pair is directly used as a candidate feature; like The importance of the feature pairs is further ranked using the random forest method, retaining those with higher scores as candidate features and eliminating those with lower scores; finally, after traversing all feature pairs, a new feature set II is obtained. Step 5: Feature Subset Search and Multi-Model Cross-Validation Step 6: Determine the global optimal model and the optimal feature subset.
2. The method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening according to claim 1, characterized in that: In step one, the physicochemical property values include, but are not limited to: atomic number, relative atomic mass, atomic radius, covalent radius, electronegativity, total number of valence electrons, first ionization energy, melting point, density, and bulk modulus.
3. The method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening according to claim 1, characterized in that: In step two, multiple feature importance evaluation methods are used, including Gini importance of random forest, gain importance of XGBoost, SHAP value, and Boruta algorithm.
4. The method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening according to claim 1, characterized in that: The specific steps of step two are as follows: First, sort the features in the new feature set II obtained in step four in descending order of random forest importance score, and take the top M features as the feature pool III to be searched. Then, iterate through all possible non-empty subsets from Feature Pool III. For each feature subset, perform 5-fold cross-validation using three regression models and calculate the average coefficient of determination for each model on that feature subset. The scores and root mean square error; the three regression models are Gaussian process regression, gradient boosting regression, and random forest.
5. The method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening according to claim 1, characterized in that: Step six specifically involves: based on the cross-validation results from step five, if there exists... The model-feature subset combination with the highest score is selected as the optimal solution for predicting the target performance; if multiple combinations exist... If the scores are similar, the option with fewer features or lower model complexity should be selected first.
6. The method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening according to claim 1, characterized in that: In step three, the top K features in terms of importance are selected, with K preferably being 15 to 25.
7. The method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening according to claim 1, characterized in that: In step four, a correlation threshold is set. Preferred It ranges from 0.7 to 0.
9.
8. The method for predicting the performance of hydrogen storage alloys based on feature engineering and multi-model screening according to claim 1, characterized in that: In step five, the first M features are selected as the feature pool to be searched, preferably M is 8 to 12.