Credit risk prediction method and device based on adaptive stacking integrated model, electronic equipment and storage medium
By using an adaptive Stacking ensemble model, which leverages K-Means clustering and logistic regression weighted fusion, the problems of existing models in selecting base model combinations and class imbalance are solved, achieving high-precision and stable credit risk prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LIAOCHENG UNIV
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing credit risk prediction models struggle to adaptively select the optimal base model combination, handle class imbalance issues, and lack generalization ability during macroeconomic fluctuations.
An adaptive Stacking ensemble model is adopted, which generates a subset of datasets through K-Means clustering. It combines base models such as logistic regression, decision tree, support vector machine, random forest and convolutional neural network, and uses the validation set to select the optimal combination of base models. A two-stage Stacking architecture is constructed to perform weighted fusion with logistic regression as the meta-classifier.
It significantly improves the accuracy and robustness of credit risk prediction, can adaptively match the data characteristics of different risk subgroups, and enhances the model's generalization ability and prediction stability.
Smart Images

Figure CN122243628A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of financial risk assessment technology, specifically to a credit risk prediction method, device, electronic device, and storage medium based on an adaptive Stacking ensemble model. Background Technology
[0002] Credit risk forecasting is a crucial aspect of financial risk management, and accurate forecasting is essential for maintaining financial market stability and optimizing resource allocation. With global economic integration, credit risk has become increasingly complex and dynamic, highlighting the significant limitations of traditional methods. Traditional statistical credit risk assessment methods (such as logistic regression and linear discriminant analysis) impose stringent assumptions on data distribution, struggle to handle complex nonlinear relationships, and have limited modeling capabilities. While single machine learning models (such as decision trees, support vector machines, and random forests) improve prediction accuracy to some extent, they are susceptible to data noise, class imbalance, and overfitting, resulting in significantly reduced generalization ability during macroeconomic fluctuations. Furthermore, existing ensemble learning methods (such as Bagging, Boosting, and Stacking) typically employ fixed combinations of base classifiers, making it difficult to adapt to the heterogeneity of customer groups with different risk characteristics, leading to insufficient sensitivity in identifying high-risk customers. Therefore, constructing a credit risk forecasting model that can adaptively select the optimal combination of base models, effectively handle class imbalance, and balance prediction accuracy with risk control capabilities has become a pressing technical challenge for those skilled in the art. Summary of the Invention
[0003] This application provides a credit risk prediction method, apparatus, electronic device, and storage medium based on an adaptive Stacking ensemble model, which overcomes the shortcomings of existing technologies such as insufficient model adaptability, difficulty in handling class imbalance, and fixed-base classifier combinations.
[0004] The specific technical solution of this embodiment is as follows:
[0005] In a first aspect, embodiments of this application provide a credit risk prediction method based on an adaptive Stacking ensemble model, including:
[0006] S10. Obtain credit data and divide the obtained credit data into training set, validation set and test set;
[0007] S20. Cluster the default samples in the training set to generate several first subsets, cluster the non-default samples in the training set to generate several second subsets, and combine the first and second subsets in pairs to form several third subsets. Select logistic regression, decision tree, support vector machine, random forest and convolutional neural network as base models. Based on the non-empty combination of base models, use the validation set to calculate the prediction accuracy of each third subset, and select the base model combination with the highest accuracy as the optimal base model combination.
[0008] S301. For each optimal base model combination, generate the meta-feature matrix;
[0009] S302. Using a logistic regression model as a meta-classifier, train the meta-feature matrix by combining the original labels.
[0010] S303. Input the test set into each optimal base model combination to generate the predicted probability, which is used as the meta-feature. Then, perform weighted fusion through the trained logistic regression classifier to output the final credit risk prediction result.
[0011] In some embodiments, after acquiring the credit data but before dividing it into training, validation, and test sets, the following steps are also included:
[0012] S101. Standardize the acquired credit data and fill in missing values using the median.
[0013] In some embodiments, the number of default samples in the training set clustered to generate a first subset is the same as the number of non-default samples in the training set clustered to generate a second subset. The method for determining the number includes the following steps:
[0014] K10. Calculate the sum of squared intra-cluster errors (SSE) for different cluster numbers k. When the rate of decrease in SSE changes from steep to gradual, the k value corresponding to this inflection point is the optimal number of clusters. The formula for calculating SSE is:
[0015]
[0016] Where X = {x1, x2, ..., xn} is a dataset containing n samples; C = {c1, c2, ..., ck} contains k clusters; ci represents the i-th cluster center, k ≤ n.
[0017] In some embodiments, after combining the first and second subsets pairwise to form several third subsets, and before calculating the prediction accuracy of each third subset using the validation set, the following steps are also included:
[0018] S201. Perform local oversampling on each third subset of data to increase the number of default samples within the third subset of data.
[0019] Local oversampling includes the following steps: randomly generating synthetic samples on the lines connecting minority class sample points and their neighbors; combining the datasets after clustering default samples and non-default samples into multiple data subsets; performing minority oversampling on each data subset containing default samples and non-default samples; and repeating the above process of generating synthetic samples until a predetermined number of oversampling samples is reached.
[0020] In some of these embodiments, the logistic regression model uses L1 regularization with a regularization strength of C = 0.1 × 10; the decision tree model uses information gain as the splitting criterion and has a maximum depth of 8; the random forest model has 100 trees; the support vector machine model uses a Gaussian kernel with a regularization parameter of C = 1.0; and the convolutional neural network model has a learning rate of 0.001, 32 convolutional kernels, a dropout value of 0.1, and no batch normalization is applied.
[0021] In some embodiments, generating the meta-feature matrix for each optimal base model combination includes the following steps:
[0022] S3011. Divide the training set into N mutually exclusive subsets;
[0023] S3012. Train the model using N-1 subsets in sequence, and generate prediction probabilities on the remaining 1 subset;
[0024] S3013. Stack the iterative prediction results to form a meta-feature matrix.
[0025] Secondly, embodiments of this application provide a credit risk prediction device based on an adaptive Stacking ensemble model, including a data acquisition and preprocessing module, a clustering and base model selection module, and an ensemble prediction module.
[0026] The data acquisition and preprocessing module is used to acquire credit data, preprocess the credit data, and divide it into training set, validation set and test set according to a preset ratio;
[0027] The clustering and base model selection module clusters default samples and non-default samples in the training set to generate multiple first and second subsets. The first and second subsets are then combined in pairs to form several third subsets. Logistic regression, decision tree, support vector machine, random forest, and convolutional neural network are selected as base models. Based on the non-empty combination of base models, the prediction accuracy of each third subset is calculated using the validation set. The base model combination with the highest accuracy is selected as the optimal base model combination.
[0028] The integrated prediction module generates a meta-feature matrix for each optimal base model combination; it uses a logistic regression model as a meta-classifier and combines the meta-feature matrix with the original labels for training; it inputs the test set into each optimal base model combination to generate prediction probabilities, which are used as meta-features, and then performs weighted fusion through the trained logistic regression classifier to output the final credit risk prediction result.
[0029] In some embodiments, the data acquisition and preprocessing module is also used to standardize the credit data, fill missing values with the median, and perform stratified random partitioning according to the ratio of training set, validation set, and test set = 6:2:2.
[0030] The clustering and base model selection module is also used to cluster default samples and non-default samples separately using the K-Means clustering algorithm, and determine the optimal number of clusters using the elbow rule; randomly combine the subsets of default samples and the subsets of non-default samples in pairs to form multiple data subsets; apply the SMOTE algorithm to perform local oversampling within each data subset; select logistic regression model, decision tree model, support vector machine model, random forest model, and convolutional neural network model as base models, generate 31 non-empty combinations through exhaustive search, and select the base model combination with the highest accuracy within each data subset;
[0031] The integrated prediction module is also used to perform 5-fold cross-validation on the selected optimal base model combination to generate a meta-feature matrix, which is then trained using a logistic regression model as the meta-classifier.
[0032] Thirdly, embodiments of this application provide an electronic device,
[0033] Electronic devices include: a memory and a processor; the memory is used to store program code and transfer program code to the processor;
[0034] The processor is used to execute the credit risk prediction method steps of any of the above embodiments according to the instructions in the program code.
[0035] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer instructions. When the computer instructions are executed on an electronic device, the electronic device performs the steps of the credit risk prediction method according to any of the above embodiments.
[0036] Compared with the prior art, the embodiments of this application have the following beneficial effects:
[0037] The credit risk prediction method based on an adaptive Stacking ensemble model provided in this application achieves refined customer segmentation by clustering default and non-default samples separately using K-Means clustering. An exhaustive method is employed to dynamically optimize five heterogeneous base models, overcoming the limitations of fixed-base classifiers in traditional ensemble models. This allows the model to adaptively match the data characteristics of different risk subgroups, significantly improving prediction accuracy. A two-stage Stacking ensemble architecture is constructed: the first stage uses cross-validation to generate meta-features to suppress overfitting, and the second stage uses logistic regression as a meta-classifier for nonlinear fusion, further enhancing the model's generalization ability and robustness. Attached Figure Description
[0038] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0039] Figure 1 This is a flowchart illustrating a credit risk prediction method based on an adaptive Stacking ensemble model provided in some embodiments of this application;
[0040] Figure 2 This is a schematic diagram of the structure of a credit risk prediction device based on an adaptive Stacking ensemble model provided in some embodiments of this application. Detailed Implementation
[0041] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0042] The use of "applies to" or "configured to" in this application implies open and inclusive language, which does not exclude the applicability to or configuration to devices performing additional tasks or steps. Additionally, the use of "based on" implies openness and inclusivity, because processes, steps, calculations, or other actions "based on" one or more of the stated conditions or values may in practice be based on additional conditions or values beyond those stated.
[0043] In this application, the term "exemplary" is used to mean "used as an example, illustration, or description." Any embodiment described as "exemplary" in this application is not necessarily to be construed as being more preferred or advantageous than other embodiments. The following description is provided to enable any person skilled in the art to make and use this application. Details are set forth in the following description for purposes of explanation. It should be understood that those skilled in the art will recognize that this application can be made without using these specific details. In other instances, well-known structures and processes are not described in detail to avoid obscuring the description of this application with unnecessary detail. Therefore, this application is not intended to be limited to the embodiments shown, but is consistent with the broadest scope of the principles and features disclosed in this application.
[0044] Firstly, please refer to Figure 1 This application provides a credit risk prediction method based on an adaptive Stacking ensemble model, comprising the following steps:
[0045] S10. Obtain credit data and divide the obtained credit data into training set, validation set and test set.
[0046] In S10, when dividing credit data into training, validation and test sets, the ratio of the three can be set arbitrarily. For example, 7:2:1, 6:2:2, 5:3:2, etc. can be used.
[0047] S20. Cluster the defaulted samples in the training set to generate several first subsets, and cluster the non-defaulted samples in the training set to generate several second subsets. Combine the first and second subsets pairwise to form several third subsets. Select logistic regression, decision tree, support vector machine, random forest, and convolutional neural network as base models. Based on the non-empty combination of base models, calculate the prediction accuracy of each third subset using the validation set, and select the base model combination with the highest accuracy as the optimal base model combination.
[0048] In S20, the number of generated first subsets can be 1, 2, 3, or even more; similarly, the number of generated second subsets can also be 1, 2, 3, or even more. The number of generated first subsets and second subsets can be the same or different. Based on the number of generated first and second subsets, the number of third subsets can be determined. For example, if the number of first subsets is 2 and the number of second subsets is 4, then pairwise combinations of the first and second subsets will yield 2*4=8 third subsets. If both the number of first and second subsets are 3, then the number of third subsets is 3*3=9. Non-empty combinations of base models include single-model, dual-model, triple-model, quadruple-model, and quintuple-model combinations; there are 31 non-empty combinations of 5 base models. The prediction accuracy of each third subset is calculated using the validation set and determined by the AUC value. For each third subset, the base model combination with the highest prediction accuracy is obtained.
[0049] S301. For each optimal base model combination, generate the meta-feature matrix.
[0050] S302. Using a logistic regression model as a meta-classifier, train the meta-feature matrix by combining the original labels.
[0051] S303. Input the test set into each optimal base model combination to generate the predicted probability, which is used as the meta-feature. Then, perform weighted fusion through the trained logistic regression classifier to output the final credit risk prediction result.
[0052] In the above embodiments, K-Means clustering is used to cluster default and non-default samples separately, achieving refined segmentation of the customer group. An exhaustive method is employed to dynamically optimize five heterogeneous basis models, overcoming the limitations of traditional fixed-base classifiers in ensemble models. This allows the model to adaptively match the data characteristics of different risk subgroups, significantly improving prediction accuracy. A two-stage Stacking ensemble architecture is constructed. The first stage uses cross-validation to generate meta-features to suppress overfitting, while the second stage uses logistic regression as a meta-classifier for nonlinear fusion, further enhancing the model's generalization ability and robustness.
[0053] In some embodiments, in step S10, after acquiring the credit data and before dividing the acquired credit data into a training set, a validation set, and a test set, the following steps are also included:
[0054] S101. Standardize the acquired credit data and fill in missing values using the median.
[0055] In the above embodiments, the data is standardized by scaling the original data proportionally through mathematical transformations to eliminate the influence of differences in units, values, and scales between different features, thus converting the data to a uniform scale. Missing values are filled with the median to avoid information loss.
[0056] In some embodiments, the number of default samples in the training set clustered to generate a first subset is the same as the number of non-default samples in the training set clustered to generate a second subset. The method for determining the number includes the following steps:
[0057] K10. Calculate the sum of squared intra-cluster errors (SSE) for different cluster numbers k. When the rate of decrease in SSE changes from steep to gradual, the k value corresponding to this inflection point is the optimal number of clusters. The formula for calculating SSE is:
[0058]
[0059] Where X = {x1, x2, ..., xn} is a dataset containing n samples; C = {c1, c2, ..., ck} contains k clusters; ci represents the i-th cluster center, k ≤ n.
[0060] By using the settings in the above embodiments, the elbow rule can be used to automatically determine the optimal number of clusters, reducing the subjectivity of manually setting the number of clusters, making the clustering results more consistent with the distribution characteristics of the data itself, and improving the rationality of subsequent base model selection.
[0061] In some embodiments, after combining the first and second subsets pairwise to form several third subsets, and before calculating the prediction accuracy of each third subset using the validation set, the following steps are also included:
[0062] S201. Perform local oversampling on each third subset of data to increase the number of default samples within the third subset of data.
[0063] Local oversampling includes the following steps: randomly generating synthetic samples on the lines connecting minority class sample points and their neighbors; combining the datasets after clustering default samples and non-default samples into multiple data subsets; performing minority oversampling on each data subset containing default samples and non-default samples; and repeating the above process of generating synthetic samples until a predetermined number of oversampling samples is reached.
[0064] By using the above embodiments, the SMOTE algorithm is applied to perform local oversampling in each data subset, which increases the number of default samples in the subset and makes the ratio of default samples to non-default samples more balanced, thus alleviating the class imbalance problem.
[0065] In some of these embodiments, the logistic regression model uses L1 regularization with a regularization strength of C = 0.1 × 10; the decision tree model uses information gain as the splitting criterion and has a maximum depth of 8; the random forest model has 100 trees; the support vector machine model uses a Gaussian kernel with a regularization parameter of C = 1.0; and the convolutional neural network model has a learning rate of 0.001, 32 convolutional kernels, a dropout value of 0.1, and no batch normalization is applied.
[0066] Through the above embodiments, hyperparameters adapted to the credit risk prediction task are set for each base model. While ensuring that each base model fully explores the data features, overfitting of a single model is suppressed through regularization, Dropout and other methods, providing stable and reliable base prediction results for subsequent integration and fusion, and further ensuring the prediction stability and generalization ability of the final model.
[0067] In some embodiments, generating the meta-feature matrix for each optimal base model combination includes the following steps:
[0068] S3011. Divide the training set into N mutually exclusive subsets;
[0069] S3012. Train the model using N-1 subsets in sequence, and generate prediction probabilities on the remaining 1 subset;
[0070] S3013. Stack the iterative prediction results to form a meta-feature matrix.
[0071] By using the above-described embodiments and generating meta-features through K-fold cross-validation, overfitting can be effectively reduced while ensuring that the meta-features can fully reflect the predictive ability of the base model. This provides accurate and reliable input for the fusion training of the meta-classifier, further improving the generalization performance of the entire ensemble model.
[0072] Secondly, embodiments of this application provide a credit risk prediction device based on an adaptive Stacking ensemble model, including a data acquisition and preprocessing module, a clustering and base model selection module, and an ensemble prediction module.
[0073] The data acquisition and preprocessing module is used to acquire credit data, preprocess the credit data, and divide it into training set, validation set and test set according to a preset ratio;
[0074] The clustering and base model selection module clusters default samples and non-default samples in the training set to generate multiple first and second subsets. The first and second subsets are then combined in pairs to form several third subsets. Logistic regression, decision tree, support vector machine, random forest, and convolutional neural network are selected as base models. Based on the non-empty combination of base models, the prediction accuracy of each third subset is calculated using the validation set. The base model combination with the highest accuracy is selected as the optimal base model combination.
[0075] The integrated prediction module generates a meta-feature matrix for each optimal base model combination; it uses a logistic regression model as a meta-classifier and combines the meta-feature matrix with the original labels for training; it inputs the test set into each optimal base model combination to generate prediction probabilities, which are used as meta-features, and then performs weighted fusion through the trained logistic regression classifier to output the final credit risk prediction result.
[0076] Thirdly, embodiments of this application provide an electronic device, which includes: a memory and a processor; the memory is used to store program code and transmit the program code to the processor; the processor is used to execute the credit risk prediction method steps of any of the above embodiments according to the instructions in the program code.
[0077] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer instructions. When the computer instructions are executed on an electronic device, the electronic device performs the credit risk prediction method steps described in any of the above embodiments.
[0078] Example 1
[0079] This embodiment provides a credit risk prediction method, which includes the following steps:
[0080] Data Acquisition and Preprocessing
[0081] The processing terminal (such as a server or computer) acquires credit data. In this embodiment, a total of 3045 customer records are included, of which 2995 are non-default samples and 50 are default samples.
[0082] Table 1. Statistical information of the accepted sample dataset
[0083]
[0084] Data preprocessing:
[0085] Continuous financial indicators were standardized to eliminate dimensional differences; missing values were filled with the median to avoid information loss. Then, stratified random sampling was used to partition the dataset according to a training set:validation set:test set ratio of 6:2:2, ensuring that the ratio of defaulting to non-defaulting samples in each subset was consistent with the original dataset. The partitioning results are shown in Table 2.
[0086] Table 2. Results of the partitioning of training, validation, and test sets.
[0087]
[0088] Clustering and Base Model Optimization
[0089] The K-Means algorithm was applied to cluster defaulted and non-defaulted samples in the training set. The optimal number of clusters was determined using the elbow rule: the sum of squared errors (SSE) within each cluster was calculated from k=2 to k=5. When k=3, the rate of decrease in SSE changed from steep to gradual, showing a clear inflection point; therefore, the optimal number of clusters was determined to be 3. Further verification using the silhouette coefficient showed that the silhouette coefficient reached 0.58 at k=3, higher than 0.42 at k=2 and 0.51 at k=4, confirming the reasonableness of the clustering.
[0090] The three subsets of default samples clustered together with the three subsets of non-default samples clustered together are randomly combined pairwise to form nine data subsets. Within each data subset, the SMOTE algorithm is applied for local oversampling, increasing the number of default samples and balancing the ratio of default to non-default samples, thus mitigating the class imbalance problem.
[0091] Five models were selected as base models: Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), and Convolutional Neural Network (CNN). All non-empty combinations were generated using an exhaustive method, totaling 2... 5 -1 = 31 combinations (including single-model, dual-model, triple-model, quadruple-model, and quintuple-model combinations). Within each data subset, the prediction accuracy of each combination is calculated using the validation set (based on the AUC value). The base model combination with the highest accuracy is selected as the optimal base model combination for that subset, forming a heterogeneous base model pool.
[0092] Two-phase Stacking integration
[0093] Constructing a two-stage Stacking integration model:
[0094] Phase 1: For each optimal base model combination, a meta-feature matrix is generated using 5-fold cross-validation. Specifically, the training set is divided into 5 mutually exclusive subsets, and the base models are trained sequentially using 4 of these subsets. Prediction probabilities are then generated on the remaining subset. The prediction results from the 5 iterations are stacked to form the meta-feature matrix.
[0095] The second stage involves using a logistic regression (LR) model as the meta-classifier, combining the meta-feature matrix with the original labels for training, and learning the nonlinear combination rules of the base model output.
[0096] The test set is input into each optimal base model combination to generate predicted probabilities, which are used as meta-features. These probabilities are then weighted and fused through a trained logistic regression meta-classifier to output the final credit risk prediction result.
[0097] Model Evaluation
[0098] The prediction results are evaluated using the following metrics: Accuracy, F1 score, AUC, Type I Error, Type II Error, and G-Mean. The formulas for each metric are as described above.
[0099] In this embodiment, the evaluation results obtained on the test set are shown in Table 3:
[0100] Table 3 Model Evaluation Results
[0101]
[0102] Compared with existing comparative models (including LGBM-GA-Stacking, TabNeT-Stacking, BS-Stacking, Stacking-MBCG, Stacking ML+DL, Bagging, XGBoost, Adaboost, RF, LR, DT, SVM, and CNN), the method in this embodiment performs best in all metrics, especially reducing the Type II error rate to 4.2%, which is significantly better than other models. The evaluation results of the 13 existing comparative models on the test set are shown in Table 4.
[0103] Table 4 Evaluation results of 13 existing comparative models
[0104]
[0105] Example 2: Robustness Test
[0106] To verify the model's generalization ability, cross-scenario validation was performed using the first and second region credit datasets from the UCI machine learning library. The first and second region datasets are shown in Table 5.
[0107] Table 5 Description of the UCI Public Credit Dataset
[0108]
[0109] Given the missing customer rejection data in the UCI public credit dataset, this embodiment randomly partitions the datasets from the first and second regions to simulate the acceptance and rejection processes during credit approval. The datasets are divided into two subsets: a test set and a training set. Stratified random sampling is performed based on category labels, with the training set, validation set, and test set divided in a 6:2:2 ratio. A credit risk prediction model based on the adaptive Stacking ensemble model proposed in this invention is used on the training set to predict the default probability of all samples. The partitioning results of the public datasets from the first region are shown in Table 6, and the partitioning results of the public datasets from the second region are shown in Table 7.
[0110] Table 6. Division of Public Datasets in the First Region
[0111]
[0112] Table 7. Division of Public Datasets in the Second Region
[0113]
[0114] The data preprocessing and model building steps are the same as in the above embodiments, except that the number of clusters is adjusted according to the characteristics of each dataset.
[0115] Table 8. Model Comparison Results Based on Public Datasets from the First Region
[0116]
[0117] In the first regional dataset, the model achieved an AUC of 0.915 and a Class II error rate of 12.8%; in the second regional dataset, the AUC reached 0.909 and the Class II error rate was 12.7%. The results demonstrate that the method of this invention maintains excellent predictive performance across data with different geographical locations, economic environments, and feature structures, exhibiting good generalization ability.
[0118] The predictive performance of the model for the first region's public dataset is shown in Table 8.
[0119] For the public dataset of the second region, the predictive performance of the model is shown in Table 9.
[0120] Table 9. Model Comparison Results Based on Public Datasets from the Second Region
[0121]
[0122] Example 3: Credit Risk Prediction Device
[0123] like Figure 2 As shown, this embodiment provides a credit risk prediction device, including:
[0124] Data acquisition and preprocessing module 101: Used to acquire credit data, standardize the data, fill in missing values with the median, and divide the data into training set, validation set and test set in a 6:2:2 ratio.
[0125] Clustering and Base Model Selection Module 102: It is used to cluster default and non-default samples separately using the K-Means algorithm, generate subsets and combine them in pairs; exhaustively combine five base models (LR, DT, SVM, RF, CNN) within each subset, and select the optimal base model combination based on the accuracy of the validation set; it is also used to apply SMOTE oversampling within the subsets.
[0126] Integrated prediction module 103: used to build a two-stage Stacking ensemble model. In the first stage, a meta-feature matrix is generated through 5-fold cross-validation. In the second stage, logistic regression is used as the meta-classifier to train and output the prediction results.
[0127] Model Evaluation Module 104: Used to evaluate prediction results using accuracy, F1 score, AUC, Type I error, Type II error, and G-Mean index.
[0128] The specific implementation methods of each module correspond to the method steps in Example 1.
[0129] Model accuracy evaluation includes the following metrics:
[0130] The formula for calculating accuracy is: ;
[0131] The formula for calculating the F1 score is: ,in, , ;
[0132] AUC refers to the area enclosed by the ROC curve and the coordinate axes, and is used to quantify the overall discriminative power of a model.
[0133] The formula for calculating Type I error is: ;
[0134] The formula for calculating Type II errors is: ;
[0135] The formula for calculating G-Mean is: ;
[0136] Wherein, TP represents the number of samples that actually defaulted and were correctly predicted, TN represents the number of samples that actually did not default and were correctly predicted, FP represents the number of samples that actually did not default but were misjudged as defaulting, and FN represents the number of samples that actually defaulted but were misjudged as non-defaulting.
[0137] Example 4: Electronic device and storage medium
[0138] This embodiment provides an electronic device, including a memory and a processor. The memory stores program code, and the processor executes the program code to implement the credit risk prediction method described in Embodiment 1 or 2.
[0139] This embodiment also provides a computer-readable storage medium storing computer instructions that, when executed on an electronic device, cause the electronic device to perform the method described in embodiment 1 or 2.
[0140] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A credit risk prediction method based on an adaptive Stacking ensemble model, characterized in that, include: S10. Obtain credit data and divide the obtained credit data into training set, validation set and test set; S20. Cluster the default samples in the training set to generate several first subsets, cluster the non-default samples in the training set to generate several second subsets, and combine the first and second subsets in pairs to form several third subsets. Select logistic regression, decision tree, support vector machine, random forest and convolutional neural network as base models. Based on the non-empty combination of base models, use the validation set to calculate the prediction accuracy of each third subset, and select the base model combination with the highest accuracy as the optimal base model combination. S301. For each optimal base model combination, generate the meta-feature matrix; S302. Using a logistic regression model as the meta-classifier, the meta-feature matrix is combined with the original labels for training. S303. Input the test set into each optimal base model combination to generate the predicted probability, which is used as the meta-feature. Then, perform weighted fusion through the trained logistic regression classifier to output the final credit risk prediction result.
2. The credit risk prediction method based on the adaptive Stacking ensemble model according to claim 1, characterized in that, After acquiring credit data, but before dividing the acquired credit data into training, validation, and test sets, the following steps are also included: S101. Standardize the acquired credit data and fill in missing values using the median.
3. The credit risk prediction method based on the adaptive Stacking ensemble model according to claim 1, characterized in that, The number of default samples in the training set used to generate the first subset is the same as the number of non-default samples in the training set used to generate the second subset. The method for determining the number includes the following steps: K10. Calculate the sum of squared intra-cluster errors (SSE) for different cluster numbers k. When the rate of decrease in SSE changes from steep to gradual, the k value corresponding to this inflection point is the optimal number of clusters. The formula for calculating SSE is: ; Where X = {x1, x2, ..., xn} is a dataset containing n samples; C = {c1, c2, ..., ck} contains k clusters; ci represents the i-th cluster center, k ≤ n.
4. The credit risk prediction method based on an adaptive Stacking ensemble model according to claim 1 or 3, characterized in that, After combining the first and second subsets pairwise to form several third subsets, and before calculating the prediction accuracy of each third subset using the validation set, the following steps are also included: S201. Perform local oversampling on each third subset of data to increase the number of default samples within the third subset of data. Local oversampling includes the following steps: randomly generating synthetic samples on the lines connecting minority class sample points and their neighbors; combining the datasets after clustering default samples and non-default samples into multiple data subsets; performing minority oversampling on each data subset containing default samples and non-default samples; and repeating the above process of generating synthetic samples until a predetermined number of oversampling samples is reached.
5. The credit risk prediction method based on the adaptive Stacking ensemble model according to claim 1, characterized in that, The logistic regression model uses L1 regularization with a regularization strength C = 0.1 × 10; the decision tree model uses information gain as the splitting criterion and has a maximum depth of 8; the random forest model has 100 trees; the support vector machine model uses a Gaussian kernel with a regularization parameter C = 1.0; the convolutional neural network model has a learning rate of 0.001, 32 convolutional kernels, a dropout value of 0.1, and no batch normalization is applied.
6. The credit risk prediction method based on the adaptive Stacking ensemble model according to claim 1, characterized in that, For each optimal combination of basis models, generating the meta-feature matrix includes the following steps: S3011. Divide the training set into N mutually exclusive subsets; S3012. Train the model using N-1 subsets in sequence, and generate prediction probabilities on the remaining 1 subset; S3013. Stack the iterative prediction results to form a meta-feature matrix.
7. A credit risk prediction device based on an adaptive Stacking ensemble model, characterized in that, include: The data acquisition and preprocessing module is used to acquire credit data, preprocess the credit data, and divide it into training set, validation set and test set according to a preset ratio. The clustering and base model selection module clusters default samples and non-default samples in the training set to generate multiple first and second subsets. The first and second subsets are then combined pairwise to form several third subsets. Logistic regression, decision tree, support vector machine, random forest, and convolutional neural network are selected as base models. Based on the non-empty combination of base models, the prediction accuracy of each third subset is calculated using the validation set. The base model combination with the highest accuracy is selected as the optimal base model combination. The integrated prediction module generates a meta-feature matrix for each optimal base model combination; Using a logistic regression model as the meta-classifier, the meta-feature matrix is combined with the original labels for training. The test set is input into each optimal base model combination to generate predicted probabilities, which are used as meta-features. These probabilities are then weighted and fused through the trained logistic regression classifier to output the final credit risk prediction result.
8. The credit risk prediction device based on the adaptive Stacking ensemble model according to claim 7, characterized in that, The data acquisition and preprocessing module is also used to standardize credit data, fill missing values with the median, and perform stratified random partitioning according to the ratio of training set, validation set, and test set = 6:2:
2. The clustering and base model optimization module is also used to cluster default samples and non-default samples separately using the K-Means clustering algorithm, and determine the optimal number of clusters using the elbow rule; randomly combine the subsets of default samples and the subsets of non-default samples in pairs to form multiple data subsets; and apply the SMOTE algorithm to perform local oversampling within each data subset. Logistic regression, decision tree, support vector machine, random forest, and convolutional neural network were selected as base models. 31 non-empty combinations were generated by exhaustive search. The base model combination with the highest accuracy was selected in each data subset. The integrated prediction module is also used to perform 5-fold cross-validation on the selected optimal base model combination to generate a meta-feature matrix, and train the logistic regression model as the meta-classifier.
9. An electronic device, characterized in that: The electronic device includes: a memory and a processor; the memory is used to store program code and transmit the program code to the processor; The processor is configured to execute the steps of the credit risk prediction method according to any one of claims 1-6, based on the instructions in the program code.
10. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores computer instructions, which, when executed on an electronic device, cause the electronic device to perform the steps of the credit risk prediction method according to any one of claims 1-6.