Microplastic prediction method based on fusion of feature engineering and meta-learning
By integrating feature engineering and meta-learning, and combining random forest and gradient boosting models, the problem of insufficient feature representation in microplastic remote sensing inversion is solved, thereby improving the accuracy and stability of microplastic prediction and making it suitable for microplastic detection in marine environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TIANJIN UNIV
- Filing Date
- 2025-07-22
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, single machine learning models are insufficient in feature representation in microplastic remote sensing inversion, making it difficult to capture complex interactions between features. They also have limited generalization ability and are sensitive to noise, failing to fully explore the correlation between prediction results of different models.
By combining feature engineering and meta-learning, a microplastic abundance detection base model is constructed through feature interaction using random forest and gradient boosting models. A meta-learner is then used to fuse the models, generating meta-features for microplastic detection. Iterative training is then performed to improve prediction accuracy.
By enhancing the model's ability to capture complex relationships through feature interaction and binning, the meta-learner integrates the advantages of multiple models, improving R2 on the test set by 10%-15%, enhancing robustness, reducing the risk of overfitting, and improving model stability and accuracy.
Smart Images

Figure CN120853757B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of machine learning technology and marine environment regression prediction technology, and in particular to a regression prediction method for marine microplastics that integrates feature engineering, random forest, gradient boosting tree and meta-learner. Background Technology
[0002] Currently, remote sensing inversion technology for microplastics is still immature, with few case studies. Existing technologies, single machine learning models (such as random forests or gradient boosting trees) suffer from the following problems when processing high-dimensional, nonlinear data: 1. Insufficient feature representation ability: Traditional models rely on raw features, making it difficult to capture complex interactions between features. 2. Limited model generalization ability: The predictive performance of a single model is easily affected by hyperparameter selection and is sensitive to noise. 3. Insufficient utilization of meta-features: Existing methods do not fully explore the correlation between the prediction results of different models, making it difficult to further improve accuracy.
[0003] For example, some existing random forest-based prediction methods, while capable of handling high-dimensional data, fail to incorporate feature interaction terms, leading to insufficient modeling of nonlinear relationships. Furthermore, some existing techniques employ gradient boosting trees for prediction, but their fixed binning strategies do not dynamically adapt to the training set distribution, making them prone to bias on the test set. Current techniques for remote sensing inversion of microplastic abundance generally use a single model, which is susceptible to noise. Existing methods do not fully explore the correlations between the prediction results of different models.
[0004] How to improve feature engineering and model fusion strategies is a technical problem that this invention urgently needs to solve. Summary of the Invention
[0005] This invention aims to address the problem of insufficient feature representation capabilities in current technologies. It proposes a microplastic prediction method and system based on the fusion of feature engineering and meta-learning. By combining random forest and gradient boosting models, it captures the interaction relationships between features through feature interaction, thereby achieving a further improvement in the prediction accuracy of microplastic prediction.
[0006] This invention is achieved using the following technical solution:
[0007] This invention discloses a microplastic prediction method based on the fusion of feature engineering and meta-learning, the method comprising the following steps:
[0008] S1, obtain microplastic detection data, select and combine them through feature engineering, and extract microplastic detection features;
[0009] S2, Construct a microplastic abundance detection base model, input the microplastic detection features into the microplastic abundance detection base model, and calculate the microplastic abundance prediction result;
[0010] S3, the microplastic abundance detection base model is trained independently using the random forest algorithm and the gradient boosting algorithm respectively to obtain the corresponding microplastic abundance prediction random forest model and microplastic abundance prediction gradient boosting model.
[0011] S4. Construct a meta-learner, and fuse the microplastic abundance prediction random forest model and the microplastic abundance prediction gradient boosting model through the meta-learner to obtain new features for microplastic abundance detection. Perform inversion through cross-validation to obtain the prediction results of the microplastic abundance detection base model and generate microplastic detection meta-features.
[0012] S5. Construct a microplastic abundance detection meta-model, and use the microplastic detection meta-features to iteratively train the microplastic abundance detection meta-model to obtain optimized prediction results.
[0013] In some embodiments, S1 further includes:
[0014] Interaction terms are generated for microplastic detection features, and interval variable transformation is performed: The interaction terms generated according to the microplastic detection feature matrix are (B6+B7), B6, (B5+B6), B7, (B5+B7) and (B5+B7) / B4. B4-B7 are all 400nm-2200nm band data from environmental monitoring remote sensing data. B4 is the 545nm-565nm band, B5 is the 1230nm-1250nm band, B6 is the 1628-1652nm band, and B7 is the 2105nm-2135nm band. An interaction column of feature 1×feature 2 is generated, which is expanded to 7 dimensions.
[0015] The interval-type variables are dynamically binned, the optimal bin boundaries are calculated, the first column of features in the training set is selected and divided into five equal-width intervals, the values in the training set are assigned to the corresponding bins, and the bin boundaries are saved.
[0016] In some implementations, S3 further includes:
[0017] A random forest model is constructed, and the model is trained on the original sample dataset for microplastic detection. The importance of the input data features for microplastic detection is evaluated, and the features are selected as optimized input data features for microplastic detection based on their importance.
[0018] A gradient boosting model is constructed and trained on the original sample dataset for microplastic detection. The learner is trained using the least squares criterion, and the decision objective is to minimize the sum of squared residuals between the predicted and actual microplastic abundance detection results.
[0019] In some implementations, obtaining the prediction results and derived features of the microplastic abundance detection base model through cross-validation in step S4, and generating microplastic abundance detection meta-features, further includes:
[0020] The microplastic detection input data, trained by the random forest model and the gradient boosting model, was randomly and uniformly divided into 5 subsets. Five-fold cross-validation was performed to construct the meta-feature matrix [RF_oof, GB_oof, RF_oof×GB_oof, RF_oof]. 2 GB_oof 2 ].
[0021] In some implementations, S5 further includes constructing a meta-model for microplastic abundance detection:
[0022] Standardize the meta-features and train the meta-model for microplastic abundance detection. The goal is to minimize the sum of squared residuals between the predicted and actual abundance results of microplastic detection, thereby obtaining the predicted microplastic abundance value.
[0023] In some implementations, S3 further includes constructing a meta-learner:
[0024] A random forest model and a gradient boosting model for microplastic abundance detection are generated using 5-fold cross-validation and used as meta-features. Meta-feature standardization is performed, and the dimensions of the meta-features are expanded to include the original predicted value, the interaction term RF_oof×GB_oof, the squared terms RF_oof2, and GB_oof2. The LSBoost algorithm is used to train the feature engineering meta-model, and the prediction results are iteratively optimized by combining a decision tree base learner.
[0025] In some implementations, the method further includes visualization: plotting scatter plots of the prediction results for the training and test sets.
[0026] In some implementations, the method further includes performing meta-learner evaluation:
[0027] The prediction results of the normalized training and test sets are reversed to restore the original range;
[0028] Calculate the coefficient of determination R between the predicted and actual abundance values for microplastic detection. 2 The root mean square error (RMSE), mean absolute error (MAE), and mean square error (MSE) are used to evaluate the meta-model for detecting microplastic abundance.
[0029] Compared with existing technologies, the present invention can achieve the following beneficial technical effects:
[0030] 1) Enhance the model's ability to capture complex relationships through feature interaction and binning; the meta-learner further integrates the advantages of multiple models, and experiments show that the test set R 2An improvement of 10%-15% can achieve an increase in accuracy.
[0031] 2) Check the model's R-value by reducing the number of training samples. 2 And RMSE. When the training samples are in the range of 84-40, 40-16, and less than 16, it outperforms the traditional model. Therefore, this model is more stable than the traditional model and can achieve stability improvement.
[0032] 3) Optimizing parameters through grid search and employing cross-validation strategies effectively reduces the risk of overfitting, achieving enhanced robustness. Specifically, noise is added to the model's training set, R... 2 The RSE decreased from 0.95 to 0.92, a decrease of approximately 3.2%, while the RMSE increased from 0.63 particles / m³ to 0.67 particles / m³, an increase of approximately 6.3%. Meanwhile, the RSE of the traditional model... 2 The decrease was 13.9%, while the RMSE increased by 26.1%. Therefore, the robustness of this model was enhanced. Attached Figure Description
[0033] Figure 1 This is an overall flowchart of the microplastic prediction method based on the fusion of feature engineering and meta-learning of the present invention;
[0034] Figure 2 This is a detailed implementation diagram of the overall process of the microplastic prediction method based on the fusion of feature engineering and meta-learning of the present invention;
[0035] Figure 3 This is a technical roadmap for the microplastic prediction method based on the fusion of feature engineering and meta-learning of the present invention. Detailed Implementation
[0036] The specific embodiments of the present invention will now be described in further detail with reference to the accompanying drawings.
[0037] Figure 1 The present invention illustrates a microplastic prediction method based on the fusion of feature engineering and meta-learning, with the following specific steps:
[0038] Step 1: Perform data preprocessing to obtain the original sample dataset for microplastic detection; the original samples utilize remote sensing data from environmental monitoring in the 400-2200nm band, 545-565nm band, and 1230-1250nm band to retrieve the abundance of microplastics in seawater.
[0039] Step 1.1: Perform data cleaning, deleting samples containing missing values or outliers, such as data exceeding 3 times the standard deviation;
[0040] Step 1.2: Perform normalization processing. Both input and output data are normalized to the [0,1] interval using the mapminmax function;
[0041] Step 1.3: Divide the dataset into training and test sets in an 8:2 ratio to ensure consistent distribution.
[0042] Step 2: Construct a microplastic abundance detection basis model; details are as follows:
[0043] Step 2.1: Generate interaction terms to obtain the expression of the joint effect of microplastic detection features. The microplastic detection feature matrix P_train (6-dimensional) is used as model input. This matrix contains 6 eigenvalues, namely feature 1 to feature 6. The interaction terms generated based on the microplastic detection feature matrix are (B6+B7), B6, (B5+B6), B7, (B5+B7), and (B5+B7) / B4. Here, B4-B7 are all 400nm-2200nm band data from environmental monitoring remote sensing data. B4 is the 545nm-565nm band, B5 is the 1230nm-1250nm band, B6 is the 1628-1652nm band, and B7 is the 2105nm-2135nm band. Then, an interaction column of feature 1 × feature 2 is generated, expanding to 7 dimensions.
[0044] Step 2.2: Transform the microplastic detection features included in each generated interaction item into interval variables, and dynamically bin these interval variables. First, use the `histcount` function to automatically calculate the optimal bin boundaries. Select the first column of features in the training set and divide it into five equally wide intervals. Then, use the `discretize` function to assign the values in the training set to the corresponding bins. For example, if the boundary is [0, 2, 4, 6, 8, 10], then the value 3.5 will be assigned to the second bin, which is the interval [2, 4]. Finally, save the bin boundaries. When binning the test set, call the `edges` parameter to avoid introducing future information.
[0045] Step 3: Introduce a meta-learning fusion method to optimize the parameters of the original sample dataset for microplastic detection and to train the base model for microplastic abundance detection; the specific description is as follows:
[0046] Step 3.1: Train the microplastic abundance detection base model using a random forest (RF) model to obtain the corresponding microplastic abundance prediction random forest model; construct the random forest model using the TreeBagger function, defining the parameters of TreeBagger as follows: the number of random forests is 500; the splitting criterion for each tree is set to 'regression', which minimizes the mean squared error (MSE) between the predicted abundance value and the true value; the feature importance score is set to 'OOBPredictorImportance', selecting samples not selected by the random forest (approximately 37% of the data) to evaluate the importance of microplastic detection input data features, and selecting them as optimized microplastic detection input data features based on their importance; set the number of leaf nodes for each tree to 5.
[0047] Step 3.2: Train the original sample dataset for microplastic detection using the Gradient Boosting (GB) model to obtain a gradient boosting model for microplastic abundance prediction. Construct the gradient boosting model using the fitrensemble function. Key parameters are set as follows: 80 weak learners (decision trees); the training method is 'LSBoost', using the least squares criterion to train the learners; the decision objective is to minimize the sum of squared residuals between the predicted and actual abundance values of microplastic detection; the number of iterations is set to 50; and the learning rate is 0.1. This achieves enhanced convergence and generalization of the microplastic detection input data.
[0048] Step 3.3: Train the original microplastic detection sample dataset using the RF model and GB model, and generate abundance prediction values to form a meta-feature matrix; randomly and uniformly divide the microplastic detection input dataset obtained in steps 3.1 and 3.2 into 5 subsets, and use 5-fold cross-validation to generate abundance prediction values (RF_oof and GB_oof) generated by the RF model and GB model, respectively, and construct a 5-dimensional meta-feature matrix [RF_oof, GB_oof, RF_oof×GB_oof, RF_oof...]. 2 GB_oof 2 ]).
[0049] Step 4: Construct a meta-learner, train a meta-model using the meta-learner, and input the trained meta-model into the feature engineering model for microplastic detection and prediction; perform the following steps:
[0050] Step 4.1: Perform meta-feature standardization. Specifically, call the zscore function to standardize the meta-features, which will be used as the standardized meta-feature matrix.
[0051] Step 4.2: Train the meta-learner model; specifically, call the `fitrensemble` function to construct a linear boosting (LSBoost) ensemble model, using the meta-feature matrix constructed in Step 3.3 as the feature values to be trained; set `Method = 'LSBoost'`, train the meta-learner using the least squares criterion, with the goal of minimizing the sum of squared residuals between the abundance predictions and the true values for microplastic detection; set the decision trees to 90; the learning rate to 0.05 to prevent overfitting; the number of tree splits to 15 to control complexity; and the number of leaf node samples to 10.
[0052] Step 4.3: Input the trained meta-learner model into the feature engineering model for microplastic detection and prediction. Specifically, call the predict function to obtain the microplastic abundance prediction value.
[0053] The method of this invention also includes model evaluation and visualization:
[0054] Step 5: Evaluate and visualize the feature engineering model;
[0055] Step 5.1: Perform data denormalization. Specifically, use the mapminmax function to reverse the prediction results of the normalized training set and test set back to the original range.
[0056] Step 5.2: Calculate the R-squared value between the predicted and actual abundance values for microplastic detection. 2 RMSE, MAE, and MSE are used to evaluate the feature engineering model. 2 (R-Squared Coefficient of Determination): Measures the model's ability to explain the variance of the target variable. The closer the value is to 1, the better the model fits the data. MSE (Mean Squared Error): The average of the squared differences between the predicted and actual abundance values, reflecting the absolute magnitude of the prediction error. RMSE (Root Mean Squared Error): The square root of the MSE, aligning the error dimensions with the target variable. MAE (Mean Absolute Error): The average of the absolute differences between the predicted and actual abundance values.
[0057] Step 5.3: Draw a scatter plot of the prediction results for the training and test sets.
[0058] Figure 2 This invention illustrates the technical route of the microplastic prediction method based on the fusion of feature engineering and meta-learning, combined with... Figure 2 The specific details of the method of the present invention are further described below:
[0059] 1. Data Preprocessing and Feature Engineering
[0060] This step involves cleaning up null values in the dataset and preprocessing the data.
[0061] 1.1 Data cleaning: Remove samples containing missing or outlier values to ensure data quality.
[0062] 1.2 Feature Interaction: Generate product terms between features (e.g., feature 1 × feature 2) to enhance nonlinear expressive power.
[0063] 1.3 Binning Processing: Discretize the continuous features into bins and save the bin boundaries to fit the test set.
[0064] 2. Base Model Training and Optimization
[0065] This step uses the data from the first preprocessing step to train the model. The two models are trained independently to ensure the accuracy of the results.
[0066] 2.1 Random Forest Model: The minimum number of leaf nodes (3 / 5 / 10) and the number of trees (50 / 200 / 500) are optimized through grid search to minimize the RMSE of the training set. The prediction results of the random forest model are obtained after running the program.
[0067] 2.2 Gradient Boosting Tree Model: The Bagging method is adopted, with a learning cycle count of 80 to improve model stability. The gradient boosting prediction results are obtained after running the model.
[0068] 3. Meta-learner construction
[0069] The predictions from random forest and gradient boosting models are used as new feature inputs to the meta-learner to train a linear boosting (LSBoost) ensemble model, further integrating the advantages of each model. Cross-validation generates meta-features: 5-fold cross-validation is used to obtain the predictions of the base models on the training set and their derived features (product terms, squared terms). Finally, the meta-model is trained, iteratively optimizing the prediction results.
[0070] 3.1 Cross-validation: Abundance predictions (RF_oof, GB_oof) for RF and GB are generated through 5-fold cross-validation and used as meta-feature inputs.
[0071] 3.2 Meta-feature standardization: Z-score normalization is performed on the meta-features to eliminate the influence of dimensions.
[0072] 3.3 Extended Meta-Feature Dimensions: Includes original abundance predictions, interaction terms (RF_oof × GB_oof), and squared terms (RF_oof). 2 GB_oof 2 ).
[0073] 3.4 Meta-model training: The LSBoost algorithm is used in combination with a decision tree-based learner (maximum number of splits 15, minimum number of leaf nodes 10), and the prediction results are optimized through 90 learning cycles.
[0074] 4. Model Evaluation
[0075] Plot a scatter plot of the training and test sets, and calculate RA. 2 Indicators such as MSE and MAE are used to evaluate the accuracy of the model.
[0076] In summary, this invention combines random forest and gradient boosting models to construct a meta-learner, enhancing the utilization of meta-features. It fully explores the complex and subtle correlations between various models, resulting in significant improvements in accuracy and generalizability.
[0077] Based on an expert activation prediction mechanism with no training cost, the model built using existing data can accurately reflect the abundance of microplastics in the Bohai Sea. Experiments show that the model exhibits the same distribution characteristics as the actual microplastic abundance from daily to annual perspectives.
[0078] Validation Analysis: The trained model was applied to remote sensing imagery, and mean squared values were applied to obtain an inverted map of microplastic abundance distribution for a specific year and month. The results were consistent with the measured values. Errors only existed in localized areas, but the overall accuracy was within acceptable limits. Therefore, the model is suitable for spatiotemporal feature analysis and has strong applicability.
[0079] It should be noted that although the present invention has been shown and described with reference to specific exemplary embodiments thereof, those skilled in the art should understand that the present invention is not limited to the above embodiments, and all modifications to the present invention fall within the scope of protection of the present invention.
Claims
1. A microplastic prediction method based on the fusion of feature engineering and meta-learning, characterized in that, The method includes the following steps: S1, obtaining microplastic detection data, selecting and combining it through feature engineering, and extracting microplastic detection features; S1 further includes: generating interaction terms for the microplastic detection features and performing interval variable transformation: the interaction terms generated according to the microplastic detection feature matrix are (B6+B7), B6, (B5+B6), B7, (B5+B7) and (B5+B7) / B4, where B4-B7 are all 400nm-2200nm band data from environmental monitoring remote sensing data, B4 is the 545nm~565nm band, B5 is the 1230nm~1250nm band, B6 is the 1628~1652nm band, and B7 is the 2105nm~2135nm band, generating an interaction column of feature 1×feature 2, expanding to 7 dimensions; The interval-type variables are dynamically binned, the optimal bin boundaries are calculated, the first column of features in the training set is selected and divided into five equal-width intervals, the values in the training set are assigned to the corresponding bins, and the bin boundaries are saved. S2, construct a microplastic abundance detection base model, input the microplastic detection features into the microplastic abundance detection base model, and calculate the microplastic abundance prediction result; S3, train the microplastic abundance detection base model independently using the random forest algorithm and the gradient boosting algorithm respectively to obtain the corresponding microplastic abundance prediction random forest model and microplastic abundance prediction gradient boosting model; S3 further includes: constructing a random forest model, training the model on the original sample dataset of microplastic detection using the random forest model, evaluating the importance of the microplastic detection input data features, and selecting the optimized microplastic detection input data features based on their importance; A gradient boosting model is constructed and trained on the original sample dataset for microplastic detection. The learner is trained using the least squares criterion, and the decision objective is to minimize the sum of squared residuals between the predicted and actual microplastic abundance detection results. S4, construct a meta-learner model, and fuse the microplastic abundance prediction random forest model and the microplastic abundance prediction gradient boosting model through the meta-learner model to obtain new features for microplastic abundance detection. Inversion is performed through cross-validation to obtain the prediction results of the microplastic abundance detection base model, generating microplastic detection meta-features. S4 further includes: generating a microplastic abundance detection random forest model and a microplastic abundance detection gradient boosting model through 5-fold cross-validation as meta-features; standardizing the meta-features and expanding their dimensions to include the original predicted value, the interaction term RF_oof × GB_oof, and the squared term RF_oof. 2 GB_oof 2 The feature engineering meta-model is trained using the LSBoost algorithm, and the prediction results are iteratively optimized by combining it with a decision tree-based learner. S5. Construct a microplastic abundance detection meta-model, and use the microplastic detection meta-features to iteratively train the microplastic abundance detection meta-model to obtain optimized prediction results.
2. The microplastic prediction method based on the fusion of feature engineering and meta-learning according to claim 1, characterized in that, S4, obtaining the prediction results and derived features of the microplastic abundance detection base model through cross-validation, and generating microplastic abundance detection meta-features, further includes: randomly and uniformly dividing the microplastic detection input data trained by the random forest model and gradient boosting model into 5 subsets, performing 5-fold cross-validation, and constructing the meta-feature matrix [RF_oof, GB_oof, RF_oof×GB_oof, RF_oof] 2 GB_oof 2 ].
3. The microplastic prediction method based on the fusion of feature engineering and meta-learning according to claim 1, characterized in that, S5 further includes constructing a meta-model for microplastic abundance detection: standardizing meta-features, training the meta-model for microplastic abundance detection, with the goal of minimizing the sum of squared residuals between the predicted abundance results and the actual results for microplastic detection, and obtaining the predicted microplastic abundance value.
4. The microplastic prediction method based on the fusion of feature engineering and meta-learning according to claim 1, characterized in that, The method further includes visualization: plotting scatter plots of the prediction results for the training and test sets.
5. The microplastic prediction method based on the fusion of feature engineering and meta-learning according to claim 1, characterized in that, The method further includes evaluating the meta-learner model: reverting the predictions from the normalized training and test sets back to the original range; and calculating the coefficient of determination R between the predicted abundance values and the true values for microplastic detection. 2 The root mean square error (RMSE), mean absolute error (MAE), and mean square error (MSE) are used to evaluate the meta-model for detecting microplastic abundance.