Airborne sea surface height measurement method based on interpretable feature optimization

By combining PCA-SHAP feature optimization and XGBoost model, the problems of low accuracy and insufficient interpretability of GNSS-R sea surface altimetry were solved, achieving high-precision and interpretable sea surface altitude inversion and supporting underwater vehicle navigation.

CN122241040APending Publication Date: 2026-06-19QINGDAO HARBIN INSTITUTE OF TECHNOLOGY (WEIHAI)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
QINGDAO HARBIN INSTITUTE OF TECHNOLOGY (WEIHAI)
Filing Date
2026-01-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing GNSS-R sea surface altimetry technology suffers from low altimetry accuracy and insufficient model interpretability, making it difficult to meet the navigation needs of underwater vehicles.

Method used

A PCA-SHAP feature optimization method was adopted. By constructing a PCA-SHAP feature optimization process, principal component features with strong discriminative power were selected. A sea surface height inversion model was established in conjunction with XGBoost, and hyperparameter optimization was performed. Finally, sea surface height inversion was achieved on the test set.

Benefits of technology

It significantly improves the accuracy of sea surface altimetry, provides an interpretable model decision path, enhances the interpretability of inversion results, and provides technical support for future spaceborne GNSS-R altimetry missions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241040A_ABST
    Figure CN122241040A_ABST
Patent Text Reader

Abstract

This invention relates to the interdisciplinary fields of satellite altimetry and electronic signal science. Specifically, it is an airborne sea surface altimetry method based on interpretable feature optimization that can significantly improve altimetry accuracy. First, based on interferometric time-delay waveform data, a PCA-SHAP feature optimization process is constructed: Principal Component Analysis (PCA) is used to reduce the dimensionality and remove redundancy from the high-dimensional waveform data; SHAP values ​​are introduced to quantitatively evaluate the predictive contribution of each principal component in sea surface height inversion, and principal component features with strong discriminative power are selected. Second, an XGBoost sea surface height inversion model is established, and hyperparameter optimization is completed by combining grid search and five-fold cross-validation. Finally, sea surface height inversion is achieved on a test set. This invention provides traceable and verifiable interpretive evidence for improving inversion accuracy and provides necessary technical support for future spaceborne GNSS-R altimetry missions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the interdisciplinary fields of satellite altimetry and electronic signal science, and more specifically to an airborne sea surface altimetry method based on interpretable feature optimization that can significantly improve altimetry accuracy. Background Technology

[0002] Sea surface height (SSH) is a key geophysical parameter characterizing ocean dynamic processes and sea level changes, and it has significant application value in ocean circulation research, global sea level change monitoring, and seabed topography inversion. High-precision SSH data helps to obtain high-resolution global ocean gravity fields, thereby providing a benchmark for underwater gravity-matched navigation and improving the navigation accuracy and stealth of underwater vehicles.

[0003] Currently, the conventional method for obtaining global ocean gravity fields is satellite altimetry. However, due to limitations in revisit time (approximately 10 days) and observation density (approximately 100 km × 100 km), the data has low spatiotemporal resolution on a global scale, making it difficult to meet the navigation needs of underwater vehicles. Global Navigation Satellite System Reflectometry (GNSS-R), a bistatic passive microwave remote sensing technique, inverts sea surface height by measuring the propagation delay difference between the direct GNSS signal and the sea surface reflected signal. This technology boasts advantages such as abundant signal sources, lightweight payload, high observation density, and short revisit period, providing a new technical approach for obtaining high spatiotemporal resolution sea surface height. GNSS-R can be flexibly deployed on various platforms, including shore-based, airborne, and spaceborne platforms, demonstrating potential not only in sea surface height measurement but also in other remote sensing fields such as soil moisture, sea wind, and sea ice detection.

[0004] However, current major spaceborne GNSS-R missions (such as the UK's TDS-1, the US's CYGNSS, and China's BuFeng-1) are mostly designed for wind field and sea state remote sensing, without systematic optimization for sea surface altimetry. Furthermore, limitations inherent in GNSS signals, such as limited bandwidth and weak reflection power, result in significantly lower altimetry accuracy compared to traditional radar altimeters. Airborne GNSS-R sea surface altimetry, as an important preliminary verification method for spaceborne systems, can provide experimental basis and prior knowledge for signal processing and system design, playing a crucial supporting role in advancing future spaceborne GNSS-R altimetry missions.

[0005] The core of GNSS-R sea surface altimetry is extracting the specular reflection delay of the reflected signal relative to the direct signal. Interferometric GNSS-R (iGNSS-R) technology, by directly cross-correlating the direct and reflected signals received through independent channels, can fully utilize the signal bandwidth and obtain waveforms with higher delay resolution, offering an accuracy advantage compared to traditional GNSS-R (cGNSS-R) methods that rely on local pseudocode reconstruction. However, sea surface roughness causes waveform broadening and distortion, resulting in multipath scattering components in the received signal, making it difficult to directly extract the pure specular delay. Therefore, "waveform re-tracking" techniques are needed to estimate the theoretical specular delay from the disturbed composite waveform. Traditional re-tracking methods (such as DER, HALF, FIT, etc.) typically extract only a single delay scalar from the waveform, failing to fully utilize the high-dimensional scattering information inherent in the waveform morphology, thus limiting the inversion accuracy.

[0006] Machine learning, with its powerful nonlinear fitting capabilities, can directly establish a nonlinear mapping relationship between the original waveform and sea surface height, fully mining the effective information in the observation data, thus alleviating the limitations of traditional re-tracking methods in inversion accuracy. Therefore, it has been introduced into GNSS-R sea surface height inversion tasks. However, existing machine learning methods still face two key challenges in GNSS-R sea surface height inversion: First, feature quality determines the upper limit of model performance, and existing methods mostly rely on empirical feature design, lacking highly discriminative feature optimization methods, resulting in insufficient feature information density and limited improvement in inversion accuracy. Second, while end-to-end deep learning models help improve inversion accuracy, their decision-making process lacks a clear explanation path, making it difficult to trace the root cause of accuracy improvement. This "black box" characteristic results in insufficient model interpretability, making it difficult to promote in practical applications. Therefore, a crucial problem that urgently needs to be solved is: how to fully utilize the high-dimensional information of GNSS-R continuous waveform morphology to improve inversion accuracy while constructing a clear and traceable model decision interpretation method, so that the machine learning inversion results have explicit interpretability. Current research lacks a solution to this problem. Summary of the Invention

[0007] This invention addresses the shortcomings and deficiencies of existing technologies by proposing an airborne sea surface altimetry method based on interpretable feature optimization, which can improve the interpretability of inversion results and significantly enhance altimetry accuracy.

[0008] This invention achieves its purpose through the following measures: An airborne sea surface altimetry method based on interpretable feature optimization is characterized by the following steps: First, a PCA-SHAP feature optimization process is constructed based on interferometric time-delay waveform data: Principal Component Analysis (PCA) is used to reduce the dimensionality and remove redundancy from the high-dimensional waveform data; SHAP values ​​are introduced to quantitatively evaluate the predictive contribution of each principal component in sea surface height inversion, and principal component features with strong discriminative power are selected; Second, an XGBoost sea surface height inversion model is established, and hyperparameter optimization is completed by combining grid search and five-fold cross-validation; Finally, sea surface height inversion is achieved on a test set. The selection of principal component features with strong discriminative power specifically includes the following: Step 1: Multi-scale waveform feature construction and extraction of classical time delay estimation for re-tracking. Based on the waveform power sequence, three progressive multi-scale feature sets are constructed: interferometric waveform features. IWF Traditional tracking points CRP Features and waveform leading edge features WLF All features are based on a standardized one-dimensional interferometric time-delay waveform power sequence: (1), in, d Indicates the number of waveform sampling points, i.e., the number of delay units. x i Indicates the first i The power values ​​at each waveform sampling point, including the characteristics of the interference waveform. IWF 401 sampling points of the waveform are retained. d =401: (2), Traditional tracking points CRP Delay estimation selects the half-power point τ half Peak point of the first derivative τ der Two time delay scalars, d =2, forming the CRP feature set, which is used as the input to the traditional re-tracking method. Its mathematical definition is: (3), (4), in, x ( τ ) is the time-delay-power waveform function. The peak power of the waveform. τ peak This is the peak time; (3) Waveform leading edge characteristics WLF Extract the waveform from its starting point to its peak point. τ max The complete leading edge region, d=201, forming a continuous power sequence to focus on leading-edge morphological information that is more sensitive to height changes: (5), Step 2: Feature selection and optimization based on PCA-SHAP IWF The feature set is the main object of feature optimization. An optimization method is employed in two stages: PCA dimensionality reduction and reconstruction, and Mean|SHAP| selection based on predicted contributions. First, the original high-dimensional waveform is mapped to a low-dimensional orthogonal space. Then, based on the predicted contributions, more discriminative features are selected from the principal components. Specifically, this includes: First, the original feature matrix of the standardized waveform is analyzed. X Perform principal component analysis (PCA) and retain the top performers with a cumulative variance contribution rate of at least 95%. p Each principal component, for IWF That p The value is 103: (6), (7), in, Z p =[ PC1 , PC2 ,…, PCj Principal component score matrix, V p For the corresponding load matrix, compression and reconstruction from the original waveform space to the orthogonal principal component space are achieved; Step 2-2: Introduce SHAP values ​​to measure the predictive contribution of principal components to the inversion output. Train the base model on the training data and calculate the SHAP values ​​of each principal component. M θ and its prediction function f(⋅) The input features are principal component vectors. Z =[ z 1 , z 2 ,…,z j ] T For the tree model XGBoost, TreeSHAP is used for computation, defining the first... P The SHAP values ​​of the principal components are: (8), Where ℱ={1,2,…, p} represents the entire feature set. Z S Indicates using only subsets S The model output at that time; based on the SHAP value, the first...j The predicted contribution of each principal component is its mean absolute SHAP value: (9), The principal component importance index sequence is obtained by sorting the components in descending order of predicted contribution: (10) in, i j Indicates ranking j The principal component numbering in the PCA intrinsic numbering system, for a given target dimension K ≤ p Based on this, select the previous K The principal components constitute the final feature subset: (11), The corresponding set of principal component indices is: (12).

[0009] This invention, during the training phase of the XGBoost model, adjusts the principal component subset size. K With model hyperparameters θ The grid search method is incorporated into the configuration, and the optimal configuration is selected through five-fold cross-validation. θ * , K * ), where cross-validation and parameter selection include: representing candidate configurations as ( θ , K Their combination constitutes the parameter space: (13) K The candidate set is defined as Where 103 represents the total number of principal components retained after PCA dimensionality reduction, a two-stage grid strategy of "coarse screening-fine screening" is adopted for K: first, a large step size is used to scan the global range to locate potential intervals, and then a fine search is performed within this interval with a step size ΔK=1 to determine the candidate optimal K; the selection criterion of "correlation constraint + error minimization" is used to find the optimal configuration. θ * , K * First, the average Pearson correlation coefficient of the 50% validation is required. Among the candidate configurations that satisfy this constraint, the mean of the 50% verification RMSE is used as the loss and minimized, as mathematically expressed below: (14) in, Z (k) For the firstk Principal component representation of training samples. The first determined by SHAP sort K Principal component indexes, M θ To use hyperparameters θ The model, Let be the loss function, take , y (k) val It is the first k Labels (SSH) of samples in the validation set.

[0010] The cross-validation process in this invention is as follows: Step (1): In each parameter space G Traverse the candidate configurations in the middle; Step (2): Perform five-fold cross-validation for each candidate configuration: at the... k Within the training fold, a temporary model is first trained using all principal components of the training fold. M θ (k) Based on this model, the SHAP importance of each principal component (TreeSHAP) is calculated, and the principal components are sorted in descending order of importance. Then, only the top [principal components] are retained. K The model is retrained using principal components, and its performance is evaluated on the validation fold. Step (3): Use the average RMSE on the 50% validation set as the primary performance metric; when the average RMSE difference among multiple configurations does not exceed 1%, or does not exceed the preset threshold of ΔRMSE, follow the principle of simplicity and prioritize the smaller RMSE. K ; In obtaining the optimal configuration θ * , K * After that, use all the training data to complete the final model training and save the reproducible feature processing flow, which specifically includes: Step a: Based on the standardized parameters saved during the feature engineering stage, calculate the PCA transformation matrix using the training set. V p The training samples are projected into the principal component space; Step b: Calculate the principal component prediction contributions on the training set, sort them in descending order, and fix the index set. ; Step c: Based on θ * and The final model is obtained through training. The model was trained using the XGBoost regression framework, with a fixed random seed to ensure reproducibility. No premature stopping occurred in the experiment; early stopping was not triggered / not effective, and the number of iterations was adjusted accordingly. θ * The relevant parameters are controlled and incorporated into the search or fixed settings, among which, θ This represents the set of model hyperparameters, including parameters related to tree structure complexity and regularization.

[0011] This invention maintains strict independence in the model's testing phase; test samples are only saved during the training phase. V p Projected onto the principal component space and indexed using a fixed index. After extracting a subset of features, the data is input to generate prediction results. This process ensures that the test data does not participate in any feature ranking or rule updates, thereby achieving unbiased generalization evaluation.

[0012] This invention only defines the calculable backtracking quantity and screening indicators, specifically divided into the following two parts: (1) Inverse mapping of principal component prediction contributions: PCA loading matrix V p The linear relationship between principal components and wave points is given. The predicted contribution of the sample on each principal component is distributed back to the wave points according to the loading weight. The first... i The first sample f The retrospective prediction contribution value of each waveform point is: (15), of which, For the first i The sample at the th j SHAP values ​​on each principal component; V f,j For load matrix elements, p The total number of principal components. Based on the above definition, the predicted contribution sequence of waveform points can be obtained, which can be used to characterize the decomposition of the predicted contribution of different sampling points to the model output; (2) Joint screening criteria for key waveform points: To establish a more robust screening criterion at the waveform point level, external statistical correlation is introduced and combined with internal predictive contribution. A comprehensive contribution index is defined to identify discriminative original features in waveform points. First, the distance correlation coefficient is used. Measuring the original waveform points x f sea ​​level reference value y Generalized dependencies between (tags): (16), among which, Representing variables xf and y The distance covariance measures the coordinated change in the joint distribution of the two. The distance variance describes the intrinsic dispersion of a single variable. The distance correlation coefficient is calculated from this. The value range is [0,1]. The larger the value, the stronger the statistical dependence. It can simultaneously characterize linear and nonlinear associations and pass the permutation test. p A value <0.05 is used to select a set of waveform points that are statistically significant. Distance correlation and permutation tests are only calculated and the selection rules are fixed in the training set and are not included in the test set stage. Based on this, the predicted contribution of waveform points is combined with statistical correlation to construct a comprehensive contribution index: (17) (18) in, Indicates the first j The combined contribution of each principal component to the waveform points corresponding to the principal components; Indicates the first f The waveform points are at N The average absolute prediction contribution over a sample.

[0013] The sea level height reference value in this invention adopts DTU 18. Sea level height after tidal correction ( SSH DTU ).

[0014] In this invention, the model accuracy evaluation metrics include: absolute accuracy is measured by the root mean square error (RMSE) and the mean absolute error (MAE) to determine the overall deviation; the smaller the value, the higher the accuracy of the inversion result. (19) (20) Trend consistency is assessed using the Pearson correlation coefficient (PCC) to evaluate the linear correlation between the predicted and reference values; a higher PCC value indicates stronger trend consistency. (twenty one), in, It is the number of data points. y i It is the first i The sea surface height reference value for each data point is the label value. It is the first i Predicted values ​​for each data point; Distribution similarity is based on the probability density function of the predicted value and the sea level reference value fitted by kernel density estimation (KDE). P DTU (x) and P Pred.Val(x) And the distribution difference is quantified by Kullback-Leibler divergence: (twenty two), in, and These represent the probability density estimates of sea surface height based on reference values ​​and model predictions, respectively. x The sea level height variable. This index is used to quantify the difference in probability distribution between the predicted result and the sea level height reference value; the smaller the value, the more consistent the predicted sequence is with the sea level height reference value. SSH DTU The closer the distributions are.

[0015] This invention validates the effectiveness of the model based on the sea surface height corrected by DTU18 tidal data. The results show that the RMSE, PCC, and KL of the XGBoost inversion results are 0.18m, 0.75, and 0.25, respectively. Compared with the traditional waveform retracking methods DER and HALF, the improvements in these three indicators are 63.27%, 66.67%, and 88.26% (relative to DER) and 51.35%, 47.06%, and 69.51% (relative to HALF), respectively. This invention can provide a traceable and verifiable explanation for improving inversion accuracy and provides necessary technical support for future spaceborne GNSS-R altimetry missions. Attached Figure Description

[0016] Figure 1 This is a flowchart of the present invention.

[0017] Figure 2 This is a flowchart of the training and testing process for the model pool in this invention.

[0018] Figure 3 It is the aircraft's flight path and corresponding... SSH DTU ;in Figure 3 (a) shows the aircraft's flight path; (b) shows the sea level reference value. SSH DTU .

[0019] Figure 4 This is a flowchart illustrating the formation of the one-dimensional interference delay power waveform in this invention.

[0020] Figure 5 This invention compares the performance metrics of heterogeneous models under different feature strategies (A / B / C), where the horizontal axis represents the regression model category and the vertical axis represents the corresponding index value. Figure 5 In the table, (a) represents RMSE; (b) represents PCC; (c) represents KL divergence; and (d) represents MAE.

[0021] Figure 6This invention compares the predicted sequences of the model under different feature strategies (A / B / C), including training and testing segments. The red line represents the sea surface height reference value, the green line represents the model prediction on the training set, and the blue line represents the model prediction on the testing set. This is used to compare the continuity of trend tracking and the magnitude response.

[0022] Figure 7 This is a comparison of the generalization performance of representative models (Bayesian Ridge and XGBoost) on test sets under different feature strategies (A / B / C) in this invention. Figure 7 (a) shows the scatter plot and regression relationship between the predicted values ​​and SSHDTU; (b) shows the probability density comparison between the predicted distribution and the SSHDTU distribution, with indicators such as RMSE, PCC, Slope and KL (the smaller the value, the closer the distribution is) given in the subplot.

[0023] Figure 8 This is a multi-index comparison of principal component discriminant evaluation in an embodiment of the present invention, wherein... Figure 8 (a) is the cumulative variance contribution rate (VCR); (b) is the distribution of AP / DC / MI and Mean|SHAP| with the principal component number (dashed line is the mean); (c) is the comparison of the top 20 principal components in terms of AP / DC / MI / VCR; (d) and (e) are the consistency of the significant principal component sets obtained by different criteria, corresponding to Jaccard and intersection number, respectively; (f) is the Mean|SHAP| of the top 10 principal components; (g) is the individual predictive contribution and cumulative predictive contribution sorted by Mean|SHAP|.

[0024] Figure 9 This is the SHAP decision pattern analysis of key principal components in this embodiment of the invention, wherein... Figure 9 (a) to (j) are SHAP dependency plots of the top 10 principal components in terms of prediction contribution (Mean|SHAP|). In the plots, the horizontal axis represents the value of the principal component, and the vertical axis represents its SHAP contribution to the prediction SSH. Positive values ​​indicate that the prediction is improved, and negative values ​​indicate that it is suppressed. (k) is the SHAP waterfall plot of the test sample, which shows a low prediction. (l) is the SHAP waterfall plot of the test sample, which shows a high prediction. The plots show the step-by-step superposition process of the contributions of each principal component. The remaining principal components are merged into "Other Features".

[0025] Figure 10 This is the distribution of waveform feature synthesis driving contribution in the embodiments of the present invention, wherein Figure 10(a) is a histogram of the composite driving contribution (CDC) of the waveform sampling points: the horizontal axis is the CDC value and the vertical axis is the number of sampling points; (b) is a scatter plot of the SHAP allocation value and distance correlation (DC) of the sampling points: the horizontal axis is the feature-allocated SHAP value assigned to the sampling point and the vertical axis is the DC of the sampling point and the reference SSH; the color of the point represents the CDC value of the sampling point; (c) is a bar chart of the CDC values ​​of the top 10 key sampling points; (d) is a magnified display of the top 10 key sampling points with the CDC extracted from (b); the coordinate meaning is the same as that of (b).

[0026] Figure 11 This refers to the contribution patterns of key waveform points and their positions within the waveform in embodiments of the present invention. Figure 11 Figures (a)–(c) are scatter plots of the top three waveform points (F201, F200, F199) in terms of Integrated Drive Contribution (CDC), showing the relationship between the power values ​​of the waveform points and the SHAP allocation contribution. The horizontal axis represents the power value of the interferometric waveform at the corresponding waveform point, and the vertical axis represents the SHAP contribution value allocated to that waveform point. Figure (d) is a schematic diagram of the position of the key waveform points in the interferometric time delay power waveform, where the left is the interferometric waveform and the right is the leading edge region of the interferometric waveform.

[0027] Figure 12 This invention relates to the traditional re-tracking single-point features and continuous waveform features in XGBoost inversion. SSH DTU A comparison of generalization performance in [the context of the data], among which Figure 12 (a) test set SSH DTU The sequence is compared with the predicted sequence, with the horizontal axis representing the test sample number n and the vertical axis representing SSH / m; (b) is SSH DTU Comparing the probability density distribution with the predicted values, the horizontal axis represents SSH / m, and the vertical axis represents the probability density; (c) is the prediction-actual scatter plot and the y=x reference line, with the horizontal axis representing... SSH DTU / m, the vertical axis is the predicted value / m, the subplot gives PCC, RMSE and MAE, the comparison input features include: CRP-HALF, CRP-DER, which are traditional re-tracking single-point delay features and IWF, WLF, which are continuous waveform features. Detailed Implementation

[0028] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0029] Unlike previous studies, this invention proposes a novel Explainable Machine Learning-based Feature Optimization (EML-FO) method for sea surface height inversion, aiming to improve inversion accuracy and interpretability. Specifically, the research includes: First, feature optimization and training / testing of the inversion model. Based on interferometric time-delay waveform data, a PCA-SHAP feature optimization process is constructed: Principal Component Analysis (PCA) is first used to reduce the dimensionality and remove redundancy from the high-dimensional waveform data; then, SHAP values ​​are introduced to quantitatively evaluate the predictive contribution of each principal component in sea surface height inversion, thereby selecting principal component features with stronger discriminative power. Based on this, an XGBoost sea surface height inversion model is established, and hyperparameter optimization is completed by combining grid search and five-fold cross-validation. Finally, sea surface height inversion is achieved on the test set. Second, effectiveness verification. Using the sea surface height reference value corrected by the DTU18 tides as a benchmark, the model inversion results are compared and verified to evaluate the prediction accuracy and stability of the method. Third, this invention conducts interpretability analysis and reverse tracing of the decision path of the XGBoost model. It explains and analyzes the prediction process of the XGBoost model, traces the sources of prediction contributions and accuracy gains, and reveals the contribution sources and mechanisms of key features (key principal components and original waveform features) to the prediction results. This research provides a new technical approach for the effective utilization of high-dimensional information from continuous waveforms and the enhancement of model interpretability in GNSS-R sea surface height inversion.

[0030] The following describes the construction of the novel EML-FO method in this invention: Feature quality is a core factor limiting the accuracy of GNSS-R sea surface height machine learning inversion. The lack of interpretability in the model's decision-making process makes it difficult to trace the root cause of accuracy improvement, thus hindering iterative optimization and in-depth research of the inversion method. This invention proposes a novel Explainable Machine Learning-based Feature Optimization (EML-FO) method for sea surface height inversion, aiming to improve both the accuracy and interpretability of the model inversion. Figure 1 As shown, the overall process of EML-FO is divided into the following four core modules: feature engineering, XGBoost model training and testing, reverse tracing path construction and inversion result evaluation.

[0031] The core objective of feature engineering is to extract information-dense and highly discriminative features from raw observation data, providing a high-quality data foundation for subsequent model building and interpretable analysis. This invention, based on airborne GNSS-R data, employs interferometric processing (iGNSS-R) technology to obtain interferometric waveforms with narrower main lobes and higher time delay resolution. Compared to traditional cGNSS-R waveforms, these waveforms can carry richer sea surface scattering information. To preserve as much of the high-dimensional information contained in the continuous waveform shape as possible, a standardized waveform power sequence is used as the core input. Compared to single-point time delay estimation derived from geometric relationships, continuous waveforms carry richer scattering response information in terms of shape, broadening, peak value, and trailing edge variations, providing more comprehensive input support for sea surface height inversion. Simultaneously, to maintain the independence of the feature system and reduce the complexity of attribution backtracking, this invention does not introduce external auxiliary variables such as wind speed, but only performs feature construction and screening based on the waveform power sequence.

[0032] (1) Construction of multi-scale waveform features and extraction of classical time delay estimation for re-tracking: To unify waveform information representation and support subsequent accuracy comparison and prediction contribution analysis, this invention constructs three progressive multi-scale feature sets based on waveform power sequences: Interferometric Waveform Features, IWF Conventional Retracking Points CRP Waveform Leading-edge Features WLF The design is based on the fact that the leading edge morphology is most sensitive to changes in sea level height, while the interferometric waveform carries more comprehensive scattering information. All features are based on a standardized one-dimensional interferometric time-delay waveform power sequence: (1), where, d Indicates the number of waveform sampling points (delay units). x i Indicates the first i The power value of each waveform sampling point.

[0033] (1) Characteristics of interference waveforms IWF ): IWF 401 sampling points of the waveform are retained. d =401), theoretically the most complete information, but also contains a lot of redundancy and noise: (2); (2) Traditional tracking points ( CRP Delay estimation: Traditional re-tracking methods (such as DER, HALF, FIT, etc.) typically extract a single delay scalar from the waveform. This invention selects the half-power point. τ half Peak point of the first derivative τ der Two time delay scalars ( d =2) This constitutes the CRP feature set, which is used as the input to the traditional re-tracking method. Its mathematical definition is: (3); (4), of which, x ( τ ) is the time-delay-power waveform function. The peak power of the waveform. τ peak This is the peak time.

[0034] (3) Waveform Leading-edge Features WLF ): Extract the waveform from its start point to its peak point. τ max Complete leading edge region ( d =201), forming a continuous power sequence. This is used to focus on leading-edge morphological information that is more sensitive to height changes. (5); (2) Feature selection and optimization based on PCA-SHAP: IWF The feature set is complete but has high dimensionality and redundancy, making it suitable as the main target for feature optimization. A two-stage optimization method is adopted: PCA dimensionality reduction and reconstruction, and prediction contribution value (Mean|SHAP|) selection. First, the original high-dimensional waveform is mapped to a low-dimensional orthogonal space; then, features with stronger discriminative power are selected from the principal components based on the predicted contribution. Firstly, the original feature matrix of the standardized waveform is... X Perform principal component analysis (PCA) and retain the top performers with a cumulative variance contribution rate of at least 95%. p Principal components (for IWF That p (Value 103) (6), (7), among which, Z p =[ PC1 , PC2 ,…, PCj Principal component score matrix, V p This is the corresponding load matrix. This step achieves compression and reconstruction from the original waveform space to the orthogonal principal component space.

[0035] Subsequently, the SHAP (Shapley Additive exPlanations) value is introduced to measure the predictive contribution of the principal components to the inversion output. A base model is trained on the training data, and the SHAP value of each principal component is calculated. The model is then... M θ and its prediction function f(⋅) The input features are principal component vectors. Z =[ z 1 ,z 2 ,…,z j ] T For the tree model (XGBoost), TreeSHAP is used for computation. Define the first... P The SHAP values ​​of the principal components are: (8), where ℱ={1,2,…, p} represents the entire feature set. Z S Indicates using only subsets S The model output at that time.

[0036] Based on the SHAP value, the first... j The predicted contribution of each principal component is its mean absolute SHAP value (Mean|SHAP|): (9), The principal component importance index sequence is obtained by sorting the components in descending order of predicted contribution: (10), among which, i j Indicates ranking j The principal component's numbering in the PCA's inherent numbering system. For a given target dimension K ≤ p Based on this, select the previous K The principal components constitute the final feature subset: (11), The corresponding set of principal component indices is: (12).

[0037] To ensure consistency in feature selection between the training and testing phases, this invention determines and fixes the index set during the training phase, and only performs deterministic operations of "PCA projection + feature extraction by index" during the testing phase, thus avoiding changes in feature selection on the test data from a process perspective.

[0038] The following section introduces the training and testing of the XGBoost model: This example illustrates the training, parameter selection, and testing process for the inversion model. Unlike approaches that completely separate "feature selection" and "model training," this invention incorporates the principal component subset size during the training phase of the XGBoost model. K With model hyperparameters θ The grid search method is incorporated into the configuration, and the optimal configuration is selected through five-fold cross-validation. θ * , K * To obtain stable generalization performance.

[0039] (1) Training phase: Cross-validation and parameter selection: The candidate configuration is represented as ( θ , K Their combination constitutes the parameter space: (13), among which, K The candidate set is defined as (103 represents the total number of principal components retained after PCA dimensionality reduction). To balance search efficiency and accuracy, this invention employs a two-stage grid strategy of "coarse screening-fine screening" for K: first, a large step size is used to scan the global range to locate potential intervals, and then a fine search is performed within these intervals with a step size ΔK=1 to determine the candidate optimal K.

[0040] This example uses the selection criterion of "correlation constraint + error minimization" to find the optimal configuration. θ * , K * First, we need to determine the average Pearson correlation coefficient for the 50% validation. Among the candidate configurations that satisfy this constraint, the mean of the 50% verification RMSE is used as the loss and minimized. Its mathematical expression is as follows: (14), among which, Z (k) For the first k Principal component representation of training samples. The first determined by SHAP sort K Principal component indexes, M θ To use hyperparameters θ The model, As the loss function, this invention takes , y (k) val It is the first k Labels (SSH) of samples in the validation set.

[0041] The cross-validation process in this example is as follows: 1) In each parameter space G Traverse the candidate configurations in the middle; 2) Perform five-fold cross-validation on each candidate configuration: in the... k Within the training fold, a temporary model is first trained using all principal components of the training fold. M θ (k) Based on this model, the SHAP importance of each principal component (TreeSHAP) is calculated, and the principal components are sorted in descending order of importance. Then, only the top [principal components] are retained. K The model is retrained using principal components, and its performance is evaluated on the validation fold. 3) Use the average RMSE on the 50% validation set as the primary performance metric; when the average RMSE difference among multiple configurations does not exceed 1% (or does not exceed the preset threshold of ΔRMSE), prioritize the smaller RMSE based on the principle of simplicity. K .

[0042] (2) The final model determination and testing process is as follows: After obtaining the optimal configuration ( θ * , K * After that, use all the training data to complete the final model training, and save the reproducible feature processing flow (such as...). Figure 2 (as shown) 1) Based on the standardized parameters saved during the feature engineering stage, the PCA transformation matrix is ​​calculated using the training set. V p The training samples are projected into the principal component space; 2) Calculate the principal component prediction contributions on the training set, sort them in descending order, and fix the index set. ; 3) Based on θ * and The final model is obtained through training. The model was trained using the XGBoost regression framework, with a fixed random seed to ensure reproducibility. No early stopping occurred in the experimental setup (early stopping was not triggered / not effective), and the number of iterations was determined by... θ * The relevant parameters are controlled and incorporated into the search or fixed settings. Among them, θ This represents the set of model hyperparameters, mainly including tree structure complexity and regularization-related parameters (such as max_depth, min_child_weight, subsample, colsample_bytree, learning_rate, n_estimators, and reg_alpha / reg_lambda, etc.).

[0043] The testing phase is strictly independent: test samples are only saved during the training phase. V p Projected onto the principal component space and indexed using a fixed index. After extracting a subset of features, the data is input to generate prediction results. This process ensures that the test data does not participate in any feature ranking or rule updates, thereby achieving unbiased generalization evaluation.

[0044] To achieve a traceable representation of model decision information from the principal component space back to the original waveform sampling points, the transmission relationship of principal component contribution to waveform sampling point contribution under the action of PCA load matrix is ​​established, and a screening method for key waveform delay units is given accordingly. This invention only defines calculable backtracking quantities and screening indicators to provide a consistent basis for subsequent results presentation and discussion. The specific content is divided into the following two parts: (1) Inverse mapping of principal component prediction contribution: PCA Load Matrix V p The linear relationship between principal components and wave points is given. This invention allocates the predicted contribution of a sample to each principal component back to the wave points according to the loading weight, defining the first... i The first sample f The retrospective prediction contribution value of each waveform point is: (15), of which, For the first i The sample at the th j SHAP values ​​on each principal component; V f,j For load matrix elements, p The total number of principal components. Based on the above definition, the predicted contribution sequence of waveform points can be obtained, which can be used to characterize the decomposition of the predicted contribution of different sampling points to the model output.

[0045] (2) Joint Screening Criteria for Key Waveform Points: Relying solely on the model's internal prediction contribution may be affected by high load noise or randomness. To form a more robust screening criterion at the waveform point level, external statistical correlation is introduced and combined with internal prediction contribution. A comprehensive contribution index is defined to identify discriminative original features in waveform points. First, the distance correlation coefficient is used. Measuring the original waveform points x f sea ​​level reference value y Generalized dependencies between (tags): (16), among which, Representing variables x f and yThe distance covariance measures the coordinated change in the joint distribution of the two. The distance variance describes the intrinsic dispersion of a single variable. The distance correlation coefficient is calculated from this. The value range is [0,1]. A larger value indicates a stronger statistical dependency and can simultaneously characterize both linear and nonlinear associations. This is further verified through a permutation test (…). p <0.05) Filter the set of waveform points that are statistically significant. Distance correlation and permutation tests are only calculated on the training set and the filtering rules are fixed; they are not used in the test set stage.

[0046] Based on this, the predicted contribution of waveform points is combined with statistical correlation to construct a comprehensive contribution index: (17) (18), among which, Indicates the first j The combined contribution of each principal component to the waveform points corresponding to the principal components; Indicates the first f The waveform points are at N The average absolute prediction contribution over a sample.

[0047] The evaluation metrics for the inversion results in this example are as follows: To verify the effectiveness of EML-FO sea level inversion, the model was evaluated from two aspects: prediction accuracy and feature-target correlation. The sea level reference value was adopted. DTU 18. Sea level height after tidal correction ( SSH DTU ).

[0048] Model accuracy evaluation metrics: Absolute accuracy is measured using root mean square error (RMSE) and mean absolute error (MAE) to determine overall bias. The smaller the value, the higher the accuracy of the inversion results. (19) (20); Trend consistency is assessed using the Pearson correlation coefficient (PCC) to evaluate the linear correlation between the predicted and reference values; a higher PCC value indicates stronger trend consistency. (21), among which, It is the number of data points. y i It is the first i Sea surface height reference value (label value) for each data point. It is the first i Predicted values ​​for each data point.

[0049] Distribution similarity is based on the probability density function of the predicted value and the sea level reference value fitted by kernel density estimation (KDE). P DTU (x) and P Pred.Val (x) And the distribution difference is quantified by Kullback-Leibler divergence: (22), among which, and These represent the probability density estimates of sea surface height based on reference values ​​and model predictions, respectively. x The sea level height variable. This index is used to quantify the difference in probability distribution between the predicted result and the sea level height reference value; the smaller the value, the more consistent the predicted sequence is with the sea level height reference value. SSH DTU The closer the distributions are.

[0050] In this example, the feature-label correlation index is as follows: To verify the effectiveness of PCA-SHAP feature selection, a five-dimensional evaluation system is established from two perspectives: external statistical correlation and internal model contribution. (1) Variance Contribution Ratio (VCR): Calculated by PCA, it reflects the intensity of variation of a feature under unsupervised conditions; (2) Linear correlation: The absolute value of the Pearson correlation coefficient (AP) is used to measure the linear dependence of the feature on the SSH. (3) Nonlinear correlation: The distance correlation coefficient (DC) is used to quantify the nonlinear dependence between the two. (4) Generalized correlation (two complementary measures): Distance Correlation Coefficient (DC): Quantifies the arbitrary (linear or nonlinear) dependence between a feature and the SSH, with a value range of [0,1], and is calculated based on distance covariance.

[0051] Mutual Information (MI): From an information theory perspective, it measures the generalized statistical dependency between features and labels (sea level reference values), capturing any shared information, including nonlinearities. (23), where X and Y represent the feature and target variables, respectively. Its joint probability density function, and This is the corresponding marginal probability density function.

[0052] (5) Intra-model prediction contribution: The absolute mean of SHAP is used to measure the actual contribution of each principal component to the model prediction.

[0053] The following section validates and interprets the novel EML-FO method: This example evaluates the EML-FO method from two aspects: "effectiveness" and "interpretability." First, the effectiveness and stability of the PCA-SHAP feature optimization strategy of EML-FO are verified through multi-model comparison and A / B / C feature strategy experiments. Then, using XGBoost as the core model, the prediction contribution of each principal component (Mean|SHAP|) is quantified based on SHAP, and the SHAP contribution at the principal component level is inversely mapped to waveform points along the PCA load matrix. This allows us to trace the contribution and reasons for the improvement in inversion accuracy in key time delay intervals, thereby enhancing... Data Preprocessing: The experimental data came from airborne GNSS-R observations over the Baltic Sea region on December 3, 2015. The flight altitude was approximately 3 km and the speed was approximately 50 m / s. The upward-looking RHCP antenna received the GPS L1 direct signal, and the downward-looking LHCP antenna received the reflected signal. The sampling rate was 19.42 MHz, and the observation duration was 28 minutes and 59 seconds (corresponding to the observation period (GPST): 10:52:42~11:21:41). To ensure the input quality for subsequent waveform generation and inversion modeling, the preprocessing mainly included: ① Removing non-stationary segments such as turns based on flight attitude to suppress Doppler broadening caused by platform motion; ② Filtering high-quality data according to the signal-to-noise ratio threshold (RHCP>40 dB-Hz, LHCP>35 dB-Hz); ③ Combining IGS precise ephemeris and airborne positioning and attitude information to correct propagation delay and Doppler frequency offset, generating a corrected delay-Doppler domain data product as input for subsequent interferometric waveform generation.

[0054] Sea surface height reference value matching: To evaluate the inversion accuracy, a sea surface height reference value needs to be matched for each observation point. This invention adopts the data published by the Technical University of Denmark. DTU18 Global mean sea level model ( DTU18MSS )and DTU16 Global Tidal Model ( DTU16Tide ) superimposed to generate sea level height reference values ​​( DTU-SSH , recorded as SSH DTU Its expression is: (24), among which, DTU 18 MSS Characterizing the static mean sea level, DTU 16 TideThis is for tidal correction. The matching process follows strict spatiotemporal constraints: temporally, all observation times are unified to the UTC system and aligned with the model's time series; spatially, bilinear interpolation is performed on the model grid data based on the latitude and longitude of the flight trajectory. To reduce spatiotemporal mismatch errors, the matching time window is controlled within ±5 minutes, and the spatial tolerance is less than half the model grid scale. To verify the model's generalization ability, the dataset is divided using the spatiotemporal isolation principle: ( Figure 3 The training set was taken from the D~C flight segment (GPST 11:09:02~11:20:40), consisting of 899 consecutive samples, corresponding to SSH DTU An upward trend; the test set was taken from the spatially separated A~B segments (GPST 10:58:21~11:08:41), totaling 421 samples, corresponding to SSH DTU A downward trend. This division ensures that training and testing data are independent of each other in terms of space, time, and trend, thus more rigorously testing the model's generalization ability.

[0055] In this example, the interference waveform is generated as follows: iGNSS-R technology is used to generate a high-resolution interference waveform, and the process is as follows: Figure 4 As shown: After filtering, down-conversion, and digitization of the two signals, the search time delay-Doppler range is compressed by combining precise ephemeris and navigation information. Cross-correlation, phase compensation, and coherent accumulation are then performed to generate a complex delay-Doppler map. The power delay-Doppler map is obtained by modulo squaring. Finally, integration along the Doppler dimension yields a one-dimensional interferometric time delay power waveform. This power sequence of the interferometric time delay waveform serves as the core input for subsequent feature engineering and model training. The interferometric waveform contains 401 sampling points, completely covering the time delay range of the reflected signal. To facilitate subsequent feature extraction, a mapping relationship between the waveform sampling point index and the time delay is established. Figure 4 (b) shows the interference waveform with normalized power values. Figure 4 (c) further magnifies the leading edge region sensitive to changes in sea level. This normalized waveform will serve as the original basis for subsequent feature construction. Based on this waveform, the following features will be constructed. IWF , CRP and WLF Three feature sets at different scales provide the underlying data.

[0056] Next, the performance of EML-FO will be evaluated: To verify the effect of EML-FO on improving the accuracy of sea level inversion, this example is based on the sea level reference value after DTU18 tidal correction (denoted as ). SSH DTUTo conduct comparative verification, a "dual experimental design" was adopted for evaluation: First, the performance gains brought about by feature optimization were examined in multi-model independent experiments to verify the robustness of the method; Second, the prediction results were directly compared through three sets of feature strategy control experiments (A / B / C) to verify the effectiveness of the performance improvement and highlight the decisive impact of feature optimization strategy on inversion performance.

[0057] The model pool primarily uses XGBoost, while BayesianRidge, Ridge, Huber, RandomForest, and AdaBoost are selected as control models to cover different regression algorithms such as linear regression, robust regression, and ensemble learning. All experiments were conducted under the same data partitioning and a unified training-testing process. Each model was independently trained and tested under the following three feature selection strategies: Group A (full feature baseline) retained the top 103 principal components with a cumulative variance contribution rate of no less than 95% after PCA dimensionality reduction, serving as a baseline input that was information-complete but contained a lot of redundancy and noise; Group B (unsupervised variance selection) retained only the top 10 principal components (PC1~PC10) with the highest variance contribution rate, representing the unsupervised selection strategy; Group C (EML~FO) adopted the PCA-SHAP feature optimization strategy proposed in this invention, ranking the principal components based on their SHAP contribution rate and selecting the top K principal components with the highest contribution rate as model input. Under the above design, if models with different structural principles all show consistent and significant performance improvement after adopting the same feature optimization strategy, the performance gain can be more reliably attributed to the feature optimization itself, rather than the accidental fitting of a particular model.

[0058] Figure 5 The prediction results of multiple models were compared in the three control experiments A, B, and C: Figure 5 The overall performance of each model in the experiment is presented from four dimensions: RMSE, MAE, PCC, and KL divergence. Figure 5 By comparing the predicted sequences of the training and testing segments, the model's accuracy can be visually verified. SSH DTU The continuity of trend tracking and the rationality of amplitude response.

[0059] according to Figure 5 It can be seen that Group C (EML-FO) achieved the best overall performance across all indicators: the lowest average inversion error (RMSE≈0.19 m, MAE≈0.15 m), and the best prediction accuracy. SSH DTU The trend consistency is the best (average PCC≈0.76), and the predicted distribution is consistent with... SSH DTUThe results are more accurate (with the smallest KL divergence). In contrast, Group A (all features) has a larger overall error and significant differences between different models; Group B (variance screening) lacks stability, with PCC and other indicators fluctuating significantly between models, and trend consistency failure occurs in XGBoost (PCC is negative). These results indicate that the EML-FO strategy can stably improve the accuracy of sea surface height inversion under different model structures.

[0060] More importantly, group C exhibits a significant cross-model "performance convergence" characteristic. For example... Figure 5 As shown, under strategy group C, the RMSE and PCC distributions of the six models are highly concentrated, with significantly less dispersion than groups A and B; this phenomenon is... Figure 6 The comparison of the predicted sequences provides intuitive confirmation: under only the C group condition, the prediction curves of each model can be compared between the training and test segments. SSH DTU It maintains continuous and reasonable trend tracking. This indicates that the features provided by EML-FO significantly improve the input information density, enabling different models to more fully utilize effective discriminative information to unleash their potential performance, thereby achieving similar high accuracy levels.

[0061] Figure 7 Further comparisons were made of the generalization performance of Bayesian Ridge (linear) and XGBoost (non-linear) on the test set under three feature strategies. Subfigure (a) characterizes point-to-point consistency using the predicted-true scatter plot, regression line, and RMSE / PCC / regression slope; subfigure (b) characterizes distribution consistency using probability density curves and KL divergence. The results show that both types of models in groups A and B deviate from the ideal line to varying degrees. Y = X The results showed insufficient consistency, with XGBoost exhibiting a significant negative correlation in group B (PCC = −0.31). In contrast, group C (EML-FO) achieved stable improvements on both representative models: Bayesian Ridge's RMSE decreased to 0.17, PCC increased to 0.85, the regression slope approached 1 (Slope≈0.99), and KL decreased to 0.08; XGBoost's RMSE decreased to 0.18, PCC increased to 0.75, the regression slope returned to positive (Slope≈0.50), and it maintained a small distributional difference (KL ≈ 0.25). The above quantitative results and the scatter points in the figure converge to... y = x And the density curve is more in line with SSH DTU The consistent phenomena indicate that EML-FO not only improves prediction accuracy but also improves trend consistency and distribution consistency simultaneously, thus providing cross-validation for the conclusion of improved accuracy.

[0062] The verification results of this example show that, under a unified data partitioning and training-testing process, EML-FO (Group C) not only significantly reduces absolute error indicators such as RMSE / MAE, but also simultaneously improves trend consistency indicators such as PCC and regression slope, as well as distribution consistency indicators such as KL. Furthermore, this performance gain exhibits consistent cross-model stability across six regression models with significantly different structures. Therefore, the improvement in inversion performance can be attributed to the PCA-SHAP feature optimization strategy provided by EML-FO, rather than accidental fitting or overfitting of a specific model. Based on this, this invention will trace back the key principal components and their corresponding waveform information to explain the source and contribution distribution of the accuracy gain.

[0063] Next, we conduct an EML-FO interpretability analysis: This example focuses on the interpretability of EML-FO, addressing three key questions: Which principal components are the main sources of discriminative information? How do key principal components influence prediction? Can model decisions be traced back to waveform regions with clear physical meaning? To this end, this invention first verifies the discriminative power of the selection results at the principal component level and compares different selection criteria (variance contribution, external statistical correlation, and prediction contribution). Then, using XGBoost as an example, we utilize SHAP to analyze the nonlinear action mode of key principal components. Finally, we construct a reverse tracing path along the PCA load matrix, mapping the prediction contribution at the principal component level to the original waveform delay sampling points (waveform points), thereby locating the key waveform regions driving the prediction and their contribution distribution.

[0064] Feature optimization effect analysis: In EML-FO, the interference waveform features are reduced in dimensionality by PCA to form the principal component space. To explain the accuracy gain brought by feature optimization, this example uses the 103 principal components retained by PCA in IWF as the object, and evaluates its effect on three types of complementary evidence. SSH DTU Inversion effectiveness: (i) Unsupervised variance explanatory power (VCR), reflecting the extent to which principal components explain the overall variation in the data; (ii) External statistical association (AP / DC / MI), measuring the relationship between principal components and... SSH DTU (iii) The correlation / dependency strength; (iii) the in-model prediction contribution (Mean|SHAP|), which directly reflects the effect of the principal components on the prediction output.

[0065] Figure 8 Three commonly used criteria for principal component selection were compared side-by-side: variance explained (VCR), statistical association with SSH (AP / DC / MI), and actual contribution to model prediction (SHAP). Figure 8(a) / (f) show that the variance contribution and the predictive contribution are not consistent. When sorted by VCR, the cumulative variance contribution of PC1~PC10 reaches 60.12%, but its cumulative predictive contribution is only 18.37%; conversely, when sorted by predictive contribution (Mean|SHAP|), the top 10 principal components can provide as much as 54.2% of the total predictive contribution. This comparison illustrates that relying solely on the variance criterion easily prioritizes the components that "explain the variance of the data," but may overlook the components that "explain the variance of the data." SSH DTU The most crucial discriminant information for inversion.

[0066] Combination Figure 8 (b) and Figure 8 (f) It was found that external statistical associations and predicted contributions exhibit a consistent "concentration interval" in terms of principal component indices: high-value principal components of AP / DC / MI are more distributed in the PC10~PC30 interval, and principal components with significant predicted contributions are also mainly concentrated in this interval. This is consistent with the screening results of EML-FO based on PCA-SHAP, indicating that this strategy can preferentially extract the principal components most sensitive to the output, thus providing a direct explanation for the stable improvement of subsequent inversion accuracy.

[0067] Decision Pattern Analysis of Key Principal Components: The aforementioned results demonstrate that the principal components selected by EML-FO (PCA-SHAP) possess stronger discriminative power. This example further answers the question of "how these principal components are utilized within the model": taking XGBoost as an example, based on the SHAP interpretive framework, the model's dependency structure and decision patterns are characterized at the principal component scale, thereby revealing its influence on... SSH DTU The internal logic of inversion.

[0068] Figure 9 (a)~(j) present the SHAP dependencies of the principal components with the highest prediction contribution (Mean|SHAP|), showing that the influence of key principal components on the output is generally nonlinear and interval-sensitive. For example, the SHAP value of PC15 exhibits an "S-shaped" change with its value, rising rapidly within a specific interval before saturating, indicating that its positive drive mainly occurs within a limited value range; PC17 shows a "V-shaped" response, reflecting that extreme values ​​amplify its influence on the prediction; PC31 may produce strong contributions at both high and low extreme values, suggesting that it carries information related to abnormal patterns. These phenomena indicate that the model does not linearly weight the principal components, but rather dynamically adjusts their marginal influence based on the values ​​of the principal components to capture waveform morphology and... SSH DTU Nonlinear mapping between them.

[0069] To verify the manifestation of this mechanism in individual prediction, Figure 9 Tables (k) to (l) present the SHAP waterfall plots for the test samples. Figure 9 Taking (k) as an example, the deviation of the predicted value from the global expected value is mainly formed by the superposition of a few key principal components, while the contributions of other principal components are small and scattered, presenting a sparse structure of "a few dominating and many fine-tuning". This structure is consistent with the ranking of prediction contributions at the global level, indicating that the model has formed a stable and interpretable decision path in the principal component space: the overall prediction is dominated by a small number of high-influence principal components, and the remaining components are only used for minor corrections.

[0070] Reverse tracing analysis of key waveform features: The aforementioned analysis at the principal component scale revealed that XGBoost exhibits a significant nonlinear dependence on a few key principal components, and the contribution structure is sparse, meaning that the model output is mainly dominated by a small number of high-influence components, with the remaining components only making minor corrections. However, principal components are essentially linear combinations of the original waveform delay sampling points, making it difficult to directly correspond to specific waveform positions and morphological features. To translate the interpretable results at the principal component level into an observable input space, this invention constructs a reverse tracing path of "principal component SHAP contribution back to waveform points": the sample-level SHAP contribution of the principal components is allocated to 401 waveform points according to PCA load, and a comprehensive driving contribution (CDC) is constructed by combining the distance correlation coefficient (DC) between the waveform points and SSHDTU. Based on this, the key delay units that play a core driving role in model prediction are located and their distribution characteristics are analyzed.

[0071] Figure 10 and Figure 11 The above traceability results and key features are presented: Figure 10 Based on the CDC, the importance of all waveform points is assessed and a set of key points is selected. Figure 11 Further characterize the "value-contribution" relationship of representative key waveform points and their corresponding positions in the original interference waveform. Based on Figure 10 The statistics in (a) to (c) show that the model decisions exhibit a significant concentration in the waveform space: among the 401 waveform points, approximately 95% of the waveform points have a CDC < 0.4, while the top 5% of key waveform points in terms of CDC contribute more than 58% cumulatively (cumulative contributions are calculated by summing in descending order of CDC). This result indicates that the driving information mainly originates from a few locally sensitive waveform points, rather than from the uniform superposition of the entire waveform.

[0072] Combination Figure 11As shown in (a)~(c), the contribution patterns of different key waveform points differ: waveform points located in the power peak neighborhood (such as F200 and F201) exhibit a relatively stable positive correlation between their allocated SHAP contribution and power variation; while waveform points located in the leading edge slope variation region (such as F199) show obvious interval dependence, only significantly affecting the output within a specific value range. This result indicates that the model does not rely on a single time delay point to complete the prediction, but rather characterizes the continuous morphological information of the leading edge through the joint response of adjacent waveform points.

[0073] Furthermore, the key waveform points have the same spatial orientation as the traditional re-tracking sensitive delay points. Figure 11 (d) shows that the key points are concentrated in the interval F198~F201, which covers the locations of the first derivative peak point (DER) and the half-power point (HALF). However, unlike traditional methods, the model of this invention utilizes a continuous sampling interval extending around this location rather than single-point delay estimation. In summary, the reverse tracing results show that the EML-FO guided model focuses its predictions on a small set of continuous waveform points at the leading edge of the waveform. Compared with traditional inversion methods based on single-point delay, the continuous morphology of the leading edge can provide richer and more stable discriminative information, thereby supporting improved inversion accuracy and enhancing the traceability and interpretability of the decision-making basis.

[0074] This invention focuses on the GNSS-R sea surface height (SSH) inversion problem, specifically comparing the differences in inversion results between EML-FO and traditional retracking inversion methods. EML-FO uses XGBoost as the regression model, constructing input features from continuous waveform information, and performs SSH inversion based on both the Waveform Leading Edge Feature Set (WLF) and Interferometric Waveform Feature Set (IWF). Traditional retracking inversion methods perform retracking processing on the same dataset, extracting the HALF and DER time delay estimates corresponding to Conventional Retracking Points (CRPs) to complete the inversion.

[0075] This invention quantitatively compares the inversion results of traditional re-tracking methods (CRP-HALF / CRP-DER) and XGBoost based on EML-FO, and evaluates the overall performance using three metrics: RMSE, PCC, and KL. The results show that EML-FO achieves significant and consistent performance gains across both re-tracking baselines: taking IWF as an example, its test set metrics are RMSE=0.18 m, PCC=0.75, and KL=0.25; compared to CRP-DER (0.49, 0.45, 2.13), these metrics are improved by 63.27%, 66.67%, and 88.26%, respectively; compared to CRP-HALF (0.37, 0.51, 0.82), these metrics are improved by 51.35%, 47.06%, and 69.51%, respectively. These comparisons demonstrate that EML-FO outperforms traditional re-tracking methods in key evaluation dimensions such as inversion accuracy, trend consistency, and distribution similarity.

[0076] This conclusion is in Figure 12 Consistent support was found in the visualization comparisons: in sequence comparisons, the EML-FO-driven XGBoost predicted SSH more continuously followed the changing trend of the reference SSH. Figure 12 (a)); In terms of point-to-point consistency, the scatter distribution is closer to the y=x reference line and has smaller dispersion ( Figure 12 (c)); At the statistical distribution level, the predicted probability density is closer to the reference distribution ( Figure 12 (b) Further comparison of the two types of continuous waveform inputs, IWF and WLF, shows that WLF is better in terms of correlation and distribution consistency (e.g., PCC increases from 0.75 to 0.81, and KL decreases from 0.25 to 0.07), indicating that the continuous shape of the leading edge carries more discriminative inversion information, providing more robust support for the improvement of accuracy.

[0077] Traditional retracking typically relies on a single sensitive time delay point (HALF or DER), making it more sensitive to feature point extraction errors and sea state disturbances. In contrast, EML-FO uses continuous waveform sequences as input and, through feature optimization, comprehensively utilizes the morphological information of multiple adjacent sampling points in a high-dimensional space, thereby enhancing the discriminative power and robustness of feature representation. Simultaneously, the significant reduction in KL divergence indicates that the method's improvement is not only reflected in the reduction of point errors but also in the output distribution more closely resembling the statistical structure of the reference SSH, demonstrating better output stability and statistical consistency.

Claims

1. An airborne sea surface altimetry method based on interpretable feature optimization, characterized in that, First, based on interferometric time-delay waveform data, a PCA-SHAP feature optimization process was constructed: Principal Component Analysis (PCA) was used to reduce the dimensionality and remove redundancy from the high-dimensional waveform data; SHAP values ​​were introduced to quantitatively evaluate the predictive contribution of each principal component in sea surface height inversion, and principal component features with strong discriminative power were selected. Second, a sea surface height inversion model was built using XGBoost, and hyperparameter optimization was completed by combining grid search and five-fold cross-validation. Finally, sea surface height inversion was achieved on the test set. The selection of principal component features with strong discriminative power specifically included the following: Step 1: Multi-scale waveform feature construction and extraction of classical time delay estimation for re-tracking. Based on the waveform power sequence, three progressive multi-scale feature sets are constructed: interferometric waveform features. IWF Traditional tracking points CRP Features and waveform leading edge features WLF All features are based on a standardized one-dimensional interferometric time-delay waveform power sequence: (1), in, d Indicates the number of waveform sampling points, i.e., the number of delay units. x i Indicates the first i The power values ​​at each waveform sampling point, including the characteristics of the interference waveform. IWF 401 sampling points of the waveform are retained. d =401: (2), Traditional tracking points CRP Delay estimation selects the half-power point τ half Peak point of the first derivative τ der Two time delay scalars, d =2, forming the CRP feature set, which is used as the input to the traditional re-tracking method. Its mathematical definition is: (3), (4), in, x ( τ ) is the time-delay-power waveform function. The peak power of the waveform. τ peak This is the peak time; (3) Waveform leading edge characteristics WLF Extract the waveform from its starting point to its peak point. τ max The complete leading edge region, d =201, forming a continuous power sequence to focus on leading-edge morphological information that is more sensitive to height changes: (5), Step 2: Feature selection and optimization based on PCA-SHAP IWF The feature set is the main object of feature optimization. An optimization method is employed in two stages: PCA dimensionality reduction and reconstruction, and Mean|SHAP| selection based on predicted contributions. First, the original high-dimensional waveform is mapped to a low-dimensional orthogonal space. Then, based on the predicted contributions, more discriminative features are selected from the principal components. Specifically, this includes: First, the original feature matrix of the standardized waveform is analyzed. X Perform principal component analysis (PCA) and retain the top performers with a cumulative variance contribution rate of at least 95%. p Each principal component, for IWF That p The value is 103: (6), (7), in, Z p =[ PC1 , PC2 ,…, PCj Principal component score matrix, V p For the corresponding load matrix, compression and reconstruction from the original waveform space to the orthogonal principal component space are achieved; Step 2-2: Introduce SHAP values ​​to measure the predictive contribution of principal components to the inversion output. Train the base model on the training data and calculate the SHAP values ​​of each principal component. M θ and its prediction function f(⋅) The input features are principal component vectors. Z =[ z 1 ,z 2 ,…, z j ] T For the tree model XGBoost, TreeSHAP is used for computation, defining the first... P The SHAP values ​​of the principal components are: (8), Where ℱ={1,2,…, p } represents the entire feature set. Z S Indicates using only subsets S The model output at that time; based on the SHAP value, the first... j The predicted contribution of each principal component is its mean absolute SHAP value: (9), The principal component importance index sequence is obtained by sorting the components in descending order of predicted contribution: (10), in, i j Indicates ranking j The principal component numbering in the PCA intrinsic numbering system, for a given target dimension K ≤ p Based on this, select the previous K The principal components constitute the final feature subset: (11), The corresponding set of principal component indices is: (12)。 2. The airborne sea surface altimetry method based on interpretable feature optimization according to claim 1, characterized in that, During the training phase of the XGBoost model, the principal component subset size is... K With model hyperparameters θ The grid search method is incorporated into the configuration, and the optimal configuration is selected through five-fold cross-validation. θ * , K * ), where cross-validation and parameter selection include: representing candidate configurations as ( θ , K Their combination constitutes the parameter space: (13) K The candidate set is defined as Where 103 represents the total number of principal components retained after PCA dimensionality reduction, a two-stage grid strategy of "coarse screening-fine screening" is adopted for K: first, a large step size is used to scan the global range to locate potential intervals, and then a fine search is performed within this interval with a step size ΔK=1 to determine the candidate optimal K; the selection criterion of "correlation constraint + error minimization" is used to find the optimal configuration. θ * , K * First, the average Pearson correlation coefficient of the 50% validation is required. Among the candidate configurations that satisfy this constraint, the mean of the 50% verification RMSE is used as the loss and minimized, as mathematically expressed below: (14), in, Z (k) For the first k Principal component representation of training samples. The first determined by SHAP sort K Principal component indexes, M θ To use hyperparameters θ The model, Let be the loss function, take , y (k) val It is the first k Labels (SSH) of samples in the validation set.

3. The airborne sea surface altimetry method based on interpretable feature optimization according to claim 2, characterized in that, The cross-validation process is as follows: Step (1): In each parameter space G Traverse the candidate configurations in the middle; Step (2): Perform five-fold cross-validation for each candidate configuration: at the... k Within the training fold, a temporary model is first trained using all principal components of the training fold. M θ (k) Based on this model, the SHAP importance of each principal component is calculated (TreeSHAP), and the principal components are sorted in descending order of importance. Then, only the top [principal components] are retained. K The model is retrained using principal components, and its performance is evaluated on the validation fold. Step (3): Use the average RMSE on the 50% validation set as the primary performance metric; when the average RMSE difference among multiple configurations does not exceed 1%, or does not exceed the preset threshold of ΔRMSE, follow the principle of simplicity and prioritize the smaller RMSE. K ; In obtaining the optimal configuration θ * , K * After that, use all the training data to complete the final model training and save the reproducible feature processing flow, which specifically includes: Step a: Based on the standardized parameters saved during the feature engineering stage, calculate the PCA transformation matrix using the training set. V p The training samples are projected into the principal component space; Step b: Calculate the principal component prediction contributions on the training set, sort them in descending order, and fix the index set. ; Step c: Based on θ * and The final model is obtained through training. The model was trained using the XGBoost regression framework, with a fixed random seed to ensure reproducibility. No early stopping occurred in the experiment; early stopping was not triggered / did not take effect. The number of iterations was adjusted accordingly. θ * The relevant parameters are controlled and incorporated into the search or fixed settings, among which, θ This represents the set of model hyperparameters, including parameters related to tree structure complexity and regularization.

4. The airborne sea surface altimetry method based on interpretable feature optimization according to claim 3, characterized in that, The testing phase of the model is strictly independent; test samples are only saved from the training phase. V p Projected onto the principal component space and indexed using a fixed index. After extracting a subset of features, input it to generate the prediction result.

5. The airborne sea surface altimetry method based on interpretable feature optimization according to claim 4, characterized in that, The predicted contributions of the samples on each principal component are allocated back to the waveform points according to the loading weights, and the first... i The first sample f The retrospective prediction contribution value of each waveform point is: (15), of which, For the first i The sample at the th j SHAP values ​​on each principal component; V f,j For load matrix elements, p The total number of principal components is used to obtain the predicted contribution sequence of waveform points, which is used to characterize the decomposition of the predicted contribution of different sampling points to the model output. Construct a comprehensive contribution index: (17), (18), in, Indicates the first j The combined contribution of each principal component to the waveform points corresponding to the principal components; Indicates the first f The waveform points are at N The average absolute prediction contribution over a sample.

6. The airborne sea surface altimetry method based on interpretable feature optimization according to claim 5, characterized in that, Sea level reference value adopted DTU 18. Sea level height after tidal correction SSH DTU .