A width learning tree species identification method based on multi-modal remote sensing data fusion

By employing a width learning method for multimodal remote sensing data fusion, superpixel segmentation and modal adaptive weighted fusion are dynamically guided, solving the problems of canopy-level segmentation discrepancies and low computational efficiency. This achieves a balance between accuracy and efficiency in tree species identification and is applicable to forest resource surveys and ecological environment monitoring.

CN122244514APending Publication Date: 2026-06-19GUANGDONG UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGDONG UNIV OF TECH
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing multimodal remote sensing tree species identification technologies, pixel-level feature learning does not fully exploit spatial spectral consistency, multimodal fusion lacks information interaction correction and modal differentiation contribution adaptation, ignores the value of historical data migration, and relies on complex deep learning frameworks, resulting in low computational efficiency and making it difficult to achieve a balance between identification accuracy and computational efficiency.

Method used

By collecting HSI, RGB, and LiDAR modal remote sensing data, preprocessing and spatial registration are performed to dynamically guide superpixel segmentation. Reliability weights are calculated for weighted voting fusion to generate superpixel-level segmentation maps. A spatial consistency fusion strategy is introduced to determine the overall modal reliability score. Modal adaptive dynamic weighted summation is performed to generate a fusion feature matrix, and superpixel-level width learning tree species classification is implemented.

Benefits of technology

It solves the problem of canopy segmentation and discretization, fully explores the complementary value of spectral, spatial and structural information, improves the robustness and accuracy of tree species identification, achieves a balance between identification accuracy and processing efficiency, and meets the needs of large-scale forest resource surveys and ecological environment monitoring.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244514A_ABST
    Figure CN122244514A_ABST
Patent Text Reader

Abstract

This invention relates to a width-learning tree species identification method based on multimodal remote sensing data fusion, comprising: acquiring HSI, RGB, and LiDAR modal remote sensing data; preprocessing and spatially registering the HSI, RGB, and LiDAR modal remote sensing data to obtain multimodal remote sensing data; dynamically guiding superpixel segmentation of the multimodal remote sensing data to obtain superpixel segmentation results; calculating the reliability weights of each modality in the superpixel segmentation results and performing weighted voting fusion to generate a superpixel-level segmentation map, which is used to construct an 8-region adjacency graph, and introducing a spatial consistency fusion strategy to obtain modal spatial consistency features; determining the overall modal reliability score through the modal spatial consistency features, which is used to obtain the dynamic modal attention weights of each superpixel region; weighted summing of the dynamic modal attention weights to generate a fusion feature matrix; and implementing superpixel-level width-learning tree species classification to obtain a tree species classification map.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of remote sensing image processing technology, and in particular to a width-learning tree species identification method based on multimodal remote sensing data fusion. Background Technology

[0002] Multimodal remote sensing tree species identification technology is a vegetation classification technique developed based on remote sensing image processing technology. It serves as a key support for fields such as forest resource surveys and ecological environment monitoring. Its core principle is to compensate for the representational limitations of single-modal data by fusing spectral, spatial, and structural feature information from multi-source remote sensing data such as HSI, RGB, and LiDAR, thereby extracting the differentiated features of target tree species and achieving accurate tree species identification. This technology possesses strong feature complementarity, which can improve the reliability of tree species identification in complex scenarios, and has become an important means of tree species analysis in complex vegetation areas.

[0003] However, existing technologies still face numerous bottlenecks, hindering the accuracy and efficiency of tree species identification. In tree species classification methods based on multimodal feature fusion, feature learning for each modality is often pixel-level, rarely fully exploring spatial characteristics such as spectral consistency. Furthermore, the impact of spectral variability in spectral images on identification is poorly addressed, resulting in many discrete and discontinuous regions at the canopy level. In reality, the regional morphological features at the canopy scale of a single tree provide valuable regional geometric information for tree species classification; pixel-level analysis struggles to adapt to and effectively learn from different target objects within a scene. Moreover, for multimodal feature learning, existing solutions often involve directly superimposing features from two modalities into a deep learning network, or partially constructing a dual-branch architecture for feature fusion, followed by further superposition and fusion in subsequent stages. Regardless of the approach, the interaction correction between modalities and the contribution to individual differentiation are not fully explored. The learning process overly relies on the current data itself, failing to fully leverage the transfer value of historical data and neglecting the potential of a large amount of hyperspectral historical data to provide transferable knowledge and strengthen feature representation.

[0004] Furthermore, current mainstream solutions largely rely on deep learning frameworks (such as the Transformer network), whose complex network structure leads to low computational efficiency, making it difficult to meet the rapid processing needs of large-scale remote sensing data. While width learning has the advantage of lightweight operation, there is still no mature technical solution for tree species identification scenarios involving multimodal remote sensing data fusion. This makes it difficult to achieve a balance between recognition accuracy and computational efficiency, becoming a key obstacle to the implementation of the technology. Summary of the Invention

[0005] To address the problems existing in the prior art, the purpose of this invention is to provide a width-learning tree species identification method based on multimodal remote sensing data fusion. This method solves the technical problems in existing multimodal remote sensing tree species identification technologies, such as pixel-level feature learning failing to fully exploit spatial spectral consistency, resulting in discrete canopy segmentation; multimodal fusion lacking sufficient information interaction correction and modal difference contribution adaptation; neglecting the value of historical data transfer; relying on complex deep learning frameworks leading to low computational efficiency; and the lack of a mature solution for lightweight width learning, making it difficult to balance identification accuracy and computational efficiency.

[0006] To achieve the above objectives, the present invention provides the following solution: A width-learning tree species identification method based on multimodal remote sensing data fusion includes: Collect HSI, RGB and LiDAR modal remote sensing data, preprocess and spatially register the HSI, RGB and LiDAR modal remote sensing data to obtain multimodal remote sensing data, and dynamically guide the superpixel segmentation of the multimodal remote sensing data to obtain superpixel segmentation results; The reliability weights of each modality in the superpixel segmentation results are calculated and weighted voting fusion is performed to generate a superpixel-level segmentation map, which is used to construct an 8-region adjacency graph. A spatial consistency fusion strategy is introduced to obtain modal spatial consistency features. The modal space consistency feature is used to determine the overall modal reliability score, which is then used to obtain the dynamic modal attention weights for each superpixel region. The dynamic modal attention weights are weighted and summed to generate a fusion feature matrix. Finally, superpixel-level width learning is performed to classify tree species and obtain a tree species classification map.

[0007] Optionally, acquiring the multimodal remote sensing data includes: The HSI, RGB, and LiDAR modal remote sensing data are preprocessed as follows: HSI modal remote sensing data undergoes spectral denoising using wavelet transform, and redundant bands are removed through band filtering while retaining core spectral discrimination information; RGB modal remote sensing data undergoes geometric correction using affine transform to correct image distortion, and Gaussian filtering is used to smooth noise and enhance canopy texture features; LiDAR modal remote sensing data undergoes statistical filtering to remove discrete noise points, and a cloth-simulation filtering algorithm is used to divide ground points and non-ground points, using the non-ground points to generate a canopy height model with target spatial resolution. Using any one of the preprocessed HSI, RGB, and LiDAR modal remote sensing data as a reference modality, the remaining data are mapped to the same coordinate system and generated as multimodal remote sensing data according to coordinate correlation.

[0008] Optionally, obtaining the superpixel segmentation result includes: The SLIC superpixel algorithm is used to perform superpixel segmentation on the RGB, HSI optical images and CHM raster images in the multimodal remote sensing data respectively; in the process of superpixel segmentation of the RGB and HSI optical images, the basic number of superpixels and the threshold coefficient for distinguishing between high vegetation areas and low vegetation areas are set. The mean local vegetation height is calculated using a canopy height model, and the quantile of the total vegetation height is statistically analyzed. The vegetation height type of the local area is then determined. Specifically, if the mean local vegetation height is not less than the quantile, it is identified as a high vegetation area; if the mean local vegetation height is less than the quantile, it is identified as a low vegetation area. Based on the base number of superpixels and the threshold coefficient, the number of superpixels is adaptively adjusted according to the vegetation height type of the local region to obtain the superpixel segmentation result.

[0009] Optionally, calculating the reliability weights of each modality in the superpixel segmentation result includes: , ; ; ; in, As a reliability weight, The modality traversal index represents the set of all modalities participating in the fusion. For balance coefficient, The feature homogeneity score for superpixel regions. The modality itself is segmented into a stability score. , , For coefficients, This represents a spectral consistency measure based on the average spectral angle. This represents a measure of spatial shape consistency based on regional compactness. This represents a statistical uniformity measure based on the distribution entropy of regional characteristics. This is the original superpixel segmentation result. The segmentation result is obtained under the t-th perturbation condition, where T is the number of perturbations. The operator is used to compute the intersection-union-ratio.

[0010] Optionally, obtaining the modal space consistency feature includes: A spatial consistency fusion strategy is introduced into the superpixel-level segmentation map and the 8-region adjacency map to update the superpixel-level features of each modality using neighbor-weighted updates, thereby obtaining the modality spatial consistency features: ; ; in, For modal space consistency features, Let m be the spatial fusion coefficient of mode m. For all and any superpixel regions The set of adjacent superpixels that have spatial adjacency. Let m be the similarity weights under modality m. For temperature coefficient, For the superpixel region under mode m The initial superpixel-level average features, For modal m and superpixel region Adjacent superpixel regions The initial superpixel-level average features.

[0011] Optionally, determining the overall modal reliability score includes: A local image patch is cropped from the HSI image with the superpixel region as the center, and the local image patch is input into the pre-trained large model HyperSIGMA to obtain a semantic discrimination vector, which is used to construct semantic correction coefficients for each modality feature. The intramodal feature reliability coefficient of each modal feature in the superpixel region is calculated based on the modal space consistency feature. The overall modality reliability score is obtained by using the semantic correction coefficient and the intramodal feature reliability coefficient.

[0012] Optionally, obtaining the overall modal reliability score includes: ; ; ; in, The overall modal reliability score, The intramodal characteristic reliability coefficient, Semantic correction coefficient, To control the magnitude of semantic correction, and For the learnable parameters of mode m, For modality m, the weight parameters of the shared semantic vector are... This is the semantic discrimination vector.

[0013] Optionally, generating the fused feature matrix includes: The overall modal reliability score is normalized using softmax to obtain the dynamic modal attention weights for each superpixel region. These weighted weights are then summed to generate multi-scale region features after modal adaptive dynamic weighted fusion. The superpixel-level multi-scale features are concatenated along the scale direction, and all superpixel regions are traversed. The multi-scale fused feature vectors corresponding to each superpixel region are stacked row by row to generate the fused feature matrix.

[0014] Optionally, obtaining the tree species classification map includes: The fused feature matrix is ​​input into multiple tree species classification sub-models to obtain superpixel-level prediction results; the tree species classification sub-models are trained using a training set, which consists of 1% of the labeled samples randomly selected from each tree species category; During the sub-model training process, the category weights are calculated based on the number of samples of each tree species category in the training set and then down to the sample level to obtain the weights of each training sample. An adaptive Dropout mechanism is used to apply an independent random mask to the subset of the fusion feature matrix corresponding to the training set for each sub-model to generate the sub-model training input. After sample weighting, the output weights of each sub-model are solved by the generalized pseudo-inverse with regularization to obtain multiple trained tree species classification sub-models. In the model prediction stage, superpixel-level prediction results of each trained tree species classification sub-model are obtained through a forward propagation. All superpixel-level prediction results are integrated through a majority voting strategy to obtain the final superpixel-level classification result. Based on the final superpixel-level classification result, a tree species classification map that is highly consistent with the actual forest stand spatial layout is generated.

[0015] The beneficial effects of this invention are as follows: This invention uses CHM (Central Tree Model) to dynamically guide superpixel segmentation to adapt to vegetation height heterogeneity, enabling precise region division that closely matches the canopy outline of a single tree, effectively solving the problem of canopy-level segmentation discrepancy caused by existing pixel-level learning.

[0016] This invention combines a modality-adaptive dynamic weighted fusion strategy to dynamically learn the reliability weights of each modality, fully explore the complementary value of spectral-spatial-structural information, and make up for the shortcomings of existing multimodal fusion methods that lack information interaction correction and differentiated contribution considerations.

[0017] At the sample distribution adaptation level, this invention strengthens the learning of features of rare tree species in small samples by adaptively adjusting the number of tree species samples, sinking sample-level weights, and integrating multiple sub-models, thereby improving the robustness of tree species identification in imbalanced sample scenarios.

[0018] Furthermore, this invention does not require complex iterative training; classification can be completed in the prediction stage with only one forward propagation. Relying on the lightweight advantage of width learning, it breaks through the bottleneck of low efficiency in existing complex deep learning frameworks, achieving a balance between recognition accuracy and processing efficiency. It can quickly respond to the needs of practical application scenarios such as large-scale forest resource surveys, ecological environment monitoring, and vegetation carbon storage assessment. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a detailed flowchart illustrating a width-learning tree species identification method based on multimodal remote sensing data fusion according to an embodiment of the present invention. Figure 2 This is a schematic diagram of the overall steps of a width learning tree species identification method based on multimodal remote sensing data fusion according to an embodiment of the present invention; Figure 3 This is a tree species classification diagram according to an embodiment of the present invention. Detailed Implementation

[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0022] This embodiment discloses a width-learning tree species identification method based on multimodal remote sensing data fusion, including: acquiring HSI, RGB, and LiDAR modal remote sensing data; preprocessing and spatially registering the HSI, RGB, and LiDAR modal remote sensing data to obtain multimodal remote sensing data; dynamically guiding superpixel segmentation of the multimodal remote sensing data to obtain superpixel segmentation results; calculating the reliability weights of each modality in the superpixel segmentation results and performing weighted voting fusion to generate a superpixel-level segmentation map, which is used to construct an 8-region adjacency graph, and introducing a spatial consistency fusion strategy to obtain modal spatial consistency features; determining the overall modal reliability score through the modal spatial consistency features, which is used to obtain the dynamic modal attention weights of each superpixel region; weighted summing of the dynamic modal attention weights to generate a fusion feature matrix; and implementing superpixel-level width-learning tree species classification to obtain a tree species classification map.

[0023] Specifically, this embodiment discloses a width-learning tree species identification method based on multimodal remote sensing data fusion, including the following specific technical steps: Step 1: Acquisition and preprocessing of HSI, RGB and LiDAR modal remote sensing data. Using the acquired datasets, perform differentiated preprocessing on various modal data and complete spatial registration based on a unified world coordinate system. Step 2: Based on the CHM canopy height model generated from LiDAR data, superpixel segmentation is dynamically guided. Superpixel segmentation is performed on the preprocessed RGB, HSI optical images and CHM raster images respectively. For the superpixel segmentation process of RGB and HSI, the number of superpixels is adaptively adjusted according to the vegetation height type of the region to divide the region. Step 3: HSI, RGB and LiDAR modal superpixel alignment and superpixel-level feature aggregation. The superpixel segmentation results of RGB and LiDAR modalities are mapped to the same spatial size as HSI. The segmentation differences between modalities are eliminated by modal reliability weighted voting fusion. The average features of each superpixel region under each modality are calculated to achieve feature aggregation from pixel level to superpixel level. Step 4: Spatial topology graph construction. Construct an 8-region adjacency graph and include other superpixels that have spatial adjacency relationships with each superpixel into its adjacency set. Step 5: Construct a multi-scale feature pyramid. Introduce a spatial consistency fusion strategy, calculate the similarity weights between superpixels, and perform neighbor-weighted updates on the superpixel-level features of each modality. A feature pyramid is constructed using multi-scale gradient settings. PCA-Scale operation is performed on the features of each modality and its corresponding scale to achieve feature scale adaptation and dimensionality reduction. Step Six: Modality Adaptive Dynamic Weighted Fusion. Local image patches are cropped centered on each superpixel region and input into the pre-trained HyperSIGMA model for efficient extraction of hyperspectral semantic features. Variable weights are learned for each modality feature, and intra-modality feature reliability coefficients are calculated. Simultaneously, semantic correction coefficients are constructed using the aforementioned semantic features. These two are fused to obtain the overall modality reliability score. Dynamic attention weights, adaptively adjusted according to the modality features, are generated through normalization. Based on these weights, multimodal features at each scale are weighted and summed. The multi-scale features of each superpixel region are then concatenated, and finally, the multi-scale fusion features of all superpixel regions are stacked to form a fusion feature matrix integrating multimodal data information, multi-scale feature representations, and the previously extracted hyperspectral strong semantic information. Step 7: Classification based on a sample distribution-aware width learning system. Input multi-scale cross-modal fusion features into the sample distribution-aware width learning system, and achieve superpixel-level width learning tree classification through adaptive adjustment of class weights and integration of multiple sub-models.

[0024] Furthermore, the dynamic superpixel segmentation guided by the CHM canopy height model generated based on LiDAR data in step two specifically includes: using the SLIC superpixel algorithm to perform superpixel segmentation on the preprocessed RGB, HSI optical images and CHM raster images respectively. In the superpixel segmentation process for RGB and HSI, the mean local vegetation height and 0.75 quantile obtained by CHM calculation are used simultaneously to divide the high vegetation area into high vegetation area and low vegetation area using the threshold coefficient of high and low vegetation area. This achieves adaptive adjustment of the number of superpixels according to the vegetation height type to complete the area division.

[0025] Furthermore, the HSI, RGB, and LiDAR modal superpixel alignment in step three specifically includes: eliminating segmentation differences between modalities through modal reliability weighted voting fusion: calculating the reliability weight of each modality based on the dual indicators of superpixel region feature homogeneity and modal self-segmentation stability. The superpixel region feature homogeneity score is used to quantify the consistency and integrity of the superpixel region in spectral features and spatial structure features under a single modality. The modal self-segmentation stability score quantifies the stability of the segmentation result of a single modality under different disturbance conditions by inputting multiple types of perturbations. After determining the modal weight by combining the two indicators, the multimodal superpixel segmentation results are then weighted and fused according to the weight to obtain a single unified superpixel-level segmentation map.

[0026] Furthermore, the modality adaptive dynamic weighted fusion in step six specifically includes: cropping local image patches on the HSI image centered on each superpixel region, inputting them into the pre-trained large model HyperSIGMA to efficiently extract hyperspectral semantic features, and obtaining semantic discrimination vectors to provide strong representational support for subsequent tree species identification.

[0027] The modality adaptive dynamic weighted fusion in step six specifically includes: learning variable weights for each modality at the superpixel level; calculating feature reliability coefficients based on the intramodal features of the superpixel region; and setting a larger bias for the HSI modality to reflect its dominant position in spectral discrimination. Based on the semantic discrimination features extracted by the HyperSIGMA large model, and combined with the weight parameters of the shared semantic vectors corresponding to each modality, semantic correction coefficients are constructed for the features of each modality. Based on the intramodal feature reliability coefficients and semantic correction coefficients, the overall modality reliability score is obtained; and dynamic modality attention weights are generated for each superpixel region after normalization. Multimodal features are weighted and summed at each scale, and the multi-scale fusion features of all superpixel regions are concatenated along the scale direction and stacked row by row to form the final fusion feature matrix integrating multimodal data information, multi-scale feature representation, and the hyperspectral strong semantic information extracted above.

[0028] Furthermore, the sample distribution-aware width learning system classification in step seven specifically includes: calculating class weights based on the number of samples for each tree species category, assigning greater weights to categories with fewer samples. By sinking the class weights down to the sample level, the weight of each training sample is obtained.

[0029] Furthermore, the sample distribution-aware width learning system classification in step seven specifically includes: employing an adaptive Dropout mechanism to generate multiple weakly correlated sub-model inputs by applying a random mask to the fused feature matrix, thereby improving the overall generalization ability. After sample weighting, the output weights of each sub-model are optimized using a regularized generalized pseudo-inverse solution method for the features and labels. Each sub-model generates a superpixel-level prediction result, and the results of each sub-model are integrated through majority voting to obtain the final superpixel-level classification result.

[0030] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0031] One embodiment provides a width-learning tree species identification method based on multimodal remote sensing data fusion, such as... Figures 1-2 As shown, the specific steps are as follows: Step 101: Acquisition and preprocessing of HSI, RGB, and LiDAR modal remote sensing data: It should be noted that the acquisition of multimodal remote sensing data mentioned in this embodiment can be achieved by directly collecting data using mature remote sensing equipment, or by using publicly available multimodal benchmark datasets. This embodiment preferably uses three types of publicly available datasets: SZUTreeData-R1, SZUTreeData-R2, and GDUT-SHyp3D. All three datasets integrate hyperspectral images (HSI), high-resolution RGB images, and LiDAR point cloud data, and provide accurate tree species labels.

[0032] Specifically, the multimodal remote sensing data preprocessing process in this embodiment is as follows: Wavelet transform was used to denoise the HSI data, effectively suppressing high-frequency noise interference. Based on this, redundant bands were removed through band filtering, retaining core spectral discrimination information, thus achieving effective preprocessing of high-dimensional spectral data. Affine transformation was used to geometrically correct RGB images, correcting image distortion. Gaussian filtering was then applied to smooth noise and enhance canopy texture features, improving the separability of spatial structure information. Statistical filtering was used to remove discrete noise points from the LiDAR point cloud data. A cloth-simulation filtering algorithm was used to divide ground points into ground and non-ground points, and a canopy height model (CHM) with a spatial resolution of 0.5m was generated using the non-ground points. Spatial registration of the three modalities was performed based on a unified world coordinate system. Using HSI as the reference modality, ground control points were used to map the RGB images and LiDAR point clouds to the same coordinate system, keeping the registration error within one pixel and ensuring spatial consistency of the data.

[0033] All three types of modal data contain explicit spatial location information: HSI and RGB images are located based on pixel coordinates, while LiDAR point clouds are located using (x, y, z) three-dimensional coordinates. Accurate registration can be achieved through coordinate mapping formulas, and the mapping relationship is expressed as follows: and Where (X,Y) are the planar coordinates of the LiDAR point cloud, , ) represents the planar coordinates of the top left corner of the HSI image in the world coordinate system, gsd represents the ground spatial resolution of the HSI image, and r and c correspond to the row and column numbers of the LiDAR point cloud in the HSI image, respectively.

[0034] After spatial registration is completed, a multimodal remote sensing dataset is generated by correlating the spectral information of HSI, the spatial texture information of RGB, and the structural information of LiDAR by coordinates. ,in, This is the dimensionality-reduced hyperspectral image. To correct and enhance the RGB image, The image shows the denoised LiDAR point cloud and the generated CHM data.

[0035] Step 102: Dynamically guide superpixel segmentation using the CHM canopy height model generated from LiDAR data: It should be noted that in step 102, based on the multimodal remote sensing data obtained in step 101, the CHM data generated by LiDAR is used to dynamically guide superpixel segmentation, thereby realizing superpixel-level partitioning and structured representation of the multimodal data.

[0036] Specifically, the process of step 102 in this embodiment is as follows: The SLIC (Simple Linear Iterative Clustering) superpixel algorithm is used. Superpixel segmentation was performed on the preprocessed RGB, HSI optical images, and CHM raster images, respectively. To correspond to the preprocessed modes, This refers to the number of superpixels after adapting to the vegetation height type for the corresponding modality. Specifically, it sets the base number of superpixels for the RGB and HSI superpixel segmentation processes. And the threshold coefficient for distinguishing between high-vegetation areas and low-vegetation areas. and (in The canopy height model (CHM) generated based on LiDAR data calculates the 0.75 quantile of the local vegetation mean and the statistical global vegetation height. When the average vegetation height is ≥ If the vegetation is high, it is considered a high-vegetation area; otherwise, it is considered a low-vegetation area. This applies to the pre-processed modalities. According to the vegetation height type of the area, The number of superpixels is adaptively adjusted, thereby enabling region division based on the vegetation height type.

[0037] Step 103: HSI, RGB, and LiDAR modal superpixel alignment and superpixel-level feature aggregation: Using HSI as the reference modality, the superpixel segmentation results of RGB and LiDAR modalities are mapped to a spatial size consistent with HSI. Modal reliability-weighted voting fusion is used to eliminate segmentation differences between modalities: First, for each modality m, the reliability weights of each modality are calculated based on both the homogeneity of superpixel region features and the modality's own segmentation stability. , ,in, The balancing coefficient. Superpixel region feature homogeneity score. Used to quantify the consistency and integrity of superpixel regions in terms of spectral and spatial structural features under a single modality, defined as ,in, , This represents a spectral consistency measure based on the average spectral angle. This represents a measure of spatial shape consistency based on regional compactness. This represents a statistical uniformity measure based on the distribution entropy of regional characteristics; each component is normalized before participating in the weighted calculation, and the stronger the homogeneity of regional characteristics, the better. The larger the value, the higher the modality self-segmentation stability score. This method is used to quantify the stability of a single modal segmentation result under different perturbation conditions. It measures the stability by the consistency between the original superpixel segmentation result and the segmentation result after multiple perturbations. This is the original superpixel segmentation result. The segmentation result is obtained under the t-th perturbation condition, where T is the number of perturbations. The Intersection over Union (IoU) operator is used to calculate the percentage of overlap between the original segmentation result and the perturbed segmentation result; a higher average IoU indicates a more stable segmentation result under perturbed conditions. The perturbed conditions include: segmentation algorithm parameter perturbation, input data noise perturbation, and segmentation initialization condition perturbation. The segmentation algorithm parameter perturbation involves small-range random fluctuations in the compactness parameters and the number of initial cluster centers of the SLIC superpixel algorithm; the input data noise perturbation involves adding Gaussian random noise to the input image; and the segmentation initialization condition perturbation involves adding a small-amplitude random offset to the coordinates of the initial cluster centers. These three types of perturbations represent three interference scenarios in actual remote sensing data processing: deviations in algorithm parameter settings, random noise introduced during data acquisition and preprocessing, and differences in algorithm initialization conditions. This comprehensively evaluates the modal segmentation results' resistance to different interference factors and the stability of the results. After obtaining the reliability weights for each modality, the multimodal superpixel segmentation results are weighted and fused by voting to obtain a single unified superpixel-level segmentation image. This completes the multimodal superpixel alignment process. It is based on a unified superpixel-level segmentation map. For any superpixel region Calculate its average characteristics under each mode. This achieves feature aggregation from pixel-level to superpixel-level, providing the core input for the spatially consistent neighborhood weighted update in step 105. For the region The number of pixels within, For mode m at pixel position The eigenvector at that location.

[0038] Step 104: Constructing the spatial topology graph: In step 104, based on the unified superpixel-level segmentation map obtained in step 103... Construct an 8-region adjacency graph to capture spatial relationships between superpixels, for any superpixel region Its adjacent area Indicates all with Superpixels that are spatially adjacent are all included in their adjacency set.

[0039] Step 105: Construction of Multi-Scale Feature Pyramid In step 105, based on the unified superpixel-level segmentation map obtained in step 103... A spatial consistency fusion strategy is introduced for the superpixel adjacency regions obtained in step 104. For each modality m, the superpixel-level features are updated using a neighbor-weighted algorithm, where... ∈[0,1] represents the spatial fusion coefficient of mode m. For the similarity weights under modality m, through Cosine similarity calculation, where, This is a temperature coefficient used to control the smoothness of the neighborhood weight distribution. It is based on the intra-modal spatial consistency feature. Constructing a multi-scale feature pyramid: The pyramid scale adopts... Perform PCA-Scale operation on the features of each modality m and scale s. This enables feature scale adaptation and dimensionality reduction, providing core input for modality adaptive dynamic weighted fusion at each scale in step 106.

[0040] Step 106, Modal Adaptive Dynamic Weighted Fusion: It should be noted that in step 106, this embodiment draws on the idea of ​​transfer learning to study the superpixel region. Crop local image patches from the HSI image with it as the center. The data is then fed into HyperSIGMA, a pre-trained large-scale model specifically designed for hyperspectral data. This model, pre-trained on massive amounts of hyperspectral data, integrates spectral and spatial features, and has learned rich knowledge of ground cover discrimination and strong representation capabilities. Its pre-trained weights are reused and adapted to remote sensing tree species identification scenarios to efficiently extract hyperspectral semantic discrimination features, resulting in semantic discrimination vectors. , in, This represents the feature extraction operator for the HyperSIGMA model. High-level discriminative semantic information of the hyperspectral spectrum is encoded, providing strong representational support for subsequent tree species identification. This is based on the modality space consistency features obtained in step 105. At the superpixel level, a variable weight is learned for the features of each modality m, for the superpixel region. Based on its intra-modal characteristics, the intra-modal characteristic reliability coefficient of each mode is calculated. ,in, and Learnable parameters for mode m. This is achieved by setting larger bias parameters for the HSI modes. This demonstrates its dominant position in spectral discrimination. The semantic discrimination vector is based on the output of the HyperSIGMA model. Semantic correction coefficients are constructed for each modality feature. ,in, These are the weight parameters for the shared semantic vector corresponding to modality m. Based on the intra-modal feature reliability coefficient. With semantic correction coefficient The overall modal reliability score is obtained. ,in, ∈[0,1] is a hyperparameter controlling the magnitude of semantic correction. This relates to the overall modal reliability score. Perform softmax normalization to obtain the result for each superpixel region. Dynamic modal attention weights ,in, Follow and The system adaptively adjusts to the changes in modality. At each scale *s*, the multimodal features are weighted and summed based on dynamic modal attention weights to generate superpixel-level multi-scale features after modality adaptive dynamic weighted fusion. For superpixel regions fusion features obtained at different scales Concatenate along the scale direction to obtain a multi-scale fused feature vector. Traverse all superpixel regions and fuse the multi-scale feature vectors corresponding to each region. Stacked row by row, they form the final fused feature matrix. This matrix fully integrates multimodal data information, multi-scale feature representation, and hyperspectral strong semantic information extracted above.

[0041] Step 107: Classification using a width-based learning system that is aware of sample distribution. It should be noted that in step 107, the multi-scale cross-modal fusion feature matrix obtained in step 106 is used... This embodiment achieves superpixel-level width learning for tree species classification by integrating weight adjustment based on sample distribution awareness with multi-sub-models, thus solving the problem of robustness in tree species identification caused by sample imbalance.

[0042] More specifically, the specific process of step 107 is as follows: 1% of the labeled samples from each tree species category are randomly selected as the training set, and the remaining samples are used as the test set. Class weights are calculated based on the number of samples in each tree species category. ,in, Let c be the number of samples for the c-th tree species. By assigning greater weights to classes with fewer samples (those with the largest number of samples across all classes), the model will pay more attention to these smaller classes during training. This is achieved by sinking class weights down to the sample level, resulting in the weight for each training sample. ,in, Let be the true class of sample i. To improve the model's robustness to local noise regions, an adaptive Dropout mechanism is employed during the model training phase. For each sub-model e, the fused feature matrix... Apply a random mask to the feature subset corresponding to the training samples. Generate training inputs corresponding to multiple sub-models This sub-model is a parallel base classifier built on the same learning framework with consistent structure. Its input features undergo independent random masking for differentiation, and it independently solves for weights. The only difference between the multiple sub-models is the mask of their input features (the random masking process), ensuring weak correlation in the feature space. Finally, ensemble modeling improves classification robustness. This represents element-wise multiplication. By setting some features to zero with probability p, weak correlations are created between different sub-models in the feature space, thereby improving the overall generalization ability of the model. After sample weighting, the features and labels of each sub-model are: and ,in, Y This represents the one-hot encoding form of the label. The weights of the e-th sub-model are obtained by solving the generalized pseudo-inverse with regularization. ,in, The regularization coefficient is . I For identity matrix, regularization term This effectively alleviates the matrix singularity problem caused by imbalanced samples and improves the stability of parameter solving. The model's prediction phase does not require complex iterative training; it can be completed with just one forward propagation. Let the multimodal fusion feature vector corresponding to the k-th superpixel be... , The category index with the highest score is taken as the final result, and each sub-model generates superpixel-level prediction results. The results of each sub-model are integrated through majority voting to obtain the final superpixel-level classification result. ,in For indicator functions, This represents the total number of sub-models used for ensemble. Finally, based on the final superpixel-level classification results... Generate a tree species classification map that is highly consistent with the actual forest stand spatial layout, such as Figure 3 As shown.

[0043] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A method for width-learning tree species identification based on multimodal remote sensing data fusion, characterized in that, include: Collect HSI, RGB and LiDAR modal remote sensing data, preprocess and spatially register the HSI, RGB and LiDAR modal remote sensing data to obtain multimodal remote sensing data, and dynamically guide the superpixel segmentation of the multimodal remote sensing data to obtain superpixel segmentation results; The reliability weights of each modality in the superpixel segmentation results are calculated and weighted voting fusion is performed to generate a superpixel-level segmentation map, which is used to construct an 8-region adjacency graph. A spatial consistency fusion strategy is introduced to obtain modal spatial consistency features. The modal space consistency feature is used to determine the overall modal reliability score, which is then used to obtain the dynamic modal attention weights for each superpixel region. The dynamic modal attention weights are weighted and summed to generate a fusion feature matrix. Finally, superpixel-level width learning is performed to classify tree species and obtain a tree species classification map.

2. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 1, characterized in that, Acquiring the multimodal remote sensing data includes: The HSI, RGB, and LiDAR modal remote sensing data are preprocessed as follows: HSI modal remote sensing data undergoes spectral denoising using wavelet transform, and redundant bands are removed through band filtering while retaining core spectral discrimination information; RGB modal remote sensing data undergoes geometric correction using affine transform to correct image distortion, and Gaussian filtering is used to smooth noise and enhance canopy texture features; LiDAR modal remote sensing data undergoes statistical filtering to remove discrete noise points, and a cloth-simulation filtering algorithm is used to divide ground points and non-ground points, using the non-ground points to generate a canopy height model with target spatial resolution. Using any one of the preprocessed HSI, RGB, and LiDAR modal remote sensing data as a reference modality, the remaining data are mapped to the same coordinate system and generated as multimodal remote sensing data according to coordinate correlation.

3. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 1, characterized in that, Obtaining the superpixel segmentation result includes: The SLIC superpixel algorithm is used to perform superpixel segmentation on the RGB, HSI optical images and CHM raster images in the multimodal remote sensing data respectively; in the process of superpixel segmentation of the RGB and HSI optical images, the basic number of superpixels and the threshold coefficient for distinguishing between high vegetation areas and low vegetation areas are set. The mean local vegetation height is calculated using a canopy height model, and the quantile of the total vegetation height is statistically analyzed. The vegetation height type of the local area is then determined. Specifically, if the mean local vegetation height is not less than the quantile, it is identified as a high vegetation area; if the mean local vegetation height is less than the quantile, it is identified as a low vegetation area. Based on the base number of superpixels and the threshold coefficient, the number of superpixels is adaptively adjusted according to the vegetation height type of the local region to obtain the superpixel segmentation result.

4. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 1, characterized in that, The calculation of the reliability weights for each modality in the superpixel segmentation result includes: , ; ; ; in, For reliability weights, The modality traversal index represents the set of all modalities participating in the fusion. For balance coefficient, The feature homogeneity score for superpixel regions. The modality itself is segmented into a stability score. , , For coefficients, This represents a spectral consistency measure based on the average spectral angle. This represents a measure of spatial shape consistency based on regional compactness. This represents a statistical uniformity measure based on the distribution entropy of regional characteristics. This is the original superpixel segmentation result. The segmentation result is obtained under the t-th perturbation condition, where T is the number of perturbations. The operator is used to calculate the intersection-union ratio.

5. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 1, characterized in that, Obtaining the modal space consistency feature includes: A spatial consistency fusion strategy is introduced into the superpixel-level segmentation map and the 8-region adjacency map to update the superpixel-level features of each modality using neighbor-weighted updates, thereby obtaining the modality spatial consistency features: ; ; in, For modal space consistency features, Let m be the spatial fusion coefficient of mode m. For all and any superpixel regions The set of adjacent superpixels that have spatial adjacency. Let m be the similarity weights under modality m. For temperature coefficient, For the superpixel region under mode m The initial superpixel-level average features For modal m and superpixel region Adjacent superpixel regions The initial superpixel-level average features.

6. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 1, characterized in that, Determining the overall modal reliability score includes: A local image patch is cropped from the HSI image with the superpixel region as the center, and the local image patch is input into the pre-trained large model HyperSIGMA to obtain a semantic discrimination vector, which is used to construct semantic correction coefficients for each modality feature. The intramodal feature reliability coefficient of each modal feature in the superpixel region is calculated based on the modal space consistency feature. The overall modality reliability score is obtained by using the semantic correction coefficient and the intramodal feature reliability coefficient.

7. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 6, characterized in that, Obtaining the overall modal reliability score includes: ; ; ; in, The overall modal reliability score, The intramodal characteristic reliability coefficient, Semantic correction coefficient, To control the magnitude of semantic correction, and For the learnable parameters of mode m, For modality m, the weight parameters of the shared semantic vector are... This is the semantic discrimination vector.

8. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 1, characterized in that, Generating the fused feature matrix includes: The overall modal reliability score is subjected to softmax normalization to obtain the dynamic modal attention weights for each superpixel region, and then weighted summation is performed to generate superpixel-level multi-scale features after modal adaptive dynamic weighted fusion. The superpixel-level multi-scale features are concatenated along the scale direction, and all superpixel regions are traversed. The multi-scale fused feature vectors corresponding to each superpixel region are stacked row by row to generate the fused feature matrix.

9. The method for width-learning tree species identification based on multimodal remote sensing data fusion according to claim 1, characterized in that, Obtaining the tree species classification image includes: The fused feature matrix is ​​input into multiple tree species classification sub-models to obtain superpixel-level prediction results; the tree species classification sub-models are trained using a training set, which consists of 1% of the labeled samples randomly selected from each tree species category; During the sub-model training process, the category weights are calculated based on the number of samples of each tree species category in the training set and then down to the sample level to obtain the weights of each training sample. An adaptive Dropout mechanism is used to apply an independent random mask to the subset of the fusion feature matrix corresponding to the training set for each sub-model to generate the sub-model training input. After sample weighting, the output weights of each sub-model are solved by the generalized pseudo-inverse with regularization to obtain multiple trained tree species classification sub-models. In the model prediction stage, superpixel-level prediction results of each trained tree species classification sub-model are obtained through a forward propagation. All superpixel-level prediction results are integrated through a majority voting strategy to obtain the final superpixel-level classification result. Based on the final superpixel-level classification result, a tree species classification map that is highly consistent with the actual forest stand spatial layout is generated.