Target echo classification method and system based on multi-modal data fusion
The target echo classification method based on multimodal data fusion utilizes BiLSTM and Transformer networks to remove redundant features and combines GRU for feature fusion, which solves the problems of decreased classification accuracy and performance fluctuation in existing technologies, and achieves higher classification accuracy and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SUZHOU HUOLING TECHNOLOGY CO LTD
- Filing Date
- 2025-05-12
- Publication Date
- 2026-06-26
AI Technical Summary
Existing target echo classification methods based on multimodal data fusion suffer from decreased classification accuracy and limited generalization ability in noisy environments, and their performance fluctuates greatly when the distribution of modal data changes.
A target echo classification method based on multimodal data fusion is adopted. Data is collected by millimeter-wave radar, lidar and infrared sensors. Lightweight feature extraction and location data labeling are performed. BiLSTM is used to model temporal dependencies and remove the features with the lowest weight. Cross-attention modal complementarity is performed through a two-stream Transformer network. Classification is performed by combining GRU gated recurrent units and fully connected layers.
It improves the accuracy and stability of classification, reduces the model's sensitivity to noise, maintains computational efficiency and information integrity, and adapts to feature fusion in different environments.
Smart Images

Figure CN120524418B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computational electromagnetics, specifically a target echo classification method and system based on multimodal data fusion. Background Technology
[0002] Target echoes refer to the signals returned after electromagnetic waves emitted by radar and other detection equipment illuminate a target and are scattered by the target. These echo signals carry the target's physical characteristics (such as shape, size, and material), motion state (such as speed and direction), and environmental interference information (such as clutter and noise). In complex detection scenarios, echo signals from different targets may overlap or interfere with each other, making it difficult for traditional single-modal classification methods (such as those relying solely on radio frequency or time-domain features) to meet high-precision requirements. Target echo classification systems based on multi-modal data fusion improve the accuracy and robustness of target echo classification by integrating multi-source heterogeneous data (such as radio frequency signals, time-domain waveforms, polarization features, and spatial location information).
[0003] Existing target echo classification schemes based on multimodal data fusion directly employ cross-modal fusion classification methods. However, if cross-modal fusion without feature removal is used, such as directly concatenating all modal features, redundant or noisy features may be introduced. This can lead to decreased classification accuracy and limited generalization ability in noisy environments. Furthermore, existing technologies may only use a single fusion path, failing to balance the integrity of global information with the focus on key features, resulting in significant performance fluctuations when the modal data distribution changes. Summary of the Invention
[0004] The present invention aims to solve at least one of the technical problems existing in the prior art; to this end, the present invention proposes a target echo classification method and system based on multimodal data fusion to solve the technical problems of decreased classification accuracy, limited generalization ability, and large performance fluctuation when the modal data distribution changes.
[0005] To address the aforementioned problems, a first aspect of the present invention provides a target echo classification method and system based on multimodal data fusion, comprising the following steps:
[0006] Data acquisition nodes are set up in the radar signal acquisition area. The data acquisition nodes acquire raw echo data through millimeter-wave radar, aerosol distribution data through lidar, and target thermal radiation data through infrared sensors.
[0007] The radar signal acquisition area is divided into grids, and processing nodes are set in the grids. Lightweight feature extraction is performed on the data acquisition nodes, and the location data of the corresponding nodes is added as data labels.
[0008] The processing node extracts weather parameters from the data collection node based on the data from the authorized weather station, adds the weather parameters to the data tags, and sends them to the cloud processing platform along with the extracted feature data.
[0009] The cloud processing platform performs cross-modal attention fusion on radar echo features, aerosol distribution features, and infrared features to obtain cross-modal fused features;
[0010] Temporal dependencies are modeled using BiLSTM, and the weights of radar echo features, aerosol distribution features, and infrared features are analyzed. The features with the lowest weights among radar echo features, aerosol distribution features, and infrared features are removed. The remaining two features are then fused through a two-stream Transformer network by cross-attention modal complementarity to obtain cross-fused features.
[0011] Based on cross-modal fusion features and cross-fusion features, a classifier with gated recurrent units and fully connected layers outputs the probability distribution of the classification.
[0012] Optionally, in one example of the above aspects, lightweight feature extraction is performed on the data acquisition nodes, and the location data of the corresponding nodes is added as data labels, including the following steps:
[0013] The raw echo data acquired by the millimeter-wave radar is used to generate a time-frequency spectrum map through STFT, and the time-frequency features of the time-frequency spectrum map are extracted through MobileNetV3-Small to output 256-dimensional radar echo features.
[0014] The LiDAR acquires aerosol distribution images and the infrared sensor acquires target thermal radiation images. Histogram equalization is performed, and image spatial features are extracted using ShuffleNetV2, outputting 256-dimensional feature data.
[0015] The point cloud scanned by the LiDAR is projected onto the WGS84 coordinate system to generate the latitude and longitude coordinates of the corresponding nodes, and the latitude and longitude coordinates are used as lightweight features to extract data labels.
[0016] Optionally, in one example of the above aspects, a processing node is set up in the grid. The processing node extracts weather parameters from the data acquisition node based on authorized weather station data, including the following steps:
[0017] By merging and re-segmenting the grid, the number of data acquisition nodes in each grid is controlled to be within a preset range, and processing nodes are set in the final generated grid;
[0018] The processing node obtains data from authorized weather stations, extracts weather parameters from data acquisition nodes, including temperature, humidity, wind speed, and air pressure, and normalizes the collected data into a [-1,1] vector;
[0019] The normalized weather parameter vector is aligned with the data timestamps collected by the data acquisition nodes, and a 64-dimensional environmental coding vector is generated through an MLP network.
[0020] Optionally, in one example of the above aspects, the cloud processing platform performs cross-modal attention fusion on radar echo features, aerosol distribution features, and infrared features to obtain cross-modal fused features, including the following steps:
[0021] The radar echo feature Fr, aerosol distribution feature Fa, infrared feature Ft, and environmental coding vector Fe are concatenated together to construct the joint feature tensor Fjoint = [Fr, Fa, Ft, Fe].
[0022] Adaptive weight calculation is performed, and the three-way cross-attention of radar echo feature Fr, aerosol distribution feature Fa, and infrared feature Ft is set as follows: (i,j)∈{r,a,t};
[0023] Calculate attention weights in, A mode-specific learnable parameter matrix, Scaling factor to prevent gradient vanishing;
[0024] The gating mechanism introduces environmental coding to control the flow of information, g = σ(Fe·Wg); Wg is the gating parameter matrix, and σ is the Sigmoid activation function;
[0025] Calculate cross-modal fusion features Ffusion=g·∑ i,j αij+(1-g)·LayerNorm(Fjoint).
[0026] Optionally, in one example of the above aspects, the weights of radar echo features, aerosol distribution features, and infrared features are analyzed by modeling time-series dependencies using BiLSTM, including the following steps:
[0027] Radar echo characteristics, aerosol distribution characteristics, and infrared characteristics are vector-stitched together to form a spliced feature vector Xt.
[0028] The feature sequence is modeled using BiLSTM, which includes two LSTM units: one is a forward LSTM from the beginning to the end of the sequence: htf = LSTMf(Xt, h(t-1)f), and the other is a backward LSTM from the end to the beginning of the sequence: htb = LSTMb(Xt, h(t+1)b).
[0029] The forward LSTM and backward LSTM perform bidirectional hidden state fusion ht = [htf; htb], where ht is the final hidden state at time step t, containing information from the forward and backward LSTMs;
[0030] An attention mechanism is applied to the output of BiLSTM to calculate the attention weights of radar echo features, aerosol distribution features, and infrared features, which are then assigned as the weights of radar echo features, aerosol distribution features, and infrared features, respectively.
[0031] Optionally, in one example of the above aspects, an attention mechanism is applied to the output of the BiLSTM to calculate the attention weights for radar echo features, aerosol distribution features, and infrared features as follows:
[0032] Ar=softmax(Wa·tanh(Wh·Fr+bh)) / softmax(Wa·tanh(Wh·ht+bh))
[0033] Aa=softmax(Wa·tanh(Wh·Fa+bh)) / softmax(Wa·tanh(Wh·ht+bh))
[0034] At=softmax(Wa·tanh(Wh·Ft+bh)) / softmax(Wa·tanh(Wh·ht+bh))
[0035] Where: Ar, Aa, and At are the attention weights for radar echo features, aerosol distribution features, and infrared features, respectively; Fr, Fa, and Ft represent the radar echo feature vector, aerosol distribution feature vector, and infrared feature vector, respectively; Wa and Wh are learnable weight matrices; bh is the bias vector; and tanh is the activation function used to introduce nonlinearity.
[0036] Optionally, in one example of the above aspects, the remaining two features are fused through a two-stream Transformer network by complementing each other with cross-attention modalities to obtain cross-fused features, including the following steps:
[0037] The feature with the lowest weight among radar echo features, aerosol distribution features, and infrared features is removed, and the remaining two features are set as x and y;
[0038] Construct independent Transformer branches to process the two selected modal features Fx and Fy;
[0039] Establish a bidirectional feature interaction channel between the two modalities:
[0040]
[0041] Where Attnx→y is the attention weight of the feature interaction channel from feature x to feature y, Attny→x is the attention weight of the feature interaction channel from feature y to feature x, T is the temporal length, d is the feature dimension, and Wq, Wk, and Wv are modality-specific learnable parameters.
[0042] The attention outputs of the bidirectional feature interaction channels are aggregated to obtain the aggregated feature vector:
[0043] Fx′=LayerNorm(Fx+Attnx→y),
[0044] Fy′=LayerNorm(Fy+Attny→x);
[0045] The aggregated feature vectors are concatenated to obtain cross-fusion features. in, This indicates a splicing operation.
[0046] Optionally, in one example of the above aspects, based on cross-modal fusion features and cross-fusion features, a classifier based on a gated recurrent unit (GRU) with a fully connected layer outputs a probability distribution for classification, including the following steps:
[0047] The cross-modal fusion feature Fcross and the cross-fusion feature Fcross are concatenated to obtain a new feature vector Fcombined = [Fcross; Fcross];
[0048] Using GRU to perform temporal modeling on Fcombined, we can capture the temporal dependencies between features and obtain the hidden state hT at the last time step.
[0049] By inputting hT into a fully connected layer and a softmax function, the probability distribution for classification is obtained.
[0050] According to another aspect of this disclosure, a target echo classification system based on multimodal data fusion is provided, which uses the target echo classification method based on multimodal data fusion as described above to achieve target echo classification.
[0051] Compared with the prior art, the beneficial effects of the present invention are:
[0052] This invention utilizes BiLSTM to model temporal dependencies. BiLSTM captures the temporal dependencies of features through forward / backward LSTM and optimizes them using gradient descent, with weights adaptively adjusted over time. Removing the features with the lowest weights helps reduce the computational cost of the model, and considering only two highly correlated features helps to better demonstrate modal complementarity.
[0053] This invention preserves global information through cross-modal fusion without feature removal, while cross-fusion after feature removal focuses on key discriminative features. The combination of these two approaches balances information integrity and computational efficiency. The contribution of different modal features to the classification results is dynamically adjusted through the hidden states of the GRU. If data for a certain modality is missing, the original fusion path can still utilize information from other modalities, while the selection of fusion paths can rely on key features from the remaining modalities. This balance between retaining redundant features and selecting key features reduces the model's sensitivity to specific noise patterns and improves classification stability. Attached Figure Description
[0054] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0055] Figure 1 This is a schematic diagram of the system framework of the present invention;
[0056] Figure 2 A schematic diagram of the method for adding data tags in this invention;
[0057] Figure 3 This is a schematic diagram of the probability distribution of the cross-modal fusion feature and cross-fusion feature analysis and classification of the present invention. Detailed Implementation
[0058] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0059] Please see Figure 1 Figure 1 illustrates a first aspect of the present invention, which provides a target echo classification method and system based on multimodal data fusion, comprising the following steps:
[0060] Data acquisition nodes are set up in the radar signal acquisition area. The data acquisition nodes acquire raw echo data through millimeter-wave radar, aerosol distribution data through LiDAR lidar, and target thermal radiation data through infrared sensors.
[0061] The radar signal acquisition area is divided into grids, and processing nodes are set in the grids. Lightweight feature extraction is performed on the data acquisition nodes, and the location data of the corresponding nodes is added as data labels.
[0062] The processing node extracts weather parameters from the data collection node based on the data from the authorized weather station, adds the weather parameters to the data tags, and sends them to the cloud processing platform along with the extracted feature data.
[0063] The cloud processing platform performs cross-modal attention fusion on radar echo features, aerosol distribution features, and infrared features to obtain cross-modal fused features;
[0064] Temporal dependencies are modeled using BiLSTM, and the weights of radar echo features, aerosol distribution features, and infrared features are analyzed. The features with the lowest weights among radar echo features, aerosol distribution features, and infrared features are removed. The remaining two features are then fused through a two-stream Transformer network by cross-attention modal complementarity to obtain cross-fused features.
[0065] Based on cross-modal fusion features and cross-fusion features, a classifier based on GRU gated recurrent units and fully connected layers outputs the probability distribution of the classification.
[0066] Specifically, in this embodiment, millimeter-wave radar nodes are deployed, with FMCW millimeter-wave radars (e.g., 24GHz / 77GHz) positioned at the edge or key locations of the acquisition area to cover the target region. Raw I / Q echo data (including distance, velocity, and angle information) is acquired, and the sampling rate must meet the Nyquist criterion (e.g., sampling rate ≥ 1GHz when bandwidth is 500MHz). Data from multiple nodes is synchronized via PTP (Precise Time Protocol) or GPS clock.
[0067] Deploy LiDAR nodes, co-located or staggered with millimeter-wave radar, selecting mechanical or solid-state LiDARs (such as 16-line Velodyne or Ouster OS1), with a vertical angular resolution ≤0.3°. Collect aerosol (PM2.5 / PM10) distribution data.
[0068] Deploy infrared sensor nodes and arrange uncooled infrared arrays (such as FLIR Lepton 3.5) with a resolution of ≥160×120 above the high points of the area or the path of dynamic targets to collect infrared image data.
[0069] Perform mesh generation, dividing the 3D mesh according to sensor resolution (e.g., a LiDAR horizontal angular resolution of 0.1° corresponds to approximately 0.3m@100m on the ground). For example, a 100m×100m area is divided into a 1m×1m×1m voxel mesh, with each mesh associated with a unique ID.
[0070] Set up proxy processing nodes. In each sub-region, such as a 10×10 grid, a central proxy processing node is set up to coordinate 5-10 surrounding data acquisition nodes. The proxy processing nodes perform data aggregation, time synchronization, and lightweight feature fusion.
[0071] Position encoding is performed, with the grid ID embedded as a tag in the data header, such as [Grid_ID:X12_Y34_Z2,Timestamp:1620000000]. The LiDAR's ENU coordinates are transformed to the radar's local coordinate system using a seven-parameter method. Synchronization is achieved using the NTPv4 protocol, with a time error between nodes ≤50μs.
[0072] Obtain authorized weather station data, integrate weather station data sources, and include the following data types:
[0073] Basic parameters: temperature (°C), humidity (%), air pressure (hPa), wind speed (m / s), wind direction (°);
[0074] Advanced parameters: visibility (km), precipitation intensity (mm / h), cloud height (m);
[0075] Real-time API: Connects to authorized open interfaces (such as RESTful API) of meteorological bureaus, NOAA, etc., and pulls data every 5 minutes.
[0076] Perform data quality verification and reasonableness checks, including: temperature range [-50℃, 60℃], humidity [0%, 100%], and mark outliers as NaN.
[0077] Spatial mapping is performed by using Kriging interpolation to diffuse weather station data across a grid area, generating a spatially continuous weather field. For example, weather station data (100m elevation) and B (200m elevation) are corrected using an elevation-temperature gradient (0.6℃ / 100m) to obtain the temperature at the grid center at an elevation of 150m.
[0078] Time synchronization is performed by aligning the timestamps of meteorological data to the sensor data frames (such as the 100ms period of millimeter-wave radar) and reducing timescale differences by averaging through a sliding window (window = 5 frames).
[0079] Extract the values of basic and advanced parameters from the data acquisition nodes, add the extracted values to the data tags, and send them to the cloud processing platform along with the extracted feature data.
[0080] Through a cross-modal attention mechanism, the weights of radar echo features, aerosol distribution features, and infrared features can be dynamically adjusted. For example, in rainy or foggy weather, the thermal radiation gradient of infrared features and the micro-Doppler features of radar echoes may be given higher weights, while the weight of aerosol distribution features may be reduced in low-visibility scenarios, achieving environment-adaptive feature enhancement. Cross-modal attention calculates the correlation between different modal features through an attention matrix. For example, the offset between the target trajectory detected by radar and the infrared thermal radiation center can be correlated through attention weights, thereby uncovering implicit physical laws such as "local temperature anomalies caused by drone propeller rotation," improving feature interpretability.
[0081] Noise characteristics differ significantly across modes, such as radar clutter and infrared noise. Cross-modal attention can suppress single-mode noise through feature alignment. For example, when a radar echo produces a false alarm due to strong ground clutter, if the infrared feature indicates that there is no heat source in the area, the echo feature can be suppressed through attention weighting, thereby reducing the false detection rate.
[0082] Temporal dependencies are modeled using BiLSTM, and the weights of radar echo features, aerosol distribution features, and infrared features are analyzed. The features with the lowest weights among radar echo features, aerosol distribution features, and infrared features are removed. The remaining two features are then fused through a two-stream Transformer network by cross-attention modal complementarity to obtain cross-fused features.
[0083] The process is as follows: Data acquisition → BiLSTM temporal modeling → Feature removal → Two-stream Transformer fusion → Output. BiLSTM captures the temporal dependencies of features using forward / backward LSTMs, and optimizes through gradient descent, with weights adaptively adjusted over time. Removing the features with the lowest weights helps reduce the model's computational cost, and considering only two highly correlated features helps to better demonstrate modal complementarity.
[0084] Based on cross-modal fusion features and cross-fusion features, a classifier based on GRU gated recurrent units and fully connected layers outputs the probability distribution of the classification.
[0085] Target echo signals typically exhibit temporal characteristics (such as the time-varying nature of pulse sequences). GRU dynamically captures long-term and short-term dependencies through update and reset gates, making it more effective than traditional RNNs. It preserves global information through cross-modal fusion without feature removal, while cross-fusion after feature removal focuses on key discriminative features; the combination of both balances information integrity and computational efficiency. The contribution of different modal features to the classification results is dynamically adjusted through the hidden states of GRU. If data for a certain modality is missing (e.g., radar is obstructed), the original fusion path can still utilize information from other modalities, while the selection of fusion paths can rely on key features of the remaining modalities. This balance between redundant feature retention and key feature selection reduces the model's sensitivity to specific noise patterns and improves classification stability.
[0086] In one embodiment of the present invention, lightweight feature extraction is performed on the data acquisition nodes, and the location data of the corresponding nodes is added as data labels, including the following steps:
[0087] The millimeter-wave radar acquires raw echo data and generates a time-frequency spectrum map through STFT. The time-frequency features of the time-frequency spectrum map are extracted through MobileNetV3-Small, and the radar echo features in 256 dimensions are output.
[0088] The LiDAR acquires aerosol distribution images and the infrared sensor acquires target thermal radiation images. Histogram equalization is performed, and image spatial features are extracted using ShuffleNetV2, outputting 256-dimensional feature data.
[0089] The point cloud scanned by the LiDAR is projected onto the WGS84 coordinate system to generate the latitude and longitude coordinates of the corresponding nodes, and the latitude and longitude coordinates are used as lightweight features to extract data labels.
[0090] In this embodiment, the radar echo data is used to generate a time-frequency spectrum (128×128 resolution) through STFT and normalized to [0,1]. The time-frequency features are then extracted using MobileNetV3-Small.
[0091] Aerosol distribution images and infrared images: Histogram equalization to enhance contrast, cropped to 224×224 ROI regions, and spatial features extracted using ShuffleNetV2.
[0092] Meteorological data: normalized to a [-1,1] vector, aligned with radar / infrared data timestamps.
[0093] In one embodiment of the present invention, a processing node is set in the grid. The processing node extracts weather parameters from the data acquisition node based on authorized meteorological station data, including the following steps:
[0094] By merging and re-segmenting the grid, the number of data acquisition nodes in each grid is controlled to be within a preset range, and processing nodes are set in the final generated grid;
[0095] The processing node obtains data from authorized weather stations, extracts weather parameters from data acquisition nodes, including temperature, humidity, wind speed, and air pressure, and normalizes the collected data into a [-1,1] vector;
[0096] The normalized weather parameter vector is aligned with the data timestamps collected by the data acquisition nodes, and a 64-dimensional environmental coding vector is generated through an MLP network.
[0097] In one embodiment of the present invention, the cloud processing platform performs cross-modal attention fusion on radar echo features, aerosol distribution features, and infrared features to obtain cross-modal fused features, including the following steps:
[0098] The radar echo feature Fr, aerosol distribution feature Fa, infrared feature Ft, and environmental coding vector Fe are concatenated together to construct the joint feature tensor Fjoint = [Fr, Fa, Ft, Fe].
[0099] Adaptive weight calculation is performed, and the three-way cross-attention of radar echo feature Fr, aerosol distribution feature Fa, and infrared feature Ft is set as follows: (i,j)∈{r,a,t};
[0100] Calculate attention weights in, A mode-specific learnable parameter matrix, Scaling factor to prevent gradient vanishing;
[0101] The gating mechanism introduces environmental coding to control the flow of information, g = σ(Fe·Wg); Wg is the gating parameter matrix, and σ is the Sigmoid activation function;
[0102] Calculate cross-modal fusion features Ffusion=g·∑ i,j αij+(1-g)·LayerNorm(Fjoint).
[0103] In one embodiment of the present invention, the weights of radar echo characteristics, aerosol distribution characteristics, and infrared characteristics are analyzed by modeling time-series dependencies using BiLSTM, including the following steps:
[0104] Radar echo characteristics, aerosol distribution characteristics, and infrared characteristics are vector-stitched together to form a spliced feature vector Xt.
[0105] The feature sequence is modeled using BiLSTM, which includes two LSTM units: one is a forward LSTM from the beginning to the end of the sequence: htf = LSTMf(Xt, h(t-1)f), and the other is a backward LSTM from the end to the beginning of the sequence: htb = LSTMb(Xt, h(t+1)b).
[0106] The forward LSTM and backward LSTM perform bidirectional hidden state fusion ht = [htf; htb], where ht is the final hidden state at time step t, containing information from the forward and backward LSTMs;
[0107] An attention mechanism is applied to the output of BiLSTM to calculate the attention weights of radar echo features, aerosol distribution features, and infrared features, which are then assigned as the weights of radar echo features, aerosol distribution features, and infrared features, respectively.
[0108] In one embodiment of the present invention, an attention mechanism is applied to the output of the BiLSTM to calculate the attention weights for radar echo features, aerosol distribution features, and infrared features as follows:
[0109] Ar=softmax(Wa·tanh(Wh·Fr+bh)) / softmax(Wa·tanh(Wh·ht+bh))
[0110] Aa=softmax(Wa·tanh(Wh·Fa+bh)) / softmax(Wa·tanh(Wh·ht+bh))
[0111] At=softmax(Wa·tanh(Wh·Ft+bh)) / softmax(Wa·tanh(Wh·ht+bh))
[0112] Where: Ar, Aa, and At are the attention weights for radar echo features, aerosol distribution features, and infrared features, respectively; Fr, Fa, and Ft represent the radar echo feature vector, aerosol distribution feature vector, and infrared feature vector, respectively; Wa and Wh are learnable weight matrices; bh is the bias vector; and tanh is the activation function used to introduce nonlinearity.
[0113] In one embodiment of the present invention, the remaining two features are fused through a two-stream Transformer network by complementing each other with cross-attention modalities to obtain cross-fused features, including the following steps:
[0114] The feature with the lowest weight among radar echo features, aerosol distribution features, and infrared features is removed, and the remaining two features are set as x and y;
[0115] Construct independent Transformer branches to process the two selected modal features Fx and Fy;
[0116] Establish a bidirectional feature interaction channel between the two modalities:
[0117]
[0118] Where Attnx→y is the attention weight of the feature interaction channel from feature x to feature y, Attny→x is the attention weight of the feature interaction channel from feature y to feature x, T is the temporal length, d is the feature dimension, and Wq, Wk, and Wv are modality-specific learnable parameters.
[0119] The attention outputs of the bidirectional feature interaction channels are aggregated to obtain the aggregated feature vector:
[0120] Fx′=LayerNorm(Fx+Attnx→y),
[0121] Fy′=LayerNorm(Fy+Attny→x);
[0122] The aggregated feature vectors are concatenated to obtain the cross-fusion feature Fcross=[Fx′⊕Fy′], where ⊕ represents the concatenation operation.
[0123] In one embodiment of the present invention, based on cross-modal fusion features and cross-fusion features, a classifier with a gated recurrent unit (GRU) and a fully connected layer outputs a probability distribution for classification, including the following steps:
[0124] The cross-modal fusion feature Fcross and the cross-fusion feature Fcross are concatenated to obtain a new feature vector Fcombined = [Fcross; Fcross];
[0125] Using GRU to perform temporal modeling on Fcombined, we can capture the temporal dependencies between features and obtain the hidden state hT at the last time step.
[0126] The calculation process for a GRU cell is as follows:
[0127] Update gate: zt = σ(Wz[ht-1,xt] + bz)
[0128] Reset gate: rt=σ(Wr[ht-1,xt]+br)
[0129] Candidate hidden state: h~t=tanh(Wh[rt⊙ht-1,xt]+bh)
[0130] Hidden state update: ht=(1-zt)⊙ht-1+zt⊙h~t
[0131] Where: xt is the input feature at time step t, ht-1 is the hidden state at the previous time step, Wz, Wr and Wh are learnable weight matrices, bz, br and bh are bias vectors, σ is the sigmoid activation function, and ⊙ is element-wise multiplication.
[0132] By inputting hT into a fully connected layer and a softmax function, the probability distribution for classification is obtained.
[0133] The last hidden state hT of the GRU is classified through a fully connected layer, with the formula: o = WohT + bo;
[0134] The probability distribution of the classification is obtained by using the softmax function, and the formula is as follows:
[0135] P(y=a)=softmax(o)i=exp(oa) / ∑exp(on)
[0136] Where: Wo is the weight matrix of the fully connected layer, bo is the bias vector of the fully connected layer, o is the output vector of the fully connected layer, and P(y=a) is the probability of class a.
[0137] In this embodiment, the cross-modal fusion features of radar echo features, aerosol distribution features, infrared features, and environmental coding are obtained; and through a two-stream Transformer network, two features from radar echo features, aerosol distribution features, and infrared features are complemented by cross-attention modality to obtain cross-fusion features; based on the cross-modal fusion features and cross-fusion features, a classifier based on gated recurrent units (GRU) and fully connected layers outputs the probability distribution of the classification.
[0138] The temporal dependencies between features are captured by the computation of GRU units, and the hidden state hT at the last time step is obtained.
[0139] The last hidden state hT of the GRU is classified through a fully connected layer, and the probability distribution of the classification is obtained by using the softmax function.
[0140] In another embodiment of the present invention, a target echo classification system based on multimodal data fusion is provided, which uses the target echo classification method based on multimodal data fusion as described above to achieve target echo classification.
[0141] The above embodiments are only used to illustrate the technical methods of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical methods of the present invention without departing from the spirit and scope of the technical methods of the present invention.
Claims
1. A target echo classification method based on multimodal data fusion, characterized in that, Includes the following steps: Data acquisition nodes are set up in the radar signal acquisition area. The data acquisition nodes acquire raw echo data through millimeter-wave radar, aerosol distribution data through lidar, and target thermal radiation data through infrared sensors. The radar signal acquisition area is divided into grids, and processing nodes are set in the grids. Lightweight feature extraction is performed on the data acquisition nodes, and the location data of the corresponding nodes is added as data labels. The processing node extracts weather parameters from the data collection node based on the data from the authorized weather station, adds the weather parameters to the data tags, and sends them to the cloud processing platform along with the extracted feature data. The cloud processing platform performs cross-modal attention fusion on radar echo features, aerosol distribution features, and infrared features to obtain cross-modal fused features; Temporal dependencies are modeled using BiLSTM, and the weights of radar echo features, aerosol distribution features, and infrared features are analyzed. The features with the lowest weights among radar echo features, aerosol distribution features, and infrared features are removed. The remaining two features are then fused through a two-stream Transformer network by cross-attention modal complementarity to obtain cross-fused features. Based on cross-modal fusion features and cross-fusion features, a classifier with gated recurrent units and fully connected layers outputs the probability distribution of the classification.
2. The target echo classification method based on multimodal data fusion according to claim 1, characterized in that, Lightweight feature extraction is performed on the data acquisition nodes, and the location data of the corresponding nodes is added as data labels, including the following steps: The millimeter-wave radar acquires raw echo data and generates a time-frequency spectrum map through STFT. The time-frequency features of the time-frequency spectrum map are extracted through MobileNetV3-Small, and the radar echo features in 256 dimensions are output. The LiDAR acquires aerosol distribution images and the infrared sensor acquires target thermal radiation images. Histogram equalization is performed, and image spatial features are extracted using ShuffleNetV2, outputting 256-dimensional feature data. The point cloud scanned by the LiDAR is projected onto the WGS84 coordinate system to generate the latitude and longitude coordinates of the corresponding nodes, and the latitude and longitude coordinates are used as lightweight features to extract data labels.
3. The target echo classification method based on multimodal data fusion according to claim 1, characterized in that, Set up processing nodes in the grid. These processing nodes extract weather parameters from the data acquisition nodes based on authorized weather station data, including the following steps: By merging and re-segmenting the grid, the number of data acquisition nodes in each grid is controlled to be within a preset range, and processing nodes are set in the final generated grid; The processing node obtains data from authorized weather stations, extracts weather parameters from data acquisition nodes, including temperature, humidity, wind speed, and air pressure, and normalizes the collected data into a [-1,1] vector; The normalized weather parameter vector is aligned with the data timestamps collected by the data acquisition nodes, and a 64-dimensional environmental coding vector is generated through an MLP network.
4. The target echo classification method based on multimodal data fusion according to claim 1, characterized in that, The cloud processing platform performs cross-modal attention fusion on radar echo features, aerosol distribution features, and infrared features to obtain cross-modal fused features, including the following steps: The radar echo feature Fr, aerosol distribution feature Fa, infrared feature Ft, and environmental coding vector Fe are concatenated together to construct a joint feature tensor Fjoint=[Fr,Fa,Ft,Fe]. Adaptive weight calculation is performed, and the three-way cross-attention of radar echo feature Fr, aerosol distribution feature Fa, and infrared feature Ft is set as: Qi=Fi Kj=Fj Vj=Fj ,(i,j)∈{r,a,t}; Calculate attention weights ,in, , , The learnable parameter matrix of the modes, Scaling factor to prevent gradient vanishing; The gating mechanism introduces environmental coding to control the flow of information g=σ(Fe) Wg); Wg is the gating parameter matrix, σ is the Sigmoid activation function, and Fe is the environment coding vector; Calculate cross-modal fusion features , where Fjoint is the joint feature tensor.
5. The target echo classification method based on multimodal data fusion according to claim 1, characterized in that, The temporal dependencies are modeled using BiLSTM, and the weights of radar echo characteristics, aerosol distribution characteristics, and infrared characteristics are analyzed, including the following steps: Radar echo characteristics, aerosol distribution characteristics, and infrared characteristics are vector-stitched together to form a spliced feature vector Xt. The feature sequence is modeled using BiLSTM, which includes two LSTM units: one is a forward LSTM from the beginning to the end of the sequence: htf=LSTMf(Xt,h(t-1)f), and the other is a backward LSTM from the end to the beginning of the sequence: htb=LSTMb(Xt,h(t+1)b). The forward LSTM and backward LSTM perform bidirectional hidden state fusion ht=[htf;htb], where ht is the final hidden state at time step t, which contains information from the forward and backward LSTMs; An attention mechanism is applied to the output of BiLSTM to calculate the attention weights of radar echo features, aerosol distribution features, and infrared features, which are then assigned as the weights of radar echo features, aerosol distribution features, and infrared features, respectively.
6. The target echo classification method based on multimodal data fusion according to claim 5, characterized in that, An attention mechanism is applied to the output of the BiLSTM to calculate the attention weights for radar echo features, aerosol distribution features, and infrared features: in: , and Fr, Fa, and Ft represent the attention weights for radar echo features, aerosol distribution features, and infrared features, respectively. Fr, Fa, and Ft represent the radar echo feature vector, aerosol distribution feature vector, and infrared feature vector, respectively. Wa and Wh are learnable weight matrices, bh is the bias vector, and tanh is the activation function used to introduce nonlinearity.
7. The target echo classification method based on multimodal data fusion according to claim 1, characterized in that, Using a two-stream Transformer network, the remaining two types of features are fused through cross-attention modal complementarity to obtain cross-fused features, including the following steps: The feature with the lowest weight among radar echo features, aerosol distribution features, and infrared features is removed, and the remaining two features are set as x and y; Construct independent Transformer branches to process the two selected modal features Fx and Fy; Establish a bidirectional feature interaction channel between the two modalities: in, The attention weights for the feature interaction channels from feature x to feature y. denoted as the attention weights for the feature interaction channel from feature y to feature x, where T is the temporal length, d is the feature dimension, and Wq, Wk, and Wv are modality-specific learnable parameters. The attention outputs of the bidirectional feature interaction channels are aggregated to obtain the aggregated feature vector: Fx′=LayerNorm(Fx+ ), Fy′=LayerNorm(Fy+ ); The aggregated feature vectors are concatenated to obtain the cross-fusion feature Fcross=[Fx′⊕Fy′], where ⊕ represents the concatenation operation.
8. The target echo classification method based on multimodal data fusion according to claim 1, characterized in that, Based on cross-modal fusion features and cross-fusion features, a classifier based on gated recurrent units (GRU) and fully connected layers outputs the probability distribution of the classification, including the following steps: The cross-modal fusion feature Fcross and the cross-fusion feature Fcross are concatenated to obtain a new feature vector Fcombined=[Fcross;Fcross]; Using GRU to perform temporal modeling on Fcombined, we can capture the temporal dependencies between features and obtain the hidden state hT at the last time step. By inputting hT into a fully connected layer and a softmax function, the probability distribution for classification is obtained.
9. A target echo classification system based on multimodal data fusion, characterized in that, The system uses the target echo classification method based on multimodal data fusion as described in any one of claims 1-8 to achieve target echo classification.