A time-frequency dual-channel feature fusion passive sonar target recognition method and system
The TFCA-Net deep learning model, which integrates time-frequency dual-channel features, solves the problem of low accuracy in passive sonar target recognition in complex marine environments, achieving high robustness and high accuracy in target recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- THE 715TH RES INST OF CHINA SHIPBUILDING IND CORP
- Filing Date
- 2026-04-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from low target recognition accuracy and poor time-frequency feature fusion in complex marine environments, making it difficult to meet the requirements for high robustness and high precision.
The deep learning model TFCA-Net, which uses time-frequency dual-channel feature fusion, employs an end-to-end architecture of dual input, dual branch, cross-domain fusion, and classification output. It utilizes stacked bidirectional long short-term memory networks and improved deep convolutional networks to extract time-domain and frequency-domain features, and achieves adaptive weighting and residual connections through a cross-domain attention fusion module to deeply explore the intrinsic correlation between time-frequency features.
It significantly improves the recognition accuracy and robustness in complex marine environments, with a 10.5 percentage point improvement in the single time-domain branch model and a 5.7 percentage point improvement in the single frequency-domain branch model, enhancing the model's robustness and generalization ability in noisy environments.
Smart Images

Figure CN122241439A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of underwater acoustic signal processing and underwater target recognition technology, specifically to a passive sonar target recognition method and system that integrates time-frequency dual-channel features. Background Technology
[0002] Passive sonar technology detects, locates, and identifies underwater targets by receiving and analyzing the acoustic signals emitted by these targets, such as ships, submarines, and underwater organisms. Due to its advantages such as strong concealment and long range, it has extremely important application value in fields such as marine monitoring, maritime supervision, and marine resource exploration.
[0003] However, underwater acoustic signals are easily affected by environmental noise, ship noise, water flow noise, and multipath propagation effects during propagation, resulting in severe nonlinear distortion of the received target signal. Traditional identification methods based on artificially designed features are significantly less robust and generalizable when facing high dynamic and strong noise backgrounds.
[0004] In recent years, deep learning technology has been widely used in the field of underwater acoustic target recognition due to its powerful automatic feature extraction and representation capabilities. Existing technical solutions can be mainly divided into single-input temporal models, single-input convolutional models, and simple time-frequency fusion models.
[0005] Single-input time-series models directly use a one-dimensional time-domain signal as input and employ recurrent neural networks (RNNs) or their variants, such as long short-term memory networks (LSTMs), to extract temporal dependencies from the signal. The technical steps include: preprocessing the underwater acoustic signal by denoising and normalization; inputting the preprocessed one-dimensional time-domain sequence into an LSTM network to extract temporal feature vectors; and finally outputting the target category through fully connected layers and a softmax classifier. However, single-input time-series models have limited feature representation capabilities, processing only time-domain sequences and failing to capture spatial features such as harmonic distribution and frequency band energy in the frequency domain, leading to low accuracy in identifying targets with significantly different spectral structures (such as cargo ships and submarines).
[0006] Single-input convolutional models first convert a one-dimensional time-domain signal into a two-dimensional time-frequency image using a Short-Time Fourier Transform (STFT), and then use a Convolutional Neural Network (CNN) to extract local structural features of the image. The technical steps include: performing an STFT on the time-domain signal to obtain a time-frequency matrix; feeding this matrix as input into a deep CNN model such as ResNet or Inception to extract spectral feature vectors; and finally, outputting the result from a classification layer. However, convolutional models only process the frequency domain matrix, ignoring temporal dynamics, transient impacts, and other temporal patterns in the time domain, resulting in poor robustness to targets with changing motion states (such as a patrol boat accelerating / decelerating). Therefore, single-input models cannot meet the requirements of complex scenarios.
[0007] To address the shortcomings of the two methods mentioned above, some existing technologies attempt to fuse time-domain and frequency-domain features in a simple time-frequency fusion model. The steps are as follows: feature vectors are extracted from the time-domain signal and time-frequency plot using independent time-series and convolutional models respectively; then, shallow strategies such as "feature concatenation" or fixed-weight "weighted summation" are used to fuse the two feature vectors into a new feature vector; finally, the fused feature is fed into a classifier. However, the simple time-frequency fusion model suffers from low cross-domain information utilization. Because it uses "simple concatenation / fixed weighting," the inherent correlation between time-frequency features is not quantified, resulting in redundant information in the fused feature and a lack of enhancement of key complementary information, leading to a fusion effect inferior to expectations.
[0008] In summary, existing technologies generally suffer from limitations in feature representation capabilities and low efficiency in cross-domain information fusion, making it difficult to achieve robust and accurate passive sonar target identification in complex marine environments. Therefore, how to fully explore and deeply fuse the temporal dynamic features and frequency domain structural features of underwater acoustic signals has become a pressing technical challenge that needs to be addressed in this field. Summary of the Invention
[0009] This invention provides a passive sonar target recognition method and system based on time-frequency dual-channel feature fusion, aiming to solve at least one of the problems in the prior art, namely, the low accuracy of passive sonar target recognition and the poor effect of time-frequency feature fusion in complex marine environments.
[0010] To achieve this objective, the present invention provides a passive sonar target identification method, system, and model based on time-frequency dual-channel feature fusion.
[0011] First, this invention proposes a deep learning model, TFCA-Net (Time-Frequency Cross-Attention Network), for time-frequency dual-channel feature fusion. This model adopts an end-to-end architecture of "dual input-dual branch-cross-domain fusion-classification output". Specifically, the model includes: a data input layer for receiving a one-dimensional time-domain sequence and a two-dimensional time-frequency matrix; a time-domain feature extraction branch using a stacked bidirectional long short-term memory network (BiLSTM) and a self-attention mechanism to extract deep temporal dynamic features of the signal; a frequency-domain feature extraction branch using a deep convolutional network based on Inception-v3 to extract the spectral spatial structure features of the signal; a cross-domain attention fusion module for calculating the correlation matrix between time-frequency features, generating adaptive fusion weights, and performing weighted and residual fusion; and a classification output layer for outputting the final target class probability.
[0012] Secondly, this invention proposes a cross-domain attention fusion mechanism, the steps of which are as follows:
[0013] (1) Calculate the time-domain eigenvector F T With frequency domain eigenvector F F The dot product correlation matrix M between them corr This is to quantify the correlation strength between the two in the feature space;
[0014] (2) Through a learnable single-layer neural network with the Sigmoid function as the activation, the correlation matrix is mapped to a pair of adaptive weights α and β, satisfying β=1-α, to achieve dynamic weighting;
[0015] (3) Perform weighted summation for initial fusion: α·F T +β·F F ;
[0016] (4) At the same time, F T With F F The features are concatenated and extracted using a fully connected layer and a non-linear activation function (ReLU).
[0017] (5) Finally, the weighted summation result is added to the above combined features through residual connection to obtain the final fused feature vector. This mechanism effectively explores the intrinsic correlation of time and frequency features, realizes deep complementarity, and overcomes the problems of information redundancy and low utilization caused by traditional simple splicing or fixed weighting.
[0018] Compared with the prior art, the beneficial effects of the present invention are:
[0019] 1. This invention fully utilizes complementary features through parallel extraction and deep collaboration of temporal dynamic information and frequency domain structural information. Experiments demonstrate that on the publicly available ShipsEar and DeepShip underwater acoustic target datasets, the model of this invention (TFCA-Net) significantly outperforms models using only the temporal or frequency domain branches in terms of recognition accuracy. Specifically, on the ShipsEar dataset, the accuracy is improved by 10.5 percentage points compared to the single temporal branch model and by 5.7 percentage points compared to the single frequency domain branch model; on the DeepShip dataset, the accuracy is improved by 8.9 percentage points compared to the single temporal branch model and by 5.5 percentage points compared to the single frequency domain branch model.
[0020] 2. Because it utilizes both time and frequency domain information, when the features of one domain are severely disturbed by environmental noise, the model can still obtain effective information from another relatively robust domain to make decisions, thereby improving the overall robustness and generalization ability of the model in complex marine environments. Attached Figure Description
[0021] Figure 1This is a schematic diagram of the deep learning model network structure for time-frequency dual-channel feature fusion in this invention. Detailed Implementation
[0022] The specific embodiments of the present invention are described in detail below with reference to the accompanying drawings, so that those skilled in the art can more clearly understand how to practice the present invention. Although the present invention has been described in conjunction with its preferred embodiments, these embodiments are merely illustrative and not intended to limit the scope of the invention.
[0023] Example 1
[0024] 1. Overall structure of the model
[0025] This invention proposes a deep learning model for time-frequency dual-channel feature fusion, which adopts an end-to-end structure of "dual input - dual branch - cross-domain fusion - classification output". Figure 1 The diagram shows the overall structure of the network. The input layer receives two types of preprocessed data: normalized one-dimensional time-domain sequences. (L is the time series length) and two-dimensional time-frequency matrix (H and W represent the matrix height and width, respectively). The time-domain branch and the frequency-domain branch operate in parallel, extracting time-series and spectral features respectively, generating time-domain feature vectors with consistent dimensions. and frequency domain eigenvectors (D represents the unified feature dimension). The cross-domain attention fusion module deeply mines the intrinsic relationship between the two-branch features, achieves information complementarity through adaptive weighting, and outputs a comprehensive feature vector. Finally, the classification output layer completes the mapping from features to category probabilities, achieving passive sonar target recognition.
[0026] 2. Data Preprocessing
[0027] The present invention performs the following preprocessing operations on the raw passive sonar signal:
[0028] 2.1 Standardization Processing: First, the original signal is segmented into frames, and each frame undergoes preliminary noise suppression using wavelet thresholding or spectral subtraction to obtain a relatively clean time-domain signal x(t). The frame length and frame shift can be set according to the actual sampling rate. The Min-Max normalization method is used to linearly map the amplitude of the entire time-domain signal sequence to the [0,1] interval, calculated using the following formula: Where x is the original signal amplitude, x min x max These represent the minimum and maximum values of the signal amplitude, respectively. This operation eliminates the influence of differences in radiation intensity between different targets, which helps stabilize model training.
[0029] 2.2 Frequency Domain Transformation: Perform a short-time Fourier transform (STFT) on the normalized time-domain signal to generate a two-dimensional time-frequency matrix. The matrix dimension is adapted to the feature extraction requirements of the subsequent convolution branch.
[0030] After the above preprocessing, a one-dimensional time-domain sequence is obtained. and two-dimensional time-frequency matrix , which serves as the input to the model. L is the timing length, determined based on the total signal length and model design; H is the frequency dimension; and W is the number of time frames.
[0031] 3. Temporal Feature Extraction Branch
[0032] The temporal feature extraction branch adopts a "2-layer stacked BiLSTM + self-attention mechanism" structure to accurately capture the dynamic patterns and key segments of the temporal signal:
[0033] 3.1 BiLSTM Layer: The forward LSTM captures future temporal dependencies, and the backward LSTM mines historical feature associations. The outputs of the two layers are concatenated and fused. This two-layer stacked design enhances feature extraction capabilities. If the input sequence length is L, the final output of this layer is a temporal feature matrix. This layer can capture the contextual information before and after each moment in the sequence, while the stacked design further abstracts and strengthens the temporal features.
[0034] 3.2 Self-Attention Layer: The purpose of this layer is to allow the model to focus on the most critical time segments for the recognition task. It strengthens key time segments and suppresses redundant information through weight allocation. The calculation process is as follows:
[0035]
[0036] in It is a learnable projection matrix, d k =256 represents the query / key vector dimension. LayerNorm implements layer normalization for stable training, ultimately outputting temporal feature vectors. .
[0037] 4. Frequency Domain Feature Extraction Branch
[0038] This branch is used to extract information from the two-dimensional time-frequency matrix X. F Extracting depth-based spectral spatial features.
[0039] The frequency domain feature extraction branch uses a structure similar to the Inception-v3 network. Based on the actual input shape, the parameters of some layers are modified to ensure that the final output dimension is consistent with the time domain features. Specifically, based on the Inception-v3 architecture, the input is adapted to a single-channel two-dimensional time-frequency matrix, and global pooling layers and fully connected layers are added at the end of the network to finally output a frequency domain feature vector. This lays the foundation for cross-domain integration.
[0040] 5. Cross-domain attention fusion module
[0041] The cross-domain attention fusion module is used for temporal feature vectors and frequency domain eigenvectors To achieve deep complementarity and integration.
[0042] 5.1 Correlation Matrix Calculation: Quantifying the Correlation Strength of Time-Frequency Features:
[0043]
[0044] Where D=256, for features in vector form, M corr As a scalar, it represents the correlation between time-domain features and frequency-domain features as a whole.
[0045] 5.2 Attention Weight Normalization: Adaptive weights are generated through Sigmoid activation.
[0046]
[0047] in b is the weight parameter. α The bias term is α, and the adaptive weights of the time-domain and frequency-domain features are respectively.
[0048] 5.3 Feature Fusion: Combining weighted summation and residual connection, the original feature information is preserved and cross-domain complementarity is enhanced. The fusion formula is as follows:
[0049]
[0050] in To fuse the weight matrix, b f The bias term is represented by Concat, which is the feature concatenation operation, and ReLU is the rectified linear activation function. The final output is a comprehensive feature vector. .
[0051] 6. Classification Output Layer
[0052] Class probability mapping is achieved through two fully connected layers and a softmax function:
[0053]
[0054] in The weights are the weights of the first fully connected layer. (C represents the total number of target categories) is the output layer weight. To address the bias term, a Dropout layer (with a dropout rate of 0.3) suppresses overfitting, and the Softmax activation function outputs the class probability distribution. The category corresponding to the highest probability is the recognition result.
[0055] 7. Model Training
[0056] The model training follows the standard neural network training process, with some key parameters as follows:
[0057] The model training batch size was set to 32; the optimizer was AdamW, the initial learning rate was set to 0.001, and the optimization parameters were set to 1×102. -5 The weight decay coefficient is used; the loss function is Categorical Cross-Entropy Loss, which is suitable for multi-class classification tasks.
[0058] The dataset was split into training, validation, and test sets in a 6:2:2 ratio. The total number of training epochs was set to 50, with an early stopping strategy (Patience=8) enabled. The validation set loss was used as the monitoring metric. If the validation set loss did not decrease for 8 consecutive training epochs, training was stopped, and the model parameters with the minimum validation set loss were restored as the final model.
[0059] After training, the passive sonar signal to be identified is input into the model after undergoing the same preprocessing to obtain the target category recognition result.
[0060] Example 2
[0061] In addition to the implementation methods described in Example 1, the present invention can also be implemented through other alternative solutions.
[0062] Example 1 uses BiLSTM, which can be replaced with other types of recurrent neural networks or temporal convolutional networks. For example, gated recurrent units (GRUs) can be used instead of LSTM, which have fewer parameters and faster training speeds in certain scenarios. Similarly, the number of stacked BiLSTM layers can be increased to 3 or 4 layers to extract higher-level temporal abstract features.
[0063] Example 1 uses the modified Inception-v3, but it can also be replaced with other structures, such as ResNet, DenseNet or the lighter MobileNet, to adapt to different computing resource constraints or real-time requirements. The key is that this branch must be able to effectively process two-dimensional time-frequency images and output feature vectors of fixed dimensions.
[0064] When calculating the correlation matrix, for features in vector form, the dot product is an efficient choice if F T and FF It is in matrix form, and can use fully connected layers to model the interactions between time-frequency features in a more complex way. The activation function of the nonlinear projection branch can also be replaced by other variants, such as Leaky ReLU or ELU.
[0065] All of the above alternative solutions fall within the protection scope of this invention.
[0066] Explanation of relevant technical terms:
[0067] STFT (Short Time Fourier Transform): A signal processing method that divides a time-domain signal into segments with a fixed window length, performs a Fourier transform on each segment, and obtains a two-dimensional "time-frequency" matrix.
[0068] BiLSTM (Bidirectional Long Short-Term Memory Network): A variant of recurrent neural network (RNN) containing two LSTM branches, forward and backward, which can simultaneously capture the future and historical dependencies of time series data;
[0069] Inception-v3: A deep convolutional neural network proposed by Google, based on the multi-scale feature fusion design of the "Inception module", it is one of the classic basic models for computer vision tasks such as image recognition and object detection;
[0070] Dropout: A regularization method that randomly masks some neurons during training to suppress model overfitting. The dropout rate is the proportion of neurons that are masked (e.g., 0.3 means 30% of neurons are masked).
[0071] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.
Claims
1. A passive sonar target recognition method based on time-frequency dual-channel feature fusion, characterized in that, Includes the following steps: Step 1: Obtain passive sonar original signal, pre-process it, and synchronize to generate one-dimensional time domain sequence X T and two-dimensional time-frequency matrix X F ; Step 2: Construct a time-frequency dual-channel feature fusion deep learning model, the model including: a time domain feature extraction branch configured to receive the one-dimensional time domain sequence X T and extract a time domain feature vector X F thereof a frequency domain feature extraction branch configured to receive the two-dimensional time-frequency matrix X T and extract a frequency domain feature vector X F thereof The cross-domain attention fusion module is configured to receive the time domain feature vector F T and the frequency domain feature vector F F , calculate the correlation between the two, dynamically generate adaptive fusion weights based on the correlation, and output a fusion feature vector F fusion by means of weighted summation combined with nonlinear residual mapping. a classification output layer for receiving the fused feature vector F fusion and outputting a target class probability distribution; Step 3: Train the model built in Step 2 using the labeled training dataset and optimize the model parameters; Step 4: After preprocessing the passive sonar signal to be identified in Step 1, input it into the trained model and perform forward calculation to obtain the target category identification result.
2. The passive sonar target recognition method based on time-frequency dual-channel feature fusion according to claim 1, characterized in that: The time domain feature extraction branch comprises at least one stacked bidirectional long short-term memory network layer and one self-attention layer; the bidirectional long short-term memory network layer is used for capturing forward and backward time sequence dependency of a signal and outputting a time sequence feature matrix; and the self-attention layer is used for weighting the time sequence feature matrix to highlight key time sequence segments and finally output the time domain feature vector F T .
3. The passive sonar target recognition method based on time-frequency dual-channel feature fusion according to claim 1, characterized in that: The frequency domain feature extraction branch is based on a deep convolutional neural network adapted to single-channel input. This deep convolutional neural network is an improved version of Inception-v3, where the last layer is replaced by a global pooling layer and a fully connected layer to output the frequency domain feature vector F with a preset uniform dimension. F .
4. The passive sonar target recognition method based on time-frequency dual-channel feature fusion according to claim 1, characterized in that, The cross-domain attention fusion module performs the following specific computational steps: (a) Calculate the time-domain feature vector F T and the frequency domain feature vector F F Dot product correlation score M corr : Where D is F T and F F Feature dimensions; (b) Generate adaptive weights through Sigmoid activation: in, b is the weight parameter. α For the bias term, α and β are the adaptive weights of the time-domain and frequency-domain features, respectively; (c) Obtain the fused feature vector F according to the following formula. fusion : ,in To fuse the weight matrix, b f Here, is the bias term, Concat is the feature concatenation operation, and ReLU is the linear rectified activation function.
5. The passive sonar target recognition method based on time-frequency dual-channel feature fusion according to claim 1, characterized in that, The preprocessing in step 1 includes: The original passive sonar signal is denoised and subjected to Min-Max amplitude normalization to obtain the one-dimensional time-domain sequence X. T Perform a short-time Fourier transform on the normalized time-domain signal, extract the amplitude spectrum and logarithmically transform it to generate the two-dimensional time-frequency matrix X. F .
6. A passive sonar target intelligent identification system based on time-frequency dual-channel feature fusion, characterized in that, include: The data preprocessing module is used to preprocess the input passive sonar raw signal and output a one-dimensional time-domain sequence X. T and the two-dimensional time-frequency matrix X F ; The time-domain feature extraction module, whose input is connected to the data preprocessing module, is used to extract the one-dimensional time-domain sequence X. T The temporal features are obtained, and the temporal feature vector F is output. T ; The frequency domain feature extraction module, whose input is connected to the data preprocessing module, is used to extract the two-dimensional time-frequency matrix X. F The frequency domain features are output as frequency domain feature vector F. F ; The cross-domain attention fusion module, whose input is connected to the output of the time-domain feature extraction module and the frequency-domain feature extraction module, is used to calculate F. T With F F The correlation between the two is dynamically fused, and the fused feature vector F is output. fusion ; The classification and recognition module, whose input is connected to the output of the cross-domain attention fusion module, is used to process the fused feature vector F. fusion The system performs classification and outputs the final target category identification result.
7. The system according to claim 6, characterized in that: The cross-domain attention fusion module further includes: Correlation calculation unit, used to calculate F T With F F The dot product correlation score; The weight generation unit is used to map the correlation score through a Sigmoid activation function to generate adaptive weights α and β; Feature fusion unit, used to perform The operation yields the fused feature vector F. fusion .