A method and system for ship target identification based on underwater acoustic signals
By combining multi-view temporal signal representation and multi-scale convolutional blocks, multi-scale features of ship radiation signals are extracted, solving the problems of reliance on manual judgment and noise interference in underwater acoustic target recognition, and achieving high-accuracy recognition in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-09-15
- Publication Date
- 2026-06-30
AI Technical Summary
Existing underwater acoustic target identification methods rely on manual judgment, are affected by subjective factors and underwater environmental noise, and a single feature is insufficient to describe the unique characteristics of the ship's radiated signal, especially in complex environments where the identification accuracy is low.
Multi-view time-domain signal representation is converted into time-spectrum features. Multi-scale features of ship radiation signals are extracted through multi-scale convolutional blocks and adaptive channel attention blocks to suppress noise interference and improve recognition accuracy.
It improves the accuracy and robustness of ship target identification, and can effectively identify different types of ships in complex underwater environments.
Smart Images

Figure CN117454240B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of underwater acoustic signal recognition technology, and in particular to a method and system for ship target recognition based on underwater acoustic signals. Background Technology
[0002] Underwater acoustic target recognition (UATR) is an important research direction and technical problem in the field of underwater acoustic signal processing, especially in long-range target detection and decision information transmission. Sound waves are currently the only effective information carrier for long-distance underwater propagation, and ships moving in the ocean inevitably emit acoustic signals. Ship radiation signals typically reflect information such as hull structure, propeller structure, engine power, and ship condition, serving as crucial identification information for ship type assessment and recognition. However, traditional UATR methods have relied on the manual judgment of trained sonar operators. The accuracy of recognition is affected not only by subjective factors such as psychology and physiology but also by objective conditions such as harsh underwater environments and low signal strength. Therefore, exploring accurate and reliable automatic underwater acoustic target recognition methods has significant practical implications.
[0003] Currently, many solutions have been proposed to address the problem of automatic underwater acoustic target identification. However, current technologies typically still face two major challenges that urgently need to be addressed:
[0004] (1) Due to factors such as ship size, speed and propulsion system, different ship radiation signals usually have distinguishable characteristics. Existing UATR feature extraction methods usually focus on processing a specific type of feature. However, a single feature is usually insufficient to fully describe the unique characteristics of ship radiation signals, especially when solving UATR problems under complex environmental conditions.
[0005] (2) Due to the complexity of the marine environment and the interference of background noise, the existing UATR feature extraction methods are greatly affected by noise, which affects the accuracy of ship target identification. Summary of the Invention
[0006] This invention aims to at least solve one of the technical problems existing in the prior art. To this end, this invention proposes a method, system, device, and storage medium for ship target identification based on underwater acoustic signals. This can improve the accuracy of ship identification.
[0007] According to a first aspect of the present invention, a ship target identification method based on underwater acoustic signals includes the following steps:
[0008] Extract the multi-view time-domain signal representation of the ship's underwater acoustic signal, and convert the multi-view time-domain signal representation into time-spectral features;
[0009] The temporal spectral features are input into the first multi-scale convolutional block in a series of concatenated multi-scale convolutional blocks to obtain the final features output by the last multi-scale convolutional block. The final features are then input into an average pooling layer and a fully connected layer to obtain the ship target recognition result output by the fully connected layer. Any two multi-scale convolutional blocks have the same structure, and each multi-scale convolutional block inputs the temporal spectral features or the features output by the previous concatenated multi-scale convolutional block in parallel into a set of residual convolutional blocks with different kernels to obtain a multi-scale feature map output by each residual convolutional block. The multi-scale feature maps of each residual convolutional block are merged to obtain a merged feature map. The merged feature map is then input into an adaptive channel attention block to obtain the intermediate features or the final features output by the adaptive channel attention block. The intermediate features serve as the features input to the next concatenated multi-scale convolutional block.
[0010] The control method according to embodiments of the present invention has at least the following beneficial effects:
[0011] First, a multi-view time-domain signal representation of the underwater acoustic signal is extracted, with multiple representations of the same data extracted, each focusing on different frequency components, thus enriching the feature representation. Second, time-spectral features of different frequency components are extracted and applied to subsequent model recognition. Then, in model recognition, multiple consecutive attention-based multi-scale convolutional blocks are deployed to learn key features related to the underwater acoustic signals of different ship categories. Specifically, a set of parallel residual convolutional blocks with different kernels is used to capture multi-scale features to further learn spectral spatial discrimination features at different scales. Then, an adaptive channel attention module is used to highlight the dominant part in the global features and suppress interference noise. Finally, the ship category in the underwater acoustic signal is accurately and efficiently output based on the model.
[0012] According to some embodiments of the present invention, any one of the residual convolutional blocks includes multiple concatenated two-dimensional convolutional layers, the concatenated two-dimensional convolutional layers are connected by 1x1 convolutional layers, any two of the two-dimensional convolutional layers have the same convolutional kernel, and each of the two-dimensional convolutional layers is followed by a batch normalization and rectified linear unit; any one of the residual convolutional blocks inputs the temporal spectral features or the features output by the previous concatenated multi-scale convolutional block into the first of the multiple concatenated two-dimensional convolutional layers to obtain the multi-scale feature map output by the last two-dimensional convolutional layer.
[0013] According to some embodiments of the present invention, the adaptive channel attention block uses average pooling to aggregate the global statistical features of each channel in the merged feature map, and generates an intermediate tensor by reshaping. The intermediate tensor is input into an MLP module with two fully connected layers to capture the nonlinear relationship between channels and generate attention weights for all channels. The attention weights are multiplied element-wise with the merged feature map to output the intermediate feature or the final feature.
[0014] According to some embodiments of the present invention, the multi-scale convolutional block includes three parallel residual convolutional blocks and the kernel sizes of the three parallel residual convolutional blocks are 3, 5 and 7.
[0015] According to some embodiments of the present invention, the extraction of multi-view time-domain signal representation of underwater acoustic signals and the conversion of the multi-view time-domain signal representation into time-spectral features include:
[0016] A multi-view time-domain signal representation of underwater acoustic signals is extracted using a set of bandpass filters; each of the bandpass filters has a non-overlapping frequency band.
[0017] The multi-view time-domain signal representation is converted into time-spectral features using a Mel filter.
[0018] According to some embodiments of the present invention, before the multi-view time-domain signal representation of the underwater acoustic signal is extracted using a set of bandpass filters, the ship target identification method based on the underwater acoustic signal further includes:
[0019] Gaussian noise is added to the underwater acoustic signal.
[0020] According to some embodiments of the present invention, before converting the multi-view time-domain signal representation into time-spectral features using a Mel filter, the ship target identification method based on underwater acoustic signals further includes:
[0021] The SpecAugment method is used to augment the time-spectral features.
[0022] According to a second aspect of the present invention, a ship target identification system based on underwater acoustic signals includes:
[0023] The feature extraction unit is used to extract the multi-view time-domain signal representation of the ship's underwater acoustic signal and convert the multi-view time-domain signal representation into time-spectral features;
[0024] The target recognition unit is configured to input the temporal spectral features into the first multi-scale convolutional block of a series of concatenated multi-scale convolutional blocks to obtain the final features output by the last multi-scale convolutional block, and input the final features into an average pooling layer and a fully connected layer to obtain the ship target recognition result output by the fully connected layer; wherein any two multi-scale convolutional blocks have the same structure, and any one of the multi-scale convolutional blocks inputs the temporal spectral features or the features output by the previous concatenated multi-scale convolutional block in parallel into a set of residual convolutional blocks with different kernels to obtain the multi-scale feature map output by each residual convolutional block, merge the multi-scale feature maps of each to obtain a merged feature map, and input the merged feature map into an adaptive channel attention block to obtain the intermediate features or the final features output by the adaptive channel attention block, wherein the intermediate features are used as the features input to the next concatenated multi-scale convolutional block.
[0025] An electronic device according to a third aspect of the present invention includes at least one control processor and a memory for communicatively connecting to the at least one control processor; the memory stores instructions executable by the at least one control processor, the instructions being executed by the at least one control processor to enable the at least one control processor to perform the above-described ship target identification method based on underwater acoustic signals.
[0026] According to a fourth aspect of the present invention, a computer-readable storage medium stores computer-executable instructions for causing a computer to perform the above-described ship target identification method based on underwater acoustic signals.
[0027] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the invention. Attached Figure Description
[0028] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, in which:
[0029] Figure 1 This is a flowchart of a ship target identification method based on underwater acoustic signals provided in an embodiment of the present invention;
[0030] Figure 2 This is an architecture diagram of a ship target identification method based on underwater acoustic signals provided in another embodiment of the present invention;
[0031] Figure 3 This is an architecture diagram of a multi-scale convolutional network based on an attention mechanism provided in an embodiment of the present invention;
[0032] Figure 4 This is an architecture diagram of a residual convolutional block provided in an embodiment of the present invention;
[0033] Figure 5 This is an architecture diagram of an adaptive channel attention block provided in an embodiment of the present invention;
[0034] Figure 6 This is a schematic diagram illustrating the accuracy, precision, recall, and F1 score of different feature extraction methods trained on the ShipsEar dataset according to an embodiment of the present invention.
[0035] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0036] Embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0037] In the description of this invention, the use of terms such as "first," "second," etc., is for the purpose of distinguishing technical features only and should not be construed as indicating or implying relative importance, or implicitly indicating the number of technical features indicated, or implicitly indicating the order of the technical features indicated.
[0038] In the description of this invention, it should be understood that the orientation descriptions, such as up, down, etc., are based on the orientation or positional relationship shown in the drawings and are only for the convenience of describing this invention and simplifying the description, and are not intended to indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this invention.
[0039] In the description of this invention, it should be noted that, unless otherwise explicitly defined, terms such as "setting," "installation," and "connection" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this invention in conjunction with the specific content of the technical solution.
[0040] Background Introduction
[0041] Underwater acoustic target recognition is an important research direction and technical problem in the field of underwater acoustic signal processing, especially in long-range target detection and decision information transmission. Sound waves are currently the only effective information carrier for long-distance underwater propagation, and ships moving in the ocean inevitably emit acoustic signals. Ship radiation signals typically reflect information such as hull structure, propeller structure, engine power, and ship condition, serving as crucial identification information for ship type assessment and recognition. However, traditional UATR methods have relied on the manual judgment of trained sonar operators. Recognition accuracy is affected not only by subjective factors such as psychology and physiology but also by objective conditions such as harsh underwater environments and low signal strength.
[0042] Currently, many solutions have been proposed to address the problem of automatic underwater acoustic target identification. For example, the basic components of an underwater acoustic target identification system include information acquisition, preprocessing, feature extraction, and a classifier. However, current technologies typically face two remaining challenges that urgently need to be addressed:
[0043] (1) Due to factors such as ship size, speed, and propulsion system, different ship radiation signals typically possess distinguishable characteristics. These characteristics usually manifest as different spectral spatial energy distribution structures in the time-frequency domain, which can serve as important criteria for identifying ship targets. However, most existing technologies input the entire spectral features into the classifier model, which may lead to over-reliance on the high-energy low-frequency band during training, while ignoring the contribution of the lower-energy high-frequency band (the low-frequency components of underwater acoustic signals are significantly stronger than the high-frequency components). Furthermore, existing UATR feature extraction methods typically focus on processing specific types of features; however, a single feature is usually insufficient to comprehensively describe the unique characteristics of ship radiation signals, especially when addressing UATR problems under complex environmental conditions.
[0044] (2) Due to the complexity of the marine environment and the interference of background noise, the existing UATR feature extraction methods are greatly affected by noise, which affects the accuracy of ship target identification.
[0045] First Embodiment
[0046] Reference Figure 1 One embodiment of this application provides a ship target identification method based on underwater acoustic signals. The ship target identification method based on underwater acoustic signals includes the following steps:
[0047] Step S101: Extract the multi-view time-domain signal representation of the ship's underwater acoustic signal and convert the multi-view time-domain signal representation into time-spectrum features.
[0048] Step S102: Input the temporal spectrum features into the first multi-scale convolutional block of a series of concatenated multi-scale convolutional blocks to obtain the final features output by the last multi-scale convolutional block. Then, input the final features into the average pooling layer and the fully connected layer to obtain the ship target recognition result output by the fully connected layer. Among them, any two multi-scale convolutional blocks have the same structure, and any one of the multi-scale convolutional blocks inputs the temporal spectrum features or the features output by the previous concatenated multi-scale convolutional block in parallel into a set of residual convolutional blocks with different kernels to obtain the multi-scale feature map output by each residual convolutional block. Merge the multi-scale feature maps of each block to obtain the merged feature map. Input the merged feature map into the adaptive channel attention block to obtain the intermediate or final features output by the adaptive channel attention block. The intermediate features are used as the features input to the next concatenated multi-scale convolutional block.
[0049] In step S101, the multi-view time-domain signal representation of the ship's underwater acoustic signal is extracted, and the multi-view time-domain signal representation is converted into time-spectrum features, including:
[0050] Step S1011: Use a set of bandpass filters to extract the multi-view time-domain signal representation of the underwater acoustic signal; each bandpass filter has a non-overlapping frequency band.
[0051] Step S1012: Use a Mel filter to convert the multi-view time-domain signal representation into time-spectrum features.
[0052] In step S1011, inspired by the multi-view data representation technique using spectral filtering, this embodiment proposes a bandpass filter bank as a preprocessing step to extract a subset of frequency components from the original ship acoustic signal, which is then represented as multi-view data and used for subsequent feature extraction.
[0053] It is known that the audio signals radiated by different types of ships exhibit different energy patterns within specific narrow frequency bands, presenting different frequency energy distribution structures, which can often serve as important criteria for target identification. However, most existing methods directly process the entire spectral characteristics of underwater acoustic signals, and these methods are often trained to focus more on the low-frequency bands with higher energy, while neglecting the contribution of high-frequency bands to target identification. Therefore, in order to locate the discriminative information between different types of ships, a multi-view representation of broadband ship radiation signals is adopted in the feature acquisition stage, with each view representing a narrow-band local acoustic signal.
[0054] In one embodiment of this application, since the discernible information of ship radiation signals is often concentrated in the low-frequency part, the filter bank consists of multiple bandpass filters with different cutoff frequencies.
[0055] Existing UATR feature extraction methods typically focus on isolating a specific type of feature. However, a single feature is often insufficient to comprehensively describe the unique characteristics of ship-radiated signals, especially when addressing UATR problems under complex environmental conditions. Therefore, in step S102, this embodiment uses multi-scale convolutional blocks with different kernel sizes to extract valuable parts of the features from different angles, thereby enriching the expressive power of the features. To prepare the filtered multi-view representation input model for target recognition, the multi-view representation is further converted into time-frequency features. This embodiment employs a Mel filter bank, which simulates the auditory characteristics of the human ear and is widely used in various fields such as speech recognition and underwater target recognition, for time-frequency domain conversion.
[0056] In step S102, the model includes multiple concatenated multi-scale convolutional blocks, average pooling layers, and fully connected layers. Each multi-scale convolutional block has the same network structure. The purpose of setting multiple multi-scale convolutional blocks is to further extract features at different scales and highlight the main components in the global features. The output of the previous multi-scale convolutional block serves as the input to the next multi-scale convolutional block. A batch normalization layer is set before the first multi-scale convolutional block to address the imbalanced data distribution in each view feature. After the output features of the last multi-scale convolutional block, the features are fed into an average pooling layer for pooling. Finally, the features are aggregated through fully connected layers for target category classification. This process includes two fully connected layers.
[0057] A multi-scale convolutional block consists of a set of residual convolutional blocks and an adaptive channel attention block. The set of residual convolutional blocks has different kernels. Moreover, the set of residual convolutional blocks are arranged side by side, and features are input into each residual convolutional block in parallel. Finally, the features output by each residual convolutional block are obtained, and the features are merged and then input into the adaptive channel attention block. The features output by the adaptive channel attention block are also the output of the multi-scale convolutional block for that volume.
[0058] The first multi-scale convolutional block performs the following steps: inputting the temporal spectral features into a set of residual convolutional blocks with different kernels to obtain the multi-scale feature map output by each residual convolutional block, merging the multi-scale feature maps of each to obtain the merged feature map, inputting the merged feature map into the adaptive channel attention block to obtain the intermediate features output by the adaptive channel attention block, and using the intermediate features as the input to the next multi-scale convolutional block.
[0059] The last multi-scale convolutional block performs the following steps: the intermediate features output by multiple concatenated multi-scale convolutional blocks are input in parallel into a set of residual convolutional blocks with different kernels to obtain the multi-scale feature map output by each residual convolutional block; the multi-scale feature maps of each block are merged to obtain the merged feature map; the merged feature map is input into the adaptive channel attention block to obtain the final feature output by the adaptive channel attention block; and the final feature is input into the average pooling layer.
[0060] Large-scale convolutions have a wider receptive field and stronger semantic information representation capabilities, but lower feature map resolution. Conversely, small-scale convolutions have a smaller receptive field, excel at representing detailed information, and have higher feature map resolution. Therefore, the multi-scale convolutional block in this embodiment consists of a set of residual convolutional layers with different kernel sizes and an adaptive channel attention module. On one hand, the multi-scale residual convolutional layers extract features with different receptive fields, thereby learning the discriminative features of underwater acoustic signals from as many different perspectives as possible. On the other hand, the channel attention module locates the most valuable parts from multiple features and adaptively enhances their weights.
[0061] In some embodiments of this application, any residual convolutional block includes multiple concatenated two-dimensional convolutional layers. The concatenated two-dimensional convolutional layers are connected by 1x1 convolutional layers. The convolutional kernels of any two two-dimensional convolutional layers are the same, and each two-dimensional convolutional layer is followed by a batch normalization and rectified linear unit. Any residual convolutional block inputs the time-spectral features or the features output by the previous concatenated multi-scale convolutional block into the first two-dimensional convolutional layer among the multiple concatenated two-dimensional convolutional layers to obtain the multi-scale feature map output by the last two-dimensional convolutional layer.
[0062] The first 2D convolutional layer is used to increase the feature dimension while reducing the size of the feature map. Additionally, a 1x1 convolutional layer is added in the channel concatenation section to maintain the consistency of the residual input. Each 2D convolutional layer is followed by batch normalization (BN) and rectified linear units (ReLU). The channel concatenation operation is used to fuse the output features of each residual 2D convolutional module. The combined features are then fed into the subsequent channel attention module after passing through a 1×1 convolutional layer.
[0063] In some embodiments of this application, the adaptive channel attention block uses average pooling to aggregate the global statistical features of each channel in the merged feature map, and generates an intermediate tensor by reshaping. The intermediate tensor is then input into an MLP module with two fully connected layers to capture the nonlinear relationships between channels and generate attention weights for all channels. The attention weights are then multiplied element-wise with the merged feature map to output intermediate or final features.
[0064] To capture more discriminative information from global features, we draw on the successful application of channel attention in computer vision. Step S102 employs an adaptive channel attention mechanism, which adaptively assigns different weights to all channel features by capturing the relationships between features from different channels. The core idea of the channel attention mechanism is to enhance important channels and suppress unimportant channels, thereby improving the discriminative power of the features. In this way, more weight is allocated to the dominant features in recognition, which helps to suppress interfering noise and improve the recognition accuracy and robustness of our model.
[0065] This embodiment first extracts a multi-view time-domain signal representation of the underwater acoustic signal, extracting multiple representations of the same data, each focusing on different frequency components, thus enriching the feature representation. Secondly, it extracts time-spectral features of different frequency components and applies them to subsequent model recognition. Then, in model recognition, it employs multiple consecutive attention-based multi-scale convolutional blocks to learn key features related to the underwater acoustic signals of different ship categories. Specifically, it uses a set of parallel residual convolutional blocks with different kernels to capture multi-scale features, further learning spectral spatial discrimination features at different scales. Then, an adaptive channel attention module is used to highlight the dominant components in the global features and suppress interference noise. Finally, based on the model, it accurately and efficiently outputs the ship category in the underwater acoustic signal.
[0066] In some embodiments of this application, step S101 further includes data enhancement, specifically including the following enhancement steps:
[0067] Before step S101, which uses a set of bandpass filters to extract the multi-view time-domain signal representation of the underwater acoustic signal, the method further includes adding Gaussian noise to the underwater acoustic signal.
[0068] Before step S101 uses a Mel filter to convert the multi-view time-domain signal representation into time-spectrum features, the method further includes: using the SpecAugment method to perform data augmentation on the time-spectrum features.
[0069] Underwater target recognition is extremely challenging due to the complexity of the underwater environment, the difficulty in acquiring sufficient underwater acoustic signals, and the inevitable interference from background noise and reverberation. Data augmentation is a crucial technique in recognition tasks, helping to increase the quantity and diversity of training data and improve the model's generalization ability and recognition performance. In the above embodiments, different data augmentation strategies were employed for time-domain signals and time-frequency domain features. Specifically, Gaussian noise was randomly added to the original waveform signal to generate data with different signal-to-noise ratios (SNR), simulating various noise intensities commonly found in underwater environments. For time-spectrum features, the SpecAugment method (a simple automatic speech recognition data augmentation method) was used to augment the Mel-spectrum features, which has proven effective in audio recognition tasks. Time-masking and frequency-masking are two of the most representative strategies in SpecAugment.
[0070] Second Embodiment
[0071] Reference Figures 2 to 5 One embodiment of this application provides a ship target identification method based on underwater acoustic signals, the method comprising:
[0072] Step S201: Feature extraction stage and feature enhancement stage;
[0073] Random Gaussian noise was added to the original underwater acoustic signal of the ship to generate data with different signal-to-noise ratios (SNR) to simulate various noise intensities common in underwater environments.
[0074] By performing spectral filtering on the original ship acoustic signal (composed of ship radiated acoustic signal and noise, etc.), a multi-view representation of the ship radiated acoustic signal (including mechanical vibration sound, propeller sound, and hydrodynamic sound) is obtained. Consider it represented as x(m)∈R. (1×M) (m∈1,...,M) and their corresponding category labels y∈{0,1,...,N} c The original underwater acoustic signal of N = -1, where M represents the time point in the signal, and N = 1. c This represents the total number of different categories.
[0075] Multi-view representation features By using a bandpass filter bank The original ship radiation information x is obtained by performing spectral filtering. This filter bank contains N b A time-domain bandpass filter. Specifically, the spectral filtering process can be described as follows:
[0076]
[0077] in, This indicates a bandpass filter operation. After the filtering operation, each view representation focuses on a specific frequency band, acting in a more granular manner to identify different types of ships.
[0078] A filter bank was used, where N b = 5 bandpass filters, each with non-overlapping frequency bands, and bandwidths increasing from 10 to 8000 Hz (10-500, 500-1000, 1000-2000, 2000-4000, 4000-8000 Hz). The filters are used to decompose each raw acoustic signal into 5 different representations.
[0079] The generated view is processed using a Mel filter bank to extract Mel spectrogram features of interest for different frequency components.
[0080] The SpecAugment method is used to augment the features of the Mel spectrogram.
[0081] The specific relationship between the Mel frequency (fmel) and the actual frequency can be expressed as follows:
[0082]
[0083] Mel frequency f mel The scale values correspond to the logarithmic distribution of the actual frequency f. As the frequency increases, the filter bandwidth gradually widens. The Mel spectrum characteristics corresponding to the multi-view data representation can be described as X∈R C×F×T Where C represents the number of channels (set to N) b F represents the number of filter banks, and T represents the number of time frames.
[0084] Step S202, Feature Recognition Stage;
[0085] First, a multi-scale convolutional network based on an attention mechanism is set up. The network structure includes, in sequence, batch normalization layers (such as...). Figure 3 Batch Norm, multiple multi-scale convolutional blocks (four), pooling layers (such as...) Figure 3 Avgpooling in the middle), two consecutive fully connected layers (such as...) Figure 3 (Linear Layer in the network). Multi-scale convolutional blocks serve as the backbone network, integrating Mel-spectrum features X∈R. C×F×T The input is processed to extract features at different scales and highlight the main components in the global features. Finally, two consecutive fully connected layers aggregate the spectral features and produce the final category output. Figure 2 As shown, AMSC Block is used to represent multi-scale convolutional blocks, and multiple multi-scale convolutional blocks are connected in series.
[0086] The multi-scale convolutional block consists of a set of residual convolutional blocks with different kernel sizes and an adaptive channel attention block. On one hand, the residual convolutional blocks extract features with different receptive fields, thereby learning the discriminative features of underwater acoustic signals from as many different perspectives as possible. On the other hand, the adaptive channel attention block locates the most valuable parts from multiple features and adaptively enhances their weights. In the multi-scale convolutional network of the embodiment, the multi-scale convolutional block is repeated four times, and each block consists of three parallel residual convolutional blocks with kernel sizes of 3, 5, and 7. All residual convolutional blocks within each multi-scale convolutional block have the same number of output channels, and the number of channels corresponding to the four different multi-scale convolutional blocks are 16, 32, 64, and 128, respectively.
[0087] like Figure 4 The residual convolutional block consists of three 2D convolutional layers with the same kernel size. The first convolutional layer increases the feature dimension while reducing the feature map size; therefore, its stride is (2,2). The strides of the next two convolutional layers are (1,1). Additionally, a 1x1 convolutional layer is added in the channel concatenation part to maintain the consistency of the residual input. All convolutional layers in the residual convolutional block are followed by batch normalization (BN) and rectified linear units (ReLU). The channel concatenation operation is used to fuse the output features of each convolutional layer. The combined features are then fed into the subsequent channel attention module (e.g., ...) after passing through the 1×1 convolutional layer. Figure 3 Channel Attention (in the context of channel attention).
[0088] Consider an input feature D of shape B×C′×F×T, where B is the batch size and C′=3C represents the number of channels in the combined feature. First, average pooling is applied to aggregate the global statistical features of each channel, and reshaping is used to generate an intermediate tensor of shape B×C′. Then, the aggregated features are input into an MLP module with two fully connected layers to capture non-linear inter-channel relationships and generate attention weights ω∈R for all channels. B×C′ Finally, the attention weights are multiplied element-wise with the original input features D to recalibrate the corresponding channels. In this way, greater weights are assigned to the dominant features in the recognition process, which helps to suppress interfering noise and improve the model's recognition accuracy and robustness.
[0089] This embodiment proposes two multi-dimensional feature extraction strategies to enrich feature representation. Specifically, a bandpass filter bank is applied to the original underwater acoustic signal to extract multi-view data representations of different frequency components of interest. Then, a multi-scale convolution strategy is applied to the extracted spectral features to further learn spectral spatial discrimination features at different scales. On the other hand, a channel attention mechanism is added after the multi-scale convolutional layer to select and highlight the dominant part from the global features and suppress interference noise. Finally, an attention-based multi-scale convolutional network is trained to learn key features related to the identification of different categories of ship radiation signals.
[0090] Third Embodiment
[0091] Reference Figure 6 and Figure 7 In one embodiment of this application, a set of experimental schemes based on the second embodiment described above is provided, as follows:
[0092] 1. Experimental data setup;
[0093] The experiments were evaluated on two publicly available underwater acoustic datasets: ShipsEar and DeepShip.
[0094] ShipsEar: This dataset contains ship radiated acoustic data and pure background noise data, including 90 audio tracks from 11 types of ships and various natural environmental noises. To address the data imbalance between different ship types, all records were reclassified into five distinct categories based on ship size.
[0095] DeepShip: This dataset contains 47 hours and 4 minutes of real underwater target recordings from 256 different vessels across four categories. The data recordings were sampled at a rate of 3.2 kHz.
[0096] For each dataset, all acoustic signals were resampled to 16kHz and trimmed to 3-second segments. The segmented audio segments were then randomly divided into training, validation, and test sets, comprising 70%, 15%, and 15% of the data, respectively.
[0097] 2. Experimental environment setup;
[0098] For the input features, Mel-spectral features were extracted from the temporal audio using a window length of 512 samples (32 ms) and a frame shift of 256 samples (16 ms). A total of 80 Mel filters were used in this process. A multi-scale convolutional network model based on an attention mechanism (hereinafter referred to as this network model) was implemented using PyTorch and optimized using the Adam optimizer. All models were trained for 100 epochs using an NVIDIA GeForce RTX3090 GPU and a Core i9-12900KF CPU, with an initial learning rate of 2e-4 and a batch size of 32. The CosineAnnealingLR optimization strategy was used to adjust the learning rate during training, with a minimum learning rate set to 1e-7. For data augmentation methods, the temporal mask was set to 15 and the frequency mask to 25. The minimum cross-entropy loss was used to train the model. The loss function is shown below:
[0099]
[0100] Where N and N c These represent the number of samples and the number of categories, respectively. i = 1, ..., N and c = 0, ..., N c -1. y ic = (0,1) represents the sign function, where it is 1 if the true class of sample i is equal to c, and 0 otherwise. Variable p ic It is the predicted probability that observed sample i belongs to category c.
[0101] 3. Experimental evaluation indicators;
[0102] To evaluate the classification performance of the proposed solutions, accuracy, precision, recall, and F1 score are reported as objective metrics. Higher values correspond to better performance. Classification precision can be defined as follows:
[0103]
[0104] For each category c (c∈{0,...,N) c Precision, recall, and F1 score are calculated using the following equation: -1}),
[0105]
[0106]
[0107]
[0108] Here, n ij (where i = 0, ..., N) c -1 and j = 0,...,N c-1) represents the number of samples in category i predicted as category j. To obtain overall precision, recall, and F1 score, these metrics are used across all N samples. c Take the average value from the class.
[0109] 4. Experimental results;
[0110] (1) Multi-view representation;
[0111] This section demonstrates through experiments the effectiveness of the multi-view representation strategy proposed in this application for enhancing the performance of the UATR system. In the feature extraction stage, the original ship radiation signal is first decomposed into five different data representations by a set of bandpass filters with non-overlapping frequency bands. These multi-view representations cover the frequency ranges of 0-500Hz, 500-1000Hz, 1000-2000Hz, 2000-4000Hz, and 4000-8000Hz. To investigate the contribution of different view features to identifying various ship categories and to demonstrate the superiority of the multi-view representation strategy, sub-band features, full-band features, and multi-view features were evaluated separately. Experimental results on the ShipsEar dataset are shown in the table below.
[0112]
[0113]
[0114] Table 1
[0115] For single-view representations, low-frequency features (0-500Hz) not only outperformed other features in terms of overall accuracy (92.66%), but also achieved the best performance in identifying all categories of ships. This result indicates that identifiable information from ship radiated signals tends to be concentrated in the low-frequency range. Furthermore, by comparing experimental results from different view representations, it can be observed that different frequency band features exhibit specific advantages in identifying different types of ships. For example, the highest frequency features (4000-8000Hz) are more effective in identifying category D, while features with a frequency distribution between 2000-4000Hz are more accurate in identifying category C, and features in the 1000-2000Hz range have a relative advantage in identifying background noise data (category E). Moreover, although full-band features have a higher overall accuracy (95.87%) than sub-band features, the lowest frequency features (0-500Hz) are more favorable for identifying category D. This suggests that while the model can obtain global information from broadband features, it may overlook certain fine-grained components within the features. To address this limitation, this application integrates features from different frequency bands. Experimental results show that the multi-view representation strategy produces competitive recognition performance and achieves the best accuracy results across all categories.
[0116] (2) Evaluation of feature extraction methods;
[0117] The filtered multi-view representation needs to be further converted into temporal-spectral features before being input into this network. To verify the impact of different temporal-spectral features on model performance, a set of experiments was conducted, comparing Mel spectrum, STFT, and MFCC features. For the ShipsEar dataset, the feature dimensions of Mel spectrum, STFT, and MFCC are (80, 192), (257, 192), and (13, 192), respectively, where the first value represents the resolution of the frequency dimension and the second value represents the number of time ranges. The accuracy, precision, recall, and F1 score of different feature extraction methods trained on the ShipsEar dataset are shown below. Figure 6 As shown.
[0118] from Figure 6 It can be seen that when using Mel spectrum and STFT spectrum as input features of this network, the model's evaluation results on all metrics are significantly better than those using MFCC features. Specifically, the Mel spectrum features with this network achieved the best results on the ShipsEar dataset, with an accuracy of 98.17%, precision of 98.5%, recall of 97.78%, and F1 score of 98.14%.
[0119] (3) Ablation test;
[0120] To evaluate the effectiveness of various components in the proposed network, two sets of comparative experiments were conducted on the ShipsEar and DeepShip datasets to assess the performance of different techniques. The proposed network was used as the baseline model. Subsequently, experiments were conducted by removing data augmentation mechanisms and adaptive channel attention blocks from the proposed network to determine the contribution of these methods to model performance improvements. The ablation study results on the ShipsEar and DeepShip datasets are shown in the table below.
[0121]
[0122] Table 2
[0123] Without any data augmentation, the network experienced a 0.62% decrease in accuracy, a 1.34% decrease in precision, a 0.61% decrease in recall, and a 0.98% decrease in F1 score on the ShipsEar dataset. On the DeepShip dataset, an average performance decrease of 1.38% was observed across all evaluation metrics. Therefore, data augmentation strategies have proven effective in improving recognition performance.
[0124] Furthermore, the channel attention module was removed from our network model to investigate its impact on model performance. Compared to the baseline, the simplified network model showed significant reductions in accuracy, precision, recall, and F1 score on the ShipsEar dataset, by 2.76%, 3.47%, 3.16%, and 3.32%, respectively. On the DeepShip dataset, the adaptive channel attention block achieved an average improvement of 2.08% across all metrics. These observations confirm that the adaptive channel attention mechanism proposed in our network model is crucial for improving recognition capabilities.
[0125] (4) Compare with the benchmark model;
[0126] The analysis results demonstrate the superiority of this network over representative DNN-based methods in terms of UATR. The models compared include:
[0127] EfficientNet-b0: A simple yet efficient convolutional neural network (CNN) model that has demonstrated outstanding performance in various computer vision tasks;
[0128] CRNN: A hybrid architecture that combines CNN for local feature extraction and LSTM for capturing relevant features;
[0129] MbNet-V2: A streamlined architecture for object detection using depthwise convolutions;
[0130] UATR-transformer: A transformer-based underwater acoustic target signal recognition network;
[0131] For a fair comparison, all the models mentioned above were modified to take one-dimensional Mel spectral features as input and were trained from scratch on the same ShipsEar and DeepShip datasets.
[0132]
[0133]
[0134] Table 3
[0135] The above demonstrates that our network achieved better evaluation scores than all other techniques on both datasets, obtaining recognition accuracies of 98.2% and 98.4%, respectively. This further proves that our network has excellent generalization ability.
[0136] (5) Evaluation on data at different noise levels;
[0137] To demonstrate the robustness of this network, the trained network was re-evaluated on the ShipsEar and DeepShip datasets using different levels of Gaussian noise. Noise levels were measured using signal-to-noise ratio (SNR). The raw acoustic signals from both datasets were used as clean signals in this experiment. For evaluation purposes, the signals were combined with Gaussian noise to synthesize noise samples at seven different noise levels (-15dB, -10dB, -5dB, 0dB, 5dB, 10dB, and 15dB).
[0138] For the ShipsEar dataset, our network achieved an overall accuracy of 95.87% across all classes with a signal-to-noise ratio (SNR) of 0 dB. Furthermore, it can be observed that classification accuracy generally improves gradually with increasing SNR, particularly at low SNR levels (from -15 dB to 0 dB). Additionally, due to the distinguishable spectral characteristics between marine ambient noise and ship-radiated audio, our network consistently achieved the highest recognition rate for class E in all evaluation scenarios. Compared to the ShisEar dataset, the DeepShip dataset has a relatively richer sample size, exhibiting more significant regularity across test sets at different noise levels. It is recognized that recognition accuracy is affected by class sample size and strong noise interference. Specifically, the total number of samples for classes A and B in the ShipsEar dataset and classes B and D in the DeepShip dataset is relatively small compared to other target classes. At low SNR levels (≤0 dB), the recognition accuracy for these classes drops sharply as the SNR decreases.
[0139] One embodiment of this application provides a ship target identification system based on underwater acoustic signals. This system includes a feature extraction unit 1100 and a target identification unit 1200, as detailed below:
[0140] The feature extraction unit 1100 is used to extract the multi-view time-domain signal representation of the ship's underwater acoustic signal and convert the multi-view time-domain signal representation into time-spectrum features.
[0141] The target recognition unit 1200 is used to input temporal spectral features into the first multi-scale convolutional block of a series of concatenated multi-scale convolutional blocks to obtain the final features output by the last multi-scale convolutional block. The final features are then input into an average pooling layer and a fully connected layer to obtain the ship target recognition result output by the fully connected layer. Any two multi-scale convolutional blocks have the same structure, and any one of the multi-scale convolutional blocks inputs the temporal spectral features or the features output by the previous concatenated multi-scale convolutional block in parallel into a set of residual convolutional blocks with different kernels to obtain the multi-scale feature map output by each residual convolutional block. The multi-scale feature maps of each block are merged to obtain a merged feature map, which is then input into an adaptive channel attention block to obtain the intermediate or final features output by the adaptive channel attention block. The intermediate features serve as the features input to the next concatenated multi-scale convolutional block.
[0142] This system embodiment and the above method embodiment are based on the same inventive concept. Therefore, the content of the above method embodiment can also be applied to this method embodiment, and will not be repeated here.
[0143] Reference Figure 7 This application also provides an electronic device, which includes:
[0144] At least one memory;
[0145] At least one processor;
[0146] At least one program;
[0147] The program is stored in memory, and the processor executes at least one program to implement the above-described ship target identification method based on underwater acoustic signals.
[0148] This electronic device can be any smart terminal, including mobile phones, tablets, personal digital assistants (PDAs), and in-vehicle computers.
[0149] The electronic devices according to embodiments of this application will now be described in detail.
[0150] The processor can be implemented using a general-purpose central processing unit (CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this disclosure.
[0151] The memory can be implemented in the form of read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory and called by the processor to execute the ship target identification method based on underwater acoustic signals according to the embodiments of this disclosure.
[0152] Input / output interfaces are used to implement information input and output;
[0153] The communication interface is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0154] A bus is used to transfer information between various components of a device, such as processors, memory, input / output interfaces, and communication interfaces.
[0155] The processor, memory, input / output interfaces, and communication interfaces communicate with each other within the device via a bus.
[0156] This disclosure also provides a storage medium, which is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute the above-described ship target identification method based on underwater acoustic signals.
[0157] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0158] The embodiments described in this disclosure are for the purpose of more clearly illustrating the technical solutions of this disclosure and do not constitute a limitation on the technical solutions provided by this disclosure. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by this disclosure are also applicable to similar technical problems.
[0159] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this disclosure, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0160] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0161] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0162] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0163] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0164] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0165] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0166] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0167] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0168] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A method for ship target identification based on underwater acoustic signals, characterized in that, The ship target identification method based on underwater acoustic signals includes the following steps: Extract the multi-view time-domain signal representation of the ship's underwater acoustic signal, and convert the multi-view time-domain signal representation into time-spectral features; The temporal spectral features are input into the first multi-scale convolutional block in a series of concatenated multi-scale convolutional blocks to obtain the final features output by the last multi-scale convolutional block. The final features are then input into an average pooling layer and a fully connected layer to obtain the ship target recognition result output by the fully connected layer. Any two multi-scale convolutional blocks have the same structure, and each multi-scale convolutional block inputs the temporal spectral features or the features output by the previous concatenated multi-scale convolutional block in parallel into a set of residual convolutional blocks with different kernels to obtain a multi-scale feature map output by each residual convolutional block. The multi-scale feature maps of each residual convolutional block are merged to obtain a merged feature map. The merged feature map is then input into an adaptive channel attention block to obtain the intermediate features or the final features output by the adaptive channel attention block. The intermediate features serve as the features input to the next concatenated multi-scale convolutional block.
2. The ship target identification method based on underwater acoustic signals according to claim 1, characterized in that, Each residual convolutional block includes multiple concatenated two-dimensional convolutional layers. The concatenated two-dimensional convolutional layers are connected by 1x1 convolutional layers. The convolutional kernels of any two two-dimensional convolutional layers are the same, and each two-dimensional convolutional layer is followed by a batch normalization and rectified linear unit. Each residual convolutional block inputs the temporal spectral features or the features output by the previous concatenated multi-scale convolutional block into the first two-dimensional convolutional layer among the multiple concatenated two-dimensional convolutional layers to obtain the multi-scale feature map output by the last two-dimensional convolutional layer.
3. The ship target identification method based on underwater acoustic signals according to claim 2, characterized in that, The adaptive channel attention block uses average pooling to aggregate the global statistical features of each channel in the merged feature map, and generates an intermediate tensor by reshaping. The intermediate tensor is input into an MLP module with two fully connected layers to capture the non-linear relationships between channels and generate attention weights for all channels. The attention weights are multiplied element-wise with the merged feature map to output the intermediate feature or the final feature.
4. The ship target identification method based on underwater acoustic signals according to claim 1, characterized in that, The multi-scale convolutional block comprises three parallel residual convolutional blocks with kernel sizes of 3, 5, and 7, respectively.
5. The ship target identification method based on underwater acoustic signals according to claim 1, characterized in that, The extraction of the multi-view time-domain signal representation of the underwater acoustic signal, and the conversion of the multi-view time-domain signal representation into time-spectrum features, includes: A multi-view time-domain signal representation of underwater acoustic signals is extracted using a set of bandpass filters; each of the bandpass filters has a non-overlapping frequency band. The multi-view time-domain signal representation is converted into time-spectral features using a Mel filter.
6. The ship target identification method based on underwater acoustic signals according to claim 5, characterized in that, Before employing a set of bandpass filters to extract the multi-view time-domain signal representation of the underwater acoustic signal, the ship target identification method based on the underwater acoustic signal further includes: Gaussian noise is added to the underwater acoustic signal.
7. The ship target identification method based on underwater acoustic signals according to claim 5, characterized in that, Before using a Mel filter to convert the multi-view time-domain signal representation into time-spectrum features, the ship target identification method based on underwater acoustic signals further includes: The SpecAugment method is used to augment the time-spectral features.
8. A ship target identification system based on underwater acoustic signals, characterized in that, The ship target identification system based on underwater acoustic signals includes: The feature extraction unit is used to extract the multi-view time-domain signal representation of the ship's underwater acoustic signal and convert the multi-view time-domain signal representation into time-spectral features; The target recognition unit is configured to input the temporal spectral features into the first multi-scale convolutional block of a series of concatenated multi-scale convolutional blocks to obtain the final features output by the last multi-scale convolutional block, and input the final features into an average pooling layer and a fully connected layer to obtain the ship target recognition result output by the fully connected layer; wherein any two multi-scale convolutional blocks have the same structure, and any one of the multi-scale convolutional blocks inputs the temporal spectral features or the features output by the previous concatenated multi-scale convolutional block in parallel into a set of residual convolutional blocks with different kernels to obtain the multi-scale feature map output by each residual convolutional block, merge the multi-scale feature maps of each to obtain a merged feature map, and input the merged feature map into an adaptive channel attention block to obtain the intermediate features or the final features output by the adaptive channel attention block, wherein the intermediate features are used as the features input to the next concatenated multi-scale convolutional block.
9. An electronic device, characterized in that: It includes at least one control processor and a memory for communicatively connecting to the at least one control processor; the memory stores instructions executable by the at least one control processor, which, when executed by the at least one control processor, enable the at least one control processor to perform the ship target identification method based on underwater acoustic signals as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores computer-executable instructions for causing a computer to perform the ship target identification method based on underwater acoustic signals as described in any one of claims 1 to 7.