Multimodal deep learning method for ocean ship noise classification based on bilinear fusion

By employing a bilinear fusion strategy to perform fine-grained interaction between waveform and spectral features, the problem of incomplete modal feature information in existing technologies is solved, achieving efficient ship noise classification and improving classification accuracy and robustness.

CN121963786BActive Publication Date: 2026-06-19崂山国家实验室

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
崂山国家实验室
Filing Date
2026-04-02
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, single-mode methods based on spectrograms and waveforms suffer from time-frequency conversion loss details or are susceptible to environmental noise interference in marine vessel noise classification, making it difficult to fully explore the complex correlation between waveform features and spectral features.

Method used

A bilinear fusion strategy is adopted. By performing an outer product weighted sum of the dimensionality-reduced waveform feature vector and spectral feature vector, a second-order combination relationship is established. Parallel waveform and spectral feature extraction networks are constructed to perform fine-grained interaction and nonlinear processing.

Benefits of technology

It enhances the expressive power and discriminative power of fused features, overcomes the shortcomings of incomplete information in single modalities, and improves the accuracy and robustness of ship noise classification.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121963786B_ABST
    Figure CN121963786B_ABST
Patent Text Reader

Abstract

This invention relates to a multimodal deep learning-based method for classifying marine vessel noise using bilinear fusion, belonging to the field of acoustic signal processing technology. This method includes data preprocessing, constructing a two-stream feature extraction network, bilinear feature fusion, and a classification step. The bilinear feature fusion step includes: reducing the dimensionality of waveform feature vectors and spectral feature vectors respectively; calculating the weighted sum of the outer products of the reduced waveform and spectral feature vectors through bilinear interaction to obtain a fused feature vector; and then performing nonlinear processing on this fused feature vector to obtain the final fused feature vector. This method achieves fine-grained interaction between waveform and spectral features through bilinear fusion, enabling more thorough mining of complementary information between the two modes and improving the expressive power and discriminative power of the fused features.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of acoustic signal processing technology, specifically relating to a multimodal deep learning method for classifying marine vessel noise based on bilinear fusion. Background Technology

[0002] Underwater acoustic target recognition technology analyzes ship radiated noise to determine target type, and it has significant value in marine environmental monitoring, maritime safety management, and related engineering applications. With the development of deep learning, recognition methods based on convolutional neural networks have become mainstream. Based on the modal differences of the input data, existing methods are mainly divided into two categories: spectrogram-based methods and original waveform-based methods. The former converts a one-dimensional signal into a two-dimensional time-frequency diagram (such as a Mel spectrogram) for feature extraction, effectively representing time-frequency information; the latter directly processes the original waveform, preserving the signal's fine temporal structure. However, single-modal input has limitations: spectrogram-based methods lose some original temporal details during time-frequency conversion, while waveform-based methods are susceptible to environmental noise interference.

[0003] To combine the advantages of both modalities, existing technologies have proposed multimodal fusion strategies. For example, CN118585889B discloses a method combining temporal and visual neural networks, fusing decision vectors from MFCC features and Mel spectrogram features through weighted averaging. CN119669909A further proposes a method based on multi-head attention mechanism to fuse waveform and spectrogram features. Although the above fusion methods improve recognition performance to some extent, existing fusion strategies (such as weighted averaging or attention mechanisms) are mostly shallow or post-fusion, mainly focusing on macroscopic alignment between features, and failing to fully explore the complex and fine-grained correlation information between two heterogeneous feature spaces. Such deep interactions between features are crucial for distinguishing ship targets with similar acoustic properties.

[0004] Therefore, how to design a fusion mechanism that can achieve deep interaction between waveform features and spectral features while maintaining the efficiency of the model remains a problem that needs to be solved by current technology. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides a multimodal deep learning-based marine vessel noise classification method based on bilinear fusion. This method achieves fine-grained interaction between waveform and spectral features through bilinear fusion. It performs a weighted outer product calculation on the dimensionality-reduced waveform and spectral feature vectors, modeling a second-order combination relationship between the elements of the two feature vectors. Compared to traditional shallow fusion methods such as feature concatenation or weighted averaging, this method can more fully exploit the complementary information of the two modalities, improving the expressive power and discriminative power of the fused features.

[0006] This invention provides a multimodal deep learning-based method for classifying marine vessel noise based on bilinear fusion, comprising the following steps:

[0007] Data preprocessing: Obtain the raw audio data of ship noise, resample and segment it, and extract the corresponding waveform signals and Mel spectrograms respectively;

[0008] Constructing a dual-stream feature extraction network: Parallel waveform feature extraction and spectral feature extraction networks are constructed. The waveform signal is input into the waveform feature extraction network to obtain waveform feature vectors. The Mel spectrogram is input into the spectral feature extraction network to obtain the spectral feature vector. ;

[0009] Bilinear feature fusion: combining waveform feature vectors and spectral eigenvectors Dimensionality reduction is performed separately, and the waveform feature vectors after dimensionality reduction are calculated through bilinear interaction. and the dimensionality-reduced spectral eigenvectors The outer product weighted sum yields the fused feature vector. The feature vector is then subjected to nonlinear processing to obtain the final fused feature vector. ;

[0010] Classification: The final fused feature vector Mapping to the ship category space yields the probability distribution of ship types, and the classification results are output.

[0011] This technical solution achieves fine-grained interaction between waveform features and spectral features through bilinear fusion. It performs an outer product weighted sum calculation on the dimensionality-reduced waveform feature vector and spectral feature vector to model the second-order combination relationship between the elements of the two feature vectors. Compared with the shallow fusion methods such as traditional feature splicing or weighted averaging, it can more fully explore the complementary information of the two modes and improve the expressive power and discriminative power of the fused features.

[0012] In some embodiments, during the dual-stream feature extraction step,

[0013] The waveform feature extraction network employs multiple cascaded convolutional modules combining one-dimensional convolution, batch normalization, linear rectification, and pooling to abstract the waveform signal features layer by layer, and outputs a 256-dimensional waveform feature vector after adaptive pooling. ;

[0014] The spectral feature extraction network employs multiple cascaded convolutional modules combining 2D convolution, batch normalization, linear rectification, and pooling to abstract features layer by layer from the Mel spectrum, and outputs a 256-dimensional spectral feature vector after adaptive pooling. .

[0015] In some embodiments, the dimensionality reduction method in the bilinear feature fusion step includes, as shown in equation (1), the waveform feature vector... and spectral eigenvectors By projecting the waveform onto a 128-dimensional surface using a fully connected layer, a 128-dimensional waveform feature vector is obtained. and 128-dimensional spectral eigenvectors The expression for equation (1) is:

[0016] (1);

[0017] In equation (1), , For the projection matrix, , For the corresponding bias term, Represents the space of real numbers.

[0018] In some embodiments, in the bilinear feature fusion step, the feature vectors are fused. The calculation methods include: constructing a three-dimensional bilinear weight tensor and bias terms For each output dimension of the fused features , The calculation method is shown in equation (2):

[0019] (2);

[0020] In equation (2), For a learnable bilinear weight tensor, its elements For scalars, the first quantizer is used to quantize the characteristics of the projected waveform. The element and the first spectral feature after projection The interaction between elements affects the fusion feature. Victor's contribution; It is a learnable bias vector; and These are the feature vectors of the projected waveform. The Each element and the projected spectral feature vector The Each element.

[0021] In some embodiments, the bilinear feature fusion step further includes: applying a ReLU activation function to the fused feature vector. Nonlinear processing is performed to obtain the final fused features. .

[0022] In some embodiments, the classification step uses the Softmax function to establish a probability distribution model for ship types, as shown in equation (3):

[0023] (3);

[0024] In equation (3), Let be the predicted probability distribution vector. This is the classification weight matrix. As a classification bias term, the Softmax function ensures that all elements are non-negative and sum to 1.

[0025] In some embodiments, the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion further includes training and optimization steps, which include: training the vessel type probability distribution model using a labeled smooth cross-entropy loss function and optimizing it using an optimizer; the expression of the cross-entropy function is shown in equation (4):

[0026] (4);

[0027] In equation (4), For scalar loss values, For label smoothing factor, , The total number of ship categories, This is the one-hot vector of the true label (1 for the correct category and 0 for the rest). This represents the probability distribution predicted by the model.

[0028] In some embodiments, the optimization method using an optimizer includes: using the AdamW optimizer in conjunction with the OneCycleLR learning rate scheduling strategy for optimization.

[0029] In some embodiments, the data preprocessing steps specifically include: converting the sampling frequency of the ship noise audio signal to 16kHz; dividing the resampled audio signal into segments with a fixed duration of a preset duration, and zero-padding the segments with a duration less than the preset duration to obtain a waveform signal with 80,000 sampling points; extracting the Mel spectrum for each audio segment, and setting the FFT points to 1024, the number of hops to 256, and the number of Mel frequency bands to 64 to obtain a Mel spectrum with dimensions [1, 64, 313].

[0030] In some embodiments, the data preprocessing step is followed by a data augmentation step, which includes: performing frequency domain masking on the Mel spectrogram with a preset probability, performing time domain masking on the Mel spectrogram with a preset probability, or adding Gaussian noise to the waveform signal with a preset probability.

[0031] Based on the above scheme, the multimodal deep learning marine vessel noise classification method based on bilinear fusion in this embodiment of the invention achieves fine-grained interaction between waveform features and spectral features through bilinear fusion. It performs a weighted sum of the outer product of the dimensionality-reduced waveform feature vector and spectral feature vector, modeling a second-order combination relationship between the elements of the two feature vectors. Compared with traditional shallow fusion methods such as feature concatenation or weighted averaging, this method can more fully explore the complementary information of the two modes, improving the expressive power and discriminative power of the fused features. Furthermore, by constructing parallel waveform feature extraction networks and spectral feature extraction networks, complementary multimodal features are extracted from the time-domain waveform signal and the frequency-domain Mel spectrum, respectively, overcoming the deficiency of incomplete information from a single mode. Dimensionality reduction processing before bilinear interaction effectively controls the number of parameters and computational complexity of the bilinear weight tensor. Nonlinear processing of the fused features further enhances the discriminative power of the features. In summary, this invention achieves deep interactive fusion of multimodal features while ensuring computational feasibility, effectively improving the accuracy of vessel noise classification. Attached Figure Description

[0032] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:

[0033] Figure 1 This is a flowchart of the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion, as described in this invention.

[0034] Figure 2 This is a flowchart of the steps for constructing a dual-stream feature extraction network in Embodiment 1 of the present invention;

[0035] Figure 3 This is a bar chart comparing the performance in Example 2;

[0036] Figure 4 This is a line graph comparing the performance in Example 2;

[0037] Figure 5 This is a comparison curve of the validation set loss values ​​in Example 2;

[0038] Figure 6 This is a graph comparing the training time consumption in Example 2. Detailed Implementation

[0039] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0040] like Figure 1 As shown, in one embodiment of the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion of the present invention, the method includes data preprocessing, constructing a two-stream feature extraction network, bilinear feature fusion, and classification steps. The data preprocessing step includes: acquiring the original audio data of the vessel noise, resampling and segmenting it, and extracting the corresponding waveform signal and Mel spectrum respectively. The two-stream feature extraction network construction step includes: constructing a parallel waveform feature extraction network and a spectrum feature extraction network, inputting the waveform signal into the waveform feature extraction network to obtain the waveform feature vector. The Mel spectrogram is input into the spectral feature extraction network to obtain the spectral feature vector. The bilinear feature fusion step includes: merging the waveform feature vector... and spectral eigenvectors Dimensionality reduction is performed separately, and the waveform feature vectors after dimensionality reduction are calculated through bilinear interaction. and the dimensionality-reduced spectral eigenvectors The outer product weighted sum yields the fused feature vector. The feature vector is then subjected to nonlinear processing to obtain the final fused feature vector. The classification steps include: merging the final feature vector... Mapping to the ship category space yields the probability distribution of ship types, and the classification results are output.

[0041] In the above illustrative embodiments, the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion in this invention achieves fine-grained interaction between waveform features and spectral features through bilinear fusion. It performs a weighted sum of the outer product of the dimensionality-reduced waveform feature vector and spectral feature vector, modeling a second-order combination relationship between the elements of the two feature vectors. Compared to traditional shallow fusion methods such as feature concatenation or weighted averaging, this method can more fully mine the complementary information of the two modes, improving the expressive power and discriminative power of the fused features. Furthermore, by constructing parallel waveform feature extraction networks and spectral feature extraction networks, complementary multimodal features are extracted from the time-domain waveform signal and the frequency-domain Mel spectrum, respectively, overcoming the deficiency of incomplete information from a single mode. Dimensionality reduction processing before bilinear interaction effectively controls the number of parameters and computational complexity of the bilinear weight tensor. Nonlinear processing of the fused features further enhances the discriminative power of the features. In summary, this invention achieves deep interactive fusion of multimodal features while ensuring computational feasibility, effectively improving the accuracy of vessel noise classification.

[0042] In some embodiments, such as Figure 2 As shown, in the two-stream feature extraction step,

[0043] The waveform feature extraction network employs multiple cascaded convolutional modules combining one-dimensional convolution, batch normalization, linear rectification, and pooling to abstract the waveform signal features layer by layer, and outputs a 256-dimensional waveform feature vector after adaptive pooling. ;

[0044] The spectral feature extraction network employs multiple cascaded convolutional modules combining 2D convolution, batch normalization, linear rectification, and pooling to abstract features layer by layer from the Mel spectrum, and outputs a 256-dimensional spectral feature vector after adaptive pooling. .

[0045] In some embodiments, the dimensionality reduction method in the bilinear feature fusion step includes: as shown in equation (1), reducing the waveform feature vector... and spectral eigenvectors By projecting the waveform onto a 128-dimensional surface using a fully connected layer, a 128-dimensional waveform feature vector is obtained. and 128-dimensional spectral eigenvectors The expression for equation (1) is:

[0046] (1);

[0047] In equation (1), , For the projection matrix, , For the corresponding bias term, Represents the space of real numbers.

[0048] In some embodiments, in the bilinear feature fusion step, the feature vectors are fused. The calculation methods include: constructing a three-dimensional bilinear weight tensor and bias terms For each output dimension of the fused features , The calculation method is shown in equation (2):

[0049] (2);

[0050] In equation (2), For a learnable bilinear weight tensor, its elements For scalars, the first quantizer is used to quantize the characteristics of the projected waveform. The element and the first spectral feature after projection The interaction between elements affects the fusion feature. Victor's contribution; It is a learnable bias vector; and These are the feature vectors of the projected waveform. The Each element and the projected spectral feature vector The Each element.

[0051] In some embodiments, the bilinear feature fusion step further includes: applying a ReLU activation function to the fused feature vector. Nonlinear processing is performed to obtain the final fused features. .

[0052] In some embodiments, the classification step uses the Softmax function to establish a probability distribution model for ship types, as shown in equation (3):

[0053] (3);

[0054] In equation (3), Let be the predicted probability distribution vector. This is the classification weight matrix. As a classification bias term, the Softmax function ensures that all elements are non-negative and sum to 1.

[0055] In some embodiments, such as Figure 1As shown, the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion also includes training and optimization steps. The training and optimization steps include: training the vessel type probability distribution model using a labeled smooth cross-entropy loss function, and optimizing it using an optimizer; the expression of the cross-entropy function is shown in equation (4):

[0056] (4);

[0057] In equation (4), For scalar loss values, For label smoothing factor, , The total number of ship categories, This is the one-hot vector of the true label (1 for the correct category and 0 for the rest). This represents the probability distribution predicted by the model.

[0058] In some embodiments, the optimization method using an optimizer includes: using the AdamW optimizer in conjunction with the OneCycleLR learning rate scheduling strategy for optimization.

[0059] In some embodiments, the data preprocessing step specifically includes: converting the sampling frequency of the ship noise audio signal to 16kHz; dividing the resampled audio signal into segments with a fixed duration of a preset duration, and zero-padding the segments with a duration less than the preset duration to obtain a waveform signal with 80,000 sampling points; extracting the Mel spectrum for each audio segment, and setting the FFT points to 1024, the number of hops to 256, and the number of Mel frequency bands to 64 to obtain a Mel spectrum with dimensions [1, 64, 313].

[0060] It should be noted that one-dimensional waveform features and two-dimensional spectral features can also be obtained by different processing methods, and are not limited to the original waveform signal and Mel spectral features.

[0061] In some embodiments, the data preprocessing step is followed by a data augmentation step, which includes: performing frequency domain masking on the Mel spectrum with a preset probability, performing time domain masking on the Mel spectrum with a preset probability, or adding Gaussian noise to the waveform signal with a preset probability.

[0062] Example 1

[0063] like Figure 1 As shown, the following example will be used to describe in detail the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion provided by the present invention.

[0064] Step 1: Data Preprocessing

[0065] 1.1 Obtain the raw audio data of ship noise and convert the sampling frequency of the ship noise audio signal to 16kHz to ensure that the computational burden is reduced while fully preserving the characteristic information in the audio signal.

[0066] 1.2 The resampled audio signal is divided into segments with a fixed duration of 5 seconds. For segments with a duration of less than 5 seconds, zero padding is performed to obtain a waveform signal with 80,000 sampling points. After segmentation, all samples in the dataset have a duration of 5 seconds, ensuring the consistency of the input data dimension.

[0067] 1.3 Extract the Mel spectrogram for each audio segment, and set the FFT point count to 1024, the number of hops to 256, and the number of Mel frequency bands to 64 to obtain a Mel spectrogram with dimensions [1, 64, 313].

[0068] 1.4 The waveform signal is standardized to have a mean of 0 and a standard deviation of 1; the spectrogram is logarithmically scaled to enhance contrast.

[0069] Step 2: Construct a two-stream feature extraction network

[0070] like Figure 2 As shown, parallel waveform feature extraction networks and spectral feature extraction networks are constructed to extract waveform and spectral modes, respectively:

[0071] 2.1 Waveform Feature Extraction Network (WaveNet Branch)

[0072] The first waveform convolution module consists of a one-dimensional convolutional layer, a batch normalization layer, a linear rectifier layer, and a pooling layer connected in sequence. It takes the waveform signal as input and is used to perform preliminary downsampling and shallow feature extraction on the input signal.

[0073] Three cascaded waveform convolutional blocks, each consisting of a one-dimensional convolutional layer, a batch normalization layer, and a linear rectifier layer connected in sequence, and each of the three waveform convolutional blocks having different network parameters, are used to perform multi-scale layer-by-layer abstraction on the output of the first waveform convolutional module to extract deeper temporal features.

[0074] The waveform output module includes sequentially connected adaptive pooling layers and flattening layers. It takes the output of the last waveform convolutional block as input and finally outputs a waveform feature vector with a dimension of 256. .

[0075] 2.2 Spectral Feature Extraction Network (MelNet Branch)

[0076] The first spectral convolution module consists of a two-dimensional convolutional layer, a batch normalization layer, a linear rectified layer, and a pooling layer connected in sequence. It takes the Mel spectrum as input and is used to perform preliminary feature extraction and spatial dimension compression on the input spectrum.

[0077] Four cascaded spectral convolutional blocks, each consisting of a two-dimensional convolutional layer, a batch normalization layer, a linear rectifier layer, and a pooling layer connected in sequence, and each of the four spectral convolutional blocks has different network parameters, used to perform multi-scale layer-by-layer abstraction of the output of the first spectral convolutional module to capture local patterns and structural information in the frequency domain.

[0078] The spectrum output module consists of sequentially connected adaptive pooling layers and flattening layers. It takes the output of the last spectrum convolutional block as input and finally outputs a spectral feature vector with a dimension of 256. .

[0079] Step 3: Bilinear Feature Fusion

[0080] 3.1 Convert the waveform feature vector and spectral eigenvectors By projecting the waveform onto a 128-dimensional surface using a fully connected layer, a 128-dimensional waveform feature vector is obtained. and 128-dimensional spectral eigenvectors To reduce computational complexity, the expression for equation (1) is:

[0081] (1);

[0082] In equation (1), , For the projection matrix, , For the corresponding bias term, Represents the space of real numbers.

[0083] 3.2 Constructing a 3D Bilinear Weight Tensor and bias terms Calculate the waveform feature vector after dimensionality reduction. and the dimensionality-reduced spectral eigenvectors The outer product weighted sum is then used for vectorization and dimensionality reduction projection to establish a fine-grained interaction model between the two modal features, resulting in a 256-dimensional fused feature vector. As shown in equation (5):

[0084] (5);

[0085] For each output dimension of the fused features , The calculation method is shown in equation (2):

[0086] (2);

[0087] In equation (2), For a learnable bilinear weight tensor, its elements For scalars, the first quantizer is used to quantize the characteristics of the projected waveform. The element and the first spectral feature after projection The interaction between elements affects the fusion feature. Victor's contribution; It is a learnable bias vector; and These are the feature vectors of the projected waveform. The Each element and the projected spectral feature vector The Each element.

[0088] 3.3 Applying the ReLU activation function to the fused feature vector Nonlinear processing is performed to obtain the final fused features. As shown in equation (6):

[0089] (6).

[0090] Step 4: Classification

[0091] The final fused feature vector Mapping to the ship category space, the Softmax function is used to establish a probability distribution model for ship types, as shown in Equation (3), and the classification results are output. The expression of Equation (3) is:

[0092] (3);

[0093] In equation (3), Let be the predicted probability distribution vector. This is the classification weight matrix. As a classification bias term, the Softmax function ensures that all elements are non-negative and sum to 1.

[0094] Step 5: Training and Optimization

[0095] 5.1 The ship type probability distribution model is trained using the labeled smooth cross-entropy loss function; the expression of the cross-entropy function is shown in equation (4):

[0096] (4);

[0097] In equation (4), For scalar loss values, For label smoothing factor, This is used to soften the one-hot distribution of the true labels and reduce overfitting; The total number of ship categories, This is the one-hot vector of the true label (1 for the correct category and 0 for the rest). This represents the probability distribution predicted by the model.

[0098] 5.2 The AdamW optimizer is used in conjunction with the OneCycleLR learning rate scheduling strategy for optimization;

[0099] The parameters of the AdamW optimizer are set as follows: learning rate is set to 1e-4, weight decay is set to 1e-5, first-order momentum decay coefficient β1=0.9, and second-order momentum decay coefficient β2=0.999.

[0100] In conjunction with the OneCycleLR learning rate scheduling strategy, the learning rate is linearly preheated to 1×10⁻⁶ within the first 10% of training steps. -4 In the last 90% of training steps, the learning rate is gradually decayed to 1×10 using a cosine annealing strategy. -6 ;

[0101] During training, the batch size was set to 32, and the number of training epochs was 100. Mixed Precision Training (AMP) was introduced to accelerate training and reduce memory usage. A gradient pruning strategy was also employed, limiting the maximum gradient norm to 1.0 to improve training efficiency and stability. Furthermore, an early stopping mechanism was introduced, automatically stopping training when the validation set accuracy showed no significant improvement for 15 consecutive epochs.

[0102] Preferably, after the data preprocessing step and before the construction of the dual-stream feature extraction network step, a data augmentation step is also included to improve robustness. The data augmentation step includes frequency domain masking enhancement, time domain masking enhancement, or random noise enhancement. Specifically: frequency domain masking enhancement: randomly masking 15 consecutive Mel frequency bands with a 50% probability to enhance robustness to frequency changes; time domain masking enhancement: randomly masking 35 consecutive time frames with a 50% probability to enhance robustness to local time changes; random noise enhancement: adding Gaussian noise to the waveform signal with a 30% probability, with a signal-to-noise ratio of approximately 46 dB.

[0103] Example 2

[0104] To verify the effectiveness of the method of the present invention, the following experiment will be conducted on the dataset in Example 2 using the multimodal deep learning marine vessel noise classification method based on bilinear fusion provided in Example 1, to illustrate the beneficial effects of the classification method provided by the present invention.

[0105] We conducted a ship noise classification experiment using the DeepShip public dataset. This dataset contains four categories of ship radiated noise: cargo ships, passenger ships, tankers, and tugboats.

[0106] During the experiment, the dataset was randomly divided into training, validation, and test sets at a ratio of 70%, 20%, and 10%, respectively, and 5-fold cross-validation was used to evaluate the model performance. To verify the advantages of the bilinear fusion method proposed in this invention, various comparison methods were set up, including single-modal convolutional neural networks, feature concatenation fusion methods, attention fusion methods, Transformer fusion methods, and the bilinear fusion learning method proposed in this invention. Experimental results are analyzed as follows: Figures 3-6 As shown, where:

[0107] like Figure 3 As shown in the figure, comparing the differences in accuracy and F1 score between the present invention (Fusion_Bilinear) and methods such as single-modal WaveNet, MelNet and multimodal concatenation fusion (Fusion_Concat), attention fusion (Fusion_Attention), and attention mechanism fusion (Fusion_Transformer), it can be clearly seen from the figure that the multimodal fusion method has a significant performance improvement compared to the traditional single-modal neural network. Among them, the bilinear fusion multimodal neural network achieved the highest accuracy.

[0108] like Figure 4 As shown, five-fold cross-validation was used to compare the mean and standard deviation of the accuracy of the present invention (Fusion_Bilinear) with those of single-modal WaveNet, MelNet and multimodal splicing fusion (Fusion_Concat), attention fusion (Fusion_Attention), and attention mechanism fusion (Fusion_Transformer). This verified the superiority of the bilinear fusion method, which showed the best performance in terms of consistency and accuracy across multiple experiments.

[0109] like Figure 5 As shown, by comparing the differences in loss values ​​during training on the validation set between the present invention (Fusion_Bilinear) and methods such as single-modal WaveNet, MelNet and multimodal concatenation fusion (Fusion_Concat), attention fusion (Fusion_Attention), and attention mechanism fusion (Fusion_Transformer), the loss curves of several experiments further verify the superiority of the bilinear fusion method, which performs best in terms of consistency and accuracy in multiple experiments.

[0110] like Figure 6As shown in the figure, comparing the time required for each training round of the present invention (Fusion_Bilinear) with single-modal WaveNet, MelNet with multimodal concatenation and fusion (Fusion_Concat), attention fusion (Fusion_Attention), and attention mechanism fusion (Fusion_Transformer), it can be seen from the figure that the bilinear fusion strategy multimodal algorithm achieves the best accuracy with relatively less time consumption, further proving its algorithmic superiority. It can not only achieve the relatively optimal accuracy, but also does not consume the most training time.

[0111] In summary, the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion provided in this invention increases second-order information interaction between modalities by employing a bilinear fusion strategy. Compared to strategies such as feature concatenation and attention fusion, this method improves the dimensionality and hierarchy of feature extraction. While this increases computational cost to some extent, it also achieves deeper feature extraction, resulting in optimal performance. This invention achieves a classification accuracy of 99.4% on publicly available datasets, significantly outperforming traditional methods and existing fusion strategies. It boasts advantages such as high accuracy, strong robustness, and good generalization ability, and can be widely applied in research related to marine environmental monitoring, shipping safety supervision, and military reconnaissance, providing new insights for performance improvement.

[0112] Through the description of several embodiments of the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion of the present invention, it can be seen that the embodiments of the multimodal deep learning-based marine vessel noise classification method based on bilinear fusion of the present invention have at least one or more of the following advantages:

[0113] 1. The multimodal deep learning marine vessel noise classification method based on bilinear fusion provided by this invention achieves fine-grained interaction between waveform features and spectral features through bilinear fusion. It performs weighted sum calculation of the outer product of the dimensionality-reduced waveform feature vector and spectral feature vector to model the second-order combination relationship between the elements of the two feature vectors. Compared with the shallow fusion methods such as traditional feature splicing or weighted averaging, it can more fully explore the complementary information of the two modes and improve the expressive power and discriminative power of the fused features.

[0114] 2. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion provided by this invention extracts complementary multimodal features from time-domain waveform signals and frequency-domain Mel spectrograms by constructing parallel waveform feature extraction networks and spectral feature extraction networks, respectively, thus overcoming the deficiency of incomplete information in single-modal systems.

[0115] 3. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion provided by this invention achieves deep interactive fusion of multimodal features while ensuring computational feasibility, effectively improving the accuracy of vessel noise classification.

[0116] Finally, it should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0117] The above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them; although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications can still be made to the specific implementation of the present invention or equivalent substitutions can be made to some technical features without departing from the spirit of the technical solutions of the present invention, and all such modifications and substitutions should be covered within the scope of the technical solutions claimed in the present invention.

Claims

1. A multimodal deep learning-based marine vessel noise classification method based on bilinear fusion, characterized in that, Includes the following steps: Data preprocessing: Obtain the raw audio data of ship noise, resample and segment it, and extract the corresponding waveform signals and Mel spectrograms respectively; Constructing a dual-stream feature extraction network: Parallel waveform feature extraction and spectral feature extraction networks are constructed. The waveform signal is input into the waveform feature extraction network to obtain waveform feature vectors. ; Input the Mel spectrogram into the spectral feature extraction network to obtain the spectral feature vector. ; Bilinear feature fusion: combining waveform feature vectors and spectral eigenvectors Dimensionality reduction is performed separately, and the waveform feature vectors after dimensionality reduction are calculated through bilinear interaction. and the dimensionality-reduced spectral eigenvectors The outer product weighted sum yields the fused feature vector. The feature vector is then subjected to nonlinear processing to obtain the final fused feature vector. ; Fusion feature vectors The calculation methods include: constructing a three-dimensional bilinear weight tensor and bias terms For each output dimension of the fused features , The calculation method is shown in equation (2): (2); In equation (2), For a learnable bilinear weight tensor, its elements For scalars, the first quantizer is used to quantize the characteristics of the projected waveform. The element and the first spectral feature after projection The interaction between elements affects the fusion feature. Victor's contribution; It is a learnable bias vector; and These are the feature vectors of the projected waveform. The Each element and the projected spectral feature vector The One element; Classification: The final fused feature vector Mapping to the ship category space yields the probability distribution of ship types, and the classification results are output.

2. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 1, characterized in that, In the dual-stream feature extraction step, The waveform feature extraction network employs multiple cascaded convolutional modules combining one-dimensional convolution, batch normalization, linear rectification, and pooling to abstract the waveform signal features layer by layer, and outputs a 256-dimensional waveform feature vector after adaptive pooling. ; The spectral feature extraction network employs multiple cascaded convolutional modules combining 2D convolution, batch normalization, linear rectification, and pooling to abstract features layer by layer from the Mel spectrum, and outputs a 256-dimensional spectral feature vector after adaptive pooling. .

3. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 2, characterized in that, In the bilinear feature fusion step, the dimensionality reduction method includes: as shown in equation (1), the waveform feature vector is... and spectral eigenvectors By projecting the waveform onto a 128-dimensional surface using a fully connected layer, a 128-dimensional waveform feature vector is obtained. and 128-dimensional spectral eigenvectors The expression for equation (1) is: (1); In equation (1), , For the projection matrix, , For the corresponding bias term, Represents the space of real numbers.

4. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 1, characterized in that, The bilinear feature fusion step also includes: applying the ReLU activation function to the fused feature vector. Nonlinear processing is performed to obtain the final fused feature vector. .

5. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 4, characterized in that, In the classification step, the Softmax function is used to establish a probability distribution model for ship types, as shown in equation (3): (3); In equation (3), Let be the predicted probability distribution vector. This is the classification weight matrix. As a classification bias term, the Softmax function ensures that all elements are non-negative and sum to 1.

6. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 1 or 5, characterized in that, It also includes training and optimization steps, which include: training the ship type probability distribution model using a labeled smooth cross-entropy loss function and optimizing it using an optimizer; the expression of the cross-entropy function is shown in equation (4): (4); In equation (4), For scalar loss values, , The total number of ship categories, This is a one-hot vector of the true labels, with 1 for the correct category and 0 for the rest. This represents the probability distribution predicted by the model.

7. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 6, characterized in that, Optimization methods using optimizers include: employing the AdamW optimizer in conjunction with the OneCycleLR learning rate scheduling strategy.

8. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 1, characterized in that, The data preprocessing steps specifically include: converting the sampling frequency of the ship noise audio signal to 16kHz; dividing the resampled audio signal into segments with a fixed duration of a preset duration, and zero-padding the segments with a duration less than the preset duration to obtain a waveform signal with 80,000 sampling points; extracting the Mel spectrum for each audio segment, and setting the FFT points to 1024, the number of hops to 256, and the number of Mel frequency bands to 64 to obtain a Mel spectrum with dimensions [1, 64, 313].

9. The multimodal deep learning-based marine vessel noise classification method based on bilinear fusion according to claim 1 or 8, characterized in that, The data preprocessing step is followed by a data augmentation step, which includes: performing frequency domain masking on the Mel spectrum with a preset probability, performing time domain masking on the Mel spectrum with a preset probability, or adding Gaussian noise to the waveform signal with a preset probability.